<img style="display: block; margin: 0 auto" src="https://images.squarespace-cdn.com/content/v1/645a878d9740963714b8f343/3efb24e3-9fb9-4bc7-b41e-7f36742ae747/2-2.jpg?format=1500w" alt="Lonely Octopus Logo">

**Please create a copy of the notebook in your gdrive to be able to edit it.**

**You can make a copy from the menu: File > Save a copy in Drive**

# Analyze the US Distribution of Wealth, economic and social gaps by State



You are working as a Research Analyst at a renowned consultancy. After working on a previous project and identifying the US as an ideal country for investment, it’s time to dig deeper into the numbers. <br>
The main objective is to create a development impact assessment. <br>
In order to achieve sustainability goals, it’s important to choose sites with the right population, education, infrastructure, purchasing power and diversification.

You will first need to choose significant variables through visualization, factorial analysis, group similar variables together and finally create clusters of states with similar characteristics.

Perform your analysis in the following order:

> 1) **Exploratory Data Analysis** <br>
> 2) **Statistics and Probability**<br>
> 3) **Machine Learning**<br>


Your analysis will help rank states by multiple factors to show which would benefit from which projects and what they can become like after such projects are implemented.

## **Pre-requisite Actions**

In [2]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy.cluster.hierarchy import dendrogram, linkage

# import altair as alt (Graphs look better in quality than matplotlib)

## **Get to know the Data**
Transform all data into a cross sectional format as follows (Each variable vertically has 50 observations for all states)


| Index | State   | Total Population 2021 | Population Growth or Decline 2010 to 2021 | Other variables |
|-------|---------|-----------------------|-------------------------------------------|-----------------|
| 1     | Alabama | 5,039,877             | 5.40%                                     | ...             |
| 2     | Alaska  | 732,673               | 3.20%                                     | ...             |
| ...   | ...     | ...                   | ...                                       | ...             |
| 50    | Wyoming | 578,803               | 2.70%                                     | ...             |

In [None]:
# Check data types
# Count missing values in each column

## **Download the Dataset:**

 >* **Excel File**: [Click to download US State Data](https://docs.google.com/spreadsheets/d/1MpIhrer1o0jO8bzChncxkxbSHfEpPzmk/edit?usp=sharing&ouid=116721275725764079012&rtpof=true&sd=true)

# **Exploratory Data Analysis**
Exploratory Data Analysis serves as the critical first step in analyzing the data from your "US State Data.xlsx" file. Through visualization and basic statistical techniques, EDA aims to uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. This phase is essential for gaining insights into the dataset's underlying structure, informing subsequent analysis, and guiding more complex statistical and machine learning applications.<br>
<br>
We will employ various visualization methods, including bar charts, side-by-side bar charts, scatter plots, and dual-axis charts, to depict economic, educational, and demographic characteristics across states.
Make sure to justify if a different graph is used.

### **Visualization of Demographics**

In [None]:
# What is the population growth trends from 2010 to 2021 and Households sizes in 2020 makeup across states using scatterplots
# Compare the total population of each state in 2021 using a horizontal bar chart. X-axis: Population, Y-axis: States.
# Create a line chart to compare the population growth rate from 2010 to 2021 for each state. X-axis: States, Y-axis: Growth Rate.
# What is the percentage change for population every 10 years?

# Visualize your findings

### **Education Level Distribution**

In [None]:
# Analyze the per capita income, median household income, and poverty rates across different states to identify economic disparities using box plots and histograms
# Identify the percentage of adults 25+ with a high school diploma or more by state in 2020. Use a bar chart for visualization.
# Also, compare that to the percentage of adults 25+ with a bachelor's degree or more by state in 2020. Highlight the top 5 states in a dark blue color.

# Visualize your findings

###**Economic Disparity and Poverty**

In [None]:
# Identify patterns in infrastructure and development that correlate with economic health, demographic makeup, and education levels, highlighting states with similar characteristics for targeted development or investment. Use a combination plot - bar and line graph to visualize this study
# Measure the economic disparity by plotting per capita personal income against the poverty rate for each state. Use a scatter plot for this analysis
# Identify and visualize the states with a poverty rate higher than the national average but with above-average education levels (high school diploma or more, bachelor's degree or more).

# Visualize your findings

###**Sector Analysis**

In [None]:
# Plot the percentage of all jobs that are in manufacturing by state in 2021, and highlight the top and bottom 5 states.
# Compare the average wage in the manufacturing sector to the overall average wage per job for each state using a dual-axis chart.

# ^ Replicate the above analysis for the Healthcare(Social Assist), Finance(and Insurance) and Transportation(and Warehousing) sector

# **Statistics and Probability**

This category delves into the relationships between different variables within the US State dataset, employing statistical methods to test hypotheses, determine correlations, and perform regression analysis. The objective is to understand the statistical significance and predictive power of various socio-economic indicators, guiding more informed decision-making.

We'll explore how closely related different indicators are, such as education levels and median household income, or employment rates and average wages, and predict outcomes based on these relationships.

Plus, by analyzing how changes in education levels affect income and employing probability distributions, we'll assess the economic health and variability of poverty rates across states.

### **Correlation Analysis**

In [None]:
# Analyze the correlation between covered employment in 2021 and average wage per job by state.
# Investigate the relationship between education level and median household income by state
# In addition to that, compare that relationship for 3 different states

# Determine the correlation between the poverty rate in 2020 and the percentage of adults with at least a high school diploma by state.
#Does higher education correlate with lower poverty?

# Analyze the relationship between per capita personal income in 2021 and the percentage of adults with a bachelor's degree or more.
# Is there a significant correlation between higher education and income?

### **Hypothesis Testing**

In [None]:
# Conduct a hypothesis testing to see if observed differences in economic indicators like median income or poverty rates across different regions are statistically significant.

# Test the hypothesis that states with a higher percentage of manufacturing jobs (compared to the national average) have a higher average wage per job in 2021.
# (Use a t-test for means comparison)

# Evaluate the hypothesis that states with above-average per capita personal income have lower poverty rates than the national average.
# (Use a chi-square test for independence)

### **Regression Analysis**

In [None]:
# Perform a regression analysis to understand the impact of increasing the percentage of adults with a bachelor's degree on per capita personal income by state.

# Create a multiple regression model to forecast per capita personal income in 2021.
# (Using median household income in 2020, the percentage of manufacturing jobs, and education levels as predictor)
# (For education level, you may use the total number of people who have attended University)

# Use regression analysis to examine the impact of education (both high school and bachelor's degree levels) on the average wage per job in 2021.
#(Incorporate interaction terms to explore if the impact of having a bachelor's degree on income is different)
# Compare this against various levels of high school education completion across states.

# **Machine Learning**
In this segment, apply machine learning algorithms to predict future trends, classify states into meaningful clusters, and forecast economic indicators. This approach allows us to leverage complex computational models to identify patterns and relationships that may not be immediately apparent through traditional statistical methods.

Techniques like linear regression, ensemble methods, and neural networks will be used to forecast median household incomes, employment sector changes, and more, based on current and historical socio-economic data.

Use clustering techniques, such as K-means or hierarchical clustering, to group states by similarities in socio-economic indicators. This analysis can reveal patterns that inform targeted interventions and policy development.

## <ins>**Clustering: Infrastructure and Development Opportunities** <ins><br>


### **Data Preparation**

In [None]:
# Standardize data for total earnings, Personal Contributions for Government Social Insurance, Net Earnings by Place of Residence, and Single Family Permits.

### **Hierarchical Clustering**

In [None]:
# Perform hierarchical clustering on the standardized features using a method such as Ward's method, which minimizes the variance within each cluster.
# Employ hierarchical clustering to group states based on socio-economic indicators like income levels, poverty rates, and education levels.
# (Use this analysis to suggest targeted policy interventions.)

### **Dendrogram Visualization**

In [None]:
# Use a dendrogram to visualize the hierarchical cluster formation.
# Analyze the tree structure to understand the groupings
# Determine the cut-off for the number of clusters.

# Note: Ensure the categorization is done for all states in the dataset

Cluster all states into a Dendogram ending with 50 leaves (example: online.visual-paradigm.com) ![image.png](https://online.visual-paradigm.com/repository/images/0fe81efd-c6f6-41af-98d5-d9b1f0d33f2f.png)


### **Cluster Interpretation:**

In [None]:
# Analyze the characteristics of each cluster based on the original features
# Understand the commonalities within each group and how they differ from others.

# **Dashboard**

In [None]:
# Map out the geographical patterns in population growth, economic indicators, and educational levels using maps and spatial data, ie US states

# Import necessary libraries such as geopandas for spatial data handling and matplotlib or plotly for visualization.
import geopandas as gpd
import matplotlib.pyplot as plt
# import plotly.express as px  # Optional, for interactive plots

# Validate and interpret your findings. Look for patterns or outliers in the geographical distribution of your data and consider what these might indicate.