# Project Final Report 

**Title:** Time series analysis of state- and county-level unemployment rates in the USA 

**Student Id:** den903

**Course:** CS6463

**Index:**  
[Research Objective](#-Research-Objective)  
[Setup](#Setup)  
[Data Preparation](#Data-Preparation)   
[Data Exploration](#Data-Exploration)  
[Modeling](#Modeling) 
[Presentation Graphic(s)](#Presentation-Graphic(s))


## Research Objective

The research objective of this project is to perform time series analysis of state- and county-level unemployement data in United States, apply different data science techniques to explore and visualize data, and to generate a model to predict the future unemployement.

## Data Description

The data used for this study is collected from two different sources. 
- **'unemployment-by-county-us.csv'**- This data set represents the Local Area Unemployment Statistics from January 1990 - December 2016, broken down by state,county,Year and month. The raw unformatted data is avaialble at the United States Bureau of Labor Statistics Website. The scrapped version of the data in CSV format for the purposes of this analysis was obtained from Kaggle. This time series dataset has nearly 900K observations with roughly 3K observations for each month of each year from 1990 to 2016. It gives an idea about the trend in unemployment rate in 47 states in the US over 27 years. The data is clean, without any null values or duplicates, with states Alaska, Georgia, and Florida omitted for the study.
The basic information about the dataset attributes is as follows:**

| Column name | Data type |
| :-----------------: | :------------: |
|  Year  |  int64   | 
|  Month  |  String   |
|  State  |  String   |
|  County  |  String   | 
|  Rate  |  float64   |


- **'Unemployment.csv'**- The second data set is obtained from the official website of USDA. This dataset is broken down by state, county, and year for a period from 2007 to 2017. I extracted information about 'Civilian Labour Force' for each county and state, since the first dataset has only Unemployment rate information. This time series dataset has nearly 300K observations.

- **'States.csv' and 'County.csv'**- These dataset contains state information (State name abbrevations) and county information in Federal Information Processing Standards (FIPS code) which are used for merging data sets and visualization purposes.

After cross referencing the available data in each file, I merged and generated my data set for this project. The actual unemployed count for each county is calculated using civilian labour force and unemployment rate. The final data set has nearly 300K observations with below attributes 

| Column name | Data type |
| :-----------------: | :------------: |
|  Year  |  int64   | 
|  Month  |  String   |
|  State_code  |  String   |
|  State  |  String   | 
|  County  |  String   |
|  FIPS  |  String   |
|  Civilian_labor_force  |  float64   | 
|  Unemployed  |  float64   |
|  Unemployment_Rate  |  float64   |



## Data Exploration

The final dataset is generated by merging multiple data files mentioned above. This dataset has unemployment data of 47 states/1685 counties for a decade (2007-16). This includes a total of 316875 records. It covers unemployment records from The Great Recession period as well. 

 The statistical summary is as below:

|| Year | Civilian_labor_force | Unemployment_Rate | Unemployed |
| :-----------------: | :------------: | :------------: | :------------: | :------------: |
|  count  |  316875.0   | 316875.0 | 316875.0 | 316875.0 |
|  mean  |  2012.0   | 47929.0 | 7.0  | 3307.0 |
|  std  |  3.0   | 158294.0 | 3.0  | 12623.0 |
|  min  |  2007.0   | 41.0 | 1.0  | 3.0 |
|  25%  |  2009.0   | 5074.0 | 5.0  | 317.0 |
|  50%  |  2012.0   | 11748.0 | 6.0  | 823.0 |
|  75%  |  2014.0   | 30931.0 | 9.0  | 2168.0 |
|  max  |  2016.0   | 5041430.0 | 32.0  | 649094.0 |


- Here we can see the Unemployment_Rate is varying from 1.0 to 32.0.
- Civilian_labor_force is varying from 41.0 to 5041430.0

By analyzing the data, I verified the following:

- The unemployment rate was max for few counties in California, Alabama and Michigan during the great recession period
- Min Civilian_labor_force (<50) is in 'Loving County', Texas where Unemployment_Rate is 7.9,
- Max Civilian_labor_force (5041430) is in Los Angeles County, California where Unemployment_Rate is around 5,

From these its clear that unemployment rate and civilian labour force varies drastically at couty level.

Challenges Faced: 

Below are the few challenges faced while preparing the data to create the final data set: 

- The actual dataset contain unemployment rate. We can not aggragate statewise unemployment rate since this value depends upon the civilian labour force

    Solution: Merged dataset with another dataset conatining civilian labour force for each county. It was yearly data, but since the month to month county population change is trivial, took the same value for the entire year. Then actual employed value is calculated for each county as (Civilian_labor_force*Unemployment_Rate )/100
    
- Datasets can't be merged based on County names, as they are not unique (multiple states may have counties with same name). 

    Solution: Collected FIPS code of each county and merged datset based on this. 
    
- Handling Missing data: County wise Civilian_labor_force or Unemployment_Rate were missing for few states

    Solution: Interpolated (method=time) data for missing period on state level before modeling has been performed. 

### Overall year-wise data disribution 

Plotted Box plots to represent yearly average unemployment rate and to check if there are any outliers. Here we can see  overall outliers are negligible. The unemployment rate mean is more at 2009-10 time frame that matches with the great recession period.
![download%20%283%29.png](attachment:download%20%283%29.png)

    
### County-wise Unemployment rates during recession periods

Since the dataset covers a recession period of countywise unemployment rate, plotted Choropleth Maps to visually represent the same. 

Choropleth plots have been plotted for three different time periods to visualize the impact of recession: before, during and after the recession. The darker counties represent highly unemployment affected areas. In the second figure we can see more counties with higher unemployment rate. we can clearly see the impact of 2009 recession (high unemployment rate). In the third figure, it is visible that counties are almost recovered/recovering form the recession.Also we can see that the eastern and western coasts were highly affected by the Great recession, where technology and businesses are major job sectors. 

County wise unemployment rate during 2007 to 2009:
![newplot.png](attachment:newplot.png)

County wise unemployment rate during 2010 to 2012:
![newplot%20%281%29.png](attachment:newplot%20%281%29.png)

County wise unemployment rate during 2013 to 2015:
![newplot%20%282%29.png](attachment:newplot%20%282%29.png)

 To analyze state level unemployment rate, I have aggregated data on state to find civilian labour force and unemployed count. Using these values, I calculated the unemployment rate.Plotted histogram to visualize this:
![download%20%286%29.png](attachment:download%20%286%29.png)

Here we can see the unemployment rate deviation is less compared to county level. Mean is around 6 and values varies between 1 to 15 with less outliers, where as in county level we were seeing values are varying from 1.0 to 32.

Plotted yearly average unemployment rate along with min-max band for a period of 2007-16. The rate values are varying between 2 and 15.
![download%20%284%29.png](attachment:download%20%284%29.png)


The below bubble plot is a geographical representation of the average unemployment rate by state from 2007 to 2016. The size of each bubble indicates the average unemployment rate for quick comparison. The plot illustrates comparatively higher unemployment rate towards the states in East Coast.

![newplot%20%284%29.png](attachment:newplot%20%284%29.png)


To understand the correlation between civilian labour force and unemployment rate, plotted a graph shows both the variation for same time period. From this we can see unemployment rate and civilian labour force directly related in major states like CA, NY, MA, etc. For few states data were missing. 

![download%20%281%29.png](attachment:download%20%281%29.png)




To anayze and visualize the statewise unemployment rate further, choose five states (California, Michigan, New York, Washington and North Dakota) depends on geographical location, civilian labour force & unemployment rate. Rather than using the traditional visualizing methonds, I used area plot and violin plot for data representation.

a) Area Plot: 

The area chart shows unemployment rate change over the same period for the above chosen different states. Since data is stacked on top of another, it is much easier to compare how data is evolved over the time period. 

Among these 5 states, CA is one of the high unemployment rate state, ND with least, and NY with average. 


![download%20%282%29.png](attachment:download%20%282%29.png)

b) Violin Plot: 

Each violin plot represents probability density of the data in different years. By comparing plots for over each year for a particular state, we can understand the variation in underlying distribution, which is missing in box plot. Also it gives an idea about the outliers. 

By analyzing the plot for North Dakota, we can see that the mean is almost constant across the years, and distribution is almost stable.


![download%20%285%29.png](attachment:download%20%285%29.png)

## Modeling

##better prediction can be done with least deviating data. So better option is modeling state wise.

I have used Autoregressive Integrated Moving Average (ARIMA) method for time-series forecasting.Here we are actually doing regreesion on a same feature (unemployment) but at different time. In order to do this, we have to check whether there is any correlation between the current and previous values. If so, we remove trend and seasonality components from the data through the proces seasonal decomposition. After this, we generate the model and check the accuracy. This model can be used for predicting the future values. 

Here we choose a data which is most stable for modelling, better prediction can be done with least deviating dataset. From the violin plot, it is already seen that the state North Dakota has the least deviation among the selected states. So I have chosen Northe Dakota to represent a state with less civilian labour force.

Below are the steps followed: 

- Preparing data for modeling:
The Unemployed values for ND is indexed with DatetimeIndex. The data doesn't have any missing values. There are 120 records (12 months, 10 years) for altogether.

- Checking autocorrelation:
Checked correlation between y(t) and y(t-1) with the help of a lag plot. It shows a clear realtionship between current value and previous value.

![download%20%281%29.png](attachment:download%20%281%29.png)

Also plotted autocorrelation plot in order to figure out lag parameter to model the data.

![download%20%283%29.png](attachment:download%20%283%29.png)

The lag value can be obtained from this graph. 
The autocorrelation value is above confidence level if lag value is below 20. I have tried with different values(<20) to find the best value for this parameter. 

- Seasonal decomposition:
The timeseries data needs to be removed from trend and seasonal factors. This is done using seasonal_decompose() function. Model is set to 'multiplicative' since the data is not linear.

![download%20%2810%29.png](attachment:download%20%2810%29.png)

- create the model:
The data is splitted into train and test data. Scaled the data using StandardScaler() function. Model is created using ARIMA on train data using the lag value obtained from autocorrelation plot. The model is fitted and predicted values for the test data. Mean square error is being calculated between y_test and y_pred. Different values for lag parameter is tried and chosen the best value as 9. The MSE is 0.049. Also plotted the data with predictions as below:

![image.png](attachment:image.png)



Also chosen state California to represent a state with high Civilian_Labour_Force.
Few records were missing for this dataset. Only 101 values were present out of 120 expected records (12 months, 10 years).  The values are filled using interpolation using method='time'.
All the above mentioned procedure has followed. The model predicted for test values with MSE value of 0.014.

![image.png](attachment:image.png)

## Summary

Towards this project work I analyzed State and County level unemployment data for the US for a decade (2007-16), performed data visualization and created a model to predict unemployment rate for a state using the "time series" technique ARIMA.

The countywise unemployment rate in the US during this decade is ranging between 1 to 32, which is indicating a drastic change in county level unemployment. The most unemployment affected states are California, Michigan, and Alabama and least affected are states like North Dakota. The visualization shows how the counties & states were affected during Great recession period. A timeseries prediction model using ARIMA was developed and trained on the dataset to predit future unemployment rates. 

I have also calculated unemployment rate for the US for this decade and compared against unemployment rate of states and plotted the states with better performed 5 and least performed 5 states. This will help to visualize how top-5 & bottom-5 unemployment rate related to the national unemployment rate. 

In actuality, there are several models we could explore and include several other factors or features to model our unemployment rate and help us make better forecast, but that is out of the scope of this analysis.


![download%20%289%29.png](attachment:download%20%289%29.png)

## References

1. Bls.gov. (2019). Local Area Unemployment Statistics Home Page.[online] Available at: https://www.bls.gov/lau/ [Accessed 22 Sep. 2019].

2. Kaggle.com. (2019). US Unemployment Rate by County, 1990-2016. [online] Available at: https://www.kaggle.com/jayrav13/unemployment-by-county-us [Accessed 22 Sep. 2019].

3. Towards Data Science. (2019). Modeling Youth Unemployment. [online] Available at: https://towardsdatascience.com/https-medium-com-vikramdevatha-modeling-youth-unemployment-d0f7cbcd078a [Accessed 22 Sep. 2019].

4. Belen Villena Maria, C. A. (2013/14). Statistical Analysis of Unemployment in Europe. Technische Hochschule Nürnberg Georg Simon Ohm, Nürnberg.


## Presentation Credit
Do not put anything below this cell.