# Unemployment is measured by the unemployment rate which is the number of people who are unemployed as a percentage of the total labour force. We have seen a sharp increase in the unemployment rate during Covid-19, so analyzing the unemployment rate can be a good data science project. In this project, let's walk through the task of Unemployment analysis with Python.

In [None]:
#So let’s start the task of Unemployment analysis by importing the necessary Python libraries and the dataset:

# data processing
import numpy as np 
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import warnings 
warnings.filterwarnings('ignore')

In [None]:
#Loading dataset

In [None]:
df = pd.read_csv("Unemployment_Rate_upto_11_2020[1].csv")
df.head()

In [None]:
df.tail()

In [None]:
df.shape

# The unemployment dataset contains 267 instances and 9 variables.

In [None]:
df.rename(columns={'Region.1': 'Area'}, inplace=True)
#Since we have similar names for two columns we replaced Region.1 with Area which is more concise and clear.

#Checking for missing values

In [None]:
#check null values
df.isnull().sum()

In [None]:
df.isna().sum() # return the counts of NA values in each columns

In [None]:
df= df.dropna() # Drop rows with missing values

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
df = df.fillna(df.mean()) # Fill missing 

In [None]:
df = df.drop_duplicates() # remove duplicates  entries

In [None]:
df.shape

In [None]:
# 2.summary of the dataframe

In [None]:
df.info()

In [None]:
df.columns = df.columns.str.strip()
df.info()

### We need to ensure that the column names are consistent and free of any unwanted spaces, which can help prevent issues when referencing columns during data analysis.

#### Converting data types

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes

### Here the column 'Date' was in the object type datatype so we converted it into the datetime format.

### Summary Statistics

In [None]:
# selecting the categorical variables
categorical_var = df.select_dtypes(include='object')
# Obtaining summary statistics for the categorical variables
categorical_stat = categorical_var.describe().T
categorical_stat


In [None]:
# selecting numerical variables
numerical_var = df.select_dtypes(exclude='object')
# Obtaining summar statistics for the numerical variables
numerical_stat = numerical_var.describe().T
numerical_stat

### The mean estimated unemployment rate is high. The high unemployment rate could be attributed to the economic disruptions caused by the pandemic, such as lockdowns, reduced economic activity, and job losses in various sectors.

### Dropping irrelevent column

In [None]:
df = df.drop('Frequency', axis=1)
df.head()

# Outlier detection


In [None]:
colors = ['lightblue', 'lightgreen', 'lightcoral']

# Create a figure with three subplots
plt.figure(figsize=(12, 6))

# Subplot 1: Unemployment Rate
plt.subplot(131)
df.boxplot(column='Estimated Unemployment Rate (%)', patch_artist=True)
plt.gca().get_children()[0].set_facecolor(colors[0])  # Set the color of the first box
plt.title('Unemployment Rate')

# Subplot 2: Employed
plt.subplot(132)
df.boxplot(column='Estimated Employed', patch_artist=True)
plt.gca().get_children()[0].set_facecolor(colors[1])  # Set the color of the second box
plt.title('Employed')

# Subplot 3: Labor Participation Rate
plt.subplot(133)
df.boxplot(column='Estimated Labour Participation Rate (%)', patch_artist=True)
plt.gca().get_children()[0].set_facecolor(colors[2])  # Set the color of the third box
plt.title('Labor Participation Rate')

plt.tight_layout()
plt.show()

### Presence of outliers are detected. Since we are analyzing the unemployment during Covid-19 the extreme values in the data are indicative of genuine structural changes or such events therefore we are not removing it.

### Correlation plot

In [None]:
# import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a figure and set its size
plt.figure(figsize=(10, 8))

# Create a heatmap of the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

# Set the title
plt.title('Correlation Plot')

# Display the plot
plt.show()


### Strong positive and negetive correlation does not exist among the variables.

# unemployment rate in india during Covide-19

In [None]:
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y ='Estimated Unemployment Rate (%)')
plt.xticks(rotation=45)
plt.show()

### The months of april, may, june witnessed high unemployment rate which can be associated with the lockdowns which lead to reduced economic activity and job losses in various sectors.

In [None]:
#pairplot

In [None]:
sns.pairplot(df, diag_kind="kde")

# unemployment rate in each state

In [None]:
import plotly.express as px
plot_unemp = df[['Estimated Unemployment Rate (%)','Region']]
df_unemployed = plot_unemp.groupby('Region').mean().reset_index()

df_unemployed = df_unemployed.sort_values('Estimated Unemployment Rate (%)')

fig = px.bar(df_unemployed, x='Region',y='Estimated Unemployment Rate (%)',color = 'Region',title = 'Average unemployment rate in each state',
             template='seaborn')
fig.show()

In [None]:
#highest unemployment rate in

In [None]:
#visualization the distribution of unemployment rates with different areas

In [None]:
fig = px.violin(
    df,
    x='Area',
    y='Estimated Unemployment Rate (%)',
    title='Distribution of Unemployment Rates by Areas',
    box=True,  # Include box plot inside the violin
    points='all',  # Show individual data points
)

fig.show()

#### Since areas with wider or taller violins may have more variability in unemployment rates, south and east part of India experienced higher unemployment.

# Composition of Labour Participation Rates by Region Over time

In [None]:
fig = px.area(
    df,
    x='Date',
    y='Estimated Labour Participation Rate (%)',
    color='Region',
    labels={'Estimated Labour Participation Rate (%)': 'Labour Participation Rate (%)'},
    category_orders={'Region': df['Region'].unique()}  # Preserve the order of regions
)

fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Labour Participation Rate (%)',
    legend_title='Region',
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)

fig.show()

# During the month of april labour participation declined all over India.

#### Extracting month from date

In [None]:
df['Month'] = df['Date'].dt.month
df

In [None]:
# Filter data for months 1 to 3 (before lockdown)
before_lock = df[(df['Month'] >= 1) & (df['Month'] <= 3)][['Region', 'Estimated Unemployment Rate (%)']]

# Filter data for months 3 to 5 (after lockdown)
after_lock = df[(df['Month'] >= 3) & (df['Month'] < 6)][['Region', 'Estimated Unemployment Rate (%)']]

before_lock = before_lock.groupby('Region')['Estimated Unemployment Rate (%)'].mean().reset_index().rename(
    columns={'Estimated Unemployment Rate (%)':'Unemployment Rate before Lock-Down'})

after_lock = after_lock.groupby('Region')['Estimated Unemployment Rate (%)'].mean().reset_index().rename(
    columns={'Estimated Unemployment Rate (%)':'Unemployment Rate after Lock-Down'}) 

before_lock['Percentage Change in Unemployment'] = round((after_lock['Unemployment Rate after Lock-Down'] - before_lock['Unemployment Rate before Lock-Down']) / before_lock['Unemployment Rate before Lock-Down'], 2)
plot_df = before_lock.sort_values('Percentage Change in Unemployment', ascending=False)

plt.figure(figsize=(16, 10))
sns.barplot(data=plot_df, y='Region', x='Percentage Change in Unemployment')

### If the percentage change is positive (+X%), it means that unemployment has increased by X% compared to the previous period. In other words, more people are unemployed.

### If the percentage change is negative (-X%), it means that unemployment has decreased by X% compared to the previous period. Fewer people are unemployed.

### The magnitude of the percentage change indicates how significant the change is. A larger percentage change suggests a more substantial shift in unemployment rates compared to a smaller percentage change.

### Puducherry's unemployment rate had been seriously impacted by the lock-down.

### Sikkim, Chattisgarh, Jammu & Kashmir and Himachal Pradesh have negetive percentage change.That means these states are not highly impacted by the lock down.