# Analysis on Unemployment in the Context of COVID-19  
## Notebook 01: Data Understanding

### Purpose
- This notebook focuses on understanding the structure, content, and quality of the raw global unemployment dataset before performing any data cleaning or analysis.


### Column Overview

- **Country Name**: Name of the country
- **Sex**: Gender category
- **Age Group**: Age classification
- **Age Categories**: Age broad classification
- **2014â€“2024**: Year-wise unemployment rates (wide format)

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv("../Data/global_unemployment_data.csv")
df.head()
# First five rows for a sample look

Unnamed: 0,country_name,sex,age_group,age_categories,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,Afghanistan,Female,15-24,Youth,13.34,15.974,18.57,21.137,20.649,20.154,21.228,21.64,30.561,32.2,33.332
1,Afghanistan,Female,25+,Adults,8.576,9.014,9.463,9.92,11.223,12.587,14.079,14.415,23.818,26.192,28.298
2,Afghanistan,Female,Under 15,Children,10.306,11.552,12.789,14.017,14.706,15.418,16.783,17.134,26.746,29.193,30.956
3,Afghanistan,Male,15-24,Youth,9.206,11.502,13.772,16.027,15.199,14.361,14.452,15.099,16.655,18.512,19.77
4,Afghanistan,Male,25+,Adults,6.463,6.879,7.301,7.728,7.833,7.961,8.732,9.199,11.357,12.327,13.087


In [8]:
df.shape
#(total rows, total columns)

(1134, 15)

In [3]:
df.info()
# Data type and none-null count for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1134 entries, 0 to 1133
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country_name    1134 non-null   object 
 1   sex             1134 non-null   object 
 2   age_group       1134 non-null   object 
 3   age_categories  1134 non-null   object 
 4   2014            1134 non-null   float64
 5   2015            1134 non-null   float64
 6   2016            1134 non-null   float64
 7   2017            1134 non-null   float64
 8   2018            1134 non-null   float64
 9   2019            1134 non-null   float64
 10  2020            1134 non-null   float64
 11  2021            1134 non-null   float64
 12  2022            1128 non-null   float64
 13  2023            1122 non-null   float64
 14  2024            1122 non-null   float64
dtypes: float64(11), object(4)
memory usage: 133.0+ KB


In [4]:
df.isna().sum()
# Total nulls present in each column

country_name       0
sex                0
age_group          0
age_categories     0
2014               0
2015               0
2016               0
2017               0
2018               0
2019               0
2020               0
2021               0
2022               6
2023              12
2024              12
dtype: int64

In [14]:
df.describe()
# General statistic information of the datafile

Unnamed: 0,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1128.0,1122.0,1122.0
mean,11.3878,11.272444,11.122963,10.863516,10.516499,10.311452,11.851285,11.422645,10.340361,9.985181,9.940089
std,11.119002,10.915942,10.742947,10.64098,10.527773,10.297952,11.23158,10.873412,10.26481,9.987778,9.977512
min,0.027,0.034,0.038,0.035,0.044,0.036,0.056,0.064,0.067,0.063,0.06
25%,3.9335,3.9935,3.94525,3.7475,3.67275,3.5385,4.3345,4.1535,3.55525,3.4775,3.45975
50%,7.6975,7.5475,7.5045,7.1405,6.706,6.6275,8.0675,7.5425,6.5715,6.466,6.364
75%,15.05075,14.76625,14.4675,14.142,13.343,13.2855,15.31625,14.8815,13.41,12.9145,12.68775
max,74.485,74.655,74.72,75.416,76.395,77.173,83.99,82.135,78.776,78.541,78.644


In [7]:
df["sex"].unique(), df["age_group"].unique(),df["age_categories"].unique(), df['country_name'].unique()
# To check for consistency in the Data File (You can see some inconsistency in country names)

(array(['Female', 'Male'], dtype=object),
 array(['15-24', '25+', 'Under 15'], dtype=object),
 array(['Youth', 'Adults', 'Children'], dtype=object),
 array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
        'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
        'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
        'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
        'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
        'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
        'Canada', 'Central African Republic', 'Chad', 'Channel Islands',
        'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
        'Congo, Democratic Republic of the', 'Costa Rica', 'Croatia',
        'Cuba', 'Cyprus', 'Czechia', 'Ivory Coast', 'Denmark', 'Djibouti',
        'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
        'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
        'Fiji', 'Finla

In [8]:
a=df['country_name'].unique()
len(a)
# Nummber of Countries in Data File

189

### Observations Summary

- The dataset spans multiple countries and years from 2014-2024
- Unemployment data is stored in wide format (year columns).
- Some missing values are present, only for Ukraine(22-24) and Palestinian Territories(23-24) due to war during the period of time.
- Data includes demographic splits by age categories and sex.
- *age_group* and *age_categories* provide similar information.
- Reshaping the dataset will be required for time-series analysis.

### Next Step
Proceed to data cleaning and reshaping done in **Excel Power Query**.
