# **Exploratory Data Analysis Project**
This notebook aims at practicing data analysis and using the data for a simple machine learning task. The dataset was downloaded from an official EU database and was originally provided by the Central Statistics Office of the Irish government. It represents information about the employment rate between 2019 and 2024.

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

## **1. Retrieving Data**

In [2]:
# load the dataset and print the first five rows
df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,STATISTIC,Statistic Label,TLIST(Q1),Quarter,C02199V02655,Sex,C04283V05060,Education Attainment Level,C02076V02508,Age Group,UNIT,VALUE
0,QLF50C01,Employment rate,20191,2019Q1,-,Both sexes,-,Levels of Education (Levels 0-8),-,All ages,%,74.9
1,QLF50C01,Employment rate,20191,2019Q1,-,Both sexes,-,Levels of Education (Levels 0-8),365,20 - 24 years,%,64.0
2,QLF50C01,Employment rate,20191,2019Q1,-,Both sexes,-,Levels of Education (Levels 0-8),410,25 - 29 years,%,80.1
3,QLF50C01,Employment rate,20191,2019Q1,-,Both sexes,-,Levels of Education (Levels 0-8),4251,25 - 54 years,%,80.1
4,QLF50C01,Employment rate,20191,2019Q1,-,Both sexes,-,Levels of Education (Levels 0-8),440,30 - 34 years,%,81.4


## **2. Understanding the Data**

In [3]:
# extract the number of rows and columns
nrows, ncols = df.shape
print(f"The dataset has {ncols} columns and {nrows} rows.")

The dataset has 12 columns and 4752 rows.


After inspecting the data in datawrangler, the following general findings were made:

- the main column of interest is `Value`, which gives the employment rate of a certain sex, age and education group throughout a quarter in the years 2019 - 2024
- `TLIST(Q1)` and `Quarter` both represents the quarter in numerical (e.g. 20191) and object format (e.g. 2019Q1)
- besides separate entries for different sex, age and education, there are also entries summarizing each group (e.g. both sexes). This allows for a global as well as local analysis

In [4]:
# isolate the relevant columns and change the column names to make them more usable
df = df.loc[:, ["Quarter", "Sex", "Education Attainment Level", "Age Group", "VALUE"]]
df.columns = ["quarter", "sex", "education", "age", "employment"]
df.head()

Unnamed: 0,quarter,sex,education,age,employment
0,2019Q1,Both sexes,Levels of Education (Levels 0-8),All ages,74.9
1,2019Q1,Both sexes,Levels of Education (Levels 0-8),20 - 24 years,64.0
2,2019Q1,Both sexes,Levels of Education (Levels 0-8),25 - 29 years,80.1
3,2019Q1,Both sexes,Levels of Education (Levels 0-8),25 - 54 years,80.1
4,2019Q1,Both sexes,Levels of Education (Levels 0-8),30 - 34 years,81.4


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4752 entries, 0 to 4751
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   quarter     4752 non-null   object 
 1   sex         4752 non-null   object 
 2   education   4752 non-null   object 
 3   age         4752 non-null   object 
 4   employment  3201 non-null   float64
dtypes: float64(1), object(4)
memory usage: 185.8+ KB


**Result**:
- dataset with 5 columns and 4752 rows
- 4 categorical features (`quarter`, `sex`, `education`, `age`) 
- 1 continuous label (`employment`)

In [6]:
sns.pairplot(df, hue='employment')

NameError: name 'sns' is not defined