# **Exploratory Data Analysis**

# Introduction

In this notebook, we will focus on evaluating the dataset and its characteristics from a Machine Learning perspective.

The main goal of this process is to enhance the model's performance by addressing data issues such as missing values, anomalies, and potential capture errors.

The validations we will perform include the following:

- **Data Types:** Validating data formats for consistency and compatibility.
- **Missing Data:** Identifying and addressing gaps in the dataset.
- **Descriptive Statistics:** Summarizing key statistical measures.
- **Outliers:** Detecting and analyzing unusual data points.
- **Capture Considerations:** Reviewing data collection methods for potential biases or errors.
- **Class Balance:** Evaluating the distribution of target classes.
- **Potential Correlation:** Assessing relationships between features.
- **Distribution Analysis:** Validating data distribution

## Load libraries and data

In [42]:
import pandas as pd
df = pd.read_csv('G:\Mi unidad\###_ ML Zoomcamp 2024\enape_db_formated.csv')

We're already familiar with the dataset, but I'll display it here for quick reference

In [43]:
df.head()

Unnamed: 0,resident_seq_number,sex,age,school_type,school_grade,finished_grade,em_hw_projects,em_tests,em_multimedia_evidence,em_class_participation,...,ap_depressed,ap_academic_desperation,ap_social_difficulty,ap_no_issues,economic_participation,work_hours,economic_consequences,state_number,period_type,period_number
0,1,male,13,public,primary,True,True,True,False,False,...,False,False,False,False,,,,24,year,6.0
1,2,female,19,public,bachelors,True,True,True,False,False,...,False,False,False,False,studying_other,,,24,year,1.0
2,1,female,8,public,primary,True,True,True,False,False,...,False,True,False,False,,,,22,year,2.0
3,1,male,27,private,bachelors,True,True,True,False,False,...,False,False,False,True,worked_one_hour,25.0,no_consequence,26,quadrimester,2.0
4,2,male,11,private,primary,True,True,True,False,False,...,False,False,False,True,,,,18,year,5.0


In [44]:
df.shape

(19973, 42)

## Data Types

In [45]:
numerical = list(df.select_dtypes(include=['float64', 'int64']).columns)
categorical = list(df.select_dtypes(include=['object', 'category', 'string']).columns)
boolean = list(df.select_dtypes(include=['bool']).columns)

In [46]:
print(numerical)
print(categorical)
print(boolean)

['resident_seq_number', 'age', 'help_hours', 'work_hours', 'state_number', 'period_number']
['sex', 'school_type', 'school_grade', 'expected_grade', 'economic_participation', 'economic_consequences', 'period_type']
['finished_grade', 'em_hw_projects', 'em_tests', 'em_multimedia_evidence', 'em_class_participation', 'em_class_work', 'em_class_attendance', 'em_other', 'em_no_evaluation', 'et_smartphone', 'et_laptop', 'et_desktop_pc', 'et_tablet', 'et_flat_screen', 'et_didactic_material', 'et_other', 'et_none', 'hr_mother', 'hr_father', 'hr_female_relative', 'hr_male_relative', 'hr_female_non_relative', 'hr_male_non_relative', 'hr_none', 'ap_stressed', 'ap_depressed', 'ap_academic_desperation', 'ap_social_difficulty', 'ap_no_issues']


"period_number" is a numerical value, but in our project, it is not a continuous variable. Instead, it is categorical, as it represents the grade period in which the student is enrolled. Therefore, we will convert it into a categorical or string type.

In [47]:
df['period_number'] = df['period_number'].astype('string')


In [48]:
df['period_number'].dtypes

string[python]

The same scenario goes for "state_number" as it represents the geographical state, nos a continuos value

In [49]:
df['state_number'] = df['state_number'].astype('string')

In [50]:
df['state_number'].dtypes

string[python]

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19973 entries, 0 to 19972
Data columns (total 42 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   resident_seq_number      19973 non-null  int64  
 1   sex                      19973 non-null  object 
 2   age                      19973 non-null  int64  
 3   school_type              19973 non-null  object 
 4   school_grade             19973 non-null  object 
 5   finished_grade           19973 non-null  bool   
 6   em_hw_projects           19973 non-null  bool   
 7   em_tests                 19973 non-null  bool   
 8   em_multimedia_evidence   19973 non-null  bool   
 9   em_class_participation   19973 non-null  bool   
 10  em_class_work            19973 non-null  bool   
 11  em_class_attendance      19973 non-null  bool   
 12  em_other                 19973 non-null  bool   
 13  em_no_evaluation         19973 non-null  bool   
 14  et_smartphone         

At this point I think we have the correct data types, we'll proceed and if needed we can come back and recheck

## Missing Data

### Initial validation

In [52]:
df.isnull().sum()[lambda x: x > 0]

help_hours                 8666
expected_grade             5545
economic_participation    10398
work_hours                17038
economic_consequences     17038
dtype: int64

### help_hours

The "help_hours" value indicates the amount of time the student received help. According to the survey documentation, this field is relevant for students aged between 3 and 17 years old, so it would be blank for those above this age range. Additionally, for students within this age range who provided no answer, we can assume no help was received. In both scenarios, null values can be replaced with zero.

In [53]:
df['help_hours'] = df['help_hours'].fillna(0)

### expected_grade

In the case of "expected_grade", we'll replace null values with "unknown"

In [54]:
df.groupby('expected_grade').size()

expected_grade
bachelors      11187
high_school      506
masters         1776
primary           27
secondary        173
tech_bacc        183
tech_school       14
univ_tech        205
unknown          357
dtype: int64

In [55]:
df['expected_grade'] = df['expected_grade'].fillna('unknown')

In [56]:
df.groupby('expected_grade').size()

expected_grade
bachelors      11187
high_school      506
masters         1776
primary           27
secondary        173
tech_bacc        183
tech_school       14
univ_tech        205
unknown         5902
dtype: int64

### economic_participation

According to the survey documentation, this field is relevant for students aged between 14 and 29 years old, so it will be blank for those below this age range. We can validate it as follows:

In [63]:
df[df['economic_participation'].isnull()].groupby('age').size()


age
3       32
4      362
5      851
6     1093
7     1007
8     1140
9     1110
10    1233
11    1119
12    1275
13    1176
dtype: int64

Therefore, we will replace null values with the category described as "Studying or in a different" (in the preprocessing phase, we filtered the records to include only enrolled students).

| Numeric Value | Description (in Spanish)                                               | Description (Translated)                                      | New Value              |
|---------------|------------------------------------------------------------------------|--------------------------------------------------------------|------------------------|
| 1             | trabajó por lo menos una hora (tenía trabajo pero no trabajó)?         | Worked at least one hour (had a job but didn't work)?         | worked_one_hour       |
| 2             | vendió o hizo algún producto para vender?                              | Sold or made a product to sell?                              | sold_product          |
| 3             | ayudó en las labores del campo, cría de animales, o en el negocio de un familiar o de otra persona? | Helped with farming, animal husbandry, or a family/other's business? | family_business_help |
| 4             | a cambio de un pago realizó otro tipo de actividad? (lavó o planchó ajeno, cuidó niños) | Performed other paid activity? (laundry, ironing, childcare) | paid_other_work       |
| 5             | estuvo de aprendiz o haciendo su servicio social?                      | Was an apprentice or doing community service?                | apprentice_service    |
| 6             | buscó trabajo?                                                        | Searched for a job?                                          | job_search            |
| 7             | Estudia o está en otra situación diferente a las anteriores            | Studying or in a different situation                         | studying_other        |
| b             | No sabe                                                  | Doesn't know                                           | unknown        |


In [64]:
df['economic_participation'] = df['economic_participation'].fillna('studying_other')

In [65]:
df.groupby('economic_participation').size()

economic_participation
apprentice_service        115
family_business_help      430
job_search                158
paid_other_work            19
sold_product               72
studying_other          16880
worked_one_hour          2299
dtype: int64

### work_hours

These features are also focused for students above 14 years, so we can check null distribution by age

In [58]:
df[df['work_hours'].isnull()].groupby('age').size()

age
3       32
4      362
5      851
6     1093
7     1007
8     1140
9     1110
10    1233
11    1119
12    1275
13    1176
14    1077
15    1043
16     836
17     801
18     745
19     465
20     422
21     409
22     327
23     212
24     131
25      71
26      34
27      31
28      18
29      18
dtype: int64

For cases above 14 years we can make the assumption that the lack of response is that there was no economic participation activities also, so we can replace null values with zero

In [60]:
df['work_hours'] = df['work_hours'].fillna(0)

In [61]:
df['work_hours'].isnull().sum()

0

### economic_consequences

We can use the same approach for economic consequences

In [67]:
df[df['economic_consequences'].isnull()].groupby('age').size()

age
3       32
4      362
5      851
6     1093
7     1007
8     1140
9     1110
10    1233
11    1119
12    1275
13    1176
14    1077
15    1043
16     836
17     801
18     745
19     465
20     422
21     409
22     327
23     212
24     131
25      71
26      34
27      31
28      18
29      18
dtype: int64

We'll use the `no_consequence` value

In [69]:
df.groupby('economic_consequences').size()

economic_consequences
allocate_income         274
economic_unsustain      160
education_stop          228
hire_replacement         44
income_decrease         645
no_consequence        18469
other_consequence        27
workload_increase       126
dtype: int64

In [68]:
df['economic_consequences'] = df['economic_consequences'].fillna('no_consequence')

In [70]:
df.groupby('economic_consequences').size()

economic_consequences
allocate_income         274
economic_unsustain      160
education_stop          228
hire_replacement         44
income_decrease         645
no_consequence        18469
other_consequence        27
workload_increase       126
dtype: int64

### Final Validation

In [71]:
df.isnull().sum()[lambda x: x > 0]

Series([], dtype: int64)

## Descriptive Statistics

In [72]:
df.describe()

Unnamed: 0,resident_seq_number,age,help_hours,work_hours
count,19973.0,19973.0,19973.0,19973.0
mean,1.818555,13.468883,4.166675,4.092275
std,0.956192,5.555088,5.790704,11.756149
min,1.0,3.0,0.0,0.0
25%,1.0,9.0,0.0,0.0
50%,2.0,13.0,2.0,0.0
75%,2.0,18.0,6.0,0.0
max,10.0,29.0,99.0,99.0


In [80]:
df.shape[0]  # Total rows (instances) in the DataFrame

19973

In [79]:
(df == 99).sum()[lambda x: x > 0]

help_hours    5
work_hours    2
dtype: int64

We need to adjust the "99" values in help_hours and work_hours, as according to the survey documentation, they represent an unknown value. To avoid any bias or influence, we will replace these values with zero.

In [81]:
# Replace values in help_hours
df['help_hours'] = df['help_hours'].replace({99: 0})

# Replace values in work_hours
df['work_hours'] = df['work_hours'].replace({99: 0})


In [82]:
(df == 99).sum()[lambda x: x > 0]

Series([], dtype: int64)

In [None]:
df.describe(include=['object'])

Unnamed: 0,sex,school_type,school_grade,expected_grade,economic_participation,economic_consequences,state_number,period_type,period_number
count,19973,19973,19973,19973,19973,19973,19973,19973,19973.0
unique,2,2,11,9,7,8,32,6,14.0
top,male,public,primary,bachelors,studying_other,no_consequence,29,year,3.0
freq,10231,17826,7172,11187,16880,18469,1128,16109,4878.0


In [75]:
df.describe(include=['string'])

Unnamed: 0,state_number,period_number
count,19973,19973.0
unique,32,14.0
top,29,3.0
freq,1128,4878.0


In [76]:
df.describe(include=['bool'])

Unnamed: 0,finished_grade,em_hw_projects,em_tests,em_multimedia_evidence,em_class_participation,em_class_work,em_class_attendance,em_other,em_no_evaluation,et_smartphone,...,hr_female_relative,hr_male_relative,hr_female_non_relative,hr_male_non_relative,hr_none,ap_stressed,ap_depressed,ap_academic_desperation,ap_social_difficulty,ap_no_issues
count,19973,19973,19973,19973,19973,19973,19973,19973,19973,19973,...,19973,19973,19973,19973,19973,19973,19973,19973,19973,19973
unique,2,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,True,True,True,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,True
freq,19585,17278,11247,13567,16064,10988,16990,19910,19709,13531,...,17942,19031,19849,19894,16852,14101,17792,15241,19036,10199


## Placeholder

## EOF