### 1. Carga de datos ###

Importación de librerías y carga de datos

In [19]:
import pandas as pd
from pathlib import Path
df = pd.read_csv('../data/raw/Student_performance_data.csv')

### 2. Limpieza de datos ###

Como se mencionó el notebook de análisis exploratorio, la calidad de este dataset es buena ya que no posee datos nulos ni duplicados.

### 3. Generación y modificación de datos

Comenzamos modificando el tipo de dato de las variables categóricas y cambiando sus valores por unos más descriptivos con la finalidad de que el análisis y los gráficos sean más fáciles de entender.

Genero

In [20]:
df = df.astype({"Gender": str})
df.loc[df["Gender"] == "0", "Gender"] = "Male"
df.loc[df["Gender"] == "1", "Gender"] = "Female"

Ethnicity

In [21]:
df = df.astype({"Ethnicity": str})
df.loc[df["Ethnicity"] == "0", "Ethnicity"] = "Caucasian"
df.loc[df["Ethnicity"] == "1", "Ethnicity"] = "African American"
df.loc[df["Ethnicity"] == "2", "Ethnicity"] = "Asian"
df.loc[df["Ethnicity"] == "3", "Ethnicity"] = "Other"

ParentalEducation

In [22]:
df = df.astype({"ParentalEducation": str})
df.loc[df["ParentalEducation"] == "0", "ParentalEducation"] = "None"
df.loc[df["ParentalEducation"] == "1", "ParentalEducation"] = "High School"
df.loc[df["ParentalEducation"] == "2", "ParentalEducation"] = "Some College"
df.loc[df["ParentalEducation"] == "3", "ParentalEducation"] = "Bachelor's"
df.loc[df["ParentalEducation"] == "4", "ParentalEducation"] = "Higher"

Tutoring

In [23]:
df = df.astype({"Tutoring": str})
df.loc[df["Tutoring"] == "0", "Tutoring"] = "No"
df.loc[df["Tutoring"] == "1", "Tutoring"] = "Yes"

ParentalSupport

In [24]:
df = df.astype({"ParentalSupport": str})
df.loc[df["ParentalSupport"] == "0", "ParentalSupport"] = "None"
df.loc[df["ParentalSupport"] == "1", "ParentalSupport"] = "Low"
df.loc[df["ParentalSupport"] == "2", "ParentalSupport"] = "Moderate"
df.loc[df["ParentalSupport"] == "3", "ParentalSupport"] = "High"
df.loc[df["ParentalSupport"] == "4", "ParentalSupport"] = "Very High"

Extracurricular

In [25]:
df = df.astype({"Extracurricular": str})
df.loc[df["Extracurricular"] == "0", "Extracurricular"] = "No"
df.loc[df["Extracurricular"] == "1", "Extracurricular"] = "Yes"

Sports

In [26]:
df = df.astype({"Sports": str})
df.loc[df["Sports"] == "0", "Sports"] = "No"
df.loc[df["Sports"] == "1", "Sports"] = "Yes"

Music

In [27]:
df = df.astype({"Music": str})
df.loc[df["Music"] == "0", "Music"] = "No"
df.loc[df["Music"] == "1", "Music"] = "Yes"

Volunteering

In [28]:
df = df.astype({"Volunteering": str})
df.loc[df["Volunteering"] == "0", "Volunteering"] = "No"
df.loc[df["Volunteering"] == "1", "Volunteering"] = "Yes"

GradeClass

En este caso en particular, se reemplazan los valores existentes por unos generados mediante la utilización del GPA, ya que en el notebook de análisis exploratorio hemos visto que la columna GradeClass poseía datos erróneos.

In [29]:
df = df.astype({"GradeClass": str})
df.loc[df.GPA >= 3.5 , "GradeClass"] = "A"
df.loc[(df.GPA >= 3.0) & (df.GPA < 3.5) , "GradeClass"] = "B"
df.loc[(df.GPA>= 2.5) & (df.GPA < 3.0) , "GradeClass"] = "C"
df.loc[(df.GPA >= 2.0) & (df.GPA < 2.5) , "GradeClass"] = "D"
df.loc[df.GPA < 2.0 , "GradeClass"] = "F"

Eliminamos la columna StudentID ya que no nos aporta valor

In [31]:
df.drop('StudentID', axis=1, inplace=True)

**Nota:** No se han generado nuevas variables ya que se ha considerado que las existentes son suficientes para realizar el análisis deseado.

### 4. Revisión del dataset resultante

Revisamos que los tipos de datos han cambiado

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2392 entries, 0 to 2391
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                2392 non-null   int64  
 1   Gender             2392 non-null   object 
 2   Ethnicity          2392 non-null   object 
 3   ParentalEducation  2392 non-null   object 
 4   StudyTimeWeekly    2392 non-null   float64
 5   Absences           2392 non-null   int64  
 6   Tutoring           2392 non-null   object 
 7   ParentalSupport    2392 non-null   object 
 8   Extracurricular    2392 non-null   object 
 9   Sports             2392 non-null   object 
 10  Music              2392 non-null   object 
 11  Volunteering       2392 non-null   object 
 12  GPA                2392 non-null   float64
 13  GradeClass         2392 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 261.8+ KB


Revisamos la información estadística, como se puede apreciar ahora no se muestra datos estadísticos de campos como la étnica que corresponde a una variable categorica y estaba siendo tratada como número.

In [33]:
df.describe()

Unnamed: 0,Age,StudyTimeWeekly,Absences,GPA
count,2392.0,2392.0,2392.0,2392.0
mean,16.468645,9.771992,14.541388,1.906186
std,1.123798,5.652774,8.467417,0.915156
min,15.0,0.001057,0.0,0.0
25%,15.0,5.043079,7.0,1.174803
50%,16.0,9.705363,15.0,1.893393
75%,17.0,14.40841,22.0,2.622216
max,18.0,19.978094,29.0,4.0


### 5. Se guarda el dataset modificado

In [34]:
from pathlib import Path  
filepath = Path('../data/processed/Student_performance_data.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
df.to_csv(filepath, index=False)