# Data Cleaning Process

### Pedro Antonio Ramonetti
### UCID: 12324731

For this project, I'm using data that was collected by the National Institute of Statistics and Geography  of Mexico, to understand the impact of COVID-19 on Mexican Education.

The files that are going to be cleaned are the following (both can be found in the raw data folder):

* TMODULO.csv: This file captures the surveys that were individually collected for each student in the sample. Most of these surveys were answered by a relative who was found to be in the house at the moment of the survey.
* TVIVIENDA.csv: This file captures the surveys collected at household level. 

## Household Data Cleaning

In [1]:
import pandas as pd
import numpy as np

# First, I read the data.

df_household = pd.read_csv("raw_data/TVIVIENDA.csv", low_memory = False)

# Drop variables that are not gonna be used. These variables counts the members of the household (not useful for us).

df_household.drop(df_household.iloc[:, 1:29], inplace=True, axis=1)

In [2]:
# Rename column to identify them more easily. The question ask if the household has the following goods

df_household.rename(columns = {'P2_1_1':"P2_desktop", 'P2_1_2':"P2_laptop", 'P2_1_3':"P2_tv", 
                     'P2_1_4':"P2_tablet", 'P2_1_5':"P2_smartphone", 'P2_1_6':"P2_internet"}, inplace = True)

# I dont't have use for the following variables:
df_household.drop(df_household.iloc[:, 7:16], inplace=True, axis=1)



In [3]:
# I'm saving the following columns in a different df, since the number of missing values is different. These 
# questions have to do with the advantages and disavantages of remote classes. Not really sure if I'm using them

df_household_2 = df_household.drop(columns=["P2_desktop","P2_laptop", "P2_tv", "P2_tablet", "P2_smartphone", "P2_internet"])

# Now I drop them from the original df
df_household.drop(df_household.iloc[:, 7:], inplace=True, axis=1)

#Drop NA's
df_household = df_household.dropna()

In [4]:
# Make variables into an appropiate dummy format

for (_, colname) in enumerate(df_household):
    if colname=="ENT":
        continue
    else:
        df_household.loc[df_household[colname] == 2, colname] = 0

In [5]:
# Specify the regions in our data

northwest=[2,3,8,10,25,26]
northeast=[5,19,28]
west=[6,14,16,18]
east=[13,21,29,30]
northcenter=[1,11,22,24,32]
southcenter=[9,15,17]
southeast=[4,23,27,31]
southwest=[7,12,20]

df_household["Region"] = "Northwest"
df_household["Region"] = np.where(df_household['ENT'].isin(northeast),'Northeast', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(west),'West', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(east),'East', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(northcenter),'Northcenter', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(southcenter),'Southcenter', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(southeast),'Southeast', df_household["Region"])
df_household["Region"] = np.where(df_household['ENT'].isin(southwest),'southwest', df_household["Region"])

df_household_2["Region"] = "Northwest"
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(northeast),'Northeast', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(west),'West', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(east),'East', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(northcenter),'Northcenter', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(southcenter),'Southcenter', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(southeast),'Southeast', df_household_2["Region"])
df_household_2["Region"] = np.where(df_household_2['ENT'].isin(southwest),'southwest', df_household_2["Region"])



In [6]:
# Before saving our df's, a count variable is always useful.

df_household["count"] = 1
df_household_2["count"] = 1

# We save our df's. As I mentioned, not really sure if using the second one

df_household.to_csv("household_clean.csv", index = False)
df_household_2.to_csv("household_2_clean.csv", index = False)


## Individuals Data Cleaning


In [7]:
# First, I read the data.

df = pd.read_csv("raw_data/TMODULO.csv", low_memory = False)

In [8]:
# All of the following columns are dummy variables, that are not correctly codified.

dummies = ["P3_5","P3_6","P3_7","P3_8","P3_9_1","P3_9_2","P3_9_3","P3_9_4",
"P3_9_5","P3_9_6","P3_9_7","P3_9_8","P3_9_9","P3_11","P3_12_1",
"P3_12_2","P3_12_3","P3_12_4","P3_12_5","P3_12_6","P3_12_7",
"P3_12_8","P3_14","P3_16","P3_17_1","P3_17_2","P3_17_3","P3_17_4",
"P3_17_5","P3_17_6","P3_17_7","P3_17_8","P3_20","P3_21_1","P3_21_2",
"P3_21_3","P3_21_4","P3_21_5","P3_21_6","P3_21_7","P3_23"]

for dummy in dummies:
    df.loc[df[dummy] == 2, dummy] = 0

In [9]:
# Specify the regions

df["Region"] = "Northwest"
df["Region"] = np.where(df['ENT'].isin(northeast),'Northeast', df["Region"])
df["Region"] = np.where(df['ENT'].isin(west),'West', df["Region"])
df["Region"] = np.where(df['ENT'].isin(east),'East', df["Region"])
df["Region"] = np.where(df['ENT'].isin(northcenter),'Northcenter', df["Region"])
df["Region"] = np.where(df['ENT'].isin(southcenter),'Southcenter', df["Region"])
df["Region"] = np.where(df['ENT'].isin(southeast),'Southeast', df["Region"])
df["Region"] = np.where(df['ENT'].isin(southwest),'southwest', df["Region"])

In [10]:
# I know I'm not using the following variables. Too many missing values

to_drop = ["CON","N_REN", "PAREN","P3_9_1","P3_9_2","P3_9_3","P3_9_4","P3_9_5","P3_9_6","P3_9_7","P3_9_8","P3_9_9",
           "P3_19_1","P3_19_2","P3_19_3","P3_19_4","P3_19_5","P3_19_6","P3_19_7","P3_19_8"]

for col in to_drop:
    df = df.drop([col], axis=1)

In [11]:
# We correct the gender and "Public / Private School" variables
df["gender_string"] = "female"
df["gender_string"] = np.where((df["SEXO"] == 1),'male', df["gender_string"])

df["P3_6_string"] = "public"
df["P3_15_string"] = "public"

df["P3_6_string"] = np.where((df["P3_6"] == 2),'private', df["P3_6_string"])
df["P3_15_string"] = np.where((df["P3_15"] == 2),'private', df["P3_15_string"])

In [12]:
# Count variable
df["count"] = 1

# Save the clean file. I'm uncertain on which exact variables I'm going to use, so I want to preserve the file as it is for now.
df.to_csv("students_clean.csv", index = False)