# Project University Mental Health

## Part 3: Data Transformation

# Introduction
In this Part, we will prepare our data for machine learning in Part IV. 

The interesting thing about the data is that the categorical values are engineered from the numerical ones. 

For example:
- Friends (Willingness to seek help from friends when students encounter emotional difficulties)
- Friends_bi (Whether students are willing to seek help from friends when they encounter emotional difficulties)

As such, our approach for this dataset is slightly different. We will first:
1. Get a DataFrame that contains only numerical columns
2. Get a DataFrame that contains only dummified variables from categorical variables
3. A combination of both numerical and dummified variables

In [1]:
# Step 1: Import pandas
import pandas as pd

In [2]:
# Step 2: Read the cleaned CSV
df = pd.read_csv('data_cleaned.csv')
df

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,24.0,4.0,5.0,Long,3.0,Average,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,28.0,5.0,1.0,Short,4.0,High,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,25.0,4.0,6.0,Long,4.0,High,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,29.0,5.0,1.0,Short,2.0,Low,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,28.0,5.0,1.0,Short,1.0,Low,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,Dom,JAP,Female,Under,21.0,3.0,4.0,Long,5.0,High,...,Yes,Yes,No,No,No,No,No,No,No,Yes
264,Dom,JAP,Female,Under,22.0,3.0,3.0,Medium,3.0,Average,...,Yes,Yes,Yes,No,No,No,No,No,No,No
265,Dom,JAP,Female,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No
266,Dom,JAP,Male,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No


In [3]:
# Step 3: Get a DataFrame containing only numbers
df_float = df.select_dtypes(include = 'number')
df_float

Unnamed: 0,Age,Age_cate,Stay,Japanese,English,ToDep,ToSC,APD,AHome,APH,...,Friends,Parents,Relative,Profess,Phone,Doctor,Reli,Alone,Others,Internet
0,24.0,4.0,5.0,3.0,5.0,0.0,34.0,23.0,9.0,11.0,...,5.0,6.0,3.0,2.0,1.0,4.0,1.0,3.0,4.0,3.0
1,28.0,5.0,1.0,4.0,4.0,2.0,48.0,8.0,7.0,5.0,...,7.0,7.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,3.0
2,25.0,4.0,6.0,4.0,4.0,2.0,41.0,13.0,4.0,7.0,...,3.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,3.0
3,29.0,5.0,1.0,2.0,3.0,3.0,37.0,16.0,10.0,10.0,...,5.0,5.0,5.0,5.0,2.0,2.0,2.0,4.0,4.0,3.0
4,28.0,5.0,1.0,1.0,3.0,3.0,37.0,15.0,12.0,5.0,...,5.0,5.0,2.0,5.0,2.0,5.0,5.0,4.0,4.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,21.0,3.0,4.0,5.0,4.0,8.0,27.0,16.0,9.0,10.0,...,7.0,5.0,1.0,3.0,3.0,3.0,1.0,1.0,1.0,6.0
264,22.0,3.0,3.0,3.0,4.0,2.0,48.0,8.0,10.0,5.0,...,7.0,7.0,7.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0
265,19.0,2.0,1.0,5.0,3.0,9.0,47.0,8.0,7.0,5.0,...,7.0,7.0,6.0,7.0,7.0,7.0,1.0,1.0,1.0,2.0
266,19.0,2.0,1.0,5.0,3.0,1.0,43.0,8.0,12.0,5.0,...,5.0,7.0,5.0,5.0,5.0,5.0,4.0,4.0,4.0,2.0


In [4]:
# Step 4: Export the numerical DataFrame as CSV
df_float.to_csv('data_cleaned_floats.csv', index = None)

In [5]:
# Step 5: Get a DataFrame that contains only strings
df_object = df.select_dtypes(include=  'object')
df_object

Unnamed: 0,inter_dom,Region,Gender,Academic,Stay_Cate,Japanese_cate,English_cate,Intimate,Religion,Suicide,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,Long,Average,High,,Yes,No,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,Short,High,High,,No,No,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,Long,High,High,Yes,Yes,No,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,Short,Low,Average,No,No,No,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,Short,Low,Average,Yes,No,No,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,Dom,JAP,Female,Under,Long,High,High,No,Yes,No,...,Yes,Yes,No,No,No,No,No,No,No,Yes
264,Dom,JAP,Female,Under,Medium,Average,High,Yes,Yes,No,...,Yes,Yes,Yes,No,No,No,No,No,No,No
265,Dom,JAP,Female,Under,Short,High,Average,No,No,No,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No
266,Dom,JAP,Male,Under,Short,High,Average,No,No,No,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No


In [6]:
# Step 6: Dummify your categorical DataFrame
df_object_dummy = pd.get_dummies(df_object, drop_first=True)
df_object_dummy

Unnamed: 0,inter_dom_Inter,Region_JAP,Region_Others,Region_SA,Region_SEA,Gender_Male,Academic_Under,Stay_Cate_Medium,Stay_Cate_Short,Japanese_cate_High,...,Friends_bi_Yes,Parents_bi_Yes,Relative_bi_Yes,Professional_bi_Yes,Phone_bi_Yes,Doctor_bi_Yes,religion_bi_Yes,Alone_bi_Yes,Others_bi_Yes,Internet_bi_Yes
0,1,0,0,0,1,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,1,0,0,0,1,1,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0
2,1,0,0,0,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,1,0,...,1,1,1,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,1,0,...,1,1,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,0,1,0,0,0,0,1,0,0,1,...,1,1,0,0,0,0,0,0,0,1
264,0,1,0,0,0,0,1,1,0,0,...,1,1,1,0,0,0,0,0,0,0
265,0,1,0,0,0,0,1,0,1,1,...,1,1,1,1,1,1,0,0,0,0
266,0,1,0,0,0,1,1,0,1,1,...,1,1,1,1,1,1,0,0,0,0


In [7]:
# Step 7: Export the DataFrame from Step 6
df_object_dummy.to_csv('data_cleaned_object_dummified.csv', index = None)

In [8]:
# Step 8: Get the full dummified DataFrame
df_full = pd.get_dummies(df, drop_first=True)
df_full

Unnamed: 0,Age,Age_cate,Stay,Japanese,English,ToDep,ToSC,APD,AHome,APH,...,Friends_bi_Yes,Parents_bi_Yes,Relative_bi_Yes,Professional_bi_Yes,Phone_bi_Yes,Doctor_bi_Yes,religion_bi_Yes,Alone_bi_Yes,Others_bi_Yes,Internet_bi_Yes
0,24.0,4.0,5.0,3.0,5.0,0.0,34.0,23.0,9.0,11.0,...,1,1,0,0,0,0,0,0,0,0
1,28.0,5.0,1.0,4.0,4.0,2.0,48.0,8.0,7.0,5.0,...,1,1,0,0,0,0,0,0,0,0
2,25.0,4.0,6.0,4.0,4.0,2.0,41.0,13.0,4.0,7.0,...,0,0,0,0,0,0,0,0,0,0
3,29.0,5.0,1.0,2.0,3.0,3.0,37.0,16.0,10.0,10.0,...,1,1,1,1,0,0,0,0,0,0
4,28.0,5.0,1.0,1.0,3.0,3.0,37.0,15.0,12.0,5.0,...,1,1,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,21.0,3.0,4.0,5.0,4.0,8.0,27.0,16.0,9.0,10.0,...,1,1,0,0,0,0,0,0,0,1
264,22.0,3.0,3.0,3.0,4.0,2.0,48.0,8.0,10.0,5.0,...,1,1,1,0,0,0,0,0,0,0
265,19.0,2.0,1.0,5.0,3.0,9.0,47.0,8.0,7.0,5.0,...,1,1,1,1,1,1,0,0,0,0
266,19.0,2.0,1.0,5.0,3.0,1.0,43.0,8.0,12.0,5.0,...,1,1,1,1,1,1,0,0,0,0


In [9]:
# Step 9: Export the third DataFrame as a CSV
df_full.to_csv('data_cleaned_full.csv', index = None)

### End of Part III
This dataset is unique in the sense that the features were sort of engineered already for us.

As such, we just needed to split the DataFrame into different parts. 

More specifically, we prepared three sets of DataFrames so that we can work with them in the next Part, which is machine learning modelling.