# Project 3: Communicating data findings


## Dataset: Student Performance in Exams
This dataset is about student performance, where it lists the students' grades in 3 subjects, as well as information about the student, like their gender and ethnicity. 

The dataset contains 8 columns:
* `gender`: a categorical variable (male or female)
* `race/ethnicity`: a categorical variable, taking one of five values: `group A, group B...group E` 
* `parental level of education`: categorical variable, which can be `high school, college, bachelor's, master's or associate degress`
* `lunch`: a categorical variable, which tells the type of lunch the students have. This reflects their financial status. It can be `standard` or `free\reduced`
* `test preparation course`: indicates whether a student completed such course. The values can be either `completed` or `none`
* The rest of the columns are numeric, indicating the studen't's score, out of 100, in `math`, `reading` and `writing`

The dataset has **1000 records**, but we will see if that number changes after some wrangling

## Research Questions: (TODO)


In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [20]:
df = pd.read_csv("dataset/StudentsPerformance.csv")
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## Data wrangling
In this step, I will perform the following:
* checking the column names are "code friendly"
* checking for nulls in the data
* checking if there are duplicate rows
* checking if the datatypes are the most suitable for each column

This process is iterative, so I might need to do some data wrangling later

#### Code friendly colum names
From the dataframe head, we can see that the column names are meaningful, but can sometimes be long or contain whitespace, which are not code friendly. As such, I will try to rename them

In [21]:
# replace column names with values in the dictionary
df_mod = df.rename(columns={
    "race/ethnicity":"race",
    "parental level of education": "parent_education",
    "test preparation course": "prep_course"
})

# replace whitespace with underscore
df_mod.rename(columns= lambda x: x.replace(" ", "_"), inplace = True)

df_mod.head()

Unnamed: 0,gender,race,parent_education,lunch,prep_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


#### Nulls, and datatypes
This can be easily checked by the `info()` method of the datafram

In [22]:
df_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   gender            1000 non-null   object
 1   race              1000 non-null   object
 2   parent_education  1000 non-null   object
 3   lunch             1000 non-null   object
 4   prep_course       1000 non-null   object
 5   math_score        1000 non-null   int64 
 6   reading_score     1000 non-null   int64 
 7   writing_score     1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


So, 
* the data has no null values,
* The test scores are integers, and the categorical data is of type string

So far so good. **However**, for ordinal data, like parent_education. It will be useful to sort ordinal data while visualizing. We can do that via the `pd.api.CategoricalDatatype()` function to create an ordinal datatype, and convert the dataframe column to that datatype

###### Ordinal Datatype

In [23]:
# create a list of sorted data
sorted_edu_levels = ["some high school", "high school", "some college", "associate", "bachelor's degree", "master's degree"]

# create the oridnal datatype
parent_education_ordinal = pd.api.types.CategoricalDtype(categories=sorted_edu_levels,
                                                        ordered = True)
# cast to this new type
df_mod["parent_education"] = df_mod["parent_education"].astype(parent_education_ordinal)

The rest of the ordinal columns, like lunch or prep course, are binary anyways, so we do not need to convert them

#### Duplicated rows

In [24]:
print("Are there any duplicated rows?")
df.duplicated().any()

Are there any duplicated rows?


False

That's it for data wrangling, unless something else comes up later