In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plotting

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# Intro

In this notebook, I will explain an example of the process I follow to analyze data.
In this case, the dataset is from Kaggle, showing the marks in some math and language tests.

# Let's load our data and have a look at it

In [None]:
#Load the data into pandas dataframe
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

Initially, we have a look at what type of data we have, even if we already made it in a spreadsheet software:

In [None]:
#Lookup to loaded data (first rows)
df.head()

Let's check if the dataset contains nulls and print the general description (to see if the dataset is healthy)

In [None]:
print ("Count of nulls for each column:")
print (df.isnull().sum())

df.describe()

Looks good: no nulls, no max values above 100, acceptable standard deviations and quartiles.

Let's add to the dataset the average score, looks like a nice-to-have figure:

In [None]:
df['avg score'] = (df['math score']+df['reading score']+df['writing score'])/3
df.head()


# All right, lets see if we find some correlation in data

Doing this I can understand the data I'm working with


## First, with numerical data:

As a Heatmap:

In [None]:
corr = df.corr() # correlation matrix from pandas
heat_map = sns.heatmap(corr, cmap="YlGnBu") # seaborn heatmap with my favourite tones
corr

And as a pairplot:

In [None]:
sns.pairplot(df) # seaborn pairplot to observe correlation

We see scores are highly correlated, but writing and reading scores show the strongest correlation. Makes sense.

It also makes sense to see a stronger correlation between average and reading/writing, since both of them are highly correlated and affect to 2/3 of the average score.

I can also see right skewed data

> **CONCLUSION:** Writing and reading scores are **highly correlated**.

## Now we can explore categorical trends:

### Maybe it is gender related...

In [None]:
sns.displot(df, x="math score", hue="gender", kind="kde", fill ="true")
print("Average: %d" % df['math score'].median())

Male tend do slightly better than female here.

In [None]:
sns.displot(df, x="reading score", hue="gender", kind="kde", fill ="true")
print("Average: %d" % df['reading score'].median())

Female are significantly better here.

In [None]:
sns.displot(df, x="writing score", hue="gender", kind="kde", fill ="true")
print("Average: %d" % df['writing score'].median())

Female also do better at writing.  

>**CONCLUSIONS:**
>* Male tend to be slightly better at math  
>* Female tend to be noticeably better at reading and writing
>* Math marks are slightly worse than language marks

>**FURTHER INVESIGATION:**
>* FI_1: Is there any other feature of the dataset related to gender affecting the scores? i.e. parents education, lunch or test preparation?

### ¿Could it be ethnicity related?

In [None]:
sns.displot(df, x="avg score", hue="race/ethnicity", kind="kde", fill ="true")

Looks like if group A, tends to score lower than group E. Let's dive more in that:

>**FURTHER INVESTIGATIONS:**
>* FI_2: Is ethnia (mainly groups A and E) related to any other feature?

### Let's do some correlation with categorical data


I will enconde the categorical values following this rules:
* Gender and ethnicity will be encoded randomly with a number (since they can't be put in order of importance)
* Parental education, lunch and test preparation will be encoded from low to high.

Encoding will help me in case I want to do numerical operations with categorical data.

Unique values for categorical columns are as follow:

In [None]:
#to know the unique values of each column is important for the encoding:
print(df['gender'].unique())
print(df['race/ethnicity'].unique())
print(df['parental level of education'].unique())
print(df['lunch'].unique())
print(df['test preparation course'].unique())

Now the proper encoding:

In [None]:
# map the codes for the replacement with dictionaries

gender_codes = {'gender_code': {'female': 1, 'male': 2}}
ethnicity_codes = {'ethnicity_code':{"group A":1, "group B":2, "group C":3, "group D":4, "group E":5}}
parental_codes = {'parental_code':{"some high school":1, "high school":2, "some college":3, "associate's degree":4, "bachelor's degree":5, "master's degree":6}}
lunch_codes = {'lunch_code':{"free/reduced":0, "standard":1}}
test_codes = {'test_code':{"none":0, "completed":1}}

# copy the column values to code columns, to keep the original columns

df['gender_code'] = df['gender']
df['ethnicity_code'] = df['race/ethnicity']
df['parental_code'] = df['parental level of education']
df['lunch_code'] = df['lunch']
df['test_code'] = df['test preparation course']

#now the actual replacement

df.replace(gender_codes, inplace=True)
df.replace(ethnicity_codes, inplace=True)
df.replace(parental_codes, inplace=True)
df.replace(lunch_codes, inplace=True)
df.replace(test_codes, inplace=True)

df.head()

#### Now I will have a look to ethnic groups

Let's look how ethnicity is distributed:

In [None]:
sns.catplot(x="race/ethnicity", kind="count", data=df,  order=["group A", "group B", "group C", "group D", "group E"], palette="Set2")

In [None]:
sns.catplot(x="race/ethnicity", y="avg score", kind="bar", data=df, order=["group A", "group B", "group C", "group D", "group E"], palette="Set2")

For some reason, the groups are performing significantly different. As the ethnicity distribution shows, it is not related to the number of students in each group.

>**CONCLUSION:** There are differences between ethnic groups. Average score order is, from worse to better, A->B->C->D->E.

>**FURTHER INVESTIGATION:**
>* FI_3: Why the ethnical groups get better scores from A -> E?

A boxplot, to see if it throws some insight:

In [None]:
ax = sns.boxplot(x="avg score", y="race/ethnicity", hue="gender", 
                 data=df, palette="Set2", order=["group A", "group B", "group C", "group D", "group E"])


Nothing new here, but there are some outliers in females from groups B and C we might want to have a look at.

>**FURTHER INVESTIGATION:**
>* FI_4: Investigate females from groups B and C with lower marks, might reveal some hidden problems.

#### I want to observe if lunch makes any difference

In [None]:
print("Number of people with or without standard lunch:")
df['lunch'].value_counts()

In [None]:
ax = sns.boxplot(x="avg score", y="lunch", hue="gender", 
                 data=df, palette="Set2")

Eating less does little favor to students, it seems.

>**CONCLUSION:** Not eating before a test is generally not a good idea

#### Parental educational level may impact as well:

The distribution is as follows:

In [None]:
sns.catplot(x="parental level of education", kind="count", data=df,
            order=["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"],
            palette="Set2")

In [None]:
ax = sns.boxplot(x="avg score", y="parental level of education", hue="gender", 
                 data=df, palette="Set2", order=["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"])

#### Now with the exam preparation


In [None]:
print("Number of people with test preparation course, or not:")
df['test preparation course'].value_counts()

In [None]:
ax = sns.boxplot(x="avg score", y="test preparation course", hue="gender", 
                 data=df, palette="Set2")

Even if there is no significant difference in the average scores of those who attended the test preparation course (just slightly better), we can observe that the lower tail of the ones who didn't attend is much longer, even under 50.

>**CONCLUSIONS:**
>* Completing the course is not very useful for having a higher score, but improves your chances to pass.
>* It's easier to fail for those who haven't taken the course.

#### I will try to find patterns in ethnicity related to the parental level of education 

In [None]:
print("Number of people in each group sliced with parental education:")
pd.pivot_table(df,index=["race/ethnicity"], values=["avg score"], columns=["parental_code","parental level of education"], aggfunc='count')

* Group A is the one with less educated parents.
* Group B shows a trend towards the lower half of the education ranking.
* Groups C and D are quite flat in the distribution.
* Group E parental education is quite centered in the some college/associate's degree education.

>**CONCLUSIONS:**
>* Parental education seems to affect scores. Further tests will be required for this *hypothesis* (FI_5).

# Conclusions:  

* The strongest correlation is observed between writing and reading marks.
* Male tend to be slightly better at math, while female tend to be noticeably better at writing and reading.
* Math scores are noticeably lower.
* There are differences between ethnic groups. Average score order is, from worse to better, A->B->C->D->E.
* Not eating before a test is generally not a good idea
* Completing the course is not very useful for having a higher score, but improves your chances to pass.
* It's easier to fail for those who haven't taken the preparatory course.
* Parental education seems to affect scores. Further tests will be required for this *hypothesis*.

>**FURTHER INVESTIGATIONS:**
>* FI_1: Is there any other feature of the dataset related to gender affecting the scores? i.e. parents education, lunch or test preparation?
>* FI_2: Is ethnia (mainly groups A and E) related to any other feature?
>* FI_3: Why the ethnical groups get better scores from A -> E?
>* FI_4: Investigate females from groups B and C with lower marks, might reveal some hidden problems.
>* FI_5: Parental education seems to affect scores. Further tests will be required for this *hypothesis*.