# Tirgul 9 - a sample project analysis

According to EDA (Exploratory data analysis) & modeling steps:

- Wrangling the data
- Understanding the data 
- Preparing the data
- Modeling


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
plt.rcParams['figure.figsize'] = [12, 8]

#### The Dataset
The dataset contains information on students and their grades in math, reading and writing.
[link to the data source](https://www.kaggle.com/spscientist/students-performance-in-exams)

We read the data from a github repository

In [None]:
url = 'https://raw.githubusercontent.com/ShaiYona/Data-Science2021B/main/tirgulim/tirgul9/StudentsPerformance.csv'
data = pd.read_csv(url)
data.tail()

## 1. Wrangling the data:

- Treat missing values (if needed)
- Treat column names (if needed)
- Treat data types (if needed)
- Treat any other weird thing your data might have

### Treat missing values

Check if there are missing values:

In [None]:
data.isnull().sum().sort_values(ascending=False)

Apearantly there weren't any _null's_ in the data

### Fixing data types
Check if any of the data types need to be fixed:

In [None]:
data.dtypes

We can:
- change 'gender' to be binary
- race/ethnicity to be cat.codes
- lunch to be binary
- test preparation course to be binary

We'll leave them as objects for now, but might change them later, depending on what we will want to do

## 2. Understanding the data

Let's see a summary in a pivot table (note that the default is 'mean'):

In [None]:
data.pivot_table(['math score','reading score','writing score' ],'gender') 

- Looks like the male students are leading in Math, but are behind on Reading and Writing
- How many males and how many females?

In [None]:
data['gender'].value_counts()

### Now in a Pie-Chart

In [None]:
data['gender'].value_counts().plot.pie(autopct='%1.1f%%')

## Lets study the differences between males and females:


Seperate into two datasets:

In [None]:
female = data.loc[data.gender == 'female']
male = data.loc[data.gender == 'male']
male.head()

In [None]:
plt.hist(male['math score'], alpha=0.4, label='male')
plt.hist(female['math score'], alpha=0.4, label='female')
plt.legend(loc='upper right')


In [None]:
plt.hist(male['reading score'], alpha=0.4, label='male')
plt.hist(female['reading score'], alpha=0.4, label='female')
plt.legend(loc='upper right')


In [None]:
plt.hist(male['writing score'], alpha=0.4, label='male')
plt.hist(female['writing score'], alpha=0.4, label='female')
plt.legend(loc='upper right')

We can see that male students tend to have a smaller variance then the female students.

Let's calculate the standard deviation and the range of scores

In [None]:
data.groupby('gender').std()

### Correlation between scores

In [None]:
scoreData = data[['math score','reading score','writing score']]
scoreData.tail()

In [None]:
scoreData.corr()

In [None]:
# cmap='jet' refers to table colors
# vmin=0.0 , vmax = 1 indicate the lower and upper bounderies of legend 
# Correlation is between -1 and 1, but here all the correlation is positive, so I can set
#   the vmin to be 0
# annot=True display the value of each square
sns.heatmap(scoreData.corr(), vmin=0.0 , vmax = 1,cmap='jet' , annot=True)

##### Obeservation: 
>
> The corrolation across subjects is quite high, between reading and writing is near perfect.
>

In [None]:
sns.regplot(x='reading score', y='writing score', data=data);

> Decreased correlation displays a higher spread

In [None]:
sns.regplot(x='reading score', y='math score', data=data);
# 

### Looking at parnetal level of education

In [None]:
parentEducData = data[["parental level of education"]]
parentEducData.tail()

In [None]:
parentEducData.value_counts() # counts the amount from each categorized value


[pie charts docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html)

In [None]:
# autopct display percents for each part
parentEducData.value_counts().plot.pie(autopct='%1.1f%%')

In [None]:
sns.countplot(x="parental level of education", data=data)

---
#### Project *Pro* tip: When there are more than 2-3 categories, a countplot is ALWAYS BETTER than a pie plot

It's just much easier to read it.

The only problem with our countplot is that the labels ovelap. There are many ways to fix it. 

Google it. [for example](https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot)

Example of how it could go wrong:

---

In [None]:
plt.figure(figsize=(4,6)) 
sns.countplot(x="parental level of education", data=data)

---

#### Project *Pro* tip: make sure your plots are readable (like we have just done).
You don't need to show both the unreadable version and the readable version. We know you have worked hard and struggled. Present your best!!

---

#### Searching for more correlations

Let's add a column with the mean score across all subjects:

_It's not cheating if the it can be deduced from the data_

In [None]:
data['mean score'] = scoreData.mean(axis=1)
data.tail()

Let's check - is there any connection between parental level of education and lunch to grades?

In [None]:
EducLunchMean_ScoreData = data[['parental level of education','lunch','mean score']].copy(deep=True) # not shallow/reference copy, change in data will not be affected in EducLucnh<ean
EducLunchMean_ScoreData.tail()

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','parental level of education').sort_values("mean score")

##### Don't present in an incomprehensible way.
##### For example (of what NOT to do):

In [None]:
EducLunchMean_ScoreData.groupby('parental level of education')['mean score'].hist(alpha=0.5,legend=True,figsize=(10,10))
# We cannot use 'pivot_table' here since we do not wish to aggregate the data

#### The connection bewtween lunch and student's mean score:

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','lunch')

We can see some connection here

In [None]:
EducLunchMean_ScoreData.groupby('lunch')['mean score'].hist(alpha=0.5,legend=True)

##### Observation:
> The lunch type tells us more about the student grades.
> Students with a standard lunch do better.
> This may say more about the studen't background then about their real abilites

#### The connection between parent education level and lunch type:

Turn the lunch into a category

- Standard = 1
- free/reduced = 0

In [None]:
EducLunchMean_ScoreData['lunch_cat'] = EducLunchMean_ScoreData['lunch'].astype('category').cat.codes
EducLunchMean_ScoreData

In [None]:
ptLunchEduc = EducLunchMean_ScoreData.pivot_table('lunch_cat','parental level of education').sort_values(by='lunch_cat')
ptLunchEduc

In [None]:
v_min = ptLunchEduc['lunch_cat'].min()*.99
v_max = ptLunchEduc['lunch_cat'].max()*1.01

sns.barplot(x=ptLunchEduc.index,y=ptLunchEduc['lunch_cat'])
plt.ylim(v_min,v_max)

> ##### Observation:
> It is interesting to see, that the lunch type is spread more or less equaly between the parent education levels. 
> Superficially, if lunch represents parents financial level, it was not affected by their education.


##### Project tip:
An observation is always better if it is also visual


In [None]:
# We manually orderd the plot according to the degrees
order = [5,2,4,3,1,0]
plt.figure(figsize=(10,5))
plt.scatter(ptLunchEduc.index[order],ptLunchEduc.values[order])
plt.ylim(0.5,.7)

#### The connection between parents education level and mean score:



The mean score grouped by parent's education:

In [None]:
mean_parent = EducLunchMean_ScoreData.groupby('parental level of education')['mean score'].mean()
mean_parent


In a scatter plot:

In [None]:
# We manually orderd the plot according to the degrees
order = [5,2,4,0,1,3]
plt.figure(figsize=(10,5))
plt.scatter(mean_parent.index[order],mean_parent.values[order])


##### Project tip: 
Think of which figure will present your data in the best way

In this case - a boxplot is better than a scatter plot

Present a boxplot, with rotated labels on x-axis

In [None]:
fig, axes = plt.subplots(figsize=(20, 5), ncols=3)
sns.boxplot(ax=axes[0], x='parental level of education', y='reading score', data=data)
sns.boxplot(ax=axes[1], x='parental level of education', y='writing score', data=data)
sns.boxplot(ax=axes[2], x='parental level of education', y='math score', data=data)
for i, ax in enumerate(fig.axes):
    axes[i].tick_params(axis='x', rotation=45) # chage to y axis and -45 and see what happens
#ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

##### Boxplot for mean score

In [None]:
ax = sns.boxplot(x='parental level of education', y='mean score', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

#### The connection between education level and lunch type and mean score:

The mean score grouped by parent's education: and student's mean score:

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','parental level of education','lunch',margins=True)



- The bottom margin shows the score according to the lunch (free/standard)
- The right margin shows the score according to the parents degree
- The mean for students with standard lunch is 8.5 points higher!


> ##### Observation:
> The parent's education level does not have a direct effect on the lunch type. 
>
> The parent's education level does not have a direct effect on the mean score. 
>
> But - the parent's education  level combined with the lunch type has an effect on the mean score. 

## 3. Building a model from the data

We will try to predict mean score using decision tree, based on gender, race and test preparation. 

#### Preparing the data for learning

In [None]:
X = pd.get_dummies(data[['gender','race/ethnicity','lunch','test preparation course']])
y = data[['mean score']]

X.head()

##### Remove the reduntant fields

In [None]:
X = X.drop(columns=['gender_male','lunch_standard','test preparation course_none'])
X.head()

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42) 
y_test.head()

In [None]:
print("Train STD of {}".format(y_train.std()))
print("Test STD of {}".format(y_test.std()))


##### Build the model

In [None]:
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train,y_train)

##### Evaluation

In [None]:
def eval(x_test,y_test,model):
    pred = model.predict(x_test)
    print("MSE: {:.3f}".format(mean_squared_error(pred,y_test,squared=False)))

In [None]:
eval(X_test,y_test,model)

##### Plot the tree
[plot_tree docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

Write a function that plots the tree

In [None]:
import sklearn.tree as tree
def plot_tree(tree_model,feat,size=(15,10)):
    fig = plt.figure(figsize=size)
    tree.plot_tree(tree_model, 
                   feature_names = feat, 
                   filled=True,
                  fontsize=15)  
    plt.show()

##### Tree Pruning

In [None]:
model = DecisionTreeRegressor(max_depth=3,random_state=42)
model.fit(X_train,y_train)

eval(X_test,y_test,model)
plot_tree(model,X_test.columns,size=(30,20))

In [None]:
model = DecisionTreeRegressor(min_samples_split=100,random_state=42)
model.fit(X_train,y_train)

eval(X_test,y_test,model)
plot_tree(model,X_test.columns,size=(60,20))

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=10000,max_depth=4,max_samples=100,random_state=42)
# RandomForestRegressor  fits a number of classifying decision trees 
# n_estimators is the numbers of trees to be used in the forest

model.fit(X_train,y_train.values.ravel())
eval(X_test,y_test,model)

#### Let's check the error percentage
### The fraction of difference a-b from b is:
# $\frac{|a-b|}{b} $

In [None]:
pred=model.predict(X_test)
(np.abs(pred-y_test.values.ravel())/y_test.values.ravel()).mean()