## Tirgul 9 A sample project analysis

- Read the data
- Filter data
- Determine features
- Determine label

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## The Dataset
The dataset contanins information on students and their grades in math, reading and writing.

In [None]:
data = pd.read_csv('StudentsPerformance.csv')
data.tail()

In [None]:
# Clear N/A
data = data.dropna(axis=0)
data.tail()

Apearantly there weren't any 'na's' in the data

In [None]:
# let's see some summary
data.pivot_table(['math score','reading score','writing score' ],'gender') # note the default is mean

Looks like the male students are leading in Math, but are behind on Reading and Writing

## Ploting a histogram
Let's display a histogram for each subject by gender.

Top is female.


[pandas_hist](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html)

In [None]:
data.groupby('gender').hist(figsize=(10,5))

Let's calculate the variance and the range of scores

In [None]:
data.groupby('gender').var()

We can see that male students tend to have a smaller variance then the female students.

## Correlation between scores

In [None]:
scoreData = data[['math score','reading score','writing score']]
scoreData.tail()

In [None]:
scoreData.corr()

The corrolation across subjects is quite high, but between reading and writing is near perfect.

In [None]:
# cmap='jet' refers to table colors
#  vmin=0.0 , vmax = 1 indicate the lower and upper bounderies of legend 
# annot=True display the value of each square
sns.heatmap(scoreData.corr(), vmin=0.0 , vmax = 1,cmap='jet' , annot=True)

## $\pi$-Charts
[pie charts docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html)

In [None]:
parentEducData = data[["parental level of education"]]
parentEducData.tail()

In [None]:
parentEducData.value_counts() # counts the amount from each categorized value

In [None]:
parentEducData.value_counts().plot.pie(autopct='%1.1f%%')
# autopct display percents for each part

## Looking for more correlation

In [None]:
data['mean score'] = scoreData.mean(axis=1)
data.tail()

## Is there any connection between parental level of education and lunch to grades?

In [None]:
EducLunchMean_ScoreData = data[['parental level of education','lunch','mean score']].copy(deep=True)
EducLunchMean_ScoreData.tail()

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','parental level of education').sort_values("mean score")

In [None]:
sns.regplot(x='Income', y='TotalMnt', data=mrkt_df);

In [None]:
EducLunchMean_ScoreData.groupby('parental level of education')['mean score'].hist(alpha=0.5,legend=True,figsize=(10,10))
# We cannot use 'pivot_table' here since we do not wish to aggregate the data

We can see that from "some college" to "master's degree" the difference is small

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','lunch')

In [None]:
EducLunchMean_ScoreData.groupby('lunch')['mean score'].hist(alpha=0.5,legend=True)

The lunch type tells us more about the student potential

In [None]:
EducLunchMean_ScoreData['lunch_cat'] = EducLunchMean_ScoreData['lunch'].astype('category').cat.codes
EducLunchMean_ScoreData

Standard = 1

free/reduced = 0

In [None]:
EducLunchMean_ScoreData.pivot_table('lunch_cat','parental level of education').sort_values(by='lunch_cat')

### It is interesting to see, that the lunch type is spread more or less equaly between the education levels
Let's check that if getting a free/reduced lunch is a good indecator for you score

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','parental level of education','lunch',margins=True)

- The bottom margin shows the score according to the lunch (free/standard)
- The right margin shows the score accordint to the parents degree

In [None]:
EducLunchMean_ScoreData.pivot_table('mean score','parental level of education','lunch',margins=True,aggfunc='std')

In [None]:
mean_parent = EducLunchMean_ScoreData.groupby('parental level of education')['mean score'].mean()

# We manually orderd the plot according to the degrees
order = [5,2,4,0,1,3]
plt.figure(figsize=(10,5))
plt.scatter(mean_parent.index[order],mean_parent.values[order])


## Boxplot for mean score

In [None]:
EducLunchMean_ScoreData['mean score'].plot(kind='box')

## Trees!!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as mse

data = pd.read_csv('StudentsPerformance.csv')
data.tail()

## Preparing the data for learning

In [None]:
X = pd.get_dummies(data[['gender','race/ethnicity','lunch','test preparation course']])
y = data[['math score','reading score', 'writing score']]

X.head()

## Let's remove the reduntant fields

In [None]:
X = X.drop(columns=['gender_male','lunch_standard','test preparation course_none'])
X.head()

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y['math score'],test_size=0.3,random_state=42)

In [None]:
model = DecisionTreeRegressor(random_state=42)

model.fit(X_train,y_train)

## Eval Function

In [None]:
def eval(x_test,y_test,model):
    pred = model.predict(x_test)
    print("MSE: {:.3f}".format(mse(pred,y_test,squared=False)))

In [None]:
eval(X_test,y_test,model)

## Draw Tree function
[plot_tree docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

In [None]:
import sklearn.tree as tree
def plot_tree(tree_model,feat,size=(15,10)):
    fig = plt.figure(figsize=size)
    tree.plot_tree(tree_model, 
                   feature_names = feat, 
                   filled=True,
                  fontsize=15)  
    plt.show()

## Tree Pruning

In [None]:
model = DecisionTreeRegressor(max_depth=3,random_state=42)
model.fit(X_train,y_train)

eval(X_test,y_test,model)
plot_tree(model,X_test.columns,size=(30,20))

In [None]:
model = DecisionTreeRegressor(min_samples_split=100,random_state=42)
model.fit(X_train,y_train)

eval(X_test,y_test,model)
plot_tree(model,X_test.columns,size=(60,20))