## Familiarize yourself
The first thing to do is to get familiar with your dataset. Python provides a library called Pandas that can help us to analyze and manipulate our dataset. Pandas has many fuctions to carry out those duties. In Pandas, Dataframes are the center of focus. They are a tabular representation of our data. 

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [2]:
# It is best practice to save the data's path in a variable for future use.
data_path = "StudentsPerformance.csv"

# We read it with pandas
data = pd.read_csv(data_path)

In [3]:
# Let's take a look at the summary of our data using the describe function

data.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


The count row shows the number of unempty entries for each column
The mean shows the average value for the numerical rows in the dataset
The std(standard deviation) shows how spread the data is from the mean in each column
The min row show the least record in each row, same for the max.
The 25% percental row shows the value at the first quarter in each row wit numerical entries.
The 50% percental shows the mid-value for each numerical column.
The 75% percental shows the 3rd quarter value for each numerical column.


From our summary, we see that the lowest math score is 0 and the highest is 100. Quite impressive!

Our mean values show us that the best performing course is reading, followed by writing and the least is mathematics.

In [4]:
# Check out the different columns including the non-numerical ones
data.columns.tolist()

['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course',
 'math score',
 'reading score',
 'writing score']

Check for missing data and drop rows with missing data.

In [5]:
data = data.dropna(axis = 0)

 Select prediction column and select columns to be used in data training.
 Prediction column is stored in a vaiable X. The columns for training can be called features and these are stored in the Y variable.

Firstly, we have to change the column name in the data to remove white spaces between column names to avoid errors

In [6]:
data.columns = [col.replace(' ', '_') for col in data.columns]
data.columns

Index(['gender', 'race/ethnicity', 'parental_level_of_education', 'lunch',
       'test_preparation_course', 'math_score', 'reading_score',
       'writing_score'],
      dtype='object')

In [7]:
#Select this column with the dot notation.
y = data.math_score


To select the columns for Y, we need to save the column names in a list if we want to manually select some specific columns.

In [8]:
feature_cols = ['reading_score','writing_score']
X = data[feature_cols]
X.describe()

Unnamed: 0,reading_score,writing_score
count,1000.0,1000.0
mean,69.169,68.054
std,14.600192,15.195657
min,17.0,10.0
25%,59.0,57.75
50%,70.0,69.0
75%,79.0,79.0
max,100.0,100.0


Data description for numeric columns are different from that of non-numeric columns.

If a dataset contains at least one numeric column, the describe function will summarize only the numeric columns and ignore the non-numeric columns.

If there is no numeric column in the dataset, it will provide a different kind of description for the non-numeric columns like we see above.

## Building the model

In [9]:
# Split the data into training data and validation data.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Build the model
student_model = DecisionTreeRegressor()

# Fit the model
student_model.fit(train_X, train_y)

DecisionTreeRegressor()

### Model Validation.
This is a method of evaluating the performance of the model.
A method of validating the model is checking its accuracy.
A metric used is the Mean Absolute Error.
MAE is the sum of the absolute values of the errors.

error = actual value - predicted value.

In [10]:
val_predictions = student_model.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

8.799


### Model Optimization
Overfitting is when a model captures distinct patterns in data to the extent that 
it performs so poorly on validation data and on new data. 

Underfittling is when a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

We can adjust the number of leaf nodes when building the model to test the optimum performance of the model.


In [13]:
def get_opt_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
     
     model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state = 0)
     model.fit( train_X, train_y)
     pred_val = model.predict(val_X)
     mae = mean_absolute_error(val_y, pred_val)

     return(mae)

for max_leaf_nodes in [5, 30, 40, 45, 50, 55, 60, 80, 500, 5000] :
     my_mae = get_opt_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
     print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))



Max leaf nodes: 5 		 Mean Absolute Error: 8
Max leaf nodes: 30 		 Mean Absolute Error: 7
Max leaf nodes: 40 		 Mean Absolute Error: 7
Max leaf nodes: 45 		 Mean Absolute Error: 7
Max leaf nodes: 50 		 Mean Absolute Error: 7
Max leaf nodes: 55 		 Mean Absolute Error: 7
Max leaf nodes: 60 		 Mean Absolute Error: 7
Max leaf nodes: 80 		 Mean Absolute Error: 8
Max leaf nodes: 500 		 Mean Absolute Error: 8
Max leaf nodes: 5000 		 Mean Absolute Error: 8


From the above analysis, the least mae score is when there are 50 leaf nodes in the tree. Hence the model is optimized when we set the leaf node number to 50.