This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [3]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [4]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   1309 non-null   int64  
 1   Passengerid  1309 non-null   int64  
 2   Age          1309 non-null   float64
 3   Fare         1309 non-null   float64
 4   Sex          1309 non-null   int64  
 5   sibsp        1309 non-null   int64  
 6   zero         1309 non-null   int64  
 7   zero.1       1309 non-null   int64  
 8   zero.2       1309 non-null   int64  
 9   zero.3       1309 non-null   int64  
 10  zero.4       1309 non-null   int64  
 11  zero.5       1309 non-null   int64  
 12  zero.6       1309 non-null   int64  
 13  Parch        1309 non-null   int64  
 14  zero.7       1309 non-null   int64  
 15  zero.8       1309 non-null   int64  
 16  zero.9       1309 non-null   int64  
 17  zero.10      1309 non-null   int64  
 18  zero.11      1309 non-null   int64  
 19  zero.1

In [6]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,...,zero.11,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,zero.17,zero.18,2urvived
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,...,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0
mean,654.0,655.0,29.503186,33.281086,0.355997,0.498854,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.294882,0.0,0.0,0.0,0.0,0.261268
std,378.020061,378.020061,12.905241,51.7415,0.478997,1.041658,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.837836,0.0,0.0,0.0,0.0,0.439494
min,0.0,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,327.0,328.0,22.0,7.8958,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
50%,654.0,655.0,28.0,14.4542,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
75%,981.0,982.0,35.0,31.275,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0
max,1308.0,1309.0,80.0,512.3292,1.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0


3. Prepare Data

In [23]:
# Run this section to inspect X
X = df.drop(columns = ['Pclass'])
X

Unnamed: 0.1,Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,...,zero.10,zero.11,zero.12,zero.13,zero.14,zero.15,zero.16,zero.17,zero.18,2urvived
0,0,1,22.0,7.2500,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,2,38.0,71.2833,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2,3,26.0,7.9250,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,3,4,35.0,53.1000,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,4,5,35.0,8.0500,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1304,1305,28.0,8.0500,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1305,1305,1306,39.0,108.9000,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1306,1306,1307,38.5,7.2500,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1307,1307,1308,28.0,8.0500,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# Uncomment this section to inpect y
y = df['Pclass']
y

0       3
1       1
2       3
3       1
4       3
       ..
1304    3
1305    1
1306    3
1307    3
1308    3
Name: Pclass, Length: 1309, dtype: int64

4. Calculate accuracy

In [25]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.9122137404580153

5. Persisting Models

In [27]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [28]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array([1, 3, 3, 3, 3, 3, 3, 1, 3, 2, 3, 2, 3, 3, 1, 3, 3, 2, 3, 3, 2, 3,
       1, 3, 1, 3, 3, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 3, 3, 1, 2, 3, 3, 1,
       1, 2, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 2, 1, 2, 3, 1, 2, 1, 1, 3,
       1, 3, 1, 2, 2, 1, 2, 3, 1, 3, 2, 1, 3, 1, 1, 3, 3, 2, 3, 3, 3, 2,
       3, 3, 2, 2, 2, 1, 2, 3, 3, 1, 3, 3, 1, 3, 3, 3, 2, 3, 2, 1, 1, 3,
       3, 2, 3, 1, 3, 1, 3, 1, 1, 3, 1, 2, 3, 1, 1, 2, 3, 3, 3, 1, 2, 3,
       1, 2, 3, 3, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 1, 3,
       1, 3, 1, 3, 3, 3, 3, 1, 3, 2, 3, 1, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 2, 1, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 2, 3, 3,
       1, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 3, 3, 1, 1,
       3, 3, 2, 3, 2, 1, 2, 1, 3, 2, 3, 3, 3, 2, 1, 3, 3, 2, 3, 1, 3, 3,
       2, 3, 3, 1, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 1, 3, 3, 3, 3, 2])

6. (Optional) Visualize decision trees

In [29]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
