This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [140]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [141]:
df = pd.read_csv('originalfile.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27,0,1,3,16884
1,18,1,33,1,0,2,1725
2,28,1,33,3,0,2,4449
3,33,1,22,0,0,1,21984
4,32,1,28,0,0,1,3866
...,...,...,...,...,...,...,...
343,63,1,36,0,0,0,13981
344,49,0,41,4,0,2,10977
345,34,0,29,3,0,2,6184
346,33,1,35,2,0,2,4889


2. Display columns and describe the data set

In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       348 non-null    int64
 1   sex       348 non-null    int64
 2   bmi       348 non-null    int64
 3   children  348 non-null    int64
 4   smoker    348 non-null    int64
 5   region    348 non-null    int64
 6   charges   348 non-null    int64
dtypes: int64(7)
memory usage: 19.2 KB


In [143]:
df.describe()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,39.591954,0.508621,30.16092,1.091954,0.232759,1.497126,14015.939655
std,14.417015,0.500646,5.670844,1.192021,0.423198,1.104089,12638.895029
min,18.0,0.0,15.0,0.0,0.0,0.0,1137.0
25%,27.0,0.0,26.0,0.0,0.0,1.0,4887.5
50%,40.0,1.0,30.0,1.0,0.0,2.0,9718.5
75%,53.0,1.0,34.0,2.0,0.0,2.0,19005.75
max,64.0,1.0,49.0,5.0,1.0,3.0,51194.0


3. Prepare Data

In [144]:
# Run this section to inspect X
X = df.drop(columns = ['charges'])
X

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27,0,1,3
1,18,1,33,1,0,2
2,28,1,33,3,0,2
3,33,1,22,0,0,1
4,32,1,28,0,0,1
...,...,...,...,...,...,...
343,63,1,36,0,0,0
344,49,0,41,4,0,2
345,34,0,29,3,0,2
346,33,1,35,2,0,2


In [145]:
# Uncomment this section to inpect y
y = df['charges']
y

0      16884
1       1725
2       4449
3      21984
4       3866
       ...  
343    13981
344    10977
345     6184
346     4889
347     8334
Name: charges, Length: 348, dtype: int64

4. Calculate accuracy

In [151]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.014285714285714285

5. Persisting Models

In [147]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [148]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array([ 5354,  3392,  7740,  9722,  7726,  4349, 11987,  1837,  2719,
        5125, 35491,  5246, 13880,  5246,  8232,  5989,  5253,  8240,
        1639, 14590,  5246,  1725,  8835, 29523, 38511,  2719,  3645,
       10942,  5400,  7749, 13047,  8444,  7281, 10797,  8240, 36950,
       19444, 18972, 38746,  8444, 47055, 43921,  5989, 21223,  9625,
        2155, 24476, 11741, 13880, 13217, 11837, 11743,  5246,  8232,
       38511,  8606, 10355, 13880, 14001,  6203, 37165,  6272,  3877,
        8116,  8835, 29523,  6272,  9722, 10797, 21984])

6. (Optional) Visualize decision trees

In [149]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
