This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [3]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499 entries, 0 to 498
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   AGE     499 non-null    int64
 1   FEMALE  499 non-null    int64
 2   LOS     499 non-null    int64
 3   RACE    499 non-null    int64
 4   TOTCHG  499 non-null    int64
 5   APRDRG  499 non-null    int64
dtypes: int64(6)
memory usage: 23.5 KB


In [5]:
df.describe()

Unnamed: 0,AGE,FEMALE,LOS,RACE,TOTCHG,APRDRG
count,499.0,499.0,499.0,499.0,499.0,499.0
mean,5.096192,0.511022,2.829659,1.078156,2777.631263,616.312625
std,6.952706,0.50038,3.366657,0.514746,3891.632405,178.491837
min,0.0,0.0,0.0,1.0,532.0,21.0
25%,0.0,0.0,2.0,1.0,1218.5,640.0
50%,0.0,1.0,2.0,1.0,1538.0,640.0
75%,13.0,1.0,3.0,1.0,2530.5,751.0
max,17.0,1.0,41.0,6.0,48388.0,952.0


3. Prepare Data

In [121]:
# Run this section to inspect X
X = df.drop(columns = ['LOS'])
X

Unnamed: 0,AGE,FEMALE,RACE,TOTCHG,APRDRG
0,17,1,1,2660,560
1,17,0,1,1689,753
2,17,1,1,20060,930
3,17,1,1,736,758
4,17,1,1,1194,754
...,...,...,...,...,...
494,0,1,1,5881,636
495,0,1,1,1171,640
496,0,1,1,1171,640
497,0,1,1,1086,640


In [122]:
# Uncomment this section to inpect y
y = df['LOS']
y

0      2
1      2
2      7
3      1
4      1
      ..
494    6
495    2
496    2
497    2
498    4
Name: LOS, Length: 499, dtype: int64

4. Calculate accuracy

In [190]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.72

5. Persisting Models

In [192]:
# Save the model to file
joblib.dump(model, 'hospital-stay-length.joblib')


['hospital-stay-length.joblib']

5.b. Import the model and make predictions

In [193]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('hospital-stay-length.joblib')
predictions = model.predict(X_test)
predictions

array([ 2,  2,  2,  2,  2,  2,  2,  2,  1,  2,  2,  1,  1,  2,  2,  1,  4,
        2,  3,  1,  1,  4,  2,  2,  3,  2,  2,  2,  2,  1,  2,  1,  3,  1,
        3,  4,  3,  4,  3,  3,  3,  2,  0,  1,  4,  3,  2,  7,  2,  3,  2,
        3,  4,  2,  1,  2,  0,  2,  2,  3,  4,  2,  1,  6,  3,  1,  2,  3,
        1,  4,  0,  1,  3,  1,  1,  2,  2,  1,  1,  2,  3, 18,  3,  3,  2,
        3,  3,  2,  2,  2,  3,  1,  3,  0, 18,  3,  2,  2, 18,  2])

6. (Optional) Visualize decision trees

In [194]:
tree.export_graphviz(model, out_file = 'hospital-stay-length.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
