# WiDS Intro to ML

In [1]:
from sklearn import datasets
import pandas as pd #Import the pandas library: Pandas is a Python library for data analysis.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    accuracy_score,
)
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier # import the class DecisionTreeClassifier which implements the decision tree classifier

%matplotlib inline
import sys

## Data Source:
- Suppose we are working on a Spotify Song recommendation system, we want to predict if a customer will like a given song or not based on the song's features.

- Data Description:
> A dataset of 2017 songs with attributes from Spotify's API. Each song is labeled "1" meaning I like it and "0" for songs I don't like. Let's build a classifier that could predict whether or not I would like a song. 

> The original dataset can be found here: 
https://www.kaggle.com/datasets/geomack/spotifyclassification

<div>
<img src="dataset-cover.jpeg" width="600"/>
</div>

## Step 1: we will import the data using Panda's read_csv() function:
- In the bracket, you must include the file path of the data.
    - *Note: index_col=0, indicates what column of the csv to use as the indexes (row labels) of the dataframe. If the dataset doesn't have an id column, You generally don't have to specify that.*

In [2]:
spotify_df = pd.read_csv("data.csv", index_col = 0)
spotify_df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys


Now we split the dataset into train and test sets:
> We divide our data into training and test set in the ratio 80:20 respectively (i.e if your input dataset has 100 rows, training data will be 80 rows randomly selected from the input, and test set will have the remaining 20 rows).

### Golden Rule of Machine Learning

Never mix training and test data together. Always isolate the test set, so that we can use the test set to make prediction and evaluate our model.
> You can think of `train_df` as the Practice exams & `test_df` as the FINAL exam.

Now lets look at some code:


In [3]:
#train_df will contain 80% of the input data, test_df will contain 20% of the input data (flights_df)
train_df, test_df = train_test_split(spotify_df, test_size=0.2)

In [4]:
# 1613/(1613 + 404) = 0.8
train_df.shape

(1613, 16)

In [5]:
test_df.shape

(404, 16)

## Step 2: Clean data
Let's look at the dataset information to see if there are incomplete rows

Looking at `train_df.info()` gives us a rough idea about the features being numeric or textual and the number of `NaNs` in each feature:
- Luckily, we don't have any null values in our data, as we can see below:

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1613 entries, 1043 to 1231
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      1613 non-null   float64
 1   danceability      1613 non-null   float64
 2   duration_ms       1613 non-null   int64  
 3   energy            1613 non-null   float64
 4   instrumentalness  1613 non-null   float64
 5   key               1613 non-null   int64  
 6   liveness          1613 non-null   float64
 7   loudness          1613 non-null   float64
 8   mode              1613 non-null   int64  
 9   speechiness       1613 non-null   float64
 10  tempo             1613 non-null   float64
 11  time_signature    1613 non-null   float64
 12  valence           1613 non-null   float64
 13  target            1613 non-null   int64  
 14  song_title        1613 non-null   object 
 15  artist            1613 non-null   object 
dtypes: float64(10), int64(4), object(2)
mem

In [7]:
train_df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
1043,0.214,0.638,196422,0.634,0.0,6,0.0866,-6.474,1,0.0468,92.097,4.0,0.445,0,I Could Use a Love Song,Maren Morris
907,0.000885,0.79,312000,0.427,0.446,11,0.0514,-11.361,0,0.0541,116.986,4.0,0.961,1,Nothing,HNNY
51,0.0638,0.7,224480,0.84,0.112,7,0.141,-6.227,0,0.0292,132.475,4.0,0.745,1,The Chase,Future Islands
1107,0.163,0.817,207200,0.869,0.00226,6,0.0497,-4.791,1,0.18,96.029,4.0,0.56,0,Bailame,Nacho
1728,0.105,0.563,235853,0.487,0.0,2,0.0884,-7.775,1,0.0234,142.53,3.0,0.248,0,I'll Make Love To You,Boyz II Men


We can see that song title, artist and target are categorical variables. Therefore we need to transform these into numeric variables.

number dummy variables = # categories - 1

### How can we do this in python?
By Column transformer (will be introduced in the future), but if you are interested you could learn about it here: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
<div>
<img src="columntransformer.png" width="800"/>
</div>

### For simplicity, We will remove the text features from the data and just understand the model using numeric features for now, Since textual features need some preprocessing.

In [8]:
train_df = train_df.drop(columns=["song_title", "artist"])
test_df = test_df.drop(columns=["song_title", "artist"])

## Step 3: Train Model

### Now we are ready to train our model!
The next step is to build a model using a machine learning algorithm. There are a lot of algorithms out there. Each algorithm has its pros and cons in terms of the performance. For this workshop we will use a very simple algorithm called Decision Tree.
- We don't have to program this algorithm, it's actually already implemented for us in the library called scikit-learn (you will see the library imported the class DecisionTreeClassifier which implements the decision tree classifier).

### Here we are separating the target (AKA, response variables, y variable..) from the features (AKA, independent variables,predictors,x variables..)


In [9]:
# features,target :will be usedFOR TRAINING, think about them as the Practise exams and their answer keys
X_train,y_train = train_df.drop(columns=["target"]), train_df['target']

# features,target :will be used FOR TESTING, think about them as the FINAL exams and  answer keys
X_test, y_test = test_df.drop(columns=["target"]), test_df['target']

In [10]:
X_train

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
1043,0.214000,0.638,196422,0.634,0.000000,6,0.0866,-6.474,1,0.0468,92.097,4.0,0.445
907,0.000885,0.790,312000,0.427,0.446000,11,0.0514,-11.361,0,0.0541,116.986,4.0,0.961
51,0.063800,0.700,224480,0.840,0.112000,7,0.1410,-6.227,0,0.0292,132.475,4.0,0.745
1107,0.163000,0.817,207200,0.869,0.002260,6,0.0497,-4.791,1,0.1800,96.029,4.0,0.560
1728,0.105000,0.563,235853,0.487,0.000000,2,0.0884,-7.775,1,0.0234,142.530,3.0,0.248
...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,0.107000,0.645,591707,0.924,0.026600,10,0.0982,-7.697,0,0.0569,90.852,4.0,0.886
1586,0.033600,0.890,129187,0.891,0.000000,2,0.0970,-5.542,1,0.0483,119.993,4.0,0.947
755,0.023300,0.824,299960,0.572,0.000008,11,0.2080,-4.868,1,0.0652,153.977,4.0,0.669
1132,0.001390,0.613,189387,0.874,0.000002,2,0.6560,-3.594,1,0.0697,108.038,4.0,0.532


In [11]:
y_train

1043    0
907     1
51      1
1107    0
1728    0
       ..
126     1
1586    0
755     1
1132    0
1231    0
Name: target, Length: 1613, dtype: int64

Let's talk about Decision Tree. Our goal here is to determine weather the customer likes the song or not. We can use Decision Tree algorithm here for creating the model for classification problems like this one.

### What is a Decision Tree?

- classification model that predicts value of target based on learning simple "rules" that it obtained from the data pattern.
Let us consider the following data:
<div>
<img src="img.png" width="1000"/>
</div>

This is one possible tree for this problem.
<div>
<img src="DecisionTree.png" width="1000"/>
</div>

**Pros of Decision Tree Classifier:**

    - simple to understand
    - easy to visualize
    - generate understandable rules.
    - perform classification without requiring much computation.

**Cons:**

    - less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
    - prone to errors in classification problems with many class and a relatively small number of training examples.

### How can we build a decision tree model in python?

Now lets fit the model:


In [12]:
# We need to create an object, let's call it model, and set it to a new instance of Decision Tree Classifier.
model = DecisionTreeClassifier()
# next we need to train it, so it learns the patterns of the INPUT train-set & OUTPUT train-set. Recall the Golden Rule of ML.
# The input is X_train, the output is y_train
model.fit(X_train, y_train)
# The output is just a visual for the model object you just fitted/trained.

## Step 4: Make Predictions
After fitting the model, we can predict on the testing set, and also evaluate our decision tree by calculating its accuracy.

In [13]:
predict = model.predict(X_test)
predict

array([1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,

In [14]:
# combining the predicted value with the test data set:
pred_result = test_df
pred_result['predicted'] = predict.tolist()

In [15]:
pred_result

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,predicted
210,0.827000,0.583,248213,0.776,0.000038,3,0.2980,-5.293,1,0.0661,126.536,4.0,0.782,1,1
1068,0.422000,0.482,230832,0.739,0.000019,7,0.4680,-5.476,1,0.0342,155.932,4.0,0.596,0,0
995,0.000005,0.566,123547,0.869,0.309000,0,0.0503,-7.614,1,0.0335,129.948,4.0,0.656,1,1
1476,0.131000,0.522,210816,0.860,0.020900,10,0.1220,-3.773,0,0.1450,128.019,4.0,0.263,0,0
683,0.055700,0.599,222160,0.677,0.000180,11,0.4340,-6.328,1,0.0314,126.012,4.0,0.108,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,0.539000,0.689,171560,0.830,0.000000,8,0.0882,-8.774,1,0.0954,140.732,4.0,0.822,1,1
477,0.233000,0.328,225627,0.827,0.005170,6,0.6320,-2.545,0,0.2520,103.127,4.0,0.709,1,0
1192,0.056700,0.467,293533,0.564,0.000000,9,0.1940,-4.986,0,0.3530,81.966,4.0,0.304,0,0
333,0.119000,0.716,263817,0.647,0.000014,1,0.1210,-4.008,1,0.0345,131.040,4.0,0.159,1,1


## Step 5: Evaluate the model and make improvement on its performance (Accuracy)

In [17]:
accuracy = accuracy_score(y_test, predict)
accuracy

0.6782178217821783

> The test accuracy score is one of the metrics to see how our model generalize to unseen data (This is the key for Machine Learning: Our model should generalize well for new observations). We want our model to have high accuracy when predicting the test set, so we have more confident on the model accuracy after it's deployed to real world.

### Improving the model accuracy by Hyperparameter tuning:
Did some magic here to tune the parameter that goes into our DecisionTree Algorithm:

In [18]:
results_dict = {
    "max_depth": [],
    "mean_train_score": [],
    "mean_cv_score": []
}

for depth in range(1, 26):
    model = DecisionTreeClassifier(max_depth=depth)
    cv_score = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
    results_dict["max_depth"].append(depth)
    results_dict["mean_cv_score"].append( np.mean(cv_score["test_score"]))
    results_dict["mean_train_score"].append( np.mean(cv_score["train_score"]))

result_df = pd.DataFrame(results_dict)
result_df

Unnamed: 0,max_depth,mean_train_score,mean_cv_score
0,1,0.629744,0.620585
1,2,0.702074,0.676405
2,3,0.724184,0.678867
3,4,0.745747,0.693126
4,5,0.783771,0.703673
5,6,0.818695,0.71544
6,7,0.850658,0.712376
7,8,0.878693,0.701192
8,9,0.903973,0.694349
9,10,0.927532,0.696197


In [42]:
optimized_depth = result_df[result_df.mean_cv_score == result_df.mean_cv_score.max()].iloc[0,0]
model_optimized = DecisionTreeClassifier(max_depth = optimized_depth)
model_optimized.fit(X_train, y_train)

In [43]:
optimized_depth

6

In [44]:
predict = model_optimized.predict(X_test)

accuracy = accuracy_score(y_test, predict)

accuracy

0.9473684210526315

In reality most classifiers will never have 100% accuracy. And indeed in our case it doesn't reach a 100% accuracy. This is just a quick demo to show you the Step 5 where further improvements on the model can be done using Hyperparameter tuning.

### Exercis for you: Let's Dive into Some Python Codes Now

Now we are building a model on the iris flower dataset. Start by importing the datasets library from `sklearn`, and load the iris data set.

In [52]:
#Load dataset + create a dataframe of this iris dataset. (you can ignore this step)
iris =  datasets.load_iris()
data = pd.DataFrame({
    'sepal_length': iris.data[:,0],
    'sepal_width': iris.data[:,1],
    'petal_length': iris.data[:,2],
    'petal_width': iris.data[:,3],
    'species': iris.target
})

data.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Let's print the target and feature variable names just to make sure that we are using the right dataset

In [53]:
print(iris.target_names)
print(iris.feature_names)

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


We use the `train_test_split` function to split variables into train and test set (Let's take 75% to training and 25% to testing), and train the model on the train set and perform predictions on the test set.

In [54]:
train_df, test_df = train_test_split(data, test_size = 0.25)

Since the species of iris flower is what we are interested in classifying, we first separate columns accordingly into dependent and independent variables.
Steps as follow:

In [58]:
# Seperate cols to dependent and independent variables
X_train, y_train = train_df.drop(columns=["species"]), train_df["species"]
X_test, y_test = test_df.drop(columns=["species"]), test_df["species"]

In [59]:
# create the classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# make the prediction using the Decision Tree model
y_pred = model.predict(X_test)

We now examine the Decision tree model's accuracy using the actual y (species) value and the predicted values given by the model.

In [60]:
print("The accuracy for Decision Tree model is: ", round(accuracy_score(y_test, y_pred),4))

The accuracy for Decision Tree model is:  0.9737


And we say the accuracy is pretty high for such model! To make a prediction on a single item, we can also use the `predict()` function.  
For example:  
    - sepal length = 3  
    - sepal width = 6  
    - petal length = 6  
    - petal width = 4  
Now we can predict which type of the iris flower it is as below.

In [62]:
model.predict([[3,6,6,4]])



array([2])

Here, the output is 2, which indicates an iris type of Virginica.

<div>
<img src="survey.jpg" width="500"/>
</div>
Congratulation! You have made it so far and know what a typical Decision Tree classifier in python looks like. Please spend a minute to fill out this post workshop survey. Your feedback means a lot to us.