# WiDS Intro to ML

In [None]:
from sklearn import datasets
import pandas as pd #Import the pandas library: Pandas is a Python library for data analysis.
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier # import the class DecisionTreeClassifier which implements the decision tree classifier

%matplotlib inline
import sys

## Data Source:
- Suppose we are working on a Spotify Song recommendation system, we want to predict if a customer will like a given song or not based on the song's features.

- Data Description:
> A dataset of 2017 songs with attributes from Spotify's API. Each song is labeled "1" meaning I like it and "0" for songs I don't like. Let's build a classifier that could predict whether or not I would like a song. 

> The original dataset can be found here: 
https://www.kaggle.com/datasets/geomack/spotifyclassification

<div>
<img src="dataset-cover.jpeg" width="600"/>
</div>

## Step 1: we will import the data using Panda's read_csv() function:
- In the bracket, you must include the file path of the data.
    - *Note: index_col=0, indicates what column of the csv to use as the indexes (row labels) of the dataframe. If the dataset doesn't have an id column, You generally don't have to specify that.*

In [None]:
spotify_df = pd.read_csv("data.csv", index_col = 0)
spotify_df.head()

Now we split the dataset into train and test sets:
> We divide our data into training and test set in the ratio 80:20 respectively (i.e if your input dataset has 100 rows, training data will be 80 rows randomly selected from the input, and test set will have the remaining 20 rows).

### Golden Rule of Machine Learning

Never mix training and test data together. Always isolate the test set, so that we can use the test set to make prediction and evaluate our model.
> You can think of `train_df` as the Practice exams & `test_df` as the FINAL exam.

Now lets look at some code:


In [None]:
#train_df will contain 80% of the input data, test_df will contain 20% of the input data (flights_df)
train_df, test_df = train_test_split(spotify_df, test_size=0.2)

In [None]:
# 1613/(1613 + 404) = 0.8
train_df.shape

In [None]:
test_df.shape

## Step 2: Clean data
Let's look at the dataset information to see if there are incomplete rows

Looking at `train_df.info()` gives us a rough idea about the features being numeric or textual and the number of `NaNs` in each feature:
- Luckily, we don't have any null values in our data, as we can see below:

In [None]:
train_df.info()

In [None]:
train_df.head()

We can see that song title, artist and target are categorical variables. Therefore we need to transform these into numeric variables.

number dummy variables = # categories - 1

### How can we do this in python?
By Column transformer (will be introduced in the future), but if you are interested you could learn about it here: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
<div>
<img src="columntransformer.png" width="800"/>
</div>

### For simplicity, We will remove the text features from the data and just understand the model using numeric features for now, Since textual features need some preprocessing.

In [None]:
train_df = train_df.drop(columns=["song_title", "artist"])
test_df = test_df.drop(columns=["song_title", "artist"])

## Step 3: Train Model

### Now we are ready to train our model!
The next step is to build a model using a machine learning algorithm. There are a lot of algorithms out there. Each algorithm has its pros and cons in terms of the performance. For this workshop we will use a very simple algorithm called Decision Tree.
- We don't have to program this algorithm, it's actually already implemented for us in the library called scikit-learn (you will see the library imported the class DecisionTreeClassifier which implements the decision tree classifier).

### Here we are separating the target (AKA, response variables, y variable..) from the features (AKA, independent variables,predictors,x variables..)


In [None]:
# features,target :will be usedFOR TRAINING, think about them as the Practise exams and their answer keys
X_train,y_train = train_df.drop(columns=["target"]), train_df['target']

# features,target :will be used FOR TESTING, think about them as the FINAL exams and  answer keys
X_test, y_test = test_df.drop(columns=["target"]), test_df['target']

In [None]:
X_train

In [None]:
y_train

Let's talk about Decision Tree. Our goal here is to determine weather the customer likes the song or not. We can use Decision Tree algorithm here for creating the model for classification problems like this one.

### What is a Decision Tree?

- classification model that predicts value of target based on learning simple "rules" that it obtained from the data pattern.
Let us consider the following data:
<div>
<img src="img.png" width="1000"/>
</div>

This is one possible tree for this problem.
<div>
<img src="DecisionTree.png" width="1000"/>
</div>

**Pros of Decision Tree Classifier:**

    - simple to understand
    - easy to visualize
    - generate understandable rules.
    - perform classification without requiring much computation.

**Cons:**

    - less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
    - prone to errors in classification problems with many class and a relatively small number of training examples.

### How can we build a decision tree model in python?

Now lets fit the model:


In [None]:
# We need to create an object, let's call it model, and set it to a new instance of Decision Tree Classifier.
model = DecisionTreeClassifier()
# next we need to train it, so it learns the patterns of the INPUT train-set & OUTPUT train-set. Recall the Golden Rule of ML.
# The input is X_train, the output is y_train
model.fit(X_train, y_train)
# The output is just a visual for the model object you just fitted/trained.

## Step 4: Make Predictions
After fitting the model, we can predict on the testing set, and also evaluate our decision tree by calculating its accuracy.

In [None]:
predict = model.predict(X_test)
predict

In [None]:
# combining the predicted value with the test data set:
pred_result = test_df
pred_result['predicted'] = predict.tolist()

In [None]:
pred_result

## Step 5: Evaluate the model and make improvement on its performance (Accuracy)

In [None]:
accuracy = accuracy_score(y_test, predict)
accuracy

> The test accuracy score is one of the metrics to see how our model generalize to unseen data (This is the key for Machine Learning: Our model should generalize well for new observations). We want our model to have high accuracy when predicting the test set, so we have more confident on the model accuracy after it's deployed to real world.

### Improving the model accuracy by Hyperparameter tuning:
Did some magic here to tune the parameter that goes into our DecisionTree Algorithm:

In [None]:
from sklearn.model_selection import cross_validate
import numpy as np
results_dict = {
    "max_depth": [],
    "mean_train_score": [],
    "mean_cv_score": []
}

for depth in range(1, 26):
    model = DecisionTreeClassifier(max_depth=depth)
    cv_score = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
    results_dict["max_depth"].append(depth)
    results_dict["mean_cv_score"].append( np.mean(cv_score["test_score"]))
    results_dict["mean_train_score"].append( np.mean(cv_score["train_score"]))

result_df = pd.DataFrame(results_dict)
result_df

In [None]:
optimized_depth = result_df[result_df.mean_cv_score == result_df.mean_cv_score.max()].iloc[0,0]
model_optimized = DecisionTreeClassifier(max_depth = optimized_depth)
model_optimized.fit(X_train, y_train)

In [None]:
optimized_depth

In [None]:
predict = model_optimized.predict(X_test)

accuracy = accuracy_score(y_test, predict)

accuracy

In reality most classifiers will never have 100% accuracy. And indeed in our case it doesn't reach a 100% accuracy. This is just a quick demo to show you the Step 5 where further improvements on the model can be done using Hyperparameter tuning.

### Exercis for you: Let's Dive into Some Python Codes Now

Now we are building a model on the iris flower dataset. Start by importing the datasets library from `sklearn`, and load the iris data set.

In [None]:
#Load dataset + create a dataframe of this iris dataset. (you can ignore this step)
iris =  datasets.load_iris()
data = pd.DataFrame({
    'sepal_length': iris.data[:,0],
    'sepal_width': iris.data[:,1],
    'petal_length': iris.data[:,2],
    'petal_width': iris.data[:,3],
    'species': iris.target
})

data.head(5)

Let's print the target and feature variable names just to make sure that we are using the right dataset

In [None]:
print(iris.target_names)
print(iris.feature_names)

### Now Please complete the `#TODO`s:

We use the `train_test_split` function to split variables into train and test set (Let's take 75% to training and 25% to testing), and train the model on the train set and perform predictions on the test set.

In [None]:
train_df, test_df = train_test_split(#TODO)

Since the species of iris flower is what we are interested in classifying, we first separate columns accordingly into dependent and independent variables.
Steps as follow:

In [None]:
# Seperate cols to dependent and independent variables
X_train, y_train = train_df.drop(columns=["species"]), train_df["species"]
X_test, y_test = #TODO

In [None]:
# create the classifier
model = #TODO
model.fit(#TODO)
# make the prediction using the Decision Tree model with the test x values( hint: X_test)
y_pred = model.predict(#TODO)

We now examine the Decision tree model's accuracy using the actual y (species) value and the predicted values given by the model.

In [None]:
print("The accuracy for Decision Tree model is: ", round(accuracy_score(y_test, y_pred),4))

And we say the accuracy is pretty high for such model! To make a prediction on a single observation, we can also use the `predict()` function.
For example:  
    - sepal length = 3  
    - sepal width = 6  
    - petal length = 6  
    - petal width = 4  
Now we can predict which type of the iris flower it is as below.

In [None]:
model.predict([[3,6,6,4]])

Here, the output is 2, which indicates an iris type of Virginica.

<div>
<img src="survey.jpg" width="500"/>
</div>
Congratulation! You have made it so far and know what a typical Decision Tree classifier in python looks like. Please spend a minute to fill out this post workshop survey. Your feedback means a lot to us.