## Week-5 Lab - Logistic Regression and Random Forest Model Development (Classification)
1) We will create a logistic regression model that will predict whether or not a user will click on an ad, based on the given features. As this is a binary classification problem, a logistic regression model is well suited here.

2) There is also a more challenging approach to this problem using random forest.

**Details to be found in the following cells.**

**You are expected to create new cells as much as you think you need to.**

Dataset is available at: https://www.kaggle.com/datasets/debdyutidas/advertisingcsv

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Please conduct exploratory data analysis and preprocessing as required. You can follow the steps in our previous workshops and laboratory works.
# Remember that the focus in this session is to build a classification model.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data to ad_data variable

ad_data = pd.read_csv('dataset/advertising.csv')
ad_data


## Model building

Let us split the data into training set and testing set using train_test_split, but first, let’s convert
the ‘Country’ feature to an acceptable form for the model Country is a categorical string and we need to find a way to feed this imporant piece of information into the model.


It is easy to drop this feature but this means we need to sacrifice an important piece of information for the model to perform more realistic.
We can convert the categorical feature into [dummy variables](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) using pandas.

* We convert categorical features into dummy variables (also called one-hot encoding) because machine learning models work with numerical data and can't directly process categories or labels as inputs.

In [None]:
# Create countries dummy variable using Country column. Student needs to search about dummy variable.
# How to create dummy variables? And why do we need them? Search for simple examples.
# Then run the available code cells below and understand how the data operations are done.
ad_data.columns
countries = pd.get_dummies(ad_data['Country'],drop_first=True)

In [None]:
# Concatenating dummy variables with the original dataset, and dropping other features (repetitive ones).
ad_data = pd.concat([ad_data,countries],axis=1)
ad_data.drop(['Country','Ad Topic Line','City','Timestamp'],axis=1,inplace=True)

In [None]:
# Allocate and assign the variables apropriately and prepare for fitting training data
# X = Everything else exept the 'Clicked on Ad' column, y= 'Clicked on Ad'.
X =
y =

In [None]:
#  Split the dataset apropriately test being 30% of the dataset.


In [None]:
# Train the model using logistic regression


## Predictions and Evaluations

In [None]:
# Get the prediction results


## Classification report

**Precision** and recall are two important metrics used to evaluate the performance of a classification
model. Precision measures the proportion of positive predictions that are actually true positive. In
other words, it is the ratio of true positive predictions to the total number of positive predictions.
A high precision indicates that the model is making accurate positive predictions.

**Recall** measures the proportion of actual positive cases that are correctly identified by the model. In other words, it is the ratio of true positive predictions to the total number of actual positive cases. A high recall indicates that the model is effectively identifying positive cases.

In [None]:
# How well the prediction is made? Check with classification report.
from sklearn.metrics import classification_report


In [None]:
# Print the confusion matrix for the predictions and actual values.
import seaborn as sns
from sklearn.metrics import confusion_matrix


In [None]:
# By means of using the confusion matrix, can you work out the accuracy, precision, recall, F1-Score values for each classes by pen and paper?
# Compare your findings in the classification report results. It is important to see which class you are taking as the true class (1 (click-on-ad) or 0 (no-click)) and understand how the precision/recall calculations are affected.
# Hint: Look at your slides.

# 2) Random Forest
## Challenge - Use the same X and y sets to train a Random Forest model

In [None]:
# Train a model using random forest, a tree-based model, with default parameters. Do not forget to import RandomForestClassifier from Sklearn.
# First, search for how to import random forest classifier from sklearn.
# Make sure to use new variable names for the model and the prediction outcomes.


## What did you observe from classification report? Is the performance better or worse compared to logistic regression? Any surprises?

*   The performance is 2-3 percentage points is lower compared to logistic regression. Is it surprising? We will come to that later.

In [None]:
# Import libraries for RandomizedSearchCV, randint, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

In [None]:
# Create the confusion matrix for the above prediction using random forest


## Can you comment on the confusion matrix based on True Class=1. What is your observations for TP, TN, FN and FP? Compare these findings to logistic regression and comment on your findings.

## This is actually a good performance. However, we may be able to get a better performance by optimizing our hyperparameters. Let us see if this can actually help for this problem set.


In [None]:
# You can export the first three decision trees from the forest and visualise using export_graphviz.
# Please search online and see how you can implement this. Observe your features and how they were represented at the tree-based structure.



# Can you explain what do the boxes and the values on it indicate? Can you relate this to how a decision is made? There will be a discussion on this during the drop-in session, not to be missed!

# Hyperparameter Tuning: Can you find the best hyperparameter values for random forest model using RandomizedSearchCV? Please use n_estimators and max_depth, and decide on a range for these two hyperparameters.

## We are using RandomizedSearchCV to search for the best hyperparameter values within a range. We can define the hyperparameters to use and their range in the param_dist dictionary.
- n_estimators: the number of decision trees in the forest. Say this is 5, then there will be 5 decision trees created using random features to make a final aggregated decision once a new observation has arrived to the model to click or not to click an ad.
- max_depth: the maximum depth of each decision tree in the forest. This indicates how many decision layers you can have per decision tree (n_estimators).
- RandomizedSearchCV will train many models (defined by n_iter_ and save each one as variables).

In [None]:
# Create the param_dist dictionary for the two hyperparameters with a range.


# Create a new random forest classifier, maybe a name like: rf_model_hp

# Use RandomizedSearchCV to find the best hyperparameters


# Fit the RandomizedSearchCV object to the training data again.


In [None]:
# Create a variable for the best model, e.g., best_rf is nice variable name.

# Print the best hyperparameters, you will see that each model run (running above cell) may generate a different hyperparameter value.


In [None]:
# Use the best_rf and generate predictions with the best model


# Create the confusion matrix for the improved-hyperparameter model


# Share your observations compared to the previous random forest model.
- Did you see any improvements in the performance of the model when the best hyperparameter values are applied.
- It is expected that different colab sessions will produce different values but highly likely with similar conclusions if not the same! So it is important that you interprete these results on your own and then show your results to a friend next to you and see what they got so that you can discuss.

In [None]:
# Finally, can you now create a series containing feature importances from the model and feature names from the training data
# Compute feature importances using pd.Series() function

# Set threshold for importance (e.g., features with importance > 0.005)


# Plot a simple bar chart for the important features only!


# Wrap-up
- We have developed a logistic regression model on a simple dataset to predict whether a person will click on an ad (classification problem).
- We then developed a random forest model and compared the findings to logistic regression. Random forest performed sligtly worse performance compared to logistic regression
  - Normally, our expectation is that random forest should perform better. Why do you think it was not the case? Let us discuss this during the drop-in session.
- We visualised decision-trees and practised on finding the best hyperparameter values.
  - We still observed that only a slightly improvement was achieved, in some cases maybe no improvements were observed.
- We interpreted the confusion matrix results and worked out the precision/recall calculations and compared our findings to the classification_report's.
- Finally, we printed the most important features above a certain threshold.