<a href="https://colab.research.google.com/github/TheophilusG/DataBootcamp/blob/main/forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Random Forests

**OBJECTIVES**

- Use `RandomForestClassifier` to extend Decision Tree models
- Compare models in a business use case and select model that optimizes expected profit

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.datasets import make_classification

### Ensemble of Trees

```A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.```

In [None]:
X, y = make_classification(random_state=11)

In [None]:
#instantiate


In [None]:
#fit


In [None]:
#predict


In [None]:
#confusion matrix


### Marketing Problem

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. [link](https://archive.ics.uci.edu/dataset/222/bank+marketing)

You have been tasked with finding a model for identifying further targets to offer incentive.  To do so, compare a Logistic Regression and Random Forest model to select the model that maximizes expected profit using the following cost benefit information:

- The cost of calling each customer is 2 dollars.
- A customer who purchases the product gives a profit of 200 dollars.


Recall the expected profit is found by:


$$\text{Expected Profit} = p(Y,p)*b(Y, p) + p(N, p)*b(N,p) + p(N,n)*b(N,n) + p(Y,n)*b(Y,n)$$


In [None]:
bank_marketing = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/bank.csv')

In [None]:
bank_marketing.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day_of_week,month,duration,campaign,pdays,previous,poutcome,target
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
1,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
3,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
4,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer

In [None]:
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(exclude=['object']).columns


numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Create a ColumnTransformer to apply transformers to respective columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
    ])

# Create a pipeline with preprocessing and the model
model = make_pipeline(preprocessor, LogisticRegression())

In [None]:
X = bank_marketing.drop(columns=['target'])
y = bank_marketing['target']

# Encoding target variable if necessary
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Fit the model
model.fit(X_train, y_train)

In [None]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown='ignore')


preprocessor_rf = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
    ])


rf_pipeline = Pipeline([('preprocessor', preprocessor_rf), ('rf', RandomForestClassifier())])



rf_pipeline.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix

y_log_reg_prob = model.predict_proba(X_test)[:, 1]  # Probability of class 1 ('yes')
y_log_reg_pred = (y_log_reg_prob >= 0.5).astype(int)  # Classify as 1 if probability >= 0.5, else 0


# Predict probabilities for Random Forest
y_rf_prob = rf_pipeline.predict_proba(X_test)[:, 1]  # Probability of class 1 ('yes')
y_rf_pred = (y_rf_prob >= 0.5).astype(int)  # Classify as 1 if probability >= 0.5, else 0


# Calculate profit for Logistic Regression
tn, fp, fn, tp = confusion_matrix(y_test, y_log_reg_pred).ravel()
p_y_p = tp / (tp + fp) if (tp + fp) != 0 else 0  # Probability of yes, given prediction of yes, with zero division check
p_n_p = tn / (tn + fn) if (tn + fn) != 0 else 0  # Probability of no, given prediction of no, with zero division check
log_reg_profit = p_y_p * product_profit - (1 - p_n_p) * call_cost

# Calculate profit for Random Forest
tn, fp, fn, tp = confusion_matrix(y_test, y_rf_pred).ravel()
p_y_p = tp / (tp + fp) if (tp + fp) != 0 else 0 # Probability of yes, given prediction of yes, with zero division check
p_n_p = tn / (tn + fn) if (tn + fn) != 0 else 0  # Probability of no, given prediction of no, with zero division check
rf_profit = p_y_p * product_profit - (1 - p_n_p) * call_cost

print(f'Expected profit for Logistic Regression: {log_reg_profit:.2f}')
print(f'Expected profit for Random Forest: {rf_profit:.2f}')

if log_reg_profit > rf_profit:
    print('Logistic Regression model is the better choice.')
else:
    print('Random Forest model is the better choice.')

Expected profit for Logistic Regression: 130.48
Expected profit for Random Forest: 132.33
Random Forest model is the better choice.


In [None]:
mat = confusion_matrix(y_test, y_pred)

### Summary

Please complete the form [here](https://forms.gle/C4B28692UKzvznq2A) to summarize your groups work and solutions.