# Week 5 Lab Assignment: Advanced Predictive Models

### Objective
In this lab, you will implement and evaluate advanced predictive models using Python. You will practice fitting and comparing models like Random Forest, Gradient Boosting, and Support Vector Machines (SVM).

### 1. Dataset Overview
You will work with a dataset related to customer transactions or behavior. The dataset contains multiple features that will be used to predict a target variable.

**Dataset Name:** `customer_transactions.csv`

**Attributes:**
- `Feature1`: A numerical feature.
- `Feature2`: A numerical feature.
- `Feature3`: A categorical feature (converted to numerical).
- `Target`: The outcome variable we want to predict (binary classification).

### 2. Load and Explore Dataset
**Objective:** Gain a preliminary understanding of the dataset.

**Tasks:**
1. **Load the Dataset:** Import the dataset into a Pandas DataFrame.
2. **Inspect the Data:** Use Pandas functions to inspect the first few rows, check for missing values, and understand the data types.
3. **Summary Statistics:** Generate summary statistics for numerical columns.

In [1]:
# Import necessary packages
%pip install pandas
%pip install matplotlib
%pip install scikit-learn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
%matplotlib inline

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load the dataset
df = pd.read_csv('customer_transactions.csv')

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Generate summary statistics
print(df.describe())

   Feature1  Feature2 Feature3  Target
0      45.2       1.5        A       0
1      67.1       2.3        B       1
2      56.4       3.2        A       0
3      75.3       2.7        C       1
4      34.5       1.9        B       0
Feature1    0
Feature2    0
Feature3    0
Target      0
dtype: int64
        Feature1   Feature2     Target
count  30.000000  30.000000  30.000000
mean   61.186667   2.583333   0.566667
std    14.440406   0.617606   0.504007
min    34.500000   1.500000   0.000000
25%    50.000000   2.100000   0.000000
50%    60.850000   2.650000   1.000000
75%    72.025000   3.075000   1.000000
max    89.000000   3.600000   1.000000


### 3. Data Preparation
**Objective:** Prepare the data for modeling by handling missing values and encoding categorical variables.

**Tasks:**
1. **Handle Missing Values:** Deal with any missing values in the dataset.
2. **Encode Categorical Variables:** Convert categorical variables into numerical format using techniques like one-hot encoding.
3. **Train-Test Split:** Split the data into training and testing sets.

In [3]:
# Handle missing values
df = df.dropna()

# Encode categorical variables (if necessary)
df = pd.get_dummies(df, drop_first=True)

# Split the data into training and testing sets
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape}')
print(f'Test set size: {X_test.shape}')

Training set size: (21, 4)
Test set size: (9, 4)


### 4. Model Implementation
**Objective:** Implement and train advanced predictive models.

**Tasks:**
1. **Random Forest:** Fit a Random Forest model to the training data.
2. **Gradient Boosting:** Fit a Gradient Boosting model to the training data.
3. **Support Vector Machine (SVM):** Fit an SVM model to the training data.

In [4]:
# Random Forest Model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Gradient Boosting Model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Support Vector Machine (SVM) Model
svm_model = SVC(probability=True)
svm_model.fit(X_train, y_train)

### 5. Model Evaluation
**Objective:** Evaluate the performance of each model using appropriate metrics.

**Tasks:**
1. **Random Forest Evaluation:** Evaluate using accuracy, ROC-AUC, and a classification report.
2. **Gradient Boosting Evaluation:** Evaluate using accuracy, ROC-AUC, and a classification report.
3. **SVM Evaluation:** Evaluate using accuracy, ROC-AUC, and a classification report.

In [5]:
# Evaluate Random Forest Model
rf_predictions = rf_model.predict(X_test)
rf_probabilities = rf_model.predict_proba(X_test)[:,1]
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_auc = roc_auc_score(y_test, rf_probabilities)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(f'Random Forest AUC: {rf_auc}')
print(classification_report(y_test, rf_predictions))

# Evaluate Gradient Boosting Model
gb_predictions = gb_model.predict(X_test)
gb_probabilities = gb_model.predict_proba(X_test)[:,1]
gb_accuracy = accuracy_score(y_test, gb_predictions)
gb_auc = roc_auc_score(y_test, gb_probabilities)
print(f'Gradient Boosting Accuracy: {gb_accuracy}')
print(f'Gradient Boosting AUC: {gb_auc}')
print(classification_report(y_test, gb_predictions))

# Evaluate SVM Model
svm_predictions = svm_model.predict(X_test)
svm_probabilities = svm_model.predict_proba(X_test)[:,1]
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_auc = roc_auc_score(y_test, svm_probabilities)
print(f'SVM Accuracy: {svm_accuracy}')
print(f'SVM AUC: {svm_auc}')
print(classification_report(y_test, svm_predictions))

Random Forest Accuracy: 1.0
Random Forest AUC: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         5

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

Gradient Boosting Accuracy: 1.0
Gradient Boosting AUC: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         5

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

SVM Accuracy: 1.0
SVM AUC: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         5

    accuracy                           1.

### 6. Model Comparison and Conclusion
**Objective:** Compare the performance of the models and draw conclusions.

**Tasks:**
1. **Compare Metrics:** Discuss the performance of each model based on the evaluation metrics.
2. **Conclusion:** Determine which model is best suited for the problem based on your analysis.

In [6]:
# Example comparison output
print(f'Random Forest: Accuracy = {rf_accuracy}, AUC = {rf_auc}')
print(f'Gradient Boosting: Accuracy = {gb_accuracy}, AUC = {gb_auc}')
print(f'SVM: Accuracy = {svm_accuracy}, AUC = {svm_auc}')

# Based on the results, provide your conclusion here

Random Forest: Accuracy = 1.0, AUC = 1.0
Gradient Boosting: Accuracy = 1.0, AUC = 1.0
SVM: Accuracy = 1.0, AUC = 1.0


### 7. Submission
**Deliverables:**
- Jupyter Notebook (.ipynb) with all code, visualizations, and model evaluations.
- A brief report (1-2 paragraphs) summarizing your findings and the model selection.

**Deadline:** Submit your completed notebook and report to the course portal by end of class.