# Week 6 Lab Assignment: Model Comparison

### Objective
In this lab, you will implement multiple predictive models and compare their performance using different evaluation metrics. You will learn to select the most appropriate model for a dataset based on these comparisons.

### 1. Setup and Installations
**Objective:** Ensure all necessary packages are installed and imported for the lab.

**Tasks:**
1. Install required Python packages: pandas, scikit-learn, matplotlib, numpy.

In [4]:
# Install necessary packages
%pip install pandas scikit-learn matplotlib numpy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.


### 2. Import Libraries
**Objective:** Import all necessary libraries for data manipulation, modeling, and evaluation.


In [5]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
%matplotlib inline

### 3. Load and Explore Dataset
**Objective:** Gain a preliminary understanding of the dataset to be used for modeling.

**Tasks:**
1. **Load the Dataset:** Import the dataset into a Pandas DataFrame.
2. **Inspect the Data:** Use Pandas functions to inspect the first few rows, check for missing values, and understand the data types.
3. **Summary Statistics:** Generate summary statistics for numerical columns.

In [6]:
# Load the dataset
df = pd.read_csv('customer_behavior.csv')

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Generate summary statistics
print(df.describe())

   Age  Income  Number_of_Purchases Customer_Category  Target
0   25   50000                    3                 A       0
1   45   80000                    8                 B       1
2   30   54000                    4                 A       0
3   35   60000                    2                 C       1
4   50   95000                    7                 B       1
Age                    0
Income                 0
Number_of_Purchases    0
Customer_Category      0
Target                 0
dtype: int64
             Age         Income  Number_of_Purchases    Target
count  20.000000      20.000000            20.000000  20.00000
mean   37.850000   70400.000000             4.750000   0.65000
std    10.332753   18871.867115             2.291288   0.48936
min    22.000000   49000.000000             1.000000   0.00000
25%    29.750000   54750.000000             3.000000   0.00000
50%    36.500000   66000.000000             5.000000   1.00000
75%    45.250000   80250.000000             6.250

### 4. Data Preparation
**Objective:** Prepare the data for modeling by handling missing values and encoding categorical variables.

**Tasks:**
1. **Handle Missing Values:** Deal with any missing values in the dataset.
2. **Encode Categorical Variables:** Convert categorical variables into numerical format using techniques like one-hot encoding.
3. **Train-Test Split:** Split the data into training and testing sets.

In [7]:
# Handle missing values
df = df.dropna()

# Encode categorical variables (if necessary)
df = pd.get_dummies(df, drop_first=True)

# Split the data into training and testing sets
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape}')
print(f'Test set size: {X_test.shape}')

Training set size: (14, 5)
Test set size: (6, 5)


### 5. Implementing and Evaluating Multiple Models
**Objective:** Build and evaluate multiple predictive models on the dataset.

**Tasks:**
1. **Implement Models:** Create and train logistic regression, decision tree, random forest, and gradient boosting models.
2. **Evaluate Models:** Use accuracy, precision, recall, and F1 score to evaluate the models' performance.

In [8]:
# Implementing Logistic Regression
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)
logistic_predictions = logistic_model.predict(X_test)

# Evaluate Logistic Regression
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
logistic_precision = precision_score(y_test, logistic_predictions)
logistic_recall = recall_score(y_test, logistic_predictions)
logistic_f1 = f1_score(y_test, logistic_predictions)
print(f'Logistic Regression Accuracy: {logistic_accuracy}')
print(f'Logistic Regression Precision: {logistic_precision}')
print(f'Logistic Regression Recall: {logistic_recall}')
print(f'Logistic Regression F1 Score: {logistic_f1}')

Logistic Regression Accuracy: 0.3333333333333333
Logistic Regression Precision: 0.2
Logistic Regression Recall: 1.0
Logistic Regression F1 Score: 0.3333333333333333


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [9]:
# Implementing Decision Tree
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)
tree_predictions = tree_model.predict(X_test)

# Evaluate Decision Tree
tree_accuracy = accuracy_score(y_test, tree_predictions)
tree_precision = precision_score(y_test, tree_predictions)
tree_recall = recall_score(y_test, tree_predictions)
tree_f1 = f1_score(y_test, tree_predictions)
print(f'Decision Tree Accuracy: {tree_accuracy}')
print(f'Decision Tree Precision: {tree_precision}')
print(f'Decision Tree Recall: {tree_recall}')
print(f'Decision Tree F1 Score: {tree_f1}')

Decision Tree Accuracy: 0.8333333333333334
Decision Tree Precision: 0.5
Decision Tree Recall: 1.0
Decision Tree F1 Score: 0.6666666666666666


In [10]:
# Implementing Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions)
rf_recall = recall_score(y_test, rf_predictions)
rf_f1 = f1_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(f'Random Forest Precision: {rf_precision}')
print(f'Random Forest Recall: {rf_recall}')
print(f'Random Forest F1 Score: {rf_f1}')

Random Forest Accuracy: 1.0
Random Forest Precision: 1.0
Random Forest Recall: 1.0
Random Forest F1 Score: 1.0


In [11]:
# Implementing Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)

# Evaluate Gradient Boosting
gb_accuracy = accuracy_score(y_test, gb_predictions)
gb_precision = precision_score(y_test, gb_predictions)
gb_recall = recall_score(y_test, gb_predictions)
gb_f1 = f1_score(y_test, gb_predictions)
print(f'Gradient Boosting Accuracy: {gb_accuracy}')
print(f'Gradient Boosting Precision: {gb_precision}')
print(f'Gradient Boosting Recall: {gb_recall}')
print(f'Gradient Boosting F1 Score: {gb_f1}')

Gradient Boosting Accuracy: 0.8333333333333334
Gradient Boosting Precision: 0.5
Gradient Boosting Recall: 1.0
Gradient Boosting F1 Score: 0.6666666666666666


### 6. Comparing Model Performance
**Objective:** Compare the performance of logistic regression, decision tree, random forest, and gradient boosting models.

**Tasks:**
1. **Compare Metrics:** Print and compare the accuracy, precision, recall, and F1 scores of all models.
2. **Model Selection:** Discuss which model performed best and why.

In [12]:
# Compare model performance
print(f'Logistic Regression Accuracy: {logistic_accuracy}, Precision: {logistic_precision}, Recall: {logistic_recall}, F1 Score: {logistic_f1}')
print(f'Decision Tree Accuracy: {tree_accuracy}, Precision: {tree_precision}, Recall: {tree_recall}, F1 Score: {tree_f1}')
print(f'Random Forest Accuracy: {rf_accuracy}, Precision: {rf_precision}, Recall: {rf_recall}, F1 Score: {rf_f1}')
print(f'Gradient Boosting Accuracy: {gb_accuracy}, Precision: {gb_precision}, Recall: {gb_recall}, F1 Score: {gb_f1}')

# Discuss model performance
# (Provide your analysis here based on the results)

Logistic Regression Accuracy: 0.3333333333333333, Precision: 0.2, Recall: 1.0, F1 Score: 0.3333333333333333
Decision Tree Accuracy: 0.8333333333333334, Precision: 0.5, Recall: 1.0, F1 Score: 0.6666666666666666
Random Forest Accuracy: 1.0, Precision: 1.0, Recall: 1.0, F1 Score: 1.0
Gradient Boosting Accuracy: 0.8333333333333334, Precision: 0.5, Recall: 1.0, F1 Score: 0.6666666666666666


### 7. Using Cross-Validation for Model Evaluation
**Objective:** Apply cross-validation to evaluate the stability and reliability of model performance.

**Tasks:**
1. **Perform Cross-Validation:** Use cross-validation to evaluate the models' performance.
2. **Interpret Results:** Analyze cross-validation results to understand model reliability.

In [13]:
# Perform cross-validation
logistic_cv_scores = cross_val_score(logistic_model, X, y, cv=5)
tree_cv_scores = cross_val_score(tree_model, X, y, cv=5)
rf_cv_scores = cross_val_score(rf_model, X, y, cv=5)
gb_cv_scores = cross_val_score(gb_model, X, y, cv=5)

# Print cross-validation results
print(f'Logistic Regression Cross-Validation Scores: {logistic_cv_scores}')
print(f'Decision Tree Cross-Validation Scores: {tree_cv_scores}')
print(f'Random Forest Cross-Validation Scores: {rf_cv_scores}')
print(f'Gradient Boosting Cross-Validation Scores: {gb_cv_scores}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic Regression Cross-Validation Scores: [0.5 1.  1.  1.  1. ]
Decision Tree Cross-Validation Scores: [0.75 0.75 1.   1.   1.  ]
Random Forest Cross-Validation Scores: [0.75 1.   1.   1.   1.  ]
Gradient Boosting Cross-Validation Scores: [0.75 0.75 1.   1.   1.  ]


### 8. Submission
**Deliverables:**
- Jupyter Notebook (.ipynb) with all code and model evaluations.
- A brief report (1-2 paragraphs) summarizing the findings, comparing model performance, and discussing the best model choice based on evaluation metrics and cross-validation results.

**Deadline:** Submit your completed notebook and report to the course portal by the end of class.