# Week 3 Lab Assignment: Decision Trees, Random Forest, Gradient Boosting

### Objective
In this lab, you will implement decision tree models, random forest, and gradient boosting using Python. You will learn how to build these models, understand their differences, and evaluate their performance on a dataset.

### 1. Load and Explore Dataset
**Objective:** Gain a preliminary understanding of the dataset to be used for modeling.

**Tasks:**
1. **Load the Dataset:** Import the dataset into a Pandas DataFrame.
2. **Inspect the Data:** Use Pandas functions to inspect the first few rows, check for missing values, and understand the data types.
3. **Summary Statistics:** Generate summary statistics for numerical columns.

In [2]:
# Import necessary packages
%pip install pandas
%pip install scikit-learn
%pip install matplotlib
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
%matplotlib inline

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load the dataset
df = pd.read_csv('customer_behavior.csv')

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Generate summary statistics
print(df.describe())

   Feature1  Feature2 Feature3  Target
0      35.1       0.5        A       0
1      42.2       1.1        B       1
2      28.7       0.7        A       0
3      54.5       1.5        C       1
4      47.8       1.2        B       1
Feature1    0
Feature2    0
Feature3    0
Target      0
dtype: int64
        Feature1   Feature2     Target
count  30.000000  30.000000  30.000000
mean   44.226667   1.126667   0.466667
std    11.092524   0.435441   0.507416
min    25.900000   0.400000   0.000000
25%    35.225000   0.800000   0.000000
50%    44.050000   1.050000   0.000000
75%    54.300000   1.500000   1.000000
max    62.800000   1.900000   1.000000


### 2. Data Preparation
**Objective:** Prepare the data for modeling by handling missing values and encoding categorical variables.

**Tasks:**
1. **Handle Missing Values:** Deal with any missing values in the dataset.
2. **Encode Categorical Variables:** Convert categorical variables into numerical format using techniques like one-hot encoding.
3. **Train-Test Split:** Split the data into training and testing sets.

In [4]:
# Handle missing values
df = df.dropna()

# Encode categorical variables (if necessary)
df = pd.get_dummies(df, drop_first=True)

# Split the data into training and testing sets
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape}')
print(f'Test set size: {X_test.shape}')

Training set size: (21, 4)
Test set size: (9, 4)


### 3. Implementing Decision Tree Model
**Objective:** Build and evaluate a decision tree model on the dataset.

**Tasks:**
1. **Build the Model:** Create a decision tree classifier using Scikit-learn.
2. **Train the Model:** Train the model on the training data.
3. **Evaluate the Model:** Use accuracy and classification report to evaluate the model's performance.

In [5]:
# Build and train the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
dt_predictions = dt_model.predict(X_test)

# Evaluate the model
dt_accuracy = accuracy_score(y_test, dt_predictions)
print(f'Decision Tree Accuracy: {dt_accuracy}')
print(classification_report(y_test, dt_predictions))

Decision Tree Accuracy: 0.8888888888888888
              precision    recall  f1-score   support

           0       1.00      0.80      0.89         5
           1       0.80      1.00      0.89         4

    accuracy                           0.89         9
   macro avg       0.90      0.90      0.89         9
weighted avg       0.91      0.89      0.89         9



### 4. Implementing Random Forest Model
**Objective:** Build and evaluate a random forest model on the dataset.

**Tasks:**
1. **Build the Model:** Create a random forest classifier using Scikit-learn.
2. **Train the Model:** Train the model on the training data.
3. **Evaluate the Model:** Use accuracy and classification report to evaluate the model's performance.

In [6]:
# Build and train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print(classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.8888888888888888
              precision    recall  f1-score   support

           0       1.00      0.80      0.89         5
           1       0.80      1.00      0.89         4

    accuracy                           0.89         9
   macro avg       0.90      0.90      0.89         9
weighted avg       0.91      0.89      0.89         9



### 5. Implementing Gradient Boosting Model
**Objective:** Build and evaluate a gradient boosting model on the dataset.

**Tasks:**
1. **Build the Model:** Create a gradient boosting classifier using Scikit-learn.
2. **Train the Model:** Train the model on the training data.
3. **Evaluate the Model:** Use accuracy and classification report to evaluate the model's performance.

In [7]:
# Build and train the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions
gb_predictions = gb_model.predict(X_test)

# Evaluate the model
gb_accuracy = accuracy_score(y_test, gb_predictions)
print(f'Gradient Boosting Accuracy: {gb_accuracy}')
print(classification_report(y_test, gb_predictions))

Gradient Boosting Accuracy: 0.8888888888888888
              precision    recall  f1-score   support

           0       1.00      0.80      0.89         5
           1       0.80      1.00      0.89         4

    accuracy                           0.89         9
   macro avg       0.90      0.90      0.89         9
weighted avg       0.91      0.89      0.89         9



### 6. Comparing Model Performance
**Objective:** Compare the performance of decision tree, random forest, and gradient boosting models.

**Tasks:**
1. **Compare Accuracy:** Print and compare the accuracy of all models.
2. **Model Selection:** Discuss which model performed best and why.

In [8]:
# Compare model accuracy
print(f'Decision Tree Accuracy: {dt_accuracy}')
print(f'Random Forest Accuracy: {rf_accuracy}')
print(f'Gradient Boosting Accuracy: {gb_accuracy}')

# Discuss model performance
# (Provide your analysis here based on the results)

Decision Tree Accuracy: 0.8888888888888888
Random Forest Accuracy: 0.8888888888888888
Gradient Boosting Accuracy: 0.8888888888888888


### 7. Submission
**Deliverables:**
- Jupyter Notebook (.ipynb) with all code and model evaluations.
brief report (1-2 paragraphs) summarizing the findings, comparing model performance, and discussing the application of SEMMA and CRISP-DM methodologies.

**Deadline:** Submit your completed notebook and report to the course portal by end of class.