# Week 4 Lab Assignment: Predictive Modeling Techniques

### Objective
In this lab, you will implement and evaluate basic predictive models using Python. You will practice fitting linear regression, logistic regression, and decision tree models to a dataset, and compare their performance using key evaluation metrics.

### 1. Dataset Overview
You will work with a dataset related to customer behavior or sales predictions. The dataset contains information about various features that influence the target variable.

**Dataset Name:** `customer_behavior.csv`

**Attributes:**
- `Feature1`: A numerical feature.
- `Feature2`: A numerical feature.
- `Feature3`: A categorical feature (converted to numerical).
- `Target`: The outcome variable we want to predict (continuous for regression, binary for classification).

### 2. Load and Explore Dataset
**Objective:** Gain a preliminary understanding of the dataset.

**Tasks:**
1. **Load the Dataset:** Import the dataset into a Pandas DataFrame.
2. **Inspect the Data:** Use Pandas functions to inspect the first few rows, check for missing values, and understand the data types.
3. **Summary Statistics:** Generate summary statistics for numerical columns.

In [1]:
# Import necessary packages
%pip install pandas
%pip install matplotlib
%pip install scikit-learn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, precision_score, recall_score
%matplotlib inline

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load the dataset
df = pd.read_csv('customer_behavior.csv')

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Generate summary statistics
print(df.describe())

   Feature1  Feature2 Feature3  Target
0      23.5      45.0        A       0
1      34.2      60.1        B       1
2      45.7      78.4        C       0
3      23.4      45.2        A       0
4      36.1      62.3        B       1
Feature1    0
Feature2    0
Feature3    0
Target      0
dtype: int64
        Feature1   Feature2     Target
count  18.000000  18.000000  18.000000
mean   36.372222  64.255556   0.555556
std     9.747024  14.883912   0.511310
min    22.100000  42.700000   0.000000
25%    26.975000  50.775000   0.000000
50%    36.800000  63.600000   1.000000
75%    45.075000  77.550000   1.000000
max    50.000000  88.000000   1.000000


### 3. Data Preparation
**Objective:** Prepare the data for modeling by handling missing values and encoding categorical variables.

**Tasks:**
1. **Handle Missing Values:** Deal with any missing values in the dataset.
2. **Encode Categorical Variables:** Convert categorical variables into numerical format using techniques like one-hot encoding.
3. **Train-Test Split:** Split the data into training and testing sets.

In [4]:
# Handle missing values
df = df.dropna()

# Encode categorical variables (if necessary)
df = pd.get_dummies(df, drop_first=True)

# Split the data into training and testing sets
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape}')
print(f'Test set size: {X_test.shape}')

Training set size: (12, 4)
Test set size: (6, 4)


### 4. Model Implementation
**Objective:** Implement and train predictive models.

**Tasks:**
1. **Linear Regression:** Fit a linear regression model to the training data.
2. **Logistic Regression:** Fit a logistic regression model to the training data (if the target is binary).
3. **Decision Tree:** Fit a decision tree model to the training data.

In [5]:
# Linear Regression Model (for continuous target)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Logistic Regression Model (for binary target)
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Decision Tree Model
decision_tree_model = DecisionTreeRegressor() if y_train.dtype == 'float64' else DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)

### 5. Model Evaluation
**Objective:** Evaluate the performance of each model using appropriate metrics.

**Tasks:**
1. **Linear Regression Evaluation:** Calculate the mean squared error (MSE) for the linear regression model.
2. **Logistic Regression Evaluation:** Calculate accuracy, precision, and recall for the logistic regression model.
3. **Decision Tree Evaluation:** Evaluate the decision tree model using the appropriate metric based on the target variable type.

In [6]:
# Linear Regression Evaluation
if y_train.dtype == 'float64':
    linear_predictions = linear_model.predict(X_test)
    mse = mean_squared_error(y_test, linear_predictions)
    print(f'Mean Squared Error (Linear Regression): {mse}')

# Logistic Regression Evaluation
else:
    logistic_predictions = logistic_model.predict(X_test)
    accuracy = accuracy_score(y_test, logistic_predictions)
    precision = precision_score(y_test, logistic_predictions)
    recall = recall_score(y_test, logistic_predictions)
    print(f'Accuracy (Logistic Regression): {accuracy}')
    print(f'Precision (Logistic Regression): {precision}')
    print(f'Recall (Logistic Regression): {recall}')

# Decision Tree Evaluation
tree_predictions = decision_tree_model.predict(X_test)
if y_train.dtype == 'float64':
    tree_mse = mean_squared_error(y_test, tree_predictions)
    print(f'Mean Squared Error (Decision Tree): {tree_mse}')
else:
    tree_accuracy = accuracy_score(y_test, tree_predictions)
    print(f'Accuracy (Decision Tree): {tree_accuracy}')

Accuracy (Logistic Regression): 0.8333333333333334
Precision (Logistic Regression): 1.0
Recall (Logistic Regression): 0.75
Accuracy (Decision Tree): 0.8333333333333334


### 6. Model Comparison and Conclusion
**Objective:** Compare the performance of the models and draw conclusions.

**Tasks:**
1. **Compare Metrics:** Discuss the performance of each model based on the evaluation metrics.
2. **Conclusion:** Determine which model is best suited for the problem based on your analysis.

In [7]:
# Example comparison output
if y_train.dtype == 'float64':
    print(f'Linear Regression MSE: {mse}')
    print(f'Decision Tree MSE: {tree_mse}')
else:
    print(f'Logistic Regression Accuracy: {accuracy}')
    print(f'Decision Tree Accuracy: {tree_accuracy}')
    print(f'Logistic Regression Precision: {precision}')
    print(f'Decision Tree Precision: {precision_score(y_test, tree_predictions)}')
    print(f'Logistic Regression Recall: {recall}')
    print(f'Decision Tree Recall: {recall_score(y_test, tree_predictions)}')

Logistic Regression Accuracy: 0.8333333333333334
Decision Tree Accuracy: 0.8333333333333334
Logistic Regression Precision: 1.0
Decision Tree Precision: 1.0
Logistic Regression Recall: 0.75
Decision Tree Recall: 0.75


### 7. Submission
**Deliverables:**
- Jupyter Notebook (.ipynb) with all code, visualizations, and model evaluations.
- A brief report (1-2 paragraphs) summarizing your findings and the model selection.

**Deadline:** Submit your completed notebook and report to the course portal by end of class.