# Week 18: Classification Algorithms Part 2 - Take Home

## Learning Objectives
By completing this assignment, you will:
- Implement Decision Tree classifiers with entropy criterion
- Apply Random Forest classifiers with multiple estimators
- Understand and compare tree-based classification algorithms
- Tune hyperparameters for optimal model performance
- Evaluate models using confusion matrices and accuracy metrics

---

# Part 1: Tasks

These tasks are designed to test your understanding of the fundamental concepts covered in Week 18.

---

## Task 1: Decision Tree Classification

**Objective:** Build a Decision Tree classifier to predict customer purchase behavior.

**Dataset:** `Task-Datasets/task1_decision_tree_customer_data.csv`

### Instructions:
1. Import the necessary libraries (pandas, numpy, sklearn)
2. Load the dataset and explore its structure
3. Separate features (Age, Salary) and target variable (Purchased)
4. Split the data into training (80%) and test (20%) sets
5. Build a Decision Tree classifier with criterion='entropy' and random_state=0
6. Train the model on the training data
7. Make predictions on the test set
8. Evaluate using confusion matrix and accuracy score

### Expected Deliverables:
- Confusion matrix
- Accuracy score
- Brief interpretation of results

In [None]:
# Task 1: Decision Tree Classification
# Import libraries

In [None]:
# Load and explore the dataset

In [None]:
# Separate features and target variable

In [None]:
# Split data into training and test sets

In [None]:
# Build and train the Decision Tree classifier

In [None]:
# Make predictions and evaluate

---

## Task 2: Random Forest Classification

**Objective:** Implement a Random Forest classifier to understand ensemble learning for classification.

**Dataset:** `Task-Datasets/task2_random_forest_customer_data.csv`

### Instructions:
1. Import the necessary libraries
2. Load the dataset and understand its structure
3. Separate features (Age, Salary) and target variable (Purchased)
4. Split the data into training (80%) and test (20%) sets
5. Build a Random Forest classifier with:
   - n_estimators=10
   - criterion='entropy'
   - random_state=0
6. Train the model and make predictions
7. Evaluate using confusion matrix and accuracy score
8. Compare conceptually with a single Decision Tree

### Expected Deliverables:
- Confusion matrix
- Accuracy score
- Brief explanation of why Random Forest might perform differently than a single Decision Tree

In [None]:
# Task 2: Random Forest Classification
# Import libraries

In [None]:
# Load and explore the dataset

In [None]:
# Separate features and target, split data

In [None]:
# Build and train Random Forest classifier

In [None]:
# Make predictions and evaluate

---

# Part 2: Assignments

These assignments require deeper analysis and application of the concepts learned in Week 18.

---

## Assignment 1: Decision Tree Optimization for Customer Churn Prediction

**Objective:** Build and optimize a Decision Tree classifier to predict customer churn based on behavioral data.

**Dataset:** `Assignment-Dataset/assignment1_decision_tree_optimization.csv`

**Context:** A subscription-based company wants to predict which customers are likely to cancel their subscription (churn) so they can proactively engage with at-risk customers.

### Instructions:
1. Import necessary libraries
2. Load and preprocess the dataset
3. Perform exploratory data analysis (EDA) to understand the data
4. Separate features (Age, Annual_Income, Spending_Score, Years_as_Customer, Online_Purchase_Frequency) and target (Will_Churn)
5. Split the data into training (80%) and test (20%) sets with random_state=42
6. Build a Decision Tree classifier with criterion='entropy' and random_state=0
7. Experiment with different max_depth values (2, 4, 6, 8, 10, None)
8. For each max_depth value, calculate:
   - Training accuracy
   - Test accuracy
9. Plot max_depth vs. accuracy (training and test) to visualize overfitting
10. Select the optimal max_depth and justify your choice
11. Build the final model with the optimal parameters and evaluate it

### Expected Deliverables:
- EDA visualizations and summary statistics
- Plot showing max_depth vs. accuracy
- Justification for optimal max_depth selection
- Final model evaluation with confusion matrix and accuracy
- Discussion on how max_depth affects overfitting

In [None]:
# Assignment 1: Decision Tree Optimization
# Import libraries

In [None]:
# Load and explore the dataset

In [None]:
# Exploratory Data Analysis (EDA)

In [None]:
# Prepare data: separate features/target, split

In [None]:
# Test different max_depth values (2, 4, 6, 8, 10, None)

In [None]:
# Plot max_depth vs accuracy

In [None]:
# Build final model with optimal max_depth and evaluate

### Analysis and Conclusions

*Write your analysis here:*
- What is the optimal max_depth and why?
- How does max_depth affect overfitting and underfitting?
- What features seem most important for predicting churn?
- What business recommendations would you make based on this model?

---

## Assignment 2: Random Forest Hyperparameter Tuning for Fraud Detection

**Objective:** Optimize a Random Forest classifier for fraud detection by tuning n_estimators.

**Dataset:** `Assignment-Dataset/assignment2_random_forest_optimization.csv`

**Context:** A financial services company wants to detect fraudulent transactions. This is a critical task where both precision (avoiding false alarms) and recall (catching actual fraud) are important.

### Instructions:
1. Import necessary libraries
2. Load and preprocess the dataset
3. Perform exploratory data analysis including:
   - Distribution of each feature
   - Class imbalance analysis (note: fraud is rare, ~10%)
   - Feature correlations
4. Separate features (Amount, Time_of_Day, Day_of_Week, Customer_Age, Account_Age_Days, Previous_Transactions) and target (Is_Fraud)
5. Split the data into training (80%) and test (20%) sets with random_state=42
6. Test different n_estimators values: [5, 10, 25, 50, 100, 150, 200]
7. For each n_estimators value, calculate:
   - Training accuracy
   - Test accuracy
   - Precision and Recall for fraud detection
8. Plot n_estimators vs. performance metrics
9. Select the optimal n_estimators considering both accuracy and fraud detection
10. Build the final model and provide comprehensive evaluation

### Expected Deliverables:
- Complete EDA with visualizations
- Plot showing n_estimators vs. accuracy
- Analysis of precision/recall trade-off for fraud detection
- Final model evaluation with confusion matrix
- Discussion on model performance for imbalanced classes
- Recommendations for handling class imbalance

In [None]:
# Assignment 2: Random Forest Optimization for Fraud Detection
# Import libraries

In [None]:
# Load and explore the dataset

In [None]:
# Class imbalance analysis

In [None]:
# Exploratory Data Analysis (EDA)

In [None]:
# Prepare data: separate features/target, split

In [None]:
# Test different n_estimators values

In [None]:
# Plot n_estimators vs performance metrics

In [None]:
# Build final model with optimal n_estimators and evaluate

### Analysis and Conclusions

*Write your analysis here:*
- What is the optimal number of trees (n_estimators) and why?
- How does the model perform on the imbalanced dataset?
- What is the trade-off between precision and recall for fraud detection?
- What strategies could be used to improve fraud detection?

---

## Assignment 3: Decision Tree vs. Random Forest Comparison

**Objective:** Compare Decision Tree and Random Forest classifiers on the same dataset to understand the benefits of ensemble methods.

**Dataset:** `Assignment-Dataset/assignment3_classifier_comparison.csv`

**Context:** A healthcare provider wants to predict diabetes risk based on patient health indicators. They want to understand which classifier provides better predictions and why.

### Instructions:
1. Import necessary libraries
2. Load and preprocess the dataset
3. Perform comprehensive EDA including:
   - Feature distributions
   - Class distribution (Diabetes_Risk: 0 = Low Risk, 1 = High Risk)
   - Feature correlations
   - Analysis by physical activity level
4. Separate features (Age, BMI, Blood_Pressure, Glucose_Level, Insulin_Level, Family_History, Physical_Activity) and target (Diabetes_Risk)
5. Handle categorical feature (Physical_Activity) - encode appropriately (Low=0, Medium=1, High=2)
6. Split the data into training (80%) and test (20%) sets with random_state=42
7. Implement and evaluate:
   - Decision Tree (criterion='entropy', random_state=0)
   - Decision Tree (criterion='entropy', max_depth=5, random_state=0)
   - Random Forest (n_estimators=10, criterion='entropy', random_state=0)
   - Random Forest (n_estimators=50, criterion='entropy', random_state=0)
8. Compare all classifiers using:
   - Accuracy
   - Confusion matrices
   - Classification reports (precision, recall, f1-score)
9. Determine the best classifier for diabetes risk prediction

### Expected Deliverables:
- Comprehensive EDA visualizations
- Summary table comparing all classifiers
- Individual confusion matrices for each classifier
- Discussion on why Random Forest might outperform single Decision Tree
- Recommendations for healthcare deployment

In [None]:
# Assignment 3: Decision Tree vs. Random Forest Comparison
# Import libraries

In [None]:
# Load and explore the dataset

In [None]:
# Comprehensive Exploratory Data Analysis

In [None]:
# Visualize class distribution and feature correlations

In [None]:
# Prepare data: encode categorical, separate features/target, split

In [None]:
# Implement Decision Tree (no max_depth)

In [None]:
# Implement Decision Tree (max_depth=5)

In [None]:
# Implement Random Forest (n_estimators=10)

In [None]:
# Implement Random Forest (n_estimators=50)

In [None]:
# Create comparison table and visualizations

### Analysis and Conclusions

*Write your analysis here:*
- Which classifier performed best overall?
- Why does Random Forest typically outperform a single Decision Tree?
- What is the effect of max_depth on Decision Tree performance?
- What is the effect of n_estimators on Random Forest performance?
- Which classifier would you recommend for healthcare deployment and why?

---

# Part 3: Assessment

This assessment evaluates your ability to apply all the tree-based classification techniques learned this week.

---

## Assessment: End-to-End Employee Attrition Prediction System

**Objective:** Build a complete machine learning pipeline to predict employee attrition using Decision Trees and Random Forest classifiers.

**Dataset:** `Assessment-Dataset/employee_attrition_prediction.csv`

**Context:** A large technology company is concerned about employee turnover. They want to build a predictive model that can identify employees who are likely to leave the company, so HR can proactively engage with at-risk employees and implement retention strategies.

---

### Section A: Data Loading and Exploration



---

### Section B: Exploratory Data Analysis

1. Analyze the relationship between each feature and employee attrition
2. Create visualizations for:
   - Distribution of numerical features by attrition status
   - Count plots for categorical features by attrition status
   - Correlation heatmap for numerical features
3. Analyze attrition by:
   - Department
   - Job satisfaction level
   - Work-life balance
   - Overtime status
4. Document your findings and insights

In [None]:
# Section B: Exploratory Data Analysis
# Analyze numerical features by attrition status

In [None]:
# Analyze categorical features by attrition

In [None]:
# Create correlation heatmap

In [None]:
# Analyze attrition by department and other key factors

**EDA Findings:**

*Document your key findings here:*
- 
- 
- 

---

### Section C: Data Preprocessing

1. Handle categorical variables:
   - Encode Gender (Male=1, Female=0)
   - Encode Education_Level (Bachelor=0, Master=1, PhD=2)
   - Encode Department using Label Encoding or One-Hot Encoding
   - Encode Job_Role using Label Encoding
   - Encode Overtime (Yes=1, No=0)
2. Create feature matrix (X) and target vector (y)
   - Features: All columns except Employee_ID and Left_Company
   - Target: Left_Company
3. Split data into training (80%) and test (20%) sets with random_state=42
4. Note: Feature scaling is optional for tree-based methods (discuss why)

In [None]:
# Section C: Data Preprocessing
# Handle categorical variables

In [None]:
# Create feature matrix and target vector

In [None]:
# Split data into training and test sets

---

### Section D: Model Building

Build and evaluate the following classifiers:

**D1. Decision Tree Classifier**
- Build a basic Decision Tree with criterion='entropy' and random_state=0
- Experiment with max_depth values (3, 5, 7, 10, None)
- Find the optimal max_depth
- Evaluate the best Decision Tree model

**D2. Random Forest Classifier**
- Build Random Forest with n_estimators=10, criterion='entropy', random_state=0
- Experiment with n_estimators values (10, 50, 100, 150)
- Find the optimal n_estimators
- Evaluate the best Random Forest model

**D3. Feature Importance Analysis**
- Extract feature importance from both models
- Identify top 5 most important features
- Visualize feature importance

In [None]:
# Section D1: Decision Tree Classifier
# Test different max_depth values

In [None]:
# Plot max_depth vs accuracy

In [None]:
# Build final Decision Tree model with optimal max_depth

In [None]:
# Section D2: Random Forest Classifier
# Test different n_estimators values

In [None]:
# Plot n_estimators vs accuracy

In [None]:
# Build final Random Forest model with optimal n_estimators

In [None]:
# Section D3: Feature Importance Analysis
# Extract and visualize feature importance

---

### Section E: Model Comparison and Selection

1. Create a comprehensive comparison table including:
   - Accuracy
   - Precision
   - Recall
   - F1-Score
2. Visualize the comparison using bar charts
3. Analyze confusion matrices for both models
4. Select the best model for the employee attrition prediction task
5. Justify your model selection considering:
   - Overall performance metrics
   - Business requirements (cost of false positives vs. false negatives)
   - Model interpretability
   - Feature importance insights

In [None]:
# Section E: Model Comparison
# Create comparison table

In [None]:
# Visualize comparison using bar charts

In [None]:
# Display confusion matrices for both models

In [None]:
# Final model selection and justification

---

### Section F: Conclusions and Recommendations

Write a comprehensive report addressing:

1. **Summary of Findings:**
   - Key features influencing employee attrition
   - Performance comparison: Decision Tree vs. Random Forest
   - Best performing model and configuration

2. **Business Recommendations:**
   - What are the top factors driving employee attrition?
   - Which employee segments are at highest risk?
   - What retention strategies would you recommend?

3. **Technical Recommendations:**
   - Which model should be deployed and why?
   - How does Random Forest compare to Decision Tree for this problem?
   - What monitoring should be in place?
   - How might tree-based methods compare to other classifiers (e.g., KNN, SVM)?

## Final Report

### 1. Summary of Findings

*Write your summary here:*


### 2. Business Recommendations

*Write your business recommendations here:*


### 3. Technical Recommendations

*Write your technical recommendations here:*



## Provide your publication link below!

Link: 

**Good luck!**