## Part 1: Tasks

These tasks focus on fundamental concepts from this week's module.

### Task 1: Random Forest Regression Basics

**Dataset:** `Task-Datasets/task1_random_forest_data.csv`

**Objective:** Build a Random Forest regression model to predict crop yield based on weather features.

**Requirements:**
1. Load the dataset and explore its structure
2. Split the data into training (80%) and testing (20%) sets with `random_state=42`
3. Create a Random Forest Regressor with:
   - `n_estimators=100`
   - `max_depth=10`
   - `random_state=42`
4. Train the model and make predictions on the test set
5. Calculate and print the R² score
6. Display feature importance in descending order

In [None]:
# Import necessary libraries


In [None]:
# Load and explore the dataset


In [None]:
# Split the data into features and target


In [None]:
# Split into training and testing sets


In [None]:
# Create and train the Random Forest model


In [None]:
# Make predictions and calculate R² score


In [None]:
# Display feature importance


---

### Task 2: Model Performance Evaluation

**Dataset:** `Task-Datasets/task2_model_evaluation_data.csv`

**Objective:** Evaluate a regression model using multiple performance metrics.

**Requirements:**
1. Load the dataset and split into features (temperature, humidity, wind_speed) and target (power_output)
2. Split into training (70%) and testing (30%) sets with `random_state=42`
3. Train a Random Forest Regressor with `n_estimators=50, random_state=42`
4. Make predictions on the test set
5. Calculate and display ALL of the following metrics:
   - R² Score
   - Adjusted R² Score (use formula: 1 - (1-R²) * (n-1)/(n-k-1))
   - Mean Absolute Error (MAE)
   - Mean Squared Error (MSE)
   - Root Mean Squared Error (RMSE)
6. Create a visualization comparing actual vs predicted values

In [None]:
# Import necessary libraries


In [None]:
# Load the dataset


In [None]:
# Split into features and target


In [None]:
# Split into training and testing sets


In [None]:
# Train the model


In [None]:
# Make predictions


In [None]:
# Calculate all evaluation metrics


In [None]:
# Visualize actual vs predicted values


---

### Task 3: Binary Classification with Logistic Regression

**Dataset:** `Task-Datasets/task3_classification_data.csv`

**Objective:** Build a logistic regression model to classify emails as spam or not spam.

**Requirements:**
1. Load the dataset and explore its structure
2. Split into features (all columns except 'is_spam') and target ('is_spam')
3. Split into training (80%) and testing (20%) sets with `random_state=42`
4. Create and train a Logistic Regression model with `random_state=42, max_iter=1000`
5. Make predictions on the test set
6. Calculate and display:
   - Confusion Matrix
   - Accuracy Score
   - Precision Score
   - Recall Score
   - F1 Score
7. Interpret the results: Which metric is most important for spam detection and why?

In [None]:
# Import necessary libraries


In [None]:
# Load and explore the dataset


In [None]:
# Split into features and target


In [None]:
# Split into training and testing sets


In [None]:
# Create and train the Logistic Regression model


In [None]:
# Make predictions


In [None]:
# Calculate and display classification metrics


**Interpretation:**

*Write your answer here about which metric is most important for spam detection and why*

---

## Part 2: Assignments

These assignments require deeper analysis and comparison of multiple models.

### Assignment 1: Comparative Regression Analysis

**Dataset:** `Assignment-Dataset/assignment1_house_prices.csv`

**Objective:** Compare multiple regression models for house price prediction.

**Requirements:**
1. Load the dataset and perform exploratory data analysis:
   - Display basic statistics
   - Check for missing values
   - Visualize the distribution of house prices
   - Create a correlation heatmap
2. Prepare the data:
   - Split into features and target (price)
   - Split into training (80%) and testing (20%) sets with `random_state=42`
3. Train and evaluate THREE models:
   - Linear Regression
   - Decision Tree Regressor (`max_depth=10, random_state=42`)
   - Random Forest Regressor (`n_estimators=100, max_depth=10, random_state=42`)
4. For each model, calculate:
   - R² Score
   - Adjusted R² Score
   - MAE
   - RMSE
5. Create a comparison table or visualization showing all metrics for all models
6. Provide a written analysis:
   - Which model performed best?
   - Why might Random Forest outperform or underperform compared to simpler models?
   - Which features are most important for price prediction? (use Random Forest feature importance)

In [None]:
# Import necessary libraries


In [None]:
# Load the dataset


In [None]:
# Exploratory Data Analysis


In [None]:
# Data preparation


In [None]:
# Model 1: Linear Regression


In [None]:
# Model 2: Decision Tree Regressor


In [None]:
# Model 3: Random Forest Regressor


In [None]:
# Compare all models


In [None]:
# Feature importance analysis


**Analysis:**

*Write your comparative analysis here*

---

### Assignment 2: Binary Classification with Model Tuning

**Dataset:** `Assignment-Dataset/assignment2_marketing_campaign.csv`

**Objective:** Build and optimize a classification model to predict customer conversion.

**Requirements:**
1. Load and explore the dataset:
   - Check class distribution (converted vs not converted)
   - Identify any class imbalance
   - Visualize key features by conversion status
2. Prepare the data:
   - Split into features and target (converted)
   - Split into training (75%) and testing (25%) sets with `random_state=42`
3. Train TWO Logistic Regression models:
   - Model A: Default parameters with `random_state=42, max_iter=1000`
   - Model B: With class balancing `class_weight='balanced', random_state=42, max_iter=1000`
4. For each model, calculate and display:
   - Confusion Matrix
   - Accuracy, Precision, Recall, F1-Score
   - ROC Curve and AUC Score
5. Compare the two models:
   - Which performs better overall?
   - How does class balancing affect the results?
   - What is the trade-off between precision and recall?
6. Make business recommendations:
   - Which model would you deploy and why?
   - What threshold would you use for classifying conversions?

In [None]:
# Import necessary libraries


In [None]:
# Load and explore the dataset


In [None]:
# Check class distribution


In [None]:
# Data preparation


In [None]:
# Model A: Default Logistic Regression


In [None]:
# Evaluate Model A


In [None]:
# Model B: Balanced Logistic Regression


In [None]:
# Evaluate Model B


In [None]:
# ROC Curves comparison


**Comparative Analysis:**

*Write your comparison here*

**Business Recommendations:**

*Write your recommendations here*

---

### Assignment 3: Multi-Class Classification

**Dataset:** `Assignment-Dataset/assignment3_credit_risk.csv`

**Objective:** Build a multi-class classification model to predict credit risk levels.

**Requirements:**
1. Load and explore the dataset:
   - Examine the distribution of risk levels (Low, Medium, High)
   - Analyze key features across different risk categories
   - Check for any data quality issues
2. Prepare the data:
   - Split into features and target (risk_level)
   - Split into training (70%) and testing (30%) sets with `random_state=42`
3. Train THREE classification models:
   - Logistic Regression (multi-class: `multi_class='ovr', random_state=42, max_iter=1000`)
   - Decision Tree Classifier (`max_depth=10, random_state=42`)
   - Random Forest Classifier (`n_estimators=100, max_depth=10, random_state=42`)
4. For each model, evaluate using:
   - Confusion Matrix (use heatmap visualization)
   - Classification Report (precision, recall, f1-score for each class)
   - Overall Accuracy
5. Analyze model performance:
   - Which risk level is easiest/hardest to predict?
   - Which model performs best for each risk category?
   - Are there any systematic misclassifications?
6. Feature importance:
   - Display the top 5 most important features from Random Forest
   - Explain how these features relate to credit risk

In [None]:
# Import necessary libraries


In [None]:
# Load and explore the dataset


In [None]:
# Analyze risk level distribution


In [None]:
# Data preparation


In [None]:
# Model 1: Logistic Regression


In [None]:
# Evaluate Logistic Regression


In [None]:
# Model 2: Decision Tree Classifier


In [None]:
# Evaluate Decision Tree


In [None]:
# Model 3: Random Forest Classifier


In [None]:
# Evaluate Random Forest


In [None]:
# Feature importance analysis


**Performance Analysis:**

*Write your analysis here*

**Feature Importance Interpretation:**

*Explain how the top features relate to credit risk*

---

## Part 3: Assessment Project

This is a comprehensive project that combines all the concepts learned this week.

### Assessment: Customer Churn Prediction - End-to-End ML Project

**Dataset:** `Assessment-Dataset/customer_churn_prediction.csv`

**Business Context:**
You are a data scientist at a telecommunications company. The company is losing customers to competitors and wants to identify which customers are at risk of churning (leaving the service). Your task is to build a comprehensive machine learning solution to:
1. Predict which customers will churn
2. Identify the key factors driving churn
3. Provide actionable insights to the retention team

**Dataset Description:**
The dataset contains 500 customer records with 19 features including:
- Demographics: age, gender
- Account info: tenure_months, contract_type, payment_method
- Service usage: monthly_charges, total_charges, internet_service, phone_service
- Support metrics: support_calls, satisfaction_score, complaint_filed
- Service features: streaming_tv, streaming_movies, online_security, online_backup, device_protection
- Target: churned (0 = active, 1 = churned)

---

### Phase 1: Data Understanding & Exploration

Write a summary of your findings from EDA (3-5 key insights)

**EDA Summary:**

*Write your key findings here*

1. 
2. 
3. 
4. 
5. 

---

### Phase 2: Data Preprocessing

---

### Phase 3: Model Building & Evaluation

---

### Phase 4: Feature Importance & Insights

---

### Phase 5: Business Recommendations

**Step 5.1:** Model selection justification:

Based on your analysis, answer the following:

1. **Which model would you recommend for deployment and why?**
   - Consider accuracy, interpretability, and business needs
   - Think about the cost of false positives vs false negatives

2. **What probability threshold would you use for classification?**
   - Default is 0.5, but should it be adjusted?
   - Consider the business impact of missing a churner vs false alarms

3. **How confident are you in the model's predictions?**
   - What are the limitations?
   - What additional data might improve performance?

**Model Selection:**

*Write your answer here*

**Step 5.2:** Actionable recommendations for the retention team:

Based on your feature importance analysis and model insights, provide 5-7 specific, actionable recommendations:

Example format:
- **Recommendation 1:** Target customers with month-to-month contracts for conversion to annual contracts
  - *Insight:* 60% of churners had month-to-month contracts
  - *Action:* Offer 10% discount for switching to annual contract
  - *Expected Impact:* Reduce churn by 15-20% in this segment

**Business Recommendations:**

1. **Recommendation 1:**
   - Insight:
   - Action:
   - Expected Impact:

2. **Recommendation 2:**
   - Insight:
   - Action:
   - Expected Impact:

3. **Recommendation 3:**
   - Insight:
   - Action:
   - Expected Impact:

4. **Recommendation 4:**
   - Insight:
   - Action:
   - Expected Impact:

5. **Recommendation 5:**
   - Insight:
   - Action:
   - Expected Impact:

**Step 5.3:** Implementation plan:

Outline how this model would be deployed in production:
1. How often should the model be retrained?
2. What monitoring metrics would you track?
3. How would you measure the business impact?
4. What are the next steps for model improvement?

**Implementation Plan:**

1. **Retraining Schedule:**
   
2. **Monitoring Metrics:**
   
3. **Business Impact Measurement:**
   
4. **Next Steps:**