# **CAPSTONE PROJECT 2 TEMPLATE**
### [Your Project Title Here]

**Student Name:** [Your Name]  
**Date:** [Date]  
**Course:** Intermediate AI & Data Science  
**Instructor:** Amir Charkhi  
**AI Tech Institute**

---

## ðŸ“‹ Project Overview

**Problem:** [One sentence describing what you're predicting/forecasting]

**Business Value:** [Why this matters - 1-2 sentences]

**Data Source:** [Where you got the data]

**Target Variable:** [What you're predicting]

**Success Metric:** [How you'll measure success]

---

## 1. Setup & Imports

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Plotly for interactive visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline

# Add your specific model imports here
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import LogisticRegression
# etc.

In [None]:
# Metrics (adjust based on your problem type)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,  # Classification
    mean_squared_error, mean_absolute_error, r2_score,        # Regression
    classification_report, confusion_matrix                    # Detailed reports
)

In [None]:
# Model persistence
import joblib

---
## 2. Load Data

In [None]:
# Load your dataset
# df = pd.read_csv('your_data.csv')
# OR
# df = pd.read_excel('your_data.xlsx')
# OR
# Load from Kaggle, API, etc.

In [None]:
# Display first few rows
# df.head()

In [None]:
# Dataset shape
# f"Dataset: {df.shape[0]} rows Ã— {df.shape[1]} columns"

---
## 3. Exploratory Data Analysis (EDA)

### 3.1 Basic Statistics

In [None]:
# Dataset info
# df.info()

In [None]:
# Summary statistics
# df.describe()

In [None]:
# Check missing values
# missing = df.isnull().sum()
# missing[missing > 0]

### 3.2 Target Variable Analysis

In [None]:
# Analyze your target variable
# For classification: distribution of classes
# For regression: distribution of values
# For time series: plot over time

### 3.3 Feature Distributions

In [None]:
# Visualize distributions of key features
# Use histograms, box plots, etc.

### 3.4 Correlations

In [None]:
# Correlation matrix
# numerical_cols = df.select_dtypes(include=[np.number]).columns
# corr = df[numerical_cols].corr()

# fig = px.imshow(corr, 
#                 text_auto='.2f',
#                 title='Feature Correlations')
# fig.show()

### 3.5 Key Insights from EDA

**Document your findings here:**
- Finding 1: [What you discovered]
- Finding 2: [What you discovered]
- Finding 3: [What you discovered]

---

## 4. Data Preprocessing

### 4.1 Handle Missing Values

In [None]:
# Handle missing values
# Options: drop, fill with mean/median, forward fill, etc.
# df_clean = df.dropna()
# OR
# df_clean = df.fillna(df.mean())

### 4.2 Handle Outliers

In [None]:
# Identify and handle outliers if necessary
# Use IQR method, z-score, or domain knowledge

### 4.3 Feature Engineering

In [None]:
# Create new features if beneficial
# Examples:
# - Combine existing features
# - Extract date components (year, month, day)
# - Create ratios or differences
# - Bin continuous variables

### 4.4 Encode Categorical Variables

In [None]:
# Encode categorical variables
# One-hot encoding for nominal categories
# Label encoding for ordinal categories

# categorical_cols = df_clean.select_dtypes(include=['object']).columns
# df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, drop_first=True)

### 4.5 Prepare Features and Target

In [None]:
# Separate features and target
# X = df_encoded.drop('target_column', axis=1)
# y = df_encoded['target_column']

In [None]:
# Verify shapes
# X.shape, y.shape

---
## 5. Train-Test Split

In [None]:
# Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42
# )

In [None]:
# Verify split
# f"Training: {X_train.shape[0]} samples | Testing: {X_test.shape[0]} samples"

---
## 6. Modeling

### 6.1 Baseline Model

In [None]:
# Start with a simple baseline model
# Example for classification:
# baseline_pipeline = Pipeline([
#     ('scaler', StandardScaler()),
#     ('model', LogisticRegression(random_state=42))
# ])

In [None]:
# Train baseline
# baseline_pipeline.fit(X_train, y_train)

In [None]:
# Evaluate baseline
# y_pred_baseline = baseline_pipeline.predict(X_test)
# baseline_score = accuracy_score(y_test, y_pred_baseline)
# f"Baseline Accuracy: {baseline_score:.1%}"

### 6.2 Model 2: [Your Second Model]

In [None]:
# Try a second model (e.g., Random Forest)
# model2 = Pipeline([
#     ('scaler', StandardScaler()),
#     ('model', RandomForestClassifier(random_state=42))
# ])

In [None]:
# Train
# model2.fit(X_train, y_train)

In [None]:
# Evaluate
# y_pred_2 = model2.predict(X_test)
# score_2 = accuracy_score(y_test, y_pred_2)
# f"Model 2 Accuracy: {score_2:.1%}"

### 6.3 Model 3: [Your Third Model]

In [None]:
# Try a third model
# Continue pattern from above

### 6.4 Compare Models

In [None]:
# Create comparison DataFrame
# results = pd.DataFrame({
#     'Model': ['Baseline', 'Model 2', 'Model 3'],
#     'Accuracy': [baseline_score, score_2, score_3],
#     'Precision': [...],
#     'Recall': [...],
#     'F1-Score': [...]
# })
# results

In [None]:
# Visualize comparison
# fig = px.bar(results, x='Model', y='Accuracy', 
#              title='Model Performance Comparison')
# fig.show()

---
## 7. Hyperparameter Tuning

### 7.1 Select Best Model for Tuning

In [None]:
# Based on comparison, select best model to tune
# best_model = model2  # Example

### 7.2 Define Parameter Grid

In [None]:
# Define parameters to search
# param_grid = {
#     'model__n_estimators': [50, 100, 200],
#     'model__max_depth': [5, 10, 20, None],
#     'model__min_samples_split': [2, 5, 10]
# }

### 7.3 Grid Search

In [None]:
# Perform grid search
# grid_search = GridSearchCV(
#     best_model,
#     param_grid,
#     cv=5,
#     scoring='accuracy',
#     n_jobs=-1
# )

# grid_search.fit(X_train, y_train)

In [None]:
# Best parameters
# grid_search.best_params_

In [None]:
# Best score
# grid_search.best_score_

### 7.4 Final Model

In [None]:
# Get best model
# final_model = grid_search.best_estimator_

---
## 8. Final Evaluation

### 8.1 Test Set Performance

In [None]:
# Predict on test set
# y_pred_final = final_model.predict(X_test)

In [None]:
# Calculate metrics
# For classification:
# final_accuracy = accuracy_score(y_test, y_pred_final)
# final_precision = precision_score(y_test, y_pred_final, average='weighted')
# final_recall = recall_score(y_test, y_pred_final, average='weighted')
# final_f1 = f1_score(y_test, y_pred_final, average='weighted')

# For regression:
# final_mse = mean_squared_error(y_test, y_pred_final)
# final_rmse = np.sqrt(final_mse)
# final_mae = mean_absolute_error(y_test, y_pred_final)
# final_r2 = r2_score(y_test, y_pred_final)

In [None]:
# Display metrics
# print(f"Final Model Performance:")
# print(f"Accuracy: {final_accuracy:.1%}")
# print(f"Precision: {final_precision:.1%}")
# print(f"Recall: {final_recall:.1%}")
# print(f"F1-Score: {final_f1:.1%}")

### 8.2 Classification Report

In [None]:
# Detailed classification report
# print(classification_report(y_test, y_pred_final))

### 8.3 Confusion Matrix

In [None]:
# For classification problems
# cm = confusion_matrix(y_test, y_pred_final)
# fig = px.imshow(cm, 
#                 text_auto=True,
#                 title='Confusion Matrix',
#                 labels=dict(x='Predicted', y='Actual'))
# fig.show()

### 8.4 Feature Importance

In [None]:
# If your model supports feature importance
# importances = final_model.named_steps['model'].feature_importances_
# feature_importance = pd.DataFrame({
#     'Feature': X.columns,
#     'Importance': importances
# }).sort_values('Importance', ascending=False)

# fig = px.bar(feature_importance.head(10), 
#              x='Importance', y='Feature',
#              title='Top 10 Most Important Features',
#              orientation='h')
# fig.show()

---
## 9. Business Insights & Recommendations

### 9.1 Key Findings

**Finding 1:** [Your insight]
- [Supporting detail]
- [Business implication]

**Finding 2:** [Your insight]
- [Supporting detail]
- [Business implication]

**Finding 3:** [Your insight]
- [Supporting detail]
- [Business implication]

### 9.2 Business Recommendations

1. **Recommendation 1:** [What should the business do?]
   - Expected impact: [Quantify if possible]
   
2. **Recommendation 2:** [What should the business do?]
   - Expected impact: [Quantify if possible]
   
3. **Recommendation 3:** [What should the business do?]
   - Expected impact: [Quantify if possible]

### 9.3 Model Limitations

- **Limitation 1:** [What doesn't the model handle well?]
- **Limitation 2:** [What assumptions were made?]
- **Limitation 3:** [What could go wrong?]

### 9.4 Future Improvements

- **Improvement 1:** [How could the model be enhanced?]
- **Improvement 2:** [What additional data would help?]
- **Improvement 3:** [What alternative approaches to try?]

---

## 10. Save Model

In [None]:
# Save final model
# joblib.dump(final_model, 'final_model.pkl')

In [None]:
# Save model metadata
# import json
# metadata = {
#     'model_type': 'RandomForestClassifier',
#     'accuracy': float(final_accuracy),
#     'features': X.columns.tolist(),
#     'training_date': '2024-12-XX',
#     'best_params': grid_search.best_params_
# }

# with open('model_metadata.json', 'w') as f:
#     json.dump(metadata, f, indent=2)

---
## 11. Conclusion

### Summary

**Problem:** [Restate the problem]

**Solution:** [Summarize your approach]

**Results:** [Key metrics and outcomes]

**Impact:** [Business value delivered]

### Next Steps

1. [What should happen next?]
2. [How should the model be deployed?]
3. [How should performance be monitored?]

---

**Thank you!**

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*