# Module 09: Putting It All Together

Welcome to the final module! It's time to apply everything you've learned in a **complete mini-research project**.

## What You'll Do

- Plan a complete research project from start to finish
- Apply all skills learned in Modules 00-08
- Create a reproducible research artifact
- Build a portfolio piece

## Time Required

**60 minutes** (but take as long as you need!)

---

In [None]:
# ========================================
# Setup
# ========================================

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Set style
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")

# Create output directory
output_dir = "outputs/notebook_09_final_project"
os.makedirs(output_dir, exist_ok=True)

print("‚úÖ Setup complete!")
print(f"Output directory: {output_dir}")
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## The Research Project Framework

This notebook will guide you through a complete research project using all the skills you've learned:

### Phase 1: Planning (Modules 01-03)
1. Choose a research topic
2. Review existing literature
3. Formulate research question and hypothesis

### Phase 2: Design (Modules 04-06)
4. Design your methodology
5. Plan data collection
6. Consider ethical implications

### Phase 3: Execution (Modules 07-08)
7. Collect and analyze data
8. Document your work
9. Ensure reproducibility

### Phase 4: Communication
10. Present your findings
11. Discuss limitations
12. Suggest future work

Let's begin!

---
## PHASE 1: PLANNING
---

### Step 1: Choose Your Research Topic

For this mini-project, we'll use a sample topic. You can follow along or substitute your own!

**Sample Topic**: "Effect of Feature Engineering on Customer Churn Prediction"

**Your Topic** (fill in below):
```
My research topic: _________________________________
```

In [None]:
# ========================================
# Document Your Research Topic
# ========================================

# Change these to match your project
PROJECT_INFO = {
    "title": "Effect of Feature Engineering on Customer Churn Prediction",
    "researcher": "Your Name",
    "date_started": datetime.now().strftime("%Y-%m-%d"),
    "domain": "E-commerce / Customer Analytics",
    "keywords": ["churn prediction", "feature engineering", "machine learning"],
}

print("üìã PROJECT INFORMATION")
print("=" * 60)
for key, value in PROJECT_INFO.items():
    print(f"{key.replace('_', ' ').title()}: {value}")

# Save project info
pd.DataFrame([PROJECT_INFO]).to_csv(f"{output_dir}/project_info.csv", index=False)
print(f"\n‚úÖ Project info saved to: {output_dir}/project_info.csv")

### Step 2: Literature Review Mini-Exercise

**Task**: Find 3-5 relevant papers on your topic

**Questions to answer**:
1. What have others discovered about this topic?
2. What methods have been tried?
3. What gaps exist in the literature?

For our sample project, here's a mini literature review:

In [None]:
# ========================================
# Document Literature Review
# ========================================

# Sample papers (replace with your actual papers)
literature = pd.DataFrame(
    {
        "Authors": ["Smith et al.", "Johnson & Lee", "Brown et al."],
        "Year": [2023, 2022, 2021],
        "Title": [
            "Feature Engineering for Churn Prediction",
            "Machine Learning in Customer Retention",
            "Automated Feature Selection Methods",
        ],
        "Key Finding": [
            "Behavioral features improve accuracy by 15%",
            "Random Forest outperforms Logistic Regression",
            "Automated selection reduces features by 60%",
        ],
        "Relevance": [5, 4, 4],
    }
)

print("üìö LITERATURE REVIEW SUMMARY")
print("=" * 60)
print(literature.to_string(index=False))

# Save literature review
literature.to_csv(f"{output_dir}/literature_review.csv", index=False)
print(f"\n‚úÖ Literature review saved to: {output_dir}/literature_review.csv")

# Synthesize findings
print("\nüí° KEY INSIGHTS FROM LITERATURE:")
print("  1. Feature engineering significantly impacts performance")
print("  2. Behavioral features are particularly important")
print("  3. Feature selection can improve both accuracy and efficiency")

### Step 3: Formulate Research Question & Hypothesis

Based on your literature review, create a specific research question.

**Sample Research Question**:  
"Does adding engineered behavioral features (RFM metrics) improve customer churn prediction accuracy compared to using only demographic features?"

**Sample Hypothesis**:  
"Adding RFM (Recency, Frequency, Monetary) features will improve churn prediction accuracy by at least 10% compared to demographic features alone."

In [None]:
# ========================================
# Document Research Question and Hypothesis
# ========================================

research_design = {
    "research_question": "Does adding engineered behavioral features improve churn prediction?",
    "hypothesis": "RFM features will improve accuracy by >= 10%",
    "null_hypothesis": "RFM features will NOT improve accuracy by >= 10%",
    "independent_variable": "Feature set (demographic only vs. demographic + RFM)",
    "dependent_variable": "Prediction accuracy (F1-score)",
    "significance_level": 0.05,
}

print("üî¨ RESEARCH DESIGN")
print("=" * 60)
for key, value in research_design.items():
    print(f"{key.replace('_', ' ').title()}:")
    print(f"  {value}\n")

# Save research design
with open(f"{output_dir}/research_design.txt", "w") as f:
    for key, value in research_design.items():
        f.write(f"{key.replace('_', ' ').title()}: {value}\n")

print(f"‚úÖ Research design saved to: {output_dir}/research_design.txt")

---
## PHASE 2: DESIGN
---

### Step 4: Design Your Methodology

**Experimental Design**:
- **Type**: Comparative experiment
- **Groups**: 
  - Control: Model with demographic features only
  - Treatment: Model with demographic + RFM features
- **Evaluation**: Cross-validation with 5 folds
- **Metrics**: F1-score (primary), Accuracy, Precision, Recall (secondary)

In [None]:
# ========================================
# Methodology Documentation
# ========================================

methodology = """
METHODOLOGY
============

1. DATA COLLECTION
   - Source: Simulated customer data (for demonstration)
   - Sample size: 1000 customers
   - Time period: 12 months
   
2. FEATURE ENGINEERING
   Control Features (Demographic):
   - Age
   - Gender
   - Location
   - Account tenure
   
   Treatment Features (RFM):
   - Recency: Days since last purchase
   - Frequency: Number of purchases
   - Monetary: Total spend
   
3. MODEL TRAINING
   - Algorithm: Random Forest Classifier
   - Cross-validation: 5-fold
   - Train/Test split: 80/20
   
4. EVALUATION
   - Primary metric: F1-score
   - Secondary metrics: Accuracy, Precision, Recall
   - Statistical test: Paired t-test (p < 0.05)
"""

print(methodology)

with open(f"{output_dir}/methodology.txt", "w") as f:
    f.write(methodology)

print(f"‚úÖ Methodology saved to: {output_dir}/methodology.txt")

### Step 5: Ethical Considerations

Even with simulated data, consider ethical implications:

1. **Privacy**: Customer data must be anonymized
2. **Bias**: Check for demographic bias in predictions
3. **Transparency**: Document all decisions
4. **Fairness**: Ensure predictions are fair across groups

In [None]:
# ========================================
# Ethics Checklist
# ========================================

ethics_checklist = """
ETHICS CHECKLIST
================

‚òë Data Privacy
  - All customer IDs anonymized
  - No personally identifiable information (PII)
  - Secure data storage

‚òë Bias and Fairness
  - Check model performance across demographic groups
  - Test for disparate impact
  - Document any biases found

‚òë Transparency
  - All code and methods documented
  - Assumptions clearly stated
  - Limitations acknowledged

‚òë Responsible Use
  - Model used to improve customer experience
  - Not used for discriminatory purposes
  - Regular audits planned
"""

print(ethics_checklist)

with open(f"{output_dir}/ethics_checklist.txt", "w") as f:
    f.write(ethics_checklist)

print(f"‚úÖ Ethics checklist saved to: {output_dir}/ethics_checklist.txt")

---
## PHASE 3: EXECUTION
---

### Step 6: Generate Sample Data

For this demonstration, we'll create synthetic data. In a real project, this is where you'd load your actual data.

In [None]:
# ========================================
# Generate Synthetic Customer Data
# ========================================

np.random.seed(42)  # For reproducibility!

n_customers = 1000

# Demographic features
data = pd.DataFrame(
    {
        "customer_id": range(1, n_customers + 1),
        "age": np.random.randint(18, 70, n_customers),
        "gender": np.random.choice(["M", "F"], n_customers),
        "location": np.random.choice(["Urban", "Suburban", "Rural"], n_customers),
        "tenure_months": np.random.randint(1, 60, n_customers),
    }
)

# RFM features (correlated with churn)
data["recency_days"] = np.random.randint(1, 365, n_customers)
data["frequency"] = np.random.randint(0, 50, n_customers)
data["monetary"] = np.random.uniform(0, 10000, n_customers)

# Generate churn label (influenced by RFM)
# Higher recency, lower frequency/monetary = higher churn probability
churn_prob = (
    (data["recency_days"] / 365) * 0.4
    + (1 - data["frequency"] / 50) * 0.3
    + (1 - data["monetary"] / 10000) * 0.3
)
data["churned"] = (np.random.random(n_customers) < churn_prob).astype(int)

print("üìä DATASET GENERATED")
print("=" * 60)
print(f"Total customers: {len(data)}")
print(f"Churned: {data['churned'].sum()} ({data['churned'].mean()*100:.1f}%)")
print(f"\nFirst few rows:")
print(data.head())

# Save dataset
data.to_csv(f"{output_dir}/customer_data.csv", index=False)
print(f"\n‚úÖ Dataset saved to: {output_dir}/customer_data.csv")

### Step 7: Exploratory Data Analysis

In [None]:
# ========================================
# Exploratory Data Analysis
# ========================================

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Churn rate
churn_counts = data["churned"].value_counts()
axes[0, 0].pie(
    churn_counts, labels=["Active", "Churned"], autopct="%1.1f%%", colors=["#2ecc71", "#e74c3c"]
)
axes[0, 0].set_title("Churn Distribution", fontweight="bold")

# Age distribution by churn
data.boxplot(column="age", by="churned", ax=axes[0, 1])
axes[0, 1].set_title("Age by Churn Status")
axes[0, 1].set_xlabel("Churned")
axes[0, 1].set_ylabel("Age")

# Recency vs Churn
data.boxplot(column="recency_days", by="churned", ax=axes[1, 0])
axes[1, 0].set_title("Recency by Churn Status")
axes[1, 0].set_xlabel("Churned")
axes[1, 0].set_ylabel("Recency (days)")

# Frequency vs Churn
data.boxplot(column="frequency", by="churned", ax=axes[1, 1])
axes[1, 1].set_title("Frequency by Churn Status")
axes[1, 1].set_xlabel("Churned")
axes[1, 1].set_ylabel("Frequency")

plt.tight_layout()
plt.savefig(f"{output_dir}/eda_visualizations.png", dpi=150, bbox_inches="tight")
print(f"‚úÖ EDA visualizations saved to: {output_dir}/eda_visualizations.png")
plt.show()

print("\nüí° OBSERVATIONS:")
print("  - Churn rate appears to be influenced by recency and frequency")
print("  - This suggests RFM features may be valuable predictors")

### Step 8: Model Training and Evaluation

In [None]:
# ========================================
# Model Training - Control (Demographic Only)
# ========================================

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, f1_score, confusion_matrix

# Prepare data
# Encode categorical variables
le_gender = LabelEncoder()
le_location = LabelEncoder()

data_encoded = data.copy()
data_encoded["gender_encoded"] = le_gender.fit_transform(data["gender"])
data_encoded["location_encoded"] = le_location.fit_transform(data["location"])

# Control features (demographic only)
control_features = ["age", "gender_encoded", "location_encoded", "tenure_months"]
X_control = data_encoded[control_features]

# Treatment features (demographic + RFM)
treatment_features = control_features + ["recency_days", "frequency", "monetary"]
X_treatment = data_encoded[treatment_features]

# Target
y = data_encoded["churned"]

# Split data
X_control_train, X_control_test, y_train, y_test = train_test_split(
    X_control, y, test_size=0.2, random_state=42, stratify=y
)

X_treatment_train, X_treatment_test, _, _ = train_test_split(
    X_treatment, y, test_size=0.2, random_state=42, stratify=y
)

# Train control model
print("üî¨ TRAINING CONTROL MODEL (Demographic features only)")
print("=" * 60)
model_control = RandomForestClassifier(n_estimators=100, random_state=42)
model_control.fit(X_control_train, y_train)

# Evaluate control model
y_pred_control = model_control.predict(X_control_test)
f1_control = f1_score(y_test, y_pred_control)

print(f"F1-Score (Control): {f1_control:.4f}")
print("\nClassification Report (Control):")
print(classification_report(y_test, y_pred_control))

In [None]:
# ========================================
# Model Training - Treatment (Demographic + RFM)
# ========================================

print("üî¨ TRAINING TREATMENT MODEL (Demographic + RFM features)")
print("=" * 60)
model_treatment = RandomForestClassifier(n_estimators=100, random_state=42)
model_treatment.fit(X_treatment_train, y_train)

# Evaluate treatment model
y_pred_treatment = model_treatment.predict(X_treatment_test)
f1_treatment = f1_score(y_test, y_pred_treatment)

print(f"F1-Score (Treatment): {f1_treatment:.4f}")
print("\nClassification Report (Treatment):")
print(classification_report(y_test, y_pred_treatment))

In [None]:
# ========================================
# Compare Results
# ========================================

improvement = ((f1_treatment - f1_control) / f1_control) * 100

print("\nüìä RESULTS COMPARISON")
print("=" * 60)
print(f"Control Model F1-Score:    {f1_control:.4f}")
print(f"Treatment Model F1-Score:  {f1_treatment:.4f}")
print(f"\nImprovement: {improvement:+.2f}%")

if improvement >= 10:
    print("\n‚úÖ HYPOTHESIS SUPPORTED!")
    print("   RFM features improved accuracy by >= 10%")
else:
    print("\n‚ùå HYPOTHESIS NOT SUPPORTED")
    print(f"   Improvement ({improvement:.2f}%) is less than 10%")

# Save results
results = pd.DataFrame(
    {
        "Model": ["Control (Demographic)", "Treatment (Demographic + RFM)"],
        "F1-Score": [f1_control, f1_treatment],
        "Improvement": [0, improvement],
    }
)

results.to_csv(f"{output_dir}/model_results.csv", index=False)
print(f"\n‚úÖ Results saved to: {output_dir}/model_results.csv")

In [None]:
# ========================================
# Visualize Results
# ========================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# F1-Score Comparison
models = ["Control\n(Demographic)", "Treatment\n(Demographic + RFM)"]
f1_scores = [f1_control, f1_treatment]

bars = axes[0].bar(models, f1_scores, color=["#3498db", "#2ecc71"], alpha=0.7)
axes[0].set_ylabel("F1-Score", fontweight="bold")
axes[0].set_title("Model Performance Comparison", fontweight="bold", fontsize=12)
axes[0].set_ylim([0, 1])

# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[0].text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{height:.4f}",
        ha="center",
        va="bottom",
        fontweight="bold",
    )

# Feature Importance (Treatment Model)
feature_importance = pd.DataFrame(
    {"feature": treatment_features, "importance": model_treatment.feature_importances_}
).sort_values("importance", ascending=False)

axes[1].barh(
    feature_importance["feature"], feature_importance["importance"], color="steelblue", alpha=0.7
)
axes[1].set_xlabel("Importance", fontweight="bold")
axes[1].set_title("Feature Importance (Treatment Model)", fontweight="bold", fontsize=12)
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig(f"{output_dir}/results_comparison.png", dpi=150, bbox_inches="tight")
print(f"‚úÖ Results visualization saved to: {output_dir}/results_comparison.png")
plt.show()

print("\nüí° KEY FINDINGS:")
print(f"  - Top 3 most important features:")
for i, row in feature_importance.head(3).iterrows():
    print(f"    {i+1}. {row['feature']}: {row['importance']:.4f}")

---
## PHASE 4: COMMUNICATION
---

### Step 9: Write Research Summary

In [None]:
# ========================================
# Generate Research Summary
# ========================================

summary = f"""
RESEARCH SUMMARY
================

Title: {PROJECT_INFO['title']}
Author: {PROJECT_INFO['researcher']}
Date: {PROJECT_INFO['date_started']}

ABSTRACT
--------
This study investigated whether adding engineered behavioral features (RFM metrics)
improves customer churn prediction accuracy compared to using only demographic
features. Using a simulated dataset of 1,000 customers, we trained Random Forest
classifiers with two feature sets: (1) demographic features only, and (2)
demographic features plus RFM metrics. Results showed that the RFM-enhanced model
improved F1-score by {improvement:.2f}%, {'supporting' if improvement >= 10 else 'not supporting'} our hypothesis of >= 10% improvement.

RESEARCH QUESTION
-----------------
{research_design['research_question']}

HYPOTHESIS
----------
{research_design['hypothesis']}

METHODOLOGY
-----------
- Sample Size: {len(data)} customers
- Algorithm: Random Forest Classifier
- Evaluation: 80/20 train-test split
- Primary Metric: F1-score

RESULTS
-------
- Control Model F1-Score: {f1_control:.4f}
- Treatment Model F1-Score: {f1_treatment:.4f}
- Improvement: {improvement:+.2f}%

CONCLUSIONS
-----------
{'‚úÖ The hypothesis was SUPPORTED. ' if improvement >= 10 else '‚ùå The hypothesis was NOT SUPPORTED. '}
RFM features demonstrated {'substantial' if improvement >= 10 else 'some'} value in predicting customer churn.
The most important features were behavioral (recency, frequency, monetary value),
suggesting that recent customer activity is more predictive than demographics alone.

LIMITATIONS
-----------
1. Simulated data may not reflect real-world complexity
2. Limited to one machine learning algorithm (Random Forest)
3. Small sample size (1,000 customers)
4. Single domain (e-commerce)

FUTURE WORK
-----------
1. Test with real customer data
2. Compare multiple algorithms (XGBoost, Neural Networks)
3. Investigate additional behavioral features
4. Explore temporal patterns in churn
5. Conduct fairness analysis across demographic groups

REPRODUCIBILITY
---------------
All code, data, and analysis are available in: {output_dir}
Random seed: 42 (for reproducibility)
"""

print(summary)

# Save summary
with open(f"{output_dir}/research_summary.txt", "w") as f:
    f.write(summary)

print(f"\n‚úÖ Research summary saved to: {output_dir}/research_summary.txt")

### Step 10: Create Final Report Package

In [None]:
# ========================================
# Generate README for Research Package
# ========================================

readme_content = f"""
# {PROJECT_INFO['title']}

## Research Project Package

**Author**: {PROJECT_INFO['researcher']}  
**Date**: {PROJECT_INFO['date_started']}  
**Domain**: {PROJECT_INFO['domain']}  

## Abstract

This research investigates whether adding engineered behavioral features (RFM metrics) 
improves customer churn prediction accuracy. Results showed a {improvement:.2f}% improvement 
in F1-score when RFM features were added to demographic features.

## Files in This Package

- `research_summary.txt` - Complete research summary
- `methodology.txt` - Detailed methodology
- `research_design.txt` - Research question and hypothesis
- `customer_data.csv` - Dataset used (synthetic)
- `model_results.csv` - Comparison of model performance
- `literature_review.csv` - Papers reviewed
- `eda_visualizations.png` - Exploratory data analysis
- `results_comparison.png` - Results visualization
- `ethics_checklist.txt` - Ethical considerations

## Key Findings

1. RFM features improved F1-score by {improvement:.2f}%
2. Behavioral features are more predictive than demographics
3. Recency, frequency, and monetary value are the top predictors

## Reproducibility

This research is fully reproducible:
- Random seed: 42
- All code in Jupyter notebook
- Dataset included
- Dependencies listed in requirements.txt

## Citation

If you use this work, please cite:
```
{PROJECT_INFO['researcher']} ({PROJECT_INFO['date_started']}). 
{PROJECT_INFO['title']}. 
Research project from Data Science Research Skills course.
```

## Contact

For questions or collaboration: [Your email here]
"""

with open(f"{output_dir}/README.md", "w") as f:
    f.write(readme_content)

print("üì¶ FINAL RESEARCH PACKAGE CREATED")
print("=" * 60)
print(f"\nAll files saved to: {output_dir}")
print("\n‚úÖ Your research project is complete and ready to share!")
print("\nüìÅ Package contents:")

import os

for file in sorted(os.listdir(output_dir)):
    print(f"   - {file}")

---

## Congratulations! üéâ

You've completed a full research project from start to finish!

### What You Accomplished

‚úÖ Formulated a clear research question  
‚úÖ Reviewed existing literature  
‚úÖ Developed and tested a hypothesis  
‚úÖ Designed an experiment  
‚úÖ Collected and analyzed data  
‚úÖ Considered ethical implications  
‚úÖ Made your work reproducible  
‚úÖ Documented everything thoroughly  
‚úÖ Communicated your findings  

### Your Research Portfolio

You now have a complete research project that demonstrates:
- Research skills
- Data science capabilities
- Critical thinking
- Ethical awareness
- Communication abilities

**This is a portfolio piece you can show to employers or use in applications!**

### What's Next?

1. **Do Your Own Project**
   - Choose a topic you're passionate about
   - Use this notebook as a template
   - Apply these skills to real data

2. **Share Your Work**
   - Upload to GitHub
   - Write a blog post
   - Present to peers

3. **Keep Learning**
   - Explore advanced statistical methods
   - Learn about causal inference
   - Study research design in depth

4. **Contribute to Open Science**
   - Replicate published studies
   - Share your datasets and code
   - Collaborate on research projects

### Course Completion

You've completed all 10 modules of **Data Science Research Skills**!

- Module 00: Setup & Introduction ‚úÖ
- Module 01: Literature Review Basics ‚úÖ
- Module 02: Finding and Reading Papers ‚úÖ
- Module 03: Research Methodology ‚úÖ
- Module 04: Experimental Design ‚úÖ
- Module 05: Data Collection Methods ‚úÖ
- Module 06: Research Ethics ‚úÖ
- Module 07: Reproducible Research ‚úÖ
- Module 08: Documentation & Version Control ‚úÖ
- Module 09: Putting It All Together ‚úÖ

### Thank You!

Thank you for taking this journey to learn research skills. You now have the foundation
to conduct rigorous, ethical, and reproducible research in data science.

**Keep researching, keep learning, and keep making discoveries!**

---

*Questions? Review any module or check the main project README.*  
*Feedback? We'd love to hear about your experience!*