# Work Plan

## Understanding the Project's Objectives and its Data (Frame the Problem)
- What is the company trying to solve?
- Why is the company trying to solve it?
- What are the objectives for the model to achieve?
    - _AUC-ROC  $\geq$ 0.85_

## Imported Modules
- Provide a section of all imported modules as in past projects.

## Data Overview
- Description of dataset and how they relate to each other.
- For each column:
  - Name
  - Type of Data
  - % Missing
  - Noisiness
  - Relevance for the Task
  - Type of Distribution
- **Observe**:
  - Sample rows, shape, duplicates, nulls, dtypes, and descriptive statistics (all).
- **Document Discovered Issues or Concerns for Pre-processing**:
  - Implicit erroneous values, particularly in categorical variables.
  - Opportunities for decomposing or creating features.
- Form Conclusion/Summary for pre-processing

## Pre-Processing
- Merge the datasets on **CustomerID**
- Execute all fixes discovered during exploration
- Perform Feature Engineering if possible

---

# Exploratory Data Analysis (EDA)

## Numerical Features
- Visualize distributions:
  - Histograms
  - Boxplots
- Compare distributions between churned vs. retained customers.

## Categorical Features
- Visualize distributions:
  - Bar charts
  - Compare distributinos between churned vs. retained customers - _using filled charts_.

## Target Class Distributions
- Is there an imbalance in the target class?

## Questions to Ask (Visuals to Provide):
- Are there noticeable differences in monthly payments between classes?
- What percentage of customers use each service type?
- Are there patterns in start and end dates?
    - When were customers joining and leaving?
- Which features should be prioritized for comparison with the target class?

## Correlation Analysis
- Create a correlation matrix to observe relationships between numerical features and the target variable.

## Summary
- Summarize findings and key insights.

---

# Modeling

## Model Training
- Reuse functions and processes from previous projects where applicable - attempt to make improvements
    - Possibly create function for preparing and training models
    - Create function for evaluating models
    - Create function for loading and saving models - _saves time_
- Drop any determined unnecessary features lingering from EDA
- Split data into features and target
- Apply one-hot encoding and scaling - _use standard scaler_
- Address class imbalance through most approrpiate means
- Split data into training, validation, and test sets (80:10:10)
- Train and evaluate the various models

## Models to Use
1. Dummy Classifier
2. Tree-based: Decision Tree or Random Forest
3. Logistic Regression (_linear would perform poorly_)
4. AdaBoost
5. CatBoost
6. XGBoost
7. LightGBM
8. HistGradientBoosting (new model)
9. Neural Network (self-challenge)
10. Stacking Ensemble (self-challenge, new model)
    - **_blend boosting and log-reg_**

## Model Evaluation
- Metrics:
  - **Primary**: AUC-ROC (goal: 0.85 or higher, aim for 0.88).
  - **Secondary**: Accuracy.
- Goals:
  - **Overfitting** is a concern so we will need to prioritize **regularization** more so than previous projects.

## Final Model Selection
- Select the model with the best performance on the validation set.
- Justify the choice based on evaluation metrics.

---

# Questions to Ask for Team Lead

1. How should class imbalance be handled? (e.g., oversampling, undersampling, class weights)
2. How will this solution be used by the company?
3. Are there any existing solutions to compare the performance against?
    - Are there any recommendations on specific models that have been successful in similar prediction tasks?
    - Can I reuse any experience or tools?
   > **Yes, I have worked on a churn problem before. In addition, I can reuse many functions and processes from previous projects to aid me in code development and analysis.** 
5. Are there specific features that need more in-depth analysis?
6. Which features should be prioritized during the analysis phase?
7. How would this problem be solved manually?
8. Are there any observable patterns that encourage customer retention?
9. What preferences are there for handling missing values?

---

# Additional Notes

- Revisit the objectives and assumptions after EDA.
- **Document any observations or challenges encountered during the analysis and modeling phases.**
