# Project Plan

We know that we are aiming to create a predictive model that will determine when a customer is likely to end their service. In order to do this, before we look at the model we can likely guess that our best method to do this will be to create a model that classifies users as having a plan that was cancelled compared to long-standing users and will predict based on their data when a user is likely to cancel.

As a whole, the outline of the project as understood will be to, in order:

## Perform Data Preprocessing
    
* Get a grasp of what our four data frames look like
    * Check for duplicates
    * Check for missing values
    * Check for troubling data types

* Based on the results the necessary preprocessing could be very short or more complicated
    * Remove duplicates
    * Deal with missing values, either by trying to fill them or if there is no practical way to do so, remove them
    * Convert data into proper types
    * If necessary, encode the data frame for smoother analysis
    
* We may wish to combine all four of our datasets into a single data frame

## Perform Exploratory Data Analysis

* Visualize the data
    * Check across specific metrics that at present are not clear but that can be determined after preprocessing
    * Generally a histogram, box plot, and correlation matrix will be sufficient
* Check for outliers
    * Determine which of our features are likely to have a connection to our target and take a closer look at those
* If encoding is deemed useful, we could check our class imbalance
    * If our classes are sufficiently imbalanced, we may wish to find an ideal threshold when building our models to achieve a higher accuracy

## Train the Model

* The project can be completed using a supervised learning model
    * Before we begin training, we should do both a features/target split (our target is the 'EndDate' column)
    * We also will want to split our data for testing purposes
        * We will do a 60/20/20 split with 60% of the data as a training set, 20% as a validation set, and 20% as the test set
* Based on our outline, we will want an AUC-ROC of at least 0.75, but that is the bare minimum
    * We should try to get as high as possible, but will not expect to be able to improve much beyond 0.88
* To get an idea of the time it might take processing the hyperparameter tuning, we should choose some different models
    * Dummy model
    * Decision Tree
    * Linear Regression
    * Random Forest
    * CatBoost
    * Light GBM and XGBoost would also be good to test, but since gradient boosting models take longer to train we will determine that this will only be considered worth trying if the other models are performing poorly

## Test the Model

* Check the model against the test set for AUC-ROC and accuracy
    * If necessary, the model might need some additional tuning if it performed well against the validation set but poorly with the test set
    * We may need to use a larger training set if the model is performing particularly poorly against the test set, as a training set of 60% is already a bit low. We could try a 70/15/15.

## Draw Conclusions

We will have a better idea of what exactly our conclusions will be after the project's completion so this section will be left somewhat sparse for the outline. However, we will have a list of targets that should be met before the project is considered complete:

- [ ] The code is error free and arranged in order of execution.
- [ ] Documentation is present directly preceding all code explaining its purpose.
- [ ] The data has been downloaded and prepared.
- [ ] Exploratory data analysis with graphical representation has been performed.
- [ ] The models have been trained.
- [ ] A final model has been created with an AUC-ROC of at least 0.75