# 1. Classification project → Categorical target
###    a. Explore target, how many classes (binary or multiclass)?
### b. Is there a class imbalance? If so might need to consider certain techniques
        i. Either SMOTE or class_weight maybe
###    c. Determine the business context/problem for stakeholder
        i. How is your model helping to address/solve the problem?
###    d. Will you want some form of inference (will affect model choices)
       i. Can be very beneficial
###    e. Determine which evaulation metric/metrics are most important
        i.THIS IS BIG
        ii. Think in terms of cost matrix, fp vs fn etc…


$\begin{bmatrix}
TN & FP \\
FN & TP
\end{bmatrix}$

tn = customer predicted no churn and it's true
fp = customer predicted churn and it's false
fn = customer predicted no churn and it's false
tp = customer predicted churn and it's true

a. Binary
b. Yes, 86% non-churned, 14% churned
    i. going to use SMOTE
c. Model is going to accurately predict customers that will churn in order to save money by not wasting campaign dollars on customers that would not have churned.
d. Yes, I will want to use inferential statistics to determine the biggest factors effecting churn
e. The most important metric is false positives. We want to limit our false positives so our most important metric is going to be **Prescision** and **Recall**, **F1-Score** will also be using **AUC-ROC** since there is substantial imbalance and the value of false positives and false negatives is significantly different. Also will be using Confusion Matrix to show the performance.

--------

# 2. EDA → Explore your dataset and features
    a. Basic summary info, .describe, null or missing values, column types
    b. What features are relevant for your model/analysis and why? 
    c. Relationships with target
    d. Exploratory visuals, correlations, pairplots, heatmaps, histograms etc…
    e. Determine how you will handle null values
        i. Impute values, drop rows or columns etc…
        ii. Take a look at the different types of imputers (KNNI is powerful but slow)
    f. Determine how you will handle categorical variables
        i. Binary?
        ii. Ordinal or One-Hot encode?
    g. Will you need to scale the numeric data?


a. No null values, and no duplicates
b. all features will be used for the model besides:
    i. 'phone number' (unique for every instance and therefore not significant)
    ii. 'area code' & 'state' (does not indicate usage of their service)
    iii. 'total charge' has high correlation with '..total minutes' and would cause high multi-colliniearity and total minutes is preferable since it is more indicative of a users usage.
    iv. dropping number of voicemail messages since it would cause multi collinearity and voicemail plan has a higher correlation with churn
c. All int/float columns measure usage of a customer. Category columns define different plans for the user
d. COME BACK TO THIS LATEER
e. no null values
f. handling categorical
    i. categorical columns are binary and will be converted to boolean int (0 & 1)
g. Will need to scale the numeric data since there is several different units being measured.

# 3. Determine an appropriate validation procedure
    a. Highly recommended to cross_validate with a pipeline (best way)
    b. train_test_split your data

In [4]:
X = df.drop(columns=['churn', 'phone number', 'area code'], axis=1)
y = df['churn']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 4. Develop ColumnTransformer → data preprocessing pipeline based on EDA above
    a. Sub-pipes for numeric and categorical columns
        i. As many as you need if treating some columns differently
        ii. Keep in mind things like handle_unknown = ‘ignore’
    b. Create one pipeline with numeric scaling, another without if needed
    c. Test your column transformer
        i. Fit_transform train data
        ii. Transform test data
        iii. Should have the same number of columns after transformation

In [None]:
subpipe_num = Pipeline(steps=[('ss', StandardScaler())])



subpipe_cat = Pipeline(steps=[('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))])



num_cols = ['account length','number vmail messages', 'total day minutes', 'total day calls',
           'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge',
           'total night minutes', 'total night calls', 'total night charge', 'total intl minutes',
           'total intl calls', 'total intl charge', 'customer service calls']



cat_cols = ['international plan', 'voice mail plan', 'state']



CT = ColumnTransformer(transformers=[('subpipe_num', subpipe_num, num_cols),
                                     ('subpipe_cat', subpipe_cat, cat_cols)],
                       remainder='passthrough')



CT.fit_transform(X_train).shape, CT.fit_transform(X_test).shape

# 5. Modelling Process → Pipeline using ColumnTransformer and model algorithm
    a. Start with a DummyClassifier
        i. Evaluate based on chosen metric/metrics → need to beat this
    b. Create and evaluate first simple model: Simple → Complex
        i. Start with defaults, logistic regression or decision tree based on your data
        ii. Could try other algorithms if you think warranted, KNN, Naive Bayes
    c. Iterate over previous models
        i. Use GridSearch to tune hyperparameters
            1a. Remember you can tweak parts of the CT as well
        ii. If overfit → reduce complexity, add regularization, prune tree
        iii. If underfit → increase complexity, reduce regularization, add new/more features (feature engineering)
        iv. Might require you to go back and adapt the pre-processing pipeline
    d. Consider using an ensemble model for more complexity and predictive power
        i. RandomForest, ExtraTrees, VotingClassifier, StackingClassifier etc…
        ii. Tune these via GridSearch
        iii. Iterate your heart out


In [None]:
##b. If overfit → reduce complexity, add regularization, prune tree
##c. If underfit → increase complexity, reduce regularization, add new/more features (feature engineering)
##d. Might require you to go back and adapt the pre-processing pipeline

#2. Consider using an ensemble model for more complexity and predictive power
##a. RandomForest, ExtraTrees, VotingClassifier, StackingClassifier etc…
##b. Tune these via GridSearch
##c. Iterate your heart out

### a. Start with a DummyClassifier

In [None]:
dummy_clf = DummyClassifier(strategy='stratified', random_state=42)
dummy_clf.fit(X_train, y_train)
y_pred = dummy_clf.predict(X_test)

print(precision_score(y_test, y_pred))
print(roc_auc_score(y_test, y_pred))

### b. Create and evaluate first simple model: Simple → Complex

In [None]:
imb_pipe = ImPipeline(steps=[('ct', CT),
                             ('sm', SMOTE(random_state=42)),
                            ('dectree', DecisionTreeClassifier(random_state=42))])

In [None]:
imb_pipe.fit(X_train, y_train)

y_pred = imb_pipe.predict(X_test)
print(precision_score(y_test, y_pred))
print(roc_auc_score(y_test, y_pred))

### c. Iterate over previous models

In [None]:
model = {
    'dectree': {
        'model': DecisionTreeClassifier(),
        'params': {'criterion': ['gini', 'entropy'], 'max_depth': [None, 10, 20, 30]}
    }
}

for model_name, model_params in models.items():
    model = model_params['model']
    params = model_params['params']
    
    clf = GridSearchCV(model, params, cv=5, n_jobs=-1)
    clf.fit(X_train_ct, y_train)
    
    print(f"Best parameters for {model_name}: {clf.best_params_}")
    print(f"Accuracy score of {model_name}: {clf.score(X_test_ct, y_test)}")

# 6. Final Model → Thursday AM you should have a good idea here
    a. Choose a final model based on validation scores of the chosen metric/metrics
    b. Fit final model to training set
    c. Evaluate the final model using your hold-out test set
    d. Discuss final model in the context of your business problem and stakeholder
        i. Analyze your predictive power and results, where is the model doing good and where is it maybe not doing good (confusion matrix)
        ii. This could include insights and recommendations from model
            iia. Coefs (logreg)
            iib. Feature importances (trees, rf, boosting etc…)

# 7. Explanatory visuals for presentation
    a. Back up any inference with visual
    b. Show final model vs. others (esp. Dummy)
    c. Spiced up confusion matrix, need to make sure its non-technical enough