### Main Objective of the Analysis: Prediction or Interpretation

Our customer churn analysis focuses on both prediction and interpretation to identify at-risk customers and understand why they churn.

#### Focus on Prediction
- **Early Detection**: Predict at-risk customers for timely retention strategies.
- **Targeted Interventions**: Tailor retention efforts to maximize impact.
- **Cost Efficiency**: Retain customers cost-effectively, saving on acquisition costs.
- **Optimal Resource Allocation**: Efficiently allocate resources to high-impact areas.

#### Focus on Interpretation
- **Understanding Customer Behavior**: Identify behavioral patterns and demographic factors linked to churn.
- **Identifying Churn Drivers**: Uncover root causes and dissatisfaction areas for improvement.
- **Improving Products and Services**: Use insights for continuous enhancement and customer-centric innovation.

### Benefits to the Business and Stakeholders
- **Enhanced Customer Retention**: Implement effective strategies to increase customer lifetime value.
- **Data-Driven Decision Making**: Inform product development, marketing, and customer service strategies.
- **Competitive Advantage**: Stay ahead of trends, address issues proactively, and improve offerings.
- **Stakeholder Confidence**: Build confidence through a comprehensive, proactive approach to churn analysis.




### Brief Description of the Data Set
**Description:**
The dataset used for this analysis is a customer churn dataset containing various attributes related to customer demographics, account information, and usage patterns.

**Summary of Attributes:**
- **state:** The state code where the customer resides.
- **account_length:** The duration of the customer account.
- **area_code:** The area code of the customer.
- **international_plan:** Whether the customer has an international plan (yes/no).
- **voice_mail_plan:** Whether the customer has a voicemail plan (yes/no).
- **number_vmail_messages:** Number of voicemail messages.
- **total_day_minutes, total_day_calls, total_day_charge:** Total usage during the day.
- **total_eve_minutes, total_eve_calls, total_eve_charge:** Total usage during the evening.
- **total_night_minutes, total_night_calls, total_night_charge:** Total usage during the night.
- **total_intl_minutes, total_intl_calls, total_intl_charge:** Total international usage.
- **number_customer_service_calls:** Number of customer service calls.
- **churn:** Whether the customer has churned (yes/no).

**Objective of the Analysis:**
To build a model that predicts customer churn and provides insights into the factors that influence churn.

|    | state   |   account_length | area_code     | international_plan   | voice_mail_plan   |   number_vmail_messages |   total_day_minutes |   total_day_calls |   total_day_charge |   total_eve_minutes |   total_eve_calls |   total_eve_charge |   total_night_minutes |   total_night_calls |   total_night_charge |   total_intl_minutes |   total_intl_calls |   total_intl_charge |   number_customer_service_calls | churn   |
|---:|:--------|-----------------:|:--------------|:---------------------|:------------------|------------------------:|--------------------:|------------------:|-------------------:|--------------------:|------------------:|-------------------:|----------------------:|--------------------:|---------------------:|---------------------:|-------------------:|--------------------:|--------------------------------:|:--------|
|  0 | OH      |              107 | area_code_415 | no                   | yes               |                      26 |               161.6 |               123 |              27.47 |               195.5 |               103 |              16.62 |                 254.4 |                 103 |                11.45 |                 13.7 |                  3 |                3.7  |                               1 | no      |
|  1 | NJ      |              137 | area_code_415 | no                   | no                |                       0 |               243.4 |               114 |              41.38 |               121.2 |               110 |              10.3  |                 162.6 |                 104 |                 7.32 |                 12.2 |                  5 |                3.29 |                               0 | no      |
|  2 | OH      |               84 | area_code_408 | yes                  | no                |                       0 |               299.4 |                71 |              50.9  |                61.9 |                88 |               5.26 |                 196.9 |                  89 |                 8.86 |                  6.6 |                  7 |                1.78 |                               2 | no      |
|  3 | OK      |               75 | area_code_415 | yes                  | no                |                       0 |               166.7 |               113 |              28.34 |               148.3 |               122 |              12.61 |                 186.9 |                 121 |                 8.41 |                 10.1 |                  3 |                2.73 |                               3 | no      |
|  4 | MA      |              121 | area_code_510 | no                   | yes               |                      24 |               218.2 |                88 |              37.09 |               348.5 |               108 |              29.62 |                 212.6 |                 118 |                 9.57 |                  7.5 |                  7 |                2.03 |                               3 | no      |

### Data Exploration and Data Cleaning

#### Exploration
I began the analysis by using the Pandas `describe` method to understand the statistical distribution of each feature in the dataset. This provided insights into central tendencies, dispersions, and outliers. Additionally, I checked the data types of each column using the `dtypes` attribute to distinguish between numerical and categorical columns.

#### Actions Taken

- **Handling Missing Values:**
  - There were no missing values in the dataset, so no imputation or removal was needed.

- **Categorical Data Encoding:**
  - The dataset contained categorical columns (`state`, `area_code`, `international_plan`, `voice_mail_plan`, `churn`). These were converted to numerical values using scikit-learn's `LabelEncoder`.

- **Numerical Data Scaling:**
  - I applied the `MinMaxScaler` to scale numerical columns. This transformed the features to a range between 0 and 1, ensuring consistent scale across all numerical features.

- **Feature Engineering:**
  - Created new features based on domain knowledge and data insights, primarily focusing on encoding categorical variables and scaling numerical features.

### Feature Engineering
Feature engineering involved transforming raw data into meaningful features to improve model performance.

- **Label Encoding:**
  - Converted categorical columns (`state`, `area_code`, `international_plan`, `voice_mail_plan`, `churn`) to numeric values using `LabelEncoder`.

- **Scaling:**
  - Applied `MinMaxScaler` to all numerical features to standardize them within a range of 0 to 1.


**A Brief Analysis of the Dataset** 

- **Account Length**: Average tenure is 100 months, ranging from 1 to 243 months.
- **Voicemail Messages**: Mostly zero, with an average of 7-8 messages.
- **Day Usage**: Around 180 minutes, with a maximum of 351.5 minutes.
- **Day Calls**: Average of 100 calls, ranging up to 165.
- **Day Charges**: Average charge is $30, ranging up to $59.76.
- **Evening Usage**: Similar pattern to day usage.
- **Night Usage**: Similar pattern to day and evening usage.
- **International Calls**: Relatively low usage and charges.
- **Customer Service Calls**: Mostly 1-2 calls, with some outliers.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>count</th>
      <th>mean</th>
      <th>std</th>
      <th>min</th>
      <th>25%</th>
      <th>50%</th>
      <th>75%</th>
      <th>max</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>account_length</th>
      <td>4250.0</td>
      <td>100.236235</td>
      <td>39.698401</td>
      <td>1.0</td>
      <td>73.0000</td>
      <td>100.00</td>
      <td>127.0000</td>
      <td>243.00</td>
    </tr>
    <tr>
      <th>number_vmail_messages</th>
      <td>4250.0</td>
      <td>7.631765</td>
      <td>13.439882</td>
      <td>0.0</td>
      <td>0.0000</td>
      <td>0.00</td>
      <td>16.0000</td>
      <td>52.00</td>
    </tr>
    <tr>
      <th>total_day_minutes</th>
      <td>4250.0</td>
      <td>180.259600</td>
      <td>54.012373</td>
      <td>0.0</td>
      <td>143.3250</td>
      <td>180.45</td>
      <td>216.2000</td>
      <td>351.50</td>
    </tr>
    <tr>
      <th>total_day_calls</th>
      <td>4250.0</td>
      <td>99.907294</td>
      <td>19.850817</td>
      <td>0.0</td>
      <td>87.0000</td>
      <td>100.00</td>
      <td>113.0000</td>
      <td>165.00</td>
    </tr>
    <tr>
      <th>total_day_charge</th>
      <td>4250.0</td>
      <td>30.644682</td>
      <td>9.182096</td>
      <td>0.0</td>
      <td>24.3650</td>
      <td>30.68</td>
      <td>36.7500</td>
      <td>59.76</td>
    </tr>
    <tr>
      <th>total_eve_minutes</th>
      <td>4250.0</td>
      <td>200.173906</td>
      <td>50.249518</td>
      <td>0.0</td>
      <td>165.9250</td>
      <td>200.70</td>
      <td>233.7750</td>
      <td>359.30</td>
    </tr>
    <tr>
      <th>total_eve_calls</th>
      <td>4250.0</td>
      <td>100.176471</td>
      <td>19.908591</td>
      <td>0.0</td>
      <td>87.0000</td>
      <td>100.00</td>
      <td>114.0000</td>
      <td>170.00</td>
    </tr>
    <tr>
      <th>total_eve_charge</th>
      <td>4250.0</td>
      <td>17.015012</td>
      <td>4.271212</td>
      <td>0.0</td>
      <td>14.1025</td>
      <td>17.06</td>
      <td>19.8675</td>
      <td>30.54</td>
    </tr>
    <tr>
      <th>total_night_minutes</th>
      <td>4250.0</td>
      <td>200.527882</td>
      <td>50.353548</td>
      <td>0.0</td>
      <td>167.2250</td>
      <td>200.45</td>
      <td>234.7000</td>
      <td>395.00</td>
    </tr>
    <tr>
      <th>total_night_calls</th>
      <td>4250.0</td>
      <td>99.839529</td>
      <td>20.093220</td>
      <td>0.0</td>
      <td>86.0000</td>
      <td>100.00</td>
      <td>113.0000</td>
      <td>175.00</td>
    </tr>
    <tr>
      <th>total_night_charge</th>
      <td>4250.0</td>
      <td>9.023892</td>
      <td>2.265922</td>
      <td>0.0</td>
      <td>7.5225</td>
      <td>9.02</td>
      <td>10.5600</td>
      <td>17.77</td>
    </tr>
    <tr>
      <th>total_intl_minutes</th>
      <td>4250.0</td>
      <td>10.256071</td>
      <td>2.760102</td>
      <td>0.0</td>
      <td>8.5000</td>
      <td>10.30</td>
      <td>12.0000</td>
      <td>20.00</td>
    </tr>
    <tr>
      <th>total_intl_calls</th>
      <td>4250.0</td>
      <td>4.426353</td>
      <td>2.463069</td>
      <td>0.0</td>
      <td>3.0000</td>
      <td>4.00</td>
      <td>6.0000</td>
      <td>20.00</td>
    </tr>
    <tr>
      <th>total_intl_charge</th>
      <td>4250.0</td>
      <td>2.769654</td>
      <td>0.745204</td>
      <td>0.0</td>
      <td>2.3000</td>
      <td>2.78</td>
      <td>3.2400</td>
      <td>5.40</td>
    </tr>
    <tr>
      <th>number_customer_service_calls</th>
      <td>4250.0</td>
      <td>1.559059</td>
      <td>1.311434</td>
      <td>0.0</td>
      <td>1.0000</td>
      <td>1.00</td>
      <td>2.0000</td>
      <td>9.00</td>
    </tr>
  </tbody>
</table>
</div>

### Model Selection

I used three classifiers that fit well with the Customer Churn dataset:

- **Logistic Regression:** 
  - Used as a baseline model due to its simplicity and interpretability.
  
- **Random Forest:**
  - Chosen for its ability to handle non-linear relationships and interactions between features.

- **Extreme Gradient Boosting (XGBoost):**
  - Although this model is beyond the scope of this course, I conducted additional research to understand and implement it. XGBoost is known for its high performance and accuracy in classification tasks.







### Training Process:

In training the customer churn dataset, I visualized the imbalance between churn and non-churn values, which highlighted the need for a balanced approach. Therefore, I used Stratified Shuffle Split to ensure consistent distribution of the target variable (churn) across both training and testing sets. This approach is crucial for imbalanced datasets, where the number of customers who churn differs significantly from those who do not. 

By stratifying the data and maintaining the class distribution in both splits, the model can learn from a representative sample and generalize better to unseen data. 

Additionally, I split the data into 80% training and 20% testing sets due to the relatively small dataset size, allowing for a sufficient amount of data for training while retaining a representative test set for evaluation. This technique helps to mitigate class imbalance, leading to more reliable and accurate predictions.

![image.png](attachment:image.png)

### Model Evaluation

For the model evaluation phase, I utilized several key metrics and tools from the scikit-learn library to assess the performance of my classifiers. The metrics included in the evaluation were:

- **Classification Report**: This report provides a detailed breakdown of the precision, recall, and F1-score for each class, offering insight into the model's performance on a per-class basis.
- **Accuracy Score**: This metric calculates the overall accuracy of the model, representing the proportion of correct predictions out of the total number of predictions.
- **Confusion Matrix**: This tool visually represents the number of true positive, true negative, false positive, and false negative predictions, providing a clear picture of the model's performance.
- **Precision, Recall, F-score, and Support**: Using the `precision_recall_fscore_support` function, I extracted these metrics to understand the balance between precision and recall for each class.

To streamline the evaluation process, I wrote a user-defined function, evaluate_metrics, which integrates these metrics and presents them in a comprehensive manner. This function calls the necessary scikit-learn libraries and outputs the evaluation results in an easy-to-read format. The design of this function was inspired by the tutorials of this course.

Below is the code for the `evaluate_metrics` function:

```python
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_recall_fscore_support


def evaluate_metrics(yt, yp):
    results_pos = {}
    results_pos['accuracy'] = accuracy_score(yt, yp)
    precision, recall, f_beta, _ = precision_recall_fscore_support(yt, yp)
    results_pos['recall'] = recall
    results_pos['precision'] = precision
    results_pos['f1score'] = f_beta
    return results_pos


```


**Model Evaluation: ROC Curve Analysis**

In this section, we present the performance evaluation of the three classifiers: Logistic Regression, Random Forest, and XGBoost, based on their ROC curves and corresponding AUC scores.

**Logistic Regression AUC: 0.79** 

**Random Forest AUC: 0.90**

**XGBoost AUC: 0.91**

- The ROC curve illustrates the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate) across different classification thresholds for each model.
- A higher AUC value indicates better discrimination ability of the classifier in distinguishing between the positive and negative classes.
- Based on the AUC scores, both Random Forest and XGBoost outperform Logistic Regression in terms of predictive performance, with XGBoost demonstrating the highest AUC of 0.91.

**Conclusion:**
- The ROC curve analysis suggests that Random Forest and XGBoost classifiers exhibit superior performance compared to Logistic Regression in predicting customer churn, with XGBoost being the top-performing model with an AUC of 0.91.


![image.png](attachment:image.png)

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Classifier Name</th>
      <th>Accuracy Score</th>
      <th>f1_score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Logistic Regression</td>
      <td>0.86</td>
      <td>0.82</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Random Forest</td>
      <td>0.96</td>
      <td>0.95</td>
    </tr>
    <tr>
      <th>2</th>
      <td>XGBoost</td>
      <td>0.96</td>
      <td>0.96</td>
    </tr>
  </tbody>
</table>
</div>

**Key Drivers:**

To predict churn outcome using the XGBoost model, we tuned and applied scaling parameters to 16 features, which impact the model outcome as follows:
- Objective: "binary:logistic"
- colsample_bylevel: 0.7
- colsample_bytree: 0.8 (range 0.5 ~ 1.0)
- Learning rate: 0.15
- Gamma: 1 (default: 0)
- Maximum depth (tree): 4 (to avoid overfitting)
- max_delta_step: 3
- min_child_weight: 1 (higher values lead to under-fitting)
- n_estimators: 50
- Reg_lambda: 10 (L2 Regularization)
- Scale_pos_weight: 1.5 (considering class imbalance)
- Subsample: 0.9
- Silent: False
- n_jobs: 4

The model is recursively trained with different hyperparameters until cross-validation errors are significantly reduced. We tuned max_depth and min_child_weight parameters for their high impact on model outcome.

### Model Explainability

The visualization below illustrates Shapley (SHAP) values on the x-axis and the corresponding feature values on the y-axis. Instances located to the left of the center line influence the predicted outcome negatively, while those on the right contribute positively to the prediction. This type of plot is commonly referred to as a force plot. In this plot, features that increase the prediction are highlighted in red, while those that decrease it are shown in blue.

- **Total Day Minutes**: A feature with a negative SHAP value indicates that higher values of that feature tend to decrease the predicted churn probability. In this case, the total day minutes feature has a negative SHAP value, suggesting that higher total day minutes are associated with a lower likelihood of churn. The violet and blue colors represent the range of feature values, with violet indicating lower values and blue indicating higher values.

- **Number of Customer Service Calls**: Similarly, this feature also has a negative SHAP value, suggesting that a higher number of customer service calls is associated with a lower likelihood of churn. 

![image.png](attachment:image.png)


### Key Findings and Insights

**Insights:**
**Main Drivers of Churn:**
- **International Plan**: Customers with international plans are more likely to churn, indicating dissatisfaction with international calling services.
- **Customer Service Calls**: Higher number of calls to customer service correlates strongly with churn, suggesting unresolved issues or dissatisfaction.
- **Total Day Minutes and Charges**: Customers with higher day minutes and charges are more likely to churn, possibly due to dissatisfaction with pricing plans.
- **Number of Voicemail Messages**: Surprisingly, customers with more voicemail messages also churn more, indicating potential dissatisfaction with voicemail service.

**Leveraging Insights to Reduce Churn:**
- Optimize international plans, enhance customer service efficiency, review pricing strategies, and improve voicemail service.



### Next Steps and Plan of action:

The report identifies potential limitations in the current analysis of customer churn, notably the small dataset size of 4250 rows and the imbalance between churn and non-churn outputs. To address these challenges, future actions include expanding the dataset through additional data collection efforts, exploring feature engineering techniques to capture more predictive insights, and experimenting with alternative modeling approaches better suited for imbalanced data. Rigorous cross-validation and the use of appropriate evaluation metrics will be employed to ensure model robustness and accuracy. By proactively addressing these issues, the analysis aims to refine model performance and provide more reliable predictions for customer churn.
