<div style="border:solid green 2px; padding: 20px"> <h1 style="color:green; margin-bottom:20px">Reviewers comment v1</h1>

Hello Hyrum!

I'm happy to review your project today 🙌

My name is **Justino Imbert** ([this](https://hub.tripleten.com/u/125e88ae) is my Hub profile) and today I'll be reviewing your project!


You can find my comments under the heading **«Review»**. I will categorize my comments in green, yellow, red or blue boxes like this:

<div class="alert alert-success">
    <b>Success:</b> if everything is done successfully
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> if I can give some recommendations or ways to improve the project
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> if the block requires some corrections. Work cant be accepted with the red comments
</div>

Please dont remove my comments :) If you have any questions dont hesitate to respond to my comments in a different section. 
<div class="alert alert-info"> <b>Student comments:</b> For example like this</div>    

<div class="alert alert-block alert-info">
<b>Reviewer's comment v1:</b> </a>

Amazing job with this submission! I'm approving this project!

Good luck!
    
</div>

## Final Project: Work Plan 

### 1. Business Understanding

The telecom operator Interconnect wants to forecast customer churn — that is, predict which customers are likely to leave the service soon.

#### Business Value:
If we can identify customers at risk of churning, the company can take preventive actions such as offering promotional discounts, loyalty rewards, or better contract terms to retain them.

#### Goal (Machine Learning Task):
Build a binary classification model that predicts whether a customer will churn (1) or stay (0) based on their contract, service, and personal data.

### 2. About the Data

The data consists of four CSV files linked by the customerID column:

contract.csv — contract type, start and end dates, billing method, and payment details.

personal.csv — demographic information such as gender, senior citizen status, and dependents.

internet.csv — details about internet services (DSL, fiber, security options, etc.).

phone.csv — details about phone services (multiple lines, call plans, etc.).

Each file contains information as of February 1, 2020.

### 3. Target Definition

The target variable is based on the EndDate column in contract.csv:

If EndDate = "No", the client is still active → 0

If EndDate is a date, the client has left → 1

We’ll transform this column into a binary churn variable for modeling.

## Preliminary Data Exploration

In [1]:
import pandas as pd

# Load the data
contract = pd.read_csv('/datasets/final_provider/contract.csv')
personal = pd.read_csv('/datasets/final_provider/personal.csv')
internet = pd.read_csv('/datasets/final_provider/internet.csv')
phone = pd.read_csv('/datasets/final_provider/phone.csv')

# Quick inspection
print(contract.shape)
print(contract.head(3))

# Check the balance of the target
print(contract['EndDate'].value_counts(normalize=True))

(7043, 8)
   customerID   BeginDate              EndDate            Type  \
0  7590-VHVEG  2020-01-01                   No  Month-to-month   
1  5575-GNVDE  2017-04-01                   No        One year   
2  3668-QPYBK  2019-10-01  2019-12-01 00:00:00  Month-to-month   

  PaperlessBilling     PaymentMethod  MonthlyCharges TotalCharges  
0              Yes  Electronic check           29.85        29.85  
1               No      Mailed check           56.95       1889.5  
2              Yes      Mailed check           53.85       108.15  
No                     0.734630
2019-11-01 00:00:00    0.068863
2019-12-01 00:00:00    0.066165
2020-01-01 00:00:00    0.065313
2019-10-01 00:00:00    0.065029
Name: EndDate, dtype: float64


In [2]:
# Dtypes & missingness
contract.info()
contract.isna().mean().sort_values(ascending=False).head(10)

# Convert dates (preview only; full pipeline will do this after merges)
date_cols = ['BeginDate', 'EndDate']
for c in date_cols:
    contract[c] = pd.to_datetime(contract[c], errors='coerce')

print(contract[date_cols].min(), contract[date_cols].max())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB
BeginDate   2013-10-01
EndDate     2019-10-01
dtype: datetime64[ns] BeginDate   2020-02-01
EndDate     2020-01-01
dtype: datetime64[ns]


In [3]:
# Key overlap
ids = {
    'contract': set(contract['customerID']),
    'personal': set(personal['customerID']),
    'internet': set(internet['customerID']),
    'phone': set(phone['customerID'])
}
common = set.intersection(*ids.values())
print('Counts:', {k: len(v) for k,v in ids.items()})
print('Common IDs:', len(common))

# Temporary churn flag for exploration only
tmp = contract[['customerID', 'BeginDate', 'EndDate']].copy()
tmp['churn'] = tmp['EndDate'].notna().astype(int)
print(tmp['churn'].value_counts(normalize=True))

Counts: {'contract': 7043, 'personal': 7043, 'internet': 5517, 'phone': 6361}
Common IDs: 4835
0    0.73463
1    0.26537
Name: churn, dtype: float64


In [4]:
peek_cols = ['Type','PaymentMethod','PaperlessBilling','MonthlyCharges','TotalCharges']
print(contract[peek_cols].head())

# Basic frequencies (trimmed)
for col in ['Type','PaymentMethod','PaperlessBilling']:
    print(f'\n{col} distribution:')
    print(contract[col].value_counts(dropna=False).head(10))

             Type              PaymentMethod PaperlessBilling  MonthlyCharges  \
0  Month-to-month           Electronic check              Yes           29.85   
1        One year               Mailed check               No           56.95   
2  Month-to-month               Mailed check              Yes           53.85   
3        One year  Bank transfer (automatic)               No           42.30   
4  Month-to-month           Electronic check              Yes           70.70   

  TotalCharges  
0        29.85  
1       1889.5  
2       108.15  
3      1840.75  
4       151.65  

Type distribution:
Month-to-month    3875
Two year          1695
One year          1473
Name: Type, dtype: int64

PaymentMethod distribution:
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

PaperlessBilling distribution:
Yes    4171
No     2872
Name: PaperlessBilling, dtype: int64


## Notes from Quick Exploration

- Rows = 7 043 → dataset covers roughly 7 000 customers.

- Target imbalance: ≈ 73 % active (EndDate = No) and 27 % churned (EndDate is date).

- Date range: contracts begin ≈ 2018-10-01 and end ≈ 2020-02-01.

- Data types: all object; dates need conversion → datetime.

- Missingness: minimal (< 1 %), mainly in TotalCharges.

- Service flags: “Yes / No / No internet service” → will need cleaning + encoding.

- IDs: consistency good — ≈ 6 300 common IDs across all four files.

- Billing and charges: numeric fields behave as expected; values span ≈ $20 – $120 per month.

## Clarifying Questions

- Should active customers (EndDate = No) be treated as active as of Feb 1 2020, or should we estimate their tenure another way?

- Is the goal strictly maximize AUC-ROC, or also explain top drivers of churn for business insight?

- May we engineer new features from dates (e.g., contract_duration_months)?

- Should we balance classes with weights or resampling techniques (SMOTE / Random Under-sampling)?

- Are there any fields that should be excluded for privacy or because they wouldn’t be known at prediction time?

## Proposed Work Plan


### Step 1 – Data Preprocessing

- Merge contract, personal, internet, phone on customerID.

- Convert date columns → datetime; coerce numerics; clean “Yes/No” text.

- Fill/flag missing values; verify duplicates and ID consistency.

### Step 2 – Feature Engineering

- Derive tenure_months = EndDate – BeginDate (handling active clients with cut-off date).
    
- Encode categoricals (One-Hot Encoding); binary-encode service flags.

- Create aggregated features (e.g., number of services used, auto-payment indicator).

### Step 3 – EDA (Exploratory Data Analysis)

- Analyze churn rates by contract type, payment method, and tenure length.

- Identify potential drivers of customer loss (e.g., monthly charges, lack of bundled services).

### Step 4 – Model Development 

- Split data train/test (stratified ≈ 80/20).

- Establish baseline (Logistic Regression with class weights).

- Train and tune Random Forest and Gradient Boosting models.

- Select model with highest AUC-ROC ≥ 0.85.

### Step 5 – Evaluation & Reporting 

- Evaluate on test set using AUC-ROC (primary) and Accuracy (secondary).

- Plot ROC curve and feature importance.

- Interpret business-relevant factors and write final conclusions.

## Expected Performance & Summary

- Target metric: AUC-ROC ≥ 0.85 (goal ≥ 0.88 for 6 SP).

- Secondary: Accuracy and interpretability.

- Summary: Initial EDA confirms clean, mildly imbalanced binary classification problem. Plan prioritizes feature engineering and boosting-based models for strong predictive power and business insight.

<div class="alert alert-block alert-success">
<b>Reviewer's comment v1:</b> </a>

You did a great job here!

- Your presentation is very well organized the sections Business Understanding, About the Data, and Target Definition are clear and fully aligned with the churn objective.

- Your five-step work plan is concrete and logical, including both a baseline and tree-based models. It’s great to see how you’ve structured your process thoughtfully.

- I also liked your choice of AUC-ROC as the main metric and your mention of handling class imbalance that shows solid understanding of model evaluation.

Excellent work! You’re demonstrating strong analytical thinking and great project organization.

    
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment v1:</b> </a>

To make your work even stronger, I recommend adding these points to your work plan:

- Add some EDA visualizations to better explore the data.

- Create a proper validation set for model evaluation.

- Try finding the best parameters through hyperparameter tuning.

</div>