<div style="border: solid #1e90ff 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Work Plan Review</b><br><br>

  Hi <b>Elvis</b>, I’m <b>Victor Camargo</b> (<a href="https://hub.tripleten.com/u/e9cc9c11" target="_blank">profile</a>). I’ve reviewed your <b>code  work plan</b>.<br>

  <b>Strong elements observed:</b><br>
  ✔️ Good start inspecting the datasets (<code>head()</code>, <code>info()</code>, <code>isnull()</code>, <code>describe()</code>) — this ensures you understand data types and missing values early.<br>
  ✔️ Correct decision to merge on <code>customerID</code> using a left join from <code>contract</code>, since it contains all clients.<br>
  ✔️ Appropriate cleaning of <code>TotalCharges</code> (handling blanks → NaN → numeric) and rationale for filling with 0.<br>
  ✔️ Well-defined churn target from <code>EndDate</code>, and thoughtful feature engineering with <code>tenure_months</code>.<br>
  ✔️ Clear EDA plan (distribution of churn, contract type, services, charges/tenure, demographics).<br>
  ✔️ Strong modeling path: Logistic Regression baseline → Random Forest → LightGBM with AUC-ROC as the main metric.<br><br>

  <b>Suggestions to strengthen the work:</b><br>
  • In your code, you repeated <code>print(df_contract.head())</code> when loading <code>df_personal</code> — update it to <code>df_personal.head()</code> so you actually preview the personal data.<br>
  • Don’t forget to inspect <code>internet.csv</code> and <code>phone.csv</code> as well — right now, only <code>contract</code> and <code>personal</code> are checked.<br>
  • When creating <code>tenure_months</code>, use the fixed reference date (<b>2020-02-01</b>) for active customers to avoid inconsistencies.<br>
  • Make sure you exclude <code>EndDate</code> (after deriving churn) and any ID-like columns from features to prevent leakage.<br>
  • Consider adding class imbalance handling methods explicitly (e.g., <code>class_weight</code> or upsampling on train set).<br><br>

  <b>Final note:</b> Overall this is a solid plan and the code foundation is correct. Approved to move forward ✅ — just address the small fixes above (dataset preview typo, inspecting all four files, and leakage/imbalance guardrails). With those in place, you’ll be well-prepared for the solution phase.<br>
</div>


<h1 align = "center"><span style = "font-size: 2em; font-weight: bold"> SPRINT 17 - FINAL PROJECT </span></h1>

<h1 align = "center"><span style = "font-size: 1em; font-weight: bold"> PROJECT TITLE : CLIENT CHURN PREDICTION FOR INTERCONNECT  </span></h1>

<h1 align = "center"><span style = "font-size: 2em; font-weight: bold"> WORK PLAN  </span></h1>

In [1]:
import pandas as pd

##  Data Loading and initial Exploration

In [2]:


df_contract = pd.read_csv('/datasets/final_provider/contract.csv')
print(df_contract.head())

print("------------------------------------------------------------------------------------")
print("Data info")
print(df_contract.info())

print("------------------------------------------------------------------------------------")
print("missing Values count")
print(df_contract.isnull().sum())
print("------------------------------------------------------------------------------------")
print("Basic DEscriptive statistics")
print(df_contract.describe()
)


   customerID   BeginDate              EndDate            Type  \
0  7590-VHVEG  2020-01-01                   No  Month-to-month   
1  5575-GNVDE  2017-04-01                   No        One year   
2  3668-QPYBK  2019-10-01  2019-12-01 00:00:00  Month-to-month   
3  7795-CFOCW  2016-05-01                   No        One year   
4  9237-HQITU  2019-09-01  2019-11-01 00:00:00  Month-to-month   

  PaperlessBilling              PaymentMethod  MonthlyCharges TotalCharges  
0              Yes           Electronic check           29.85        29.85  
1               No               Mailed check           56.95       1889.5  
2              Yes               Mailed check           53.85       108.15  
3               No  Bank transfer (automatic)           42.30      1840.75  
4              Yes           Electronic check           70.70       151.65  
------------------------------------------------------------------------------------
Data info
<class 'pandas.core.frame.DataFrame'>
RangeInd

In [3]:
df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
print(df_contract.head())

print("------------------------------------------------------------------------------------")
print("Data info")
print(df_personal.info())

print("------------------------------------------------------------------------------------")
print("missing Values count")
print(df_personal.isnull().sum())
print("------------------------------------------------------------------------------------")
print("Basic DEscriptive statistics")
print(df_personal.describe())


   customerID   BeginDate              EndDate            Type  \
0  7590-VHVEG  2020-01-01                   No  Month-to-month   
1  5575-GNVDE  2017-04-01                   No        One year   
2  3668-QPYBK  2019-10-01  2019-12-01 00:00:00  Month-to-month   
3  7795-CFOCW  2016-05-01                   No        One year   
4  9237-HQITU  2019-09-01  2019-11-01 00:00:00  Month-to-month   

  PaperlessBilling              PaymentMethod  MonthlyCharges TotalCharges  
0              Yes           Electronic check           29.85        29.85  
1               No               Mailed check           56.95       1889.5  
2              Yes               Mailed check           53.85       108.15  
3               No  Bank transfer (automatic)           42.30      1840.75  
4              Yes           Electronic check           70.70       151.65  
------------------------------------------------------------------------------------
Data info
<class 'pandas.core.frame.DataFrame'>
RangeInd

# Data Merging and Feature Engineering:

* Merge the four DataFrames (df_contract, df_personal, df_internet, df_phone) into a single, comprehensive DataFrame using customerID as the key.<br>

We used a left merge, which ensures we keep all clients from the contract DataFrame, as it contains all customer IDs.
  
* Clean the TotalCharges column, converting it from a string to a numeric type and handling any missing values.<br>
The column was a string because some entries were blank spaces. We converted these to NaNs and then filled them with 0. This is a reasonable approach because these missing values likely correspond to new customers who haven't accumulated any charges yet.
  

* Create the target feature, churn, which will be a binary column (0 or 1) based on the EndDate column.<br>

We created a binary target variable (1 for churn, 0 for no churn) directly from the EndDate column. This is a crucial step for setting up our classification problem.
  

* Engineer new features, such as tenure (the duration of the client's contract), which is a crucial predictor for churn.<br>
  The tenure_months feature is the client's total time with the company in months. For churned clients, this is the period from the BeginDate to the EndDate.

# Exploratory Data Analysis(EDA)

* Analyze the Distribution of Churn: We'll start with a simple visualization to see how balanced or imbalanced our target variable is. This is crucial for understanding the challenge of our classification task.

* Churn by Contract Type: We'll examine how the churn rate varies across different contract types (e.g., monthly, 1-year, 2-year). This will likely reveal a strong predictor of churn.

* Impact of Services on Churn: We'll investigate if having certain services (like OnlineSecurity, TechSupport, or StreamingTV) affects the likelihood of a client churning.

* Influence of Charges and Tenure: We'll use visualizations to see the relationship between TotalCharges, tenure_months, and MonthlyCharges with the Churn status. This will help us understand if there are financial or temporal patterns associated with churn.

* Demographic Factors: We'll check if demographic data, such as gender, SeniorCitizen, Partner, and Dependents, plays a role in customer churn.

# Data Preprocessing for Model Training

* Categorical Feature Encoding:<br> We'll convert categorical text data (like Contract, PaymentMethod, etc.) into a numerical format. We'll use one-hot encoding for this, which creates new binary columns for each category.

* Feature and Target Separation:<br> We'll split the data into two parts: the features (X), which are the columns we'll use to make predictions, and the target (y), which is our Churn variable.

* Data Splitting:<br> We'll divide our dataset into three separate sets: a training set, a validation set, and a test set. This is essential for properly evaluating our model's performance and preventing overfitting.

* Feature Scaling:<br> While not always necessary for tree-based models, it's good practice for others like logistic regression. We'll use StandardScaler to normalize the numerical features.

# Model Selection and Initial Training

* Select Models:<br> We'll use a few different types of classification algorithms to see which performs best on our data. A good starting point includes:

* Logistic Regression:<br> A simple, fast, and highly interpretable model.

* Random Forest Classifier:<br> An ensemble model known for its high performance and robustness.

* LightGBM Classifier:<br> A powerful gradient boosting model that is very effective on tabular data and is known for its speed.

* Initial Training and Evaluation:<br> We'll train each model on the training set and then evaluate its performance on the validation set. Our primary metric will be AUC-ROC, with Accuracy as an additional metric, as specified in the project requirements.

# Final Model Evaluation