In [1]:
import pandas as pd

In [5]:
inner_path = "/datasets/final_provider/"

contract_df = pd.read_csv(f"{inner_path}/contract.csv")
personal_df = pd.read_csv(f"{inner_path}/personal.csv")
internet_df = pd.read_csv(f"{inner_path}/internet.csv")
phone_df = pd.read_csv(f"{inner_path}/phone.csv")

contract_info = contract_df.info()
personal_info = personal_df.info()
internet_info = internet_df.info()
phone_info = phone_df.info()

contract_head = contract_df.head()
personal_head = personal_df.head()
internet_head = internet_df.head()
phone_head = phone_df.head()

contract_df.shape, personal_df.shape, internet_df.shape, phone_df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   D

((7043, 8), (7043, 5), (5517, 8), (6361, 2))

### Dataset Notes

- 4 datasets: `contract` (7043 rows), `personal` (7043), `internet` (5517), `phone` (6361)
- Primary file: `contract.csv` — contains target column `EndDate`
- Target: churn → `EndDate != 'No'` is churn (`1`), otherwise (`0`)
- `TotalCharges` is object type — needs conversion to float
- `BeginDate` and `EndDate` should be converted to datetime
- Some customers don’t use internet or phone → will result in missing values after merge

---

### Feature Engineering Ideas

- Create `churn` column from `EndDate`
- Extract `contract_length` in months from `BeginDate` and `EndDate`
- Extract `begin_month`, `begin_weekday` from `BeginDate`
- One-hot encode: `PaymentMethod`, `InternetService`, `Contract`, etc.
- Convert binary `Yes`/`No` columns to 1/0
- Consider class imbalance methods (SMOTE, `class_weight`)

---

### Work Plan

1. **Merge Data**  
   Merge all datasets on `customerID` to build full customer profiles

2. **Clean & Format**  
   Fix column names, convert data types, handle missing values

3. **EDA**  
   Analyze churn distribution, missing patterns, and feature relationships

4. **Feature Engineering**  
   Create new features from dates and service types, encode categoricals

5. **Modeling**  
   Train and evaluate classifiers (Logistic Regression, RandomForest, CatBoost) using AUC-ROC as primary metric

---

### Clarifying Questions

- Should “No internet/phone service” be treated as `"No"` or its own category?
- Should very short-term clients be filtered or treated as churn?
- Do we prioritize reducing false negatives (missing a churn) or false positives?
- Are there any restrictions on which models or libraries we can use?


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> Answers to Questions: <br>
1. I would likely express it as its own category. <br>
2. This is up to your discretion. <br>
    3. This is also up to you, what way would you prefer that the model be incorrect?<br>
    4. No, but we do encourage you to use the tools which were taught and practiced in the course!
<a class="tocSkip"></a>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> This looks like a good plan. Best of luck!
<a class="tocSkip"></a>