# 1. Business Requirements

- Mục tiêu: Dự đoán khách hàng có rời bỏ dịch vụ (Churn) hay không
- Ý nghĩa: Giúp doanh nghiệp giữ chân được khách hàng, cải thiện chiến lược kinh doanh, ...
- Dữ liệu: https://www.kaggle.com/datasets/blastchar/telco-customer-churn (Kaggle)
- Target variable: Churn (0: No, 1: Yes)

---

---

# 2. Data Understanding

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('../data/Telecom_Customer_Churn.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
print(f"Dataset size: {df.shape}")    # Kích thước của dữ liệu

Dataset size: (7043, 21)


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
df.describe()   # Thống kê các thuộc tính dạng numeric

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


---

### Columns Information

* `tenure`: Số tháng sử dụng dịch vụ.
* `PhoneService`: Có dịch vụ điện thoại (`Yes`/`No`).
* `MultipleLines`: Có nhiều đường dây điện thoại (`Yes`/`No`).
* `InternetService`: Loại Internet (`DSL`, `Fiber optic`, `No`).
* `OnlineSecurity`: Dịch vụ bảo mật trực tuyến (`Yes`/`No`).
* `OnlineBackup`: Dịch vụ sao lưu trực tuyến (`Yes`/`No`).
* `DeviceProtection`: Bảo vệ thiết bị (`Yes`/`No`).
* `TechSupport`: Hỗ trợ kỹ thuật (`Yes`/`No`).
* `StreamingTV`: Xem TV trực tuyến (`Yes`/`No`).
* `StreamingMovies`: Xem phim trực tuyến (`Yes`/`No`).
* `Contract`: Loại hợp đồng (`Month-to-month`, `One year`, `Two year`).
* `PaperlessBilling`: Hóa đơn điện tử (`Yes`/`No`).
* `PaymentMethod`: Phương thức thanh toán.
* `MonthlyCharges`: Phí hàng tháng.
* `TotalCharges`: Tổng chi tiêu.
* `Churn` (Label): Rời bỏ dịch vụ (`Yes`/`No`).

---

---

In [6]:
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [7]:
for col in df.select_dtypes(include='object').columns:
    print(f"{col}:\n{df[col].unique()}")

customerID:
['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
gender:
['Female' 'Male']
Partner:
['Yes' 'No']
Dependents:
['No' 'Yes']
PhoneService:
['No' 'Yes']
MultipleLines:
['No phone service' 'No' 'Yes']
InternetService:
['DSL' 'Fiber optic' 'No']
OnlineSecurity:
['No' 'Yes' 'No internet service']
OnlineBackup:
['Yes' 'No' 'No internet service']
DeviceProtection:
['No' 'Yes' 'No internet service']
TechSupport:
['No' 'Yes' 'No internet service']
StreamingTV:
['No' 'Yes' 'No internet service']
StreamingMovies:
['No' 'Yes' 'No internet service']
Contract:
['Month-to-month' 'One year' 'Two year']
PaperlessBilling:
['Yes' 'No']
PaymentMethod:
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
TotalCharges:
['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']
Churn:
['No' 'Yes']


In [8]:
import sys
sys.path.append('..')

from src.preprocessing import clean_telco_data, create_preprocessor
from src.modeling import train_all_models, print_result_table, train_single_model

In [9]:
internet_service_cols = [
        "OnlineSecurity",
        "OnlineBackup",
        "DeviceProtection",
        "TechSupport",
        "StreamingTV",
        "StreamingMovies"
    ]

In [10]:
df_clean = clean_telco_data(df, internet_service_cols)

X = df_clean.drop("Churn", axis=1)
y = df_clean["Churn"]

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

preprocessor = create_preprocessor()
models, results = train_all_models(preprocessor, X_train, y_train, X_test, y_test)

print_result_table(results)

Model              Acc        Prec       Recall     F1        
LogisticRegression 0.750357   0.518987   0.773585   0.621212
RandomForest       0.795292   0.655556   0.477089   0.552262
GradientBoosting   0.804565   0.683019   0.487871   0.569182
DecisionTree       0.719686   0.470899   0.479784   0.475300
SVM                0.799572   0.675781   0.466307   0.551834


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import joblib

pipeline = Pipeline(steps=[("preprocess", preprocessor), ("model", LogisticRegression(max_iter=100, class_weight="balanced"))])
pipeline.fit(X_train, y_train)

joblib.dump(pipeline, "../models/best_model.pkl")

0.7503566333808844 - 0.6212121212121211
