# Laporan Proyek Machine Learning - Robert Varian

# Domain Proyek
Churn pelanggan merupakan isu signifikan dalam industri jasa, terutama telekomunikasi. Churn atau kehilangan pelanggan berdampak langsung pada pendapatan dan biaya akuisisi pelanggan baru. Oleh karena itu, penting bagi perusahaan untuk mengidentifikasi pelanggan yang kemungkinan besar akan berhenti menggunakan layanan agar dapat mengambil tindakan preventif.

Referensi:
Amin, A., Anwar, S., Adnan, A., et al. (2019). Customer churn prediction in the telecommunication sector using a rough set approach. Neural Computing and Applications.

# Business Understanding
### Problem Statements
1.   Bagaimana cara mengidentifikasi pelanggan yang berisiko tinggi melakukan churn?
2.   Fitur pelanggan mana yang paling memengaruhi kemungkinan churn?

### Goals
1.   Membangun model prediksi churn yang akurat.
2.   Mengidentifikasi fitur yang berkontribusi paling besar terhadap churn.

### Solution Statements
1. Menggunakan algoritma Logistic Regression, Random Forest, dan XGBoost untuk membandingkan performa.
2. Melakukan hyperparameter tuning untuk meningkatkan performa model.
3. Menggunakan metrik F1-score untuk mengevaluasi model.



# Data Understanding
Dataset yang digunakan adalah Telco Customer Churn Dataset, tersedia di Kaggle:
🔗 https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Dataset ini berisi informasi pelanggan dari perusahaan telekomunikasi fiktif.

## Variabel Utama:
* gender, SeniorCitizen, Partner, Dependents
* tenure, PhoneService, MultipleLines, InternetService
* Contract, PaymentMethod, MonthlyCharges, TotalCharges
* Target: Churn (Yes/No)

## Visualisasi & EDA (Exploratory Data Analysis)
* Memahami distribusi data dan korelasi antar fitur.


# Data Preparation
Langkah-Langkah:
1. Loading:
 * Memasukkan data kaggle kedalam dataframe
2. Cleaning:
 * Menghapus nilai null pada kolom TotalCharges.
 * Mengonversi TotalCharges ke numerik.
3. Encoding:
 * One-hot encoding untuk fitur kategorikal.
4. Feature Scaling:
  * Standarisasi fitur numerik menggunakan StandardScaler.

# Modeling
## Model 1: Logistic Regression (baseline)
* Model baseline yang mudah diinterpretasi.
## Model 2: Random Forest
* Algoritma ensembel yang menangani interaksi fitur.
## Model 3: XGBoost
* Model boosting dengan performa tinggi dan optimisasi kecepatan.

Parameter default digunakan, dan dilakukan tuning untuk mendapatkan hasil terbaik.

# Evaluation
Metrik evaluasi:
* Accuracy
* Precision
* Recall
* F1-score

F1-score dipilih sebagai metrik utama karena dataset imbalanced (jumlah churn jauh lebih kecil).

**Hasil Evaluasi:**
Model terbaik berdasarkan F1-score adalah XGBoost.

# Data Preparation

## Data Loading

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


In [5]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"grandhavoc","key":"53aad8c892f982db09c5547e125582c7"}'}

In [8]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
!pip install kaggle



In [10]:
!kaggle datasets download -d blastchar/telco-customer-churn

Dataset URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
License(s): copyright-authors
Downloading telco-customer-churn.zip to /content
  0% 0.00/172k [00:00<?, ?B/s]
100% 172k/172k [00:00<00:00, 509MB/s]


In [11]:
import zipfile
with zipfile.ZipFile("telco-customer-churn.zip", "r") as zip_ref:
    zip_ref.extractall("telco_data")

In [12]:
df = pd.read_csv("telco_data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data Cleaning

In [13]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
df.drop(['customerID'], axis=1, inplace=True)
df = pd.get_dummies(df, drop_first=True)

## Feature Scalling

In [14]:
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modeling

## Model 1: Logistic Regression

In [15]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)

## Model 2: Random Forest

In [16]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

## Model 3: XGBoost

In [17]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

Parameters: { "use_label_encoder" } are not used.



# Evaluation


In [18]:
def evaluate_model(name, y_true, y_pred):
    return {
        'Model': name,
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1-Score': f1_score(y_true, y_pred)
    }

results = [
    evaluate_model("Logistic Regression", y_test, y_pred_lr),
    evaluate_model("Random Forest", y_test, y_pred_rf),
    evaluate_model("XGBoost", y_test, y_pred_xgb)
]

results_df = pd.DataFrame(results)
results_df.sort_values(by="F1-Score", ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.787491,0.620579,0.516043,0.563504
1,Random Forest,0.785359,0.626761,0.475936,0.541033
2,XGBoost,0.763326,0.565916,0.470588,0.513869
