# üß† Internship Task 4 ‚Äì Churn Prediction Model
Project Title: Customer Churn Analysis and Prediction<br/>
Company: Saiket Systems<br/>
Intern: Farida Bashir<br/>
Date: October 2025<br/>

## üìã Project Overview
The project aims to analyze customer churn in a telecommunications company and develop predictive models to identify at-risk customers.
The goal is to provide actionable insights to reduce churn and improve retention.



## üß© Description:

In this task, I built a machine learning model to predict customer churn using different algorithms such as Logistic Regression and Decision Tree. The data was split into training and testing sets to evaluate the model‚Äôs performance using metrics like accuracy, precision, recall, and F1-score. Feature selection and hyperparameter tuning were also performed to achieve optimal performance.

## üõ†Ô∏è Skills Demonstrated
* Machine Learning Algorithms
* Model Training and Evaluation
* Feature Selection
* Hyperparameter Tuning
* Understanding of Classification Metrics

### üß© **Step 1: Import Libraries**

In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [26]:
file =  ('Telco_Customer_Churn_Dataset  (3).csv')

In [27]:
df = pd.read_csv(file)

In [28]:
# identify categorical features
cat_cols = df.select_dtypes(include=["object"]).columns
print("categorical Columns:", cat_cols)

# convert yes/no to 1/0
df.replace({"Yes":1, "No":0}, inplace=True)

# Drop customerID if it exists (it's not useful for prediction)
if 'customerID' in df.columns:
    df = df.drop('customerID', axis=1)
# One-hot encode categorical variables
cat_cols = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

categorical Columns: Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges',
       'Churn'],
      dtype='object')


  df.replace({"Yes":1, "No":0}, inplace=True)


---

### üß† **Step 2: Prepare Your Data**

In [29]:
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale data (important for models like Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### ‚öôÔ∏è **Step 3: Train Multiple Models**
#### Logistic Regression

In [30]:
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

#### Decision Tree

In [31]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

---

### üìä **Step 4: Evaluate Models**

In [32]:
models = {
    'Logistic Regression': (y_test, y_pred_lr),
    'Decision Tree': (y_test, y_pred_dt)
}
for name, (y_true, y_pred) in models.items():
    print(f"Model: {name}")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, pos_label=1))
    print("Recall:", recall_score(y_true, y_pred, pos_label=1))
    print("F1-score:", f1_score(y_true, y_pred, pos_label=1))
    print("-"*40)

Model: Logistic Regression
Accuracy: 0.7849538679914834
Precision: 0.6121794871794872
Recall: 0.5120643431635389
F1-score: 0.5576642335766423
----------------------------------------
Model: Decision Tree
Accuracy: 0.7686302342086586
Precision: 0.5775577557755776
Recall: 0.4691689008042895
F1-score: 0.5177514792899408
----------------------------------------



**Summary:**
Logistic Regression performed better than Decision Tree, showing higher accuracy and F1-score. It gave more reliable churn predictions, while the Decision Tree slightly overfitted and had lower precision. Overall, Logistic Regression is the preferred model for predicting customer churn.

---

### üß© **Step 5: Feature Selection (Optional)**

In [33]:
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("Top 10 Features:", selected_features)

Top 10 Features: Index(['tenure', 'MonthlyCharges', 'InternetService_Fiber optic',
       'OnlineBackup_No internet service',
       'DeviceProtection_No internet service',
       'TechSupport_No internet service', 'StreamingTV_No internet service',
       'StreamingMovies_No internet service', 'Contract_Two year',
       'PaymentMethod_Electronic check'],
      dtype='object')


### üîπ **Interpretation:**

The feature selection results highlight the most influential factors related to customer churn.
Key drivers such as **tenure**, **monthly charges**, and **contract type** suggest that customers with shorter tenure, higher monthly costs, or short-term contracts are more likely to churn.
Service-related features like **Fiber optic Internet**, **streaming**, and **technical support** also indicate that access to or lack of these services can impact customer retention.
Overall, these features will be important inputs for building and improving the churn prediction model.


---

### üîß **Step 6: Hyperparameter Tuning Example**

***For Decision Tree***

In [None]:
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

**Summary:**
Hyperparameter tuning was carried out to enhance model accuracy and performance. The best parameters obtained were **max_depth = 5** and **min_samples_split = 2**, giving the model better predictive results.