# üß† Internship Task 1 ‚Äì Data Preparation
Project Title: Customer Churn Analysis and Prediction<br/>
Company: Saiket Systems<br/>
Intern: Farida Bashir<br/>
Date: October 2025<br/>

---

## üìã Project Overview
The project aims to analyze customer churn in a telecommunications company and develop predictive models to identify at-risk customers.
The goal is to provide actionable insights to reduce churn and improve retention.

---

## üß©  Description: 
In this task, you will be responsible for loading
 the dataset and conducting an initial
 exploration. Handle missing values, and if
 necessary, convert categorical variables into
 numerical representations. Furthermore, split
 the dataset into training and testing sets for
 subsequent model evaluation

## üõ†Ô∏è Skills: 
* Data loading, data exploration,
* Handling missing values, 
* Data preprocessing, 
* Categorical variable encoding, 
* Dataset splitting

#### 1. Import Necessary Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

#### 2. Load the dataset

In [5]:
file = ('Telco_Customer_Churn_Dataset  (3).csv')

In [6]:
df = pd.read_csv(file)
# preview data
print(df.shape)
df.head()

(7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


#### 3. Initial Exploration

In [7]:
# General info
df.info()
# Summary statistics
df.describe()
# Check missing values
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [8]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


#### 4. Handle Missing Values

In [9]:
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

 No Missing Values

#### 5. Encode Categorical Variables

In [10]:
# identify categorical features
cat_cols = df.select_dtypes(include=["object"]).columns
print("categorical Columns:", cat_cols)

# convert yes/no to 1/0
df.replace({"Yes":1, "No":0}, inplace=True)

# Drop customerID if it exists (it's not useful for prediction)
if 'customerID' in df.columns:
    df = df.drop('customerID', axis=1)
# One-hot encode categorical variables
cat_cols = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

categorical Columns: Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges',
       'Churn'],
      dtype='object')


  df.replace({"Yes":1, "No":0}, inplace=True)


In [11]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,Churn,gender_Male,MultipleLines_1,...,TotalCharges_995.35,TotalCharges_996.45,TotalCharges_996.85,TotalCharges_996.95,TotalCharges_997.65,TotalCharges_997.75,TotalCharges_998.1,TotalCharges_999.45,TotalCharges_999.8,TotalCharges_999.9
0,0,1,0,1,0,1,29.85,0,False,False,...,False,False,False,False,False,False,False,False,False,False
1,0,0,0,34,1,0,56.95,0,True,False,...,False,False,False,False,False,False,False,False,False,False
2,0,0,0,2,1,1,53.85,1,True,False,...,False,False,False,False,False,False,False,False,False,False
3,0,0,0,45,0,0,42.3,0,True,False,...,False,False,False,False,False,False,False,False,False,False
4,0,0,0,2,1,1,70.7,1,False,False,...,False,False,False,False,False,False,False,False,False,False


In [12]:
# Feature and Target Split
X = df.drop("Churn", axis=1)   #Features
y = df["Churn"]                #Target

#### 6. Split Data into Training and Testing sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)