#**Project: Customer Churn Analysis and Prediction**
##**Internship Task - SaiKet Systems**
**Developed by: Akhila Vaidya**

**1. Introduction**

Customer churn occurs when customers stop doing business with a company. In the telecommunications sector, where competition is high, identifying "at-risk" customers is vital for maintaining profitability.

**Project Objective**

The goal of this project is to analyze the Telco Customer Churn dataset to:

1. Identify key factors that lead to customer turnover.

2. Visualize demographic and service-related trends.

3. Build a machine learning model to predict future churn.

4. Propose data-driven retention strategies.

**Task 1: Data Cleaning and Preprocessing**

**Objective**

Raw data is rarely perfect. In this step, I am preparing the dataset by handling missing values and ensuring all features are in a format suitable for mathematical modeling.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from google.colab import files
import io

# Step 1: Upload the file from your local machine
print("Please upload 'Telco_Customer_Churn_Dataset .csv'")
uploaded = files.upload()

# Load the dataset
df = pd.read_csv('Telco_Customer_Churn_Dataset .csv')

# 1. Handle Missing Values [cite: 265]
# 'TotalCharges' often contains empty strings that prevent it from being numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Fill missing TotalCharges with the median (or drop them)
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

# 2. Categorical Encoding
# Convert 'Churn' (Target) to binary 0 and 1
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Identify categorical columns (excluding customerID)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if 'customerID' in categorical_cols:
    categorical_cols.remove('customerID')

Please upload 'Telco_Customer_Churn_Dataset .csv'


Saving Telco_Customer_Churn_Dataset .csv to Telco_Customer_Churn_Dataset  (1).csv


In [None]:
print(f"Cleaned Dataset Shape: {df_cleaned.shape}")
display(df_cleaned.head())

Cleaned Dataset Shape: (7043, 32)


Unnamed: 0,customerID,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,7590-VHVEG,0,1,29.85,29.85,0,False,True,False,False,...,False,False,False,False,False,False,True,False,True,False
1,5575-GNVDE,0,34,56.95,1889.5,0,True,False,False,True,...,False,False,False,False,True,False,False,False,False,True
2,3668-QPYBK,0,2,53.85,108.15,1,True,False,False,True,...,False,False,False,False,False,False,True,False,False,True
3,7795-CFOCW,0,45,42.3,1840.75,0,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,9237-HQITU,0,2,70.7,151.65,1,False,False,False,True,...,False,False,False,False,False,False,True,False,True,False


In [None]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   customerID                             7043 non-null   object 
 1   SeniorCitizen                          7043 non-null   int64  
 2   tenure                                 7043 non-null   int64  
 3   MonthlyCharges                         7043 non-null   float64
 4   TotalCharges                           7043 non-null   float64
 5   Churn                                  7043 non-null   int64  
 6   gender_Male                            7043 non-null   bool   
 7   Partner_Yes                            7043 non-null   bool   
 8   Dependents_Yes                         7043 non-null   bool   
 9   PhoneService_Yes                       7043 non-null   bool   
 10  MultipleLines_No phone service         7043 non-null   bool   
 11  Mult