# **SETTING UP THE ENVIRONMENT**

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from google.colab import files

# **0 : UPLOADING THE DATASET**

In [2]:
uploaded = files.upload()

Saving Telecom_Churn.csv to Telecom_Churn.csv


# **1 : LOADING THE DATASET**

In [3]:
df = pd.read_csv('/content/Telecom_Churn.csv')

print("shape :",df.shape)

df.sample(5)

shape : (7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
2823,4884-LEVMQ,Male,0,Yes,No,39,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),20.45,790.0,No
1972,7892-QVYKW,Female,0,Yes,Yes,23,Yes,Yes,Fiber optic,No,...,No,No,No,Yes,Month-to-month,No,Electronic check,85.6,1868.4,No
3859,1732-FEKLD,Female,0,No,No,54,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,No,One year,Yes,Bank transfer (automatic),94.75,5121.75,No
4730,7813-ZGGAW,Male,1,No,No,31,Yes,Yes,Fiber optic,No,...,Yes,Yes,Yes,No,Month-to-month,Yes,Bank transfer (automatic),96.6,2877.95,No
637,2077-DDHJK,Female,0,Yes,No,68,Yes,Yes,DSL,Yes,...,Yes,Yes,No,No,Two year,No,Credit card (automatic),70.9,4911.35,No


#**2 : INSPECTING THE DATASET**

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# **3 : HANDILING MISSING VALUES**

*`(a) : checking for the missing vales`*

In [6]:
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


*`THERE ARE NOT ANY MISSING VALUES`*

*`(b) : converting TotalCharges from object to numeric`*

In [7]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [8]:
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


*`(C) : DROPPING THE MISSING VALUES`*

In [9]:
df = df.dropna(subset=["TotalCharges"])

*`(d) : CHECKING THE DATASET AFTER REMOVNG THE MISSING VALUES`*

In [10]:
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


#**summary**

###  step:3 - Handling Missing Values

During preprocessing, we found 11 missing values in the `TotalCharges` column.  
These appeared only after converting `TotalCharges` from `object` to numeric, because empty strings were detected as missing values (`NaN`).  
To handle this, the rows with missing `TotalCharges` were dropped, ensuring the dataset is clean and consistent for further analysis.


# **4 : Encoding categorical variables**

In [11]:
df = df.drop("customerID", axis=1)

*`encoding binary catagorical columns`*

In [12]:
binary_cols = ["gender", "Partner", "Dependents", "PhoneService",
               "PaperlessBilling", "Churn"]

for col in binary_cols:
    df[col] = df[col].map({"No": 0, "Yes": 1}) if df[col].nunique() == 2 else LabelEncoder().fit_transform(df[col])

*`encoding multi-class columns`*

In [13]:
multi_class_cols = ["InternetService", "Contract", "PaymentMethod",
                    "MultipleLines", "OnlineSecurity", "OnlineBackup",
                    "DeviceProtection", "TechSupport", "StreamingTV",
                    "StreamingMovies"]

df = pd.get_dummies(df, columns=multi_class_cols, dtype=int)

In [14]:
df = df.drop(columns=['gender'])

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   SeniorCitizen                            7032 non-null   int64  
 1   Partner                                  7032 non-null   int64  
 2   Dependents                               7032 non-null   int64  
 3   tenure                                   7032 non-null   int64  
 4   PhoneService                             7032 non-null   int64  
 5   PaperlessBilling                         7032 non-null   int64  
 6   MonthlyCharges                           7032 non-null   float64
 7   TotalCharges                             7032 non-null   float64
 8   Churn                                    7032 non-null   int64  
 9   InternetService_DSL                      7032 non-null   int64  
 10  InternetService_Fiber optic              7032 non-nul

In [16]:
df.shape

(7032, 40)

In [17]:
df.sample(5)

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,InternetService_DSL,...,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes
6584,0,1,1,50,1,0,20.55,1070.25,0,0,...,0,0,1,0,0,1,0,0,1,0
1202,0,1,0,53,1,1,94.25,4867.95,1,0,...,0,1,0,0,0,0,1,0,0,1
6082,0,1,0,59,1,0,101.1,6039.9,1,0,...,0,1,0,0,0,0,1,0,0,1
4960,1,0,0,50,1,1,95.7,4816.7,1,0,...,0,1,0,0,0,0,1,0,0,1
1469,0,0,0,37,1,0,98.8,3475.55,1,0,...,0,0,0,1,0,0,1,0,0,1


# **Step 4: summary**

--> In this step, all categorical variables in the dataset were transformed into numerical form so that machine learning algorithms can process them. The following actions were performed:

1) Dropped customerID column

The customerID column was only an identifier and had no predictive value.

It was removed to avoid adding noise to the model.

2) Binary categorical columns encoded with LabelEncoder

Columns with only two categories (Yes/No or Male/Female) were converted into 0/1 numeric format.

Examples: Partner, Dependents, PhoneService, PaperlessBilling, Churn.

3) Multi-class categorical columns encoded with One-Hot Encoding (pd.get_dummies)

Columns with more than two categories were expanded into multiple binary (dummy) columns.

Examples: InternetService, Contract, PaymentMethod, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies.

For each, new columns like InternetService_DSL, InternetService_Fiber optic, InternetService_No were created.

4) Gender column dropped

After label encoding, it was observed that gender does not provide significant predictive value for churn.

To simplify the dataset and reduce noise, gender was removed.

5) ----> Final dataset structure

The dataset now contains 40 columns.

All features are numerical (int64 or float64).

No missing values remain.

The dataset is now properly preprocessed

# **5 : SAVING THE PREPROCESSED DATASET**

In [18]:
df.to_csv('telco_processed.csv', index=False)

In [19]:
files.download('telco_processed.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **6: CHECKING THE PREPROCESSED DATASET**

In [20]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,InternetService_DSL,...,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes
0,0,1,0,1,0,1,29.85,29.85,0,1,...,0,1,0,0,1,0,0,1,0,0
1,0,0,0,34,1,0,56.95,1889.5,0,1,...,1,1,0,0,1,0,0,1,0,0
2,0,0,0,2,1,1,53.85,108.15,1,1,...,0,1,0,0,1,0,0,1,0,0
3,0,0,0,45,0,0,42.3,1840.75,0,1,...,1,0,0,1,1,0,0,1,0,0
4,0,0,0,2,1,1,70.7,151.65,1,0,...,0,1,0,0,1,0,0,1,0,0


In [21]:
df.shape

(7032, 40)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   SeniorCitizen                            7032 non-null   int64  
 1   Partner                                  7032 non-null   int64  
 2   Dependents                               7032 non-null   int64  
 3   tenure                                   7032 non-null   int64  
 4   PhoneService                             7032 non-null   int64  
 5   PaperlessBilling                         7032 non-null   int64  
 6   MonthlyCharges                           7032 non-null   float64
 7   TotalCharges                             7032 non-null   float64
 8   Churn                                    7032 non-null   int64  
 9   InternetService_DSL                      7032 non-null   int64  
 10  InternetService_Fiber optic              7032 non-nul

# ***FINAL SUMMARY***


**-> Step 1: Load Dataset**

The Telco Customer Churn dataset was loaded into a pandas DataFrame.

The dataset contains 7043 rows and 21 columns.

**-> Step 2: Inspect Dataset**

Used df.info() and df.head() to review the structure.

Found a mix of categorical (object) and numerical (int64, float64) columns.

**-> Key observations:**

customerID is an identifier (not useful for modeling).

Most service-related columns (InternetService, Contract, PaymentMethod, etc.) are categorical.

SeniorCitizen, tenure, MonthlyCharges, and TotalCharges are numerical.

**-> Step 3: Handle Missing Values**

Initially, df.isnull().sum() showed no missing values.

After converting TotalCharges from object to float, 11 missing values appeared because some rows had blanks in that column.

Then I Drop those values.

Final dataset shape: 7032 rows × 21 columns.

**-> Step 4: Encode Categorical Variables**

Dropped customerID since it is only an identifier.

Binary categorical columns (gender, Partner, Dependents, PhoneService, PaperlessBilling, Churn) were converted into 0/1 using LabelEncoder and mapping.

Multi-class categorical columns (InternetService, Contract, PaymentMethod, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies) were encoded using One-Hot Encoding (pd.get_dummies).

The gender column was later dropped as it provided minimal predictive power.

After encoding:

Dataset shape: 7032 rows × 40 columns.

**All categorical variables are now numerical (int64)**.

Final dataset is fully clean and ready for modeling.