# Customer Churn Data Analysis

## Objective :
The objective of this project is to analyze customer data to understand factors that may lead to customer churn.
Basic data analysis and feature engineering techniques are applied using Python, NumPy, and Pandas.

## Libraries Used :
- Pandas – for data manipulation and analysis  
- NumPy – for numerical computations  

In [1]:
import pandas as pd
import numpy as np

## Dataset Loading :
In this step, the customer churn dataset is loaded into the Jupyter Notebook using Pandas.

In [2]:
df = pd.read_csv("D:\WA_Fn-UseC_-Telco-Customer-Churn.csv")

## Viewing the Dataset :
The head() function is used to display the first five rows of the dataset.
This helps in understanding the structure of the data.

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Dataset Understanding :
In this step, basic functions are used to understand the size, structure, and columns of the dataset.

In [4]:
df.shape

(7043, 21)

In [5]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## Dataset Preview :
The head() and tail() functions are used to view the first and last few rows of the dataset.

In [7]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [8]:
df.tail()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.8,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.2,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


## Statistical Summary :
The describe() function provides a statistical summary of numerical columns in the dataset.

In [9]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


## Checking Missing Values :
This step checks for missing (null) values present in each column of the dataset.

In [10]:
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

## Checking Duplicate Records :
Duplicate rows in the dataset are identified to ensure data quality.

In [11]:
df.duplicated().sum()

np.int64(0)

## Removing Duplicate Records :
Duplicate rows are removed from the dataset to avoid incorrect analysis.

In [12]:
df = df.drop_duplicates()

## Selecting Specific Columns :
Specific columns are selected from the dataset for focused analysis.

In [13]:
df[['gender', 'SeniorCitizen', 'Churn']]

Unnamed: 0,gender,SeniorCitizen,Churn
0,Female,0,No
1,Male,0,No
2,Male,0,Yes
3,Male,0,No
4,Female,0,Yes
...,...,...,...
7038,Male,0,No
7039,Female,0,No
7040,Female,0,No
7041,Male,1,Yes


## Data Selection Using Condition :
This step filters the data based on a specific condition.

In [14]:
df[df['Churn'] == 'Yes']

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,...,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes
13,0280-XJGEX,Male,0,No,No,49,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.3,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7021,1699-HPSBG,Male,0,No,No,12,Yes,No,DSL,No,...,No,Yes,Yes,No,One year,Yes,Electronic check,59.80,727.8,Yes
7026,8775-CEBBJ,Female,0,No,No,9,Yes,No,DSL,No,...,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),44.20,403.35,Yes
7032,6894-LFHLY,Male,1,No,No,1,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,75.75,75.75,Yes
7034,0639-TSIQW,Female,0,No,No,67,Yes,Yes,Fiber optic,Yes,...,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),102.95,6886.25,Yes


## Feature Engineering :
Feature engineering is used to create new features from existing data to improve analysis.

In [15]:
df['Tenure_Group'] = np.where(df['tenure'] > 12, 'Long Term', 'Short Term')

In [16]:
df[['tenure', 'Tenure_Group']].head()

Unnamed: 0,tenure,Tenure_Group
0,1,Short Term
1,34,Long Term
2,2,Short Term
3,45,Long Term
4,2,Short Term


## Categorical Analysis :
Categorical analysis helps in understanding how categorical variables are distributed in the dataset.

In [17]:
df['Churn'].value_counts()

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [18]:
df['gender'].value_counts()

gender
Male      3555
Female    3488
Name: count, dtype: int64

## Churn Analysis Based on Tenure Group :
This analysis shows how customer churn varies between short-term and long-term customers.

In [19]:
df.groupby('Tenure_Group')['Churn'].value_counts()

Tenure_Group  Churn
Long Term     No       4025
              Yes       832
Short Term    No       1149
              Yes      1037
Name: count, dtype: int64

## Mean Analysis :
The mean() function is used to find the average value of numerical columns.

In [20]:
df[['tenure', 'MonthlyCharges']].mean()

tenure            32.371149
MonthlyCharges    64.761692
dtype: float64

## Median Analysis :
The median() function shows the middle value of numerical columns.

In [21]:
df[['tenure', 'MonthlyCharges']].median()

tenure            29.00
MonthlyCharges    70.35
dtype: float64

## Mode Analysis :
The mode() function identifies the most frequently occurring values.

In [22]:
df[['gender', 'Churn']].mode()

Unnamed: 0,gender,Churn
0,Male,No


## Conclusion :
In this project, customer churn data was analyzed using Python, NumPy, and Pandas.
Basic data analysis techniques such as data inspection, handling duplicates, and conditional selection were applied.
Feature engineering was performed by creating a new tenure-based feature.
The analysis helps in understanding customer behavior and factors related to churn.
This project demonstrates the practical use of Python libraries for data analysis.