# Data Uploading



In [10]:
import pandas as pd

# Load dataset
data = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Display first 5 rows
print(data.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

**Description**
pandas is imported for handling tabular data.

pd.read_csv() reads the dataset from a CSV file into a DataFrame.

head() shows the first 5 rows to get an initial look at the dataset.


**Interpretation**
Look at the columns and values.

- What type of data does each column hold (numerical, categorical, text)?

The dataset contains a mix of numerical, categorical, and text data. The column customerID is a text identifier that uniquely represents each customer. Several columns hold numerical values, such as SeniorCitizen (which is encoded as 0 or 1), tenure (representing the number of months a customer has stayed), MonthlyCharges, and TotalCharges. Most of the remaining fields are categorical, including demographic information like gender, service-related details such as PhoneService, InternetService, StreamingTV, and TechSupport, as well as customer contract details like Contract, PaperlessBilling, PaymentMethod, and the target column Churn. These categorical fields mostly contain values like as "Yes," "No," or specific service types.


- Are there any missing or unusual values?

Afetr examining, a few unusual values appear. The TotalCharges column sometimes contains blank entries, usually for customers with very short tenures, which causes the column to be read as text instead of strictly numerical. The SeniorCitizen column is stored as 0 and 1, which makes it numerical by format except its categorical in meaning, as it simply represents a "Yes" or "No" condition. Also, several service-related columns, like MultipleLines or OnlineSecurity, include values such as "No phone service" or "No internet service." While not missing values, these function as special categories that may need to be recoded depending on the analysis. Lastly, customerID should remain unique for each entry, and duplicate IDs would be considered unusual if they appear in the dataset.

# Viewing Information About the Dataset


In [11]:
# Check dataset shape
print("Shape:", data.shape)
# Get column data types and missing values
print(data.info())

Shape: (7043, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 n

**Description**
.shape shows the number of rows and columns.

.info() displays column names, data types, and missing value counts.

**Interpretation**
- How many rows and columns are there?

The dataset contains 7,043 rows and 21 columns. Each row corresponds to an individual customer record, while the 21 columns represent different attributes related to demographics, services, account information, and churn status.
From the .info() output, we see that all columns are listed with their respective data types. There are 18 categorical columns stored as object type, two numerical columns stored as integers (SeniorCitizen and tenure), and one numerical column stored as a float (MonthlyCharges). Althpough, the TotalCharges column appears as an object instead of a numeric type, which tells us that it may contain non-numeric values,like spaces or formatting inconsistencies, even though it should logically be only numerical.

- Are there columns with missing data that may need cleaning?

In terms of missing data, the .info() summary shows us that all 21 columns have 7,043 non-null entries, meaning there are no missing values in the dataset. However, the TotalCharges column may contain blank strings that do not appear as nulls but still require cleaning. Aside from that, no other columns show missing entries, so the dataset seems generally to be complete, with the main focus for cleaning being the conversion of TotalCharges into a proper numerical format.


# Descriptive Statistics

In [12]:
# Summary statistics for numerical columns
print(data.describe())

       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


**Description**
.describe() calculates common statistics such as mean, standard deviation, minimum, maximum, and quartiles for numeric columns.

**Interpretation**

- What is the average value for each numeric column?

On average, only about 16% of customers are Senior Citizens (mean = 0.16), which makes sense given that this column is encoded as 0 and 1. The average tenure of customers is approximately 32 months, with a wide range from 0 months (new customers) to 72 months (six years). For MonthlyCharges, the average is around $64.76, with values ranging from as low as $18.25 to as high as $118.75.
(non-senior).

- Which column has the largest spread (standard deviation)?

It seems that the column with the largest standard deviation is MonthlyCharges (std = 30.09), showing that customers pay widely varying amounts depending on their services and plans. The tenure column also has a high spread (std = 24.56), reflecting a diverse customer base with both short-term and long-term subscribers. By contrast, SeniorCitizen has the smallest spread (std = 0.37), since most entries are 0

- Are there any outliers (values much higher or lower than most others)?

In terms of outliers, the minimum tenure of 0 months stands out, as it suggests new customers who have just joined. On the higher end, the maximum MonthlyCharges of $118.75 could be considered an outlier since it is much higher than the average and close to the extreme end of the distribution. However, these values are still within the expected service limits, so they may not be true anomalies but rather valid extreme cases(Values beyond expected range) .



# Simple Analysis â€“ Value Counts

In [13]:
# Frequency counts for a categorical column (replace 'ColumnName')
print(data['MonthlyCharges'].value_counts())

MonthlyCharges
20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
56.85      1
101.70     1
48.40      1
108.35     1
72.00      1
Name: count, Length: 1585, dtype: int64


**Description**
.value_counts() shows how often each category appears.

**Interpretation**

- Which category is most common?

 When looking at the frequency counts of the MonthlyCharges column, the most common charge amount is 20.50 dollars which appears 61 times in the dataset. Other frequently occurring values are close to this range, such as 19.85 dollars, 19.95 dollars, and 19.90 dollars, each appearing over 40 times. This indicates that there are certain price points where many customers cluster, likely due to popular base service packages.

- Does the distribution seem balanced or skewed?


There are 1,585 unique values for MonthlyCharges, which means many charge amounts are quite rare, sometimes appearing only once in the entire dataset. This many of unique values leans more to a skewed distribution, where a few amounts (especially lower-priced plans around $20) are common, but most other amounts are scattered and less frequent. Such a pattern reflects the variety of service combinations customers choose, with only a few standard pricing tiers being shared across larger groups of customers.