The dataset contains information about customers of a telecommunications
company and whether they have churned (i.e., discontinued their services). The dataset
includes various attributes of the customers, such as their demographics, usage patterns, and
account information. The goal is to perform data cleaning and preparation to gain insights
into the factors that contribute to customer churn.
Tasks to Perform:
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations,
and standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to
predicting customer churn.
9. Normalize or scale the data if necessary.
10. Split the dataset into training and testing sets for further analysis.
11. Export the cleaned dataset for future analysis or modeling.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("telecom_customer_churn.csv")
df

In [None]:
print(df.head())  # View the first few rows
print(df.info())  # Get information about the dataset
print(df.describe())  # Descriptive statistics

In [None]:
df.fillna(0, inplace=True)  # handling missing values

In [None]:
df.drop_duplicates(inplace=True) #removing duplicates

In [None]:
df['Gender'] = df['Gender'].str.lower()  # Example: Standardize gender to lowercase

In [None]:
df['TotalCharges'] = pd.to_numeric(df['Total Charges'], errors='coerce')  # Convert TotalCharges to numeric

In [None]:
# remove outliers from the TotalCharges column:
z_scores = (df['TotalCharges'] - df['Total Charges'].mean()) / df['Total Charges'].std()
df = df[(z_scores.abs() < 3)]

In [None]:
df['TenureinMonths'] = df['Tenure in Months'] * 30  # create new features - Convert tenure to months
df

In [None]:
scaler = StandardScaler()   #normalize the data
df[['MonthlyCharge', 'TotalCharges', 'TenureinMonths']] = scaler.fit_transform(df[['Monthly Charge', 'Total Charges', 'Tenure in Months']])

In [None]:
X = df.drop('Churn Category', axis=1)      #split data into training and testing
y = df['Churn Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
df.to_csv("Cleaned_Telecom_Customer_Churn.csv")    #export dataframe to csv