# Title: Data Cleaning and Preparation
<b>Problem Statement:</b> Analyzing Customer Churn in a Telecommunications Company<br>
<b>Dataset:</b> "Telecom_Customer_Churn.csv"<br>
<b>Description:</b> The dataset contains information about customers of a telecommunications company and whether they have churned (i.e., discontinued their services). The dataset includes various attributes of the customers, such as their demographics, usage patterns, and account information. The goal is to perform data cleaning and preparation to gain insights into the factors that contribute to customer churn.<br>

<b>Tasks to Perform:</b>
1. Import the "Telecom_Customer_Churn.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Handle missing values in the dataset, deciding on an appropriate strategy.
4. Remove any duplicate records from the dataset.
5. Check for inconsistent data, such as inconsistent formatting or spelling variations,
and standardize it.
6. Convert columns to the correct data types as needed.
7. Identify and handle outliers in the data.
8. Perform feature engineering, creating new features that may be relevant to
predicting customer churn.
9. Normalize or scale the data if necessary.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Task 1. Import the "Telecom_Customer_Churn.csv" dataset.

In [None]:
data = pd.read_csv("datasets/Telcom_Customer_Churn.csv")
data.index

## Task 2. Explore the dataset to understand its structure and content.

In [None]:
data.info()

In [None]:
data.head()

## Task 3. Handle missing values in the dataset, deciding on an appropriate strategy

In [None]:
# Convert 'TotalCharges' to numeric, forcing errors to NaN
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

In [None]:
# Check for missing values
print("\nMissing values before handling:")
print(data.isnull().sum())

In [None]:
# Handle missing values in 'TotalCharges' by dropping rows where 'TotalCharges' is NaN
data = data.dropna(subset=['TotalCharges'])

## Task 4.  Remove any duplicate records from the dataset

In [None]:
data = data.drop_duplicates()

## Task 5.  Check for inconsistent data, such as inconsistent formatting or spelling variations, and standardize it.

In [None]:
# Check unique values in categorical columns
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService',
                    'MultipleLines', 'InternetService', 'OnlineSecurity',
                    'OnlineBackup','DeviceProtection', 'TechSupport', 
                    'StreamingTV', 'StreamingMovies','Contract', 
                    'PaperlessBilling', 'PaymentMethod', 'Churn']

for col in categorical_cols:
    print(f"Unique values in {col}: {data[col].unique()}")

## Task 6. Convert columns to the correct data types as needed

In [None]:
# Assuming 'SeniorCitizen' is categorical, not numeric
data['SeniorCitizen'] = data['SeniorCitizen'].astype('category')

## Task 7. Identify and handle outliers in the data.

In [None]:
 # Example: Visualize outliers in MonthlyCharges
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['MonthlyCharges'])
plt.title('Outliers in MonthlyCharges')
plt.show()

## Task 8. Perform feature engineering, creating new features that may be relevant to predicting customer churn.

In [None]:
#Create a new feature for tenure in years
data['Tenure_Years'] = data['tenure'] / 12

## Task 9. Normalize or scale the data if necessary

In [None]:
scaler = StandardScaler()
data[['MonthlyCharges', 'TotalCharges']] = scaler.fit_transform(data[['MonthlyCharges', 'TotalCharges']])

# Split the dataset into training and testing sets for further analysis
X = data.drop(columns=['Churn', 'customerID']) # Dropping target and ID columns
y = data['Churn'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Export the cleaned dataset for future analysis
data.to_csv("datasets/Cleaned_Telecom_Customer_Churn1.csv", index=False)