# Customer Churn Analysis - Data Preprocessing

This notebook handles the data preprocessing steps including loading the data, cleaning it, and encoding categorical variables.





# Data Preprocessing

This notebook performs the initial data loading and preprocessing steps for the Customer Churn Analysis project.

## Step 1: Import Libraries and Set Up Path

In this step, I import the necessary libraries and set up the path to include the custom utility functions for data loading and cleaning.


### 1. Import Necessary Libraries and Configuration

In this cell, we import all the necessary libraries and configurations needed for our data preprocessing.


In [25]:
import json
import pandas as pd
import os
from sklearn.preprocessing import OneHotEncoder

# Load configuration
config_path = os.path.join(os.path.dirname(os.path.abspath('')), '..', 'config.json')
print(f"Config path: {config_path}")
with open(config_path, 'r') as f:
    config = json.load(f)

# Convert relative paths to absolute paths
project_root = os.path.dirname(os.path.dirname(os.path.abspath('')))
raw_data_path = os.path.join(project_root, config['raw_data_path'])
interim_cleaned_data_path = os.path.join(project_root, config['interim_cleaned_data_path'])
preprocessed_data_path = os.path.join(project_root, config['preprocessed_data_path'])

# Print the absolute paths for verification
print(f"Raw data path (absolute): {raw_data_path}")
print(f"Interim cleaned data path (absolute): {interim_cleaned_data_path}")
print(f"Preprocessed data path (absolute): {preprocessed_data_path}")


Config path: d:\Customer-Churn-Analysis\notebooks\..\config.json
Raw data path (absolute): d:\Customer-Churn-Analysis\data/raw/Dataset (ATS)-1.csv
Interim cleaned data path (absolute): d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Preprocessed data path (absolute): d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


### 2. Load Data

This cell loads the raw data from the specified path.


In [26]:
# Load the dataset
df = pd.read_csv(raw_data_path)
print("Initial columns after loading dataset:", df.columns.tolist())
# Display the first few rows of the dataset
df.head()


Initial columns after loading dataset: ['gender', 'SeniorCitizen', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'MonthlyCharges', 'Churn']


Unnamed: 0,gender,SeniorCitizen,Dependents,tenure,PhoneService,MultipleLines,InternetService,Contract,MonthlyCharges,Churn
0,Female,0,No,1,No,No,DSL,Month-to-month,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,56.95,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,53.85,Yes
3,Male,0,No,45,No,No,DSL,One year,42.3,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,70.7,Yes


### 3. Clean Data

In this cell, we clean the data by handling missing values and encoding categorical variables using OneHotEncoder.


In [27]:
def clean_data(df):
    """
    Clean the data by handling missing values and encoding categorical variables.

    Parameters:
    df (pd.DataFrame): The data to clean.

    Returns:
    pd.DataFrame: Cleaned data.
    """
    try:
        # Handle missing values by dropping rows with missing values
        df = df.dropna()
        print("Missing values handled by dropping rows with missing values.")

        # Identify and encode categorical variables
        categorical_columns = df.select_dtypes(include=['object']).columns
        print(f"Categorical columns identified: {categorical_columns}")
        if len(categorical_columns) > 0:
            # Initialize the OneHotEncoder
            encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            
            # Fit and transform the categorical columns
            encoded_data = pd.DataFrame(
                encoder.fit_transform(df[categorical_columns]),
                columns=encoder.get_feature_names_out(categorical_columns)
            )
            
            # Drop the original categorical columns from the DataFrame
            df = df.drop(columns=categorical_columns)
            
            # Concatenate the encoded data with the original DataFrame
            df = pd.concat([df, encoded_data], axis=1)
            print(f"Categorical columns {list(categorical_columns)} encoded.")
        else:
            print("No categorical columns found to encode.")
        
        return df
    except Exception as e:
        print(f"An error occurred during data cleaning: {e}")

# Clean the loaded data
df_cleaned = clean_data(df)
print("Columns after cleaning and encoding data:", df_cleaned.columns.tolist())
# Display the first few rows of the cleaned dataset
df_cleaned.head()


Missing values handled by dropping rows with missing values.
Categorical columns identified: Index(['gender', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'Contract', 'Churn'],
      dtype='object')
Categorical columns ['gender', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'Churn'] encoded.
Columns after cleaning and encoding data: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No', 'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes', 'InternetService_DSL', 'InternetService_Fiber optic', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'Churn_No', 'Churn_Yes']


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes
0,0,1,29.85,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0,34,56.95,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0,2,53.85,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0,45,42.3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0,2,70.7,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


### 4. Save Cleaned Data

This cell saves the cleaned data to the specified interim and preprocessed paths.


In [28]:
# Save the cleaned dataset
df_cleaned.to_csv(interim_cleaned_data_path, index=False)
df_cleaned.to_csv(preprocessed_data_path, index=False)
print(f"Cleaned data saved to interim at {interim_cleaned_data_path}")
print(f"Cleaned data saved to preprocessed at {preprocessed_data_path}")


Cleaned data saved to interim at d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Cleaned data saved to preprocessed at d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


## Summary

In this notebook, I have successfully completed the following steps:

1. **Imported necessary libraries and set up paths**: Ensured that the required libraries and custom utility functions are correctly imported and accessible.
2. **Loaded raw data**: Loaded the raw customer churn data from the CSV file located in the `data/raw/` directory.
3. **Cleaned the data**: Applied data cleaning procedures to handle missing values and encode categorical variables.
4. **Saved the cleaned data**: Saved the cleaned data to theinterimoceand and `Data_Preparation/preprocessed_dataset/` ssed/` directory for further analysis.

### Next Steps

1. **Handle Missing Data Points**: Implement strategies to handle any remaining missing data points, ensuring data quality and completeness.
2. **Encode Categorical Variables**: Apply appropriate encoding techniques to transform categorical variables into numerical formats suitable for machine learning algorithms.
3. **Perform Feature Scaling and Normalization**: Scale and normalize the features to ensure they are on a comparable scale, which is crucial for the performance of many machine learning models.
4. **Exploratory Data Analysis (EDA)**: Conduct EDA to understand the data distributions, relationships between variables, and identify any anomalies or patterns.

By documenting each step and summarizing the results, I ensure clarity and reproducibility for all team members and stakeholders involved in the project.
