In [2]:
# Import necessary libraries
import os
import sys
import pandas as pd

# Ensure the utils module can be found
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('')), '..', 'utils'))

# Import custom modules
from data_loader import load_data
from data_cleaner import clean_data


# Data Preprocessing

This notebook performs the initial data loading and preprocessing steps for the Customer Churn Analysis project.

## Step 1: Import Libraries and Set Up Path

In this step, I import the necessary libraries and set up the path to include the custom utility functions for data loading and cleaning.
ning.


In [3]:
# Define file paths
raw_data_path = 'D:/CustomerChurnAnalysis/data/raw/Dataset (ATS)-1.csv'

# Load the raw data
df = load_data(raw_data_path)

# Display the first few rows of the dataframe
df.head()


Data loaded successfully from D:/CustomerChurnAnalysis/data/raw/Dataset (ATS)-1.csv


Unnamed: 0,gender,SeniorCitizen,Dependents,tenure,PhoneService,MultipleLines,InternetService,Contract,MonthlyCharges,Churn
0,Female,0,No,1,No,No,DSL,Month-to-month,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,56.95,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,53.85,Yes
3,Male,0,No,45,No,No,DSL,One year,42.3,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,70.7,Yes


## Step 2: Load Data

In this step, I load the raw data from the CSV file located in the `data/raw/` directory. The custom `load_data` function from the `data_loader` module is used to read the CSV file into a pandas DataFrame.
.


In [4]:
# Clean the loaded data
df_cleaned = clean_data(df)

# Display the first few rows of the cleaned dataframe
df_cleaned.head()


Missing values handled by dropping rows with missing values.
Categorical columns ['gender', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'Churn'] encoded.


Unnamed: 0,gender_1,gender_2,SeniorCitizen,Dependents_1,Dependents_2,tenure,PhoneService_1,PhoneService_2,MultipleLines_1,MultipleLines_2,InternetService_1,InternetService_2,Contract_1,Contract_2,Contract_3,MonthlyCharges,Churn_1,Churn_2
0,1,0,0,1,0,1,1,0,1,0,1,0,1,0,0,29.85,1,0
1,0,1,0,1,0,34,0,1,1,0,1,0,0,1,0,56.95,1,0
2,0,1,0,1,0,2,0,1,1,0,1,0,1,0,0,53.85,0,1
3,0,1,0,1,0,45,1,0,1,0,1,0,0,1,0,42.3,1,0
4,1,0,0,1,0,2,0,1,1,0,0,1,1,0,0,70.7,0,1


## Step 3: Clean Data

In this step, I clean the loaded data using the custom `clean_data` function from the `data_cleaner` module. This function handles missing values and encodes categorical variables. The cleaned data is then displayed.
.


In [5]:
# Define the path to save the processed data
processed_data_path = 'D:/CustomerChurnAnalysis/data/processed/dataset.csv'

# Save the cleaned data to the processed data directory
df_cleaned.to_csv(processed_data_path, index=False)
print(f"Cleaned data saved successfully to {processed_data_path}")


Cleaned data saved successfully to D:/CustomerChurnAnalysis/data/processed/dataset.csv


## Step 4: Save Cleaned Data

In this step, I save the cleaned data to the `data/processed/` directory. The cleaned DataFrame is written to a CSV file using the `to_csv` method of pandas DataFrame.
.


## Summary

In this notebook, I have successfully completed the following steps:

1. **Imported necessary libraries and set up paths**: Ensured that the required libraries and custom utility functions are correctly imported and accessible.
2. **Loaded raw data**: Loaded the raw customer churn data from the CSV file located in the `data/raw/` directory.
3. **Cleaned the data**: Applied data cleaning procedures to handle missing values and encode categorical variables.
4. **Saved the cleaned data**: Saved the cleaned data to the `data/processed/` directory for further analysis.

### Next Steps

1. **Handle Missing Data Points**: Implement strategies to handle any remaining missing data points, ensuring data quality and completeness.
2. **Encode Categorical Variables**: Apply appropriate encoding techniques to transform categorical variables into numerical formats suitable for machine learning algorithms.
3. **Perform Feature Scaling and Normalization**: Scale and normalize the features to ensure they are on a comparable scale, which is crucial for the performance of many machine learning models.
4. **Exploratory Data Analysis (EDA)**: Conduct EDA to understand the data distributions, relationships between variables, and identify any anomalies or patterns.

By documenting each step and summarizing the results, I ensure clarity and reproducibility for all team members and stakeholders involved in the project.
