In [45]:
# Task 1: Step 1 - Import libraries and set up paths

# Import necessary libraries
import os
import sys
import json
import pandas as pd

# Ensure the utils module can be found
notebook_dir = os.path.dirname(os.path.abspath(''))
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))
utils_path = os.path.join(project_root, 'utils')

print(f"Notebook directory: {notebook_dir}")
print(f"Project root: {project_root}")
print(f"Utils path: {utils_path}")

sys.path.append(utils_path)

print("sys.path:", sys.path)

# Import custom modules
try:
    from data_loader import load_data
    from data_cleaner import clean_data
    print("Modules imported successfully.")
except ModuleNotFoundError as e:
    print(f"ModuleNotFoundError: {e}")

# Load configuration
config_path = os.path.join(project_root, 'config.json')
print(f"Config path: {config_path}")
if not os.path.exists(config_path):
    print(f"Config file does not exist at {config_path}")
else:
    with open(config_path, 'r') as f:
        config = json.load(f)

    raw_data_path = os.path.join(project_root, config['raw_data_path'])
    interim_cleaned_data_path = os.path.join(project_root, config['interim_cleaned_data_path'])
    preprocessed_data_path = os.path.join(project_root, config['preprocessed_data_path'])

    print(f"Raw data path: {raw_data_path}")
    print(f"Interim cleaned data path: {interim_cleaned_data_path}")
    print(f"Preprocessed data path: {preprocessed_data_path}")


Notebook directory: d:\Customer-Churn-Analysis\notebooks
Project root: d:\Customer-Churn-Analysis
Utils path: d:\Customer-Churn-Analysis\utils
sys.path: ['d:\\Customer-Churn-Analysis\\notebooks\\data_preprocessing', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\python38.zip', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\DLLs', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\lib', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis', '', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\lib\\site-packages', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\lib\\site-packages\\win32', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\lib\\site-packages\\win32\\lib', 'c:\\Users\\iambh\\anaconda3\\envs\\churn_analysis\\lib\\site-packages\\Pythonwin', 'd:\\Customer-Churn-Analysis\\utils', 'd:\\Customer-Churn-Analysis\\notebooks\\data_preprocessing\\..\\utils', 'd:\\Customer-Churn-Analysis\\utils', 'd:\\Customer-Churn-Analysis\\utils', 'd:\\Customer-Churn-Analysis\\util

# Data Preprocessing

This notebook performs the initial data loading and preprocessing steps for the Customer Churn Analysis project.

## Step 1: Import Libraries and Set Up Path

In this step, I import the necessary libraries and set up the path to include the custom utility functions for data loading and cleaning.


In [46]:
# Load the raw data
print(f"Attempting to load raw data from: {raw_data_path}")
df = load_data(raw_data_path)

# Check if data is loaded
if df is not None:
    # Display the first few rows of the dataframe
    display(df.head())
else:
    print(f"File not found at {raw_data_path}. Please check the file path.")


Attempting to load raw data from: d:\Customer-Churn-Analysis\data/raw/Dataset (ATS)-1.csv
Data loaded successfully from d:\Customer-Churn-Analysis\data/raw/Dataset (ATS)-1.csv


Unnamed: 0,gender,SeniorCitizen,Dependents,tenure,PhoneService,MultipleLines,InternetService,Contract,MonthlyCharges,Churn
0,Female,0,No,1,No,No,DSL,Month-to-month,29.85,No
1,Male,0,No,34,Yes,No,DSL,One year,56.95,No
2,Male,0,No,2,Yes,No,DSL,Month-to-month,53.85,Yes
3,Male,0,No,45,No,No,DSL,One year,42.3,No
4,Female,0,No,2,Yes,No,Fiber optic,Month-to-month,70.7,Yes


## Step 2: Load Data

In this step, I load the raw data from the CSV file located in the `data/raw/` directory. The custom `load_data` function from the `data_loader` module is used to read the CSV file into a pandas DataFrame.
.


In [47]:
# Clean the loaded data
if df is not None:
    df_cleaned = clean_data(df)

    # Display the first few rows of the cleaned dataframe
    display(df_cleaned.head())
else:
    print("Data loading failed, skipping cleaning step.")


Missing values handled by dropping rows with missing values.
Categorical columns ['gender', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract', 'Churn'] encoded.


Unnamed: 0,gender_1,gender_2,SeniorCitizen,Dependents_1,Dependents_2,tenure,PhoneService_1,PhoneService_2,MultipleLines_1,MultipleLines_2,InternetService_1,InternetService_2,Contract_1,Contract_2,Contract_3,MonthlyCharges,Churn_1,Churn_2
0,1,0,0,1,0,1,1,0,1,0,1,0,1,0,0,29.85,1,0
1,0,1,0,1,0,34,0,1,1,0,1,0,0,1,0,56.95,1,0
2,0,1,0,1,0,2,0,1,1,0,1,0,1,0,0,53.85,0,1
3,0,1,0,1,0,45,1,0,1,0,1,0,0,1,0,42.3,1,0
4,1,0,0,1,0,2,0,1,1,0,0,1,1,0,0,70.7,0,1


## Step 3: Clean Data

In this step, I clean the loaded data using the custom 'clean_data' function from the 'data_cleaner' module. This function handles missing values and encodes categorical variables. The cleaned data is then displayed.

In [48]:
# Save the cleaned data
if df_cleaned is not None:
    # Save the cleaned data to the interim directory
    df_cleaned.to_csv(interim_cleaned_data_path, index=False)

    # Save the cleaned data to the preprocessed_dataset directory
    df_cleaned.to_csv(preprocessed_data_path, index=False)

    print(f"Cleaned data saved to interim at {interim_cleaned_data_path}")
    print(f"Cleaned data saved to preprocessed_dataset at {preprocessed_data_path}")
else:
    print("Data cleaning failed, skipping save step.")


Cleaned data saved to interim at d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Cleaned data saved to preprocessed_dataset at d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv


## Step 4: Save Cleaned Data
In this step, I save the cleaned data to the data/interim/ and Data_Preparation/preprocessed_dataset/ directories. The cleaned DataFrame is written to a CSV file using the to_csv method of pandas DataFrame.


# Task 2: Handle Missing Data and Encode Categorical Variables



In [49]:
# Step 5 - Handle Missing Data
# Import the function to handle missing data from the custom module
try:
    from handle_missing_and_encode import handle_missing_data
except ModuleNotFoundError as e:
    print("Module import unsuccessful:", e)

# Apply the function to handle missing data on the cleaned dataset
df_missing_handled = handle_missing_data(df_cleaned)
if df_missing_handled is not None:
    # Display the first few rows of the dataframe after handling missing data
    display(df_missing_handled.head())
else:
    print("Handling missing data failed.")


Missing data handled by mean imputation.


Unnamed: 0,gender_1,gender_2,SeniorCitizen,Dependents_1,Dependents_2,tenure,PhoneService_1,PhoneService_2,MultipleLines_1,MultipleLines_2,InternetService_1,InternetService_2,Contract_1,Contract_2,Contract_3,MonthlyCharges,Churn_1,Churn_2
0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,29.85,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,34.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,56.95,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,53.85,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,45.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,42.3,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,70.7,0.0,1.0


 
 ## Step 5 - Handle Missing Data

 In this step, I will handle missing data points in the dataset. I will use mean imputation to fill in the missing values in the numerical columns. This ensures that the dataset remains complete and ready for further analysis.

``` Python
df_missing_handled = handle_missing_data(df_cleaned)
```


In [50]:
# Step 6 - Encode Categorical Variables
# Import the function to encode categorical variables from the custom module
try:
    from handle_missing_and_encode import encode_categorical_variables
except ModuleNotFoundError as e:
    print("Module import unsuccessful:", e)

# Apply the function to encode categorical variables on the dataframe with handled missing data
df_encoded = encode_categorical_variables(df_missing_handled)
if df_encoded is not None:
    # Display the first few rows of the dataframe after encoding categorical variables
    display(df_encoded.head())
else:
    print("Encoding categorical variables failed.")

No categorical columns found to encode.


Unnamed: 0,gender_1,gender_2,SeniorCitizen,Dependents_1,Dependents_2,tenure,PhoneService_1,PhoneService_2,MultipleLines_1,MultipleLines_2,InternetService_1,InternetService_2,Contract_1,Contract_2,Contract_3,MonthlyCharges,Churn_1,Churn_2
0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,29.85,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,34.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,56.95,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,53.85,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,45.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,42.3,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,70.7,0.0,1.0



## Step 6 - Encode Categorical Variables

In this step, I will encode categorical variables into numerical format. This is essential for many machine learning algorithms, which require numerical input. I will use the OneHotEncoder from sklearn to transform the categorical variables into binary columns.

```Python
df_encoded = encode_categorical_variables(df_missing_handled)
```

In [51]:
# Step 7 - Save Processed Data
# Save the processed dataframe to the specified interim and preprocessed paths
if df_encoded is not None:
    df_encoded.to_csv(interim_cleaned_data_path, index=False)
    df_encoded.to_csv(preprocessed_data_path, index=False)
    print("Processed data saved to interim at", interim_cleaned_data_path)
    print("Processed data saved to preprocessed_dataset at", preprocessed_data_path)
else:
    print("Processed data saving failed.")

Processed data saved to interim at d:\Customer-Churn-Analysis\data/interim/cleaned_dataset.csv
Processed data saved to preprocessed_dataset at d:\Customer-Churn-Analysis\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv




## Step 7 - Save Processed Data

In this step, I will save the processed data to the specified paths. This includes saving the dataset with handled missing values and encoded categorical variables. The processed data will be stored in both the interim and preprocessed directories for further use.

```Python
df_encoded.to_csv(interim_cleaned_data_path, index=False)
df_encoded.to_csv(preprocessed_data_path, index=False)
```

# Summary

In this notebook, I have successfully completed the following steps:

1. Imported necessary libraries and set up paths.
2. Loaded raw data.
3. Cleaned data by handling missing values and encoding categorical variables.
4. Handled missing data points.
5. Encoded categorical variables.
6. Saved processed data to the interim and preprocessed paths.

## Next Steps

1. Perform feature scaling and normalization to ensure that all features are on a comparable scale, which is crucial for the performance of many machine learning models.
2. Conduct exploratory data analysis (EDA) to understand the data distribution, relationships between variables, and identify any anomalies or patterns.
3. Document each step and summarize the results to ensure clarity and reproducibility for all team members and stakeholders involved in the project.
