# Feature Engineering and Data Splitting

In this notebook, we will perform data splitting and feature engineering for our customer churn analysis project. The steps include:
1. Importing necessary libraries and setting up paths for loading the cleaned data.
2. Creating new features to enhance model performance
3. Splitting the data into training and testing sets
4. Saving the processed data for further analysis


# Task 1: Feature Engineering


## 1. Import Libraries and Set Up Paths

We start by importing necessary libraries and setting up paths to our data files.

In [10]:
# Import necessary libraries
import pandas as pd
import os

# Define the base path using the current file's directory
base_path = os.path.dirname(os.path.abspath(''))
project_root = os.path.abspath(os.path.join(base_path, '..'))

# Define paths to the data files
interim_cleaned_data_path = os.path.join(project_root, 'data', 'interim', 'cleaned_dataset.csv')
processed_data_path = os.path.join(project_root, 'data', 'processed', 'processed_dataset_with_features.csv')

# Print the paths for verification
print(f"Interim cleaned data path: {interim_cleaned_data_path}")
print(f"Processed data path: {processed_data_path}")


Interim cleaned data path: d:\Customer-Churn-Analysis\data\interim\cleaned_dataset.csv
Processed data path: d:\Customer-Churn-Analysis\data\processed\processed_dataset_with_features.csv


## 2. Load the Cleaned Data

Next, we load the cleaned data from the specified path. This data has already been preprocessed to handle missing values and encode categorical variables.


In [11]:
def load_data(file_path):
    """
    Load data from a CSV file.
    
    Parameters:
    file_path (str): The path to the CSV file to be loaded.
    
    Returns:
    DataFrame: The data loaded from the CSV file, or None if an error occurs.
    """
    try:
        data = pd.read_csv(file_path)
        print(f"Data loaded successfully from {file_path}")
        return data
    except Exception as e:
        print(f"An error occurred: {e}")

# Load the cleaned data
df_cleaned = load_data(interim_cleaned_data_path)


Data loaded successfully from d:\Customer-Churn-Analysis\data\interim\cleaned_dataset.csv


## 3. Create New Features

We will create new features to enhance the predictive power of our model. The features include `Charges_Per_Tenure`, `TotalCharges`, `Contract_Type`, and `Payment_Method`.


In [12]:
def create_new_features(df):
    """
    Create new features for the dataset.
    
    Parameters:
    df (DataFrame): The input DataFrame for which new features will be created.
    
    Returns:
    DataFrame: The DataFrame with the newly created features.
    """
    if 'tenure' in df.columns and 'MonthlyCharges' in df.columns:
        df['Charges_Per_Tenure'] = df['MonthlyCharges'] / (df['tenure'] + 1)

    if 'MonthlyCharges' in df.columns and 'tenure' in df.columns:
        df['TotalCharges'] = df['MonthlyCharges'] * df['tenure']

    contract_mapping = {
        'Month-to-month': 0,
        'One year': 1,
        'Two year': 2
    }
    if 'Contract' in df.columns:
        df['Contract_Type'] = df['Contract'].map(contract_mapping)

    payment_mapping = {
        'Electronic check': 0,
        'Mailed check': 1,
        'Bank transfer (automatic)': 2,
        'Credit card (automatic)': 3
    }
    if 'PaymentMethod' in df.columns:
        df['Payment_Method'] = df['PaymentMethod'].map(payment_mapping)
    
    return df

# Create new features for the cleaned data
df_cleaned = create_new_features(df_cleaned)

# Display the first few rows of the dataset with new features
df_cleaned.head()


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes,Charges_Per_Tenure,TotalCharges
0,0.0,1.0,29.85,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14.925,29.85
1,0.0,34.0,56.95,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.627143,1936.3
2,0.0,2.0,53.85,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,17.95,107.7
3,0.0,45.0,42.3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.919565,1903.5
4,0.0,2.0,70.7,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,23.566667,141.4


## 4. Save the Processed Dataset

Finally, we save the dataset with the newly created features to the specified path.


In [13]:
# Save the dataset with new features
df_cleaned.to_csv(processed_data_path, index=False)
print(f"Dataset with new features saved to {processed_data_path}")


Dataset with new features saved to d:\Customer-Churn-Analysis\data\processed\processed_dataset_with_features.csv


# Task 2: Data Splitting



## 1. Import Libraries and Configuration

This cell imports the necessary libraries and loads the configuration file from the main directory. It finds the project root directory by searching for the `config.json` file and sets the file paths using the configuration settings. The absolute paths are printed for verification.

In [14]:
# Import necessary libraries
import pandas as pd
import os
import json
from sklearn.model_selection import train_test_split

# Function to find the project root directory by looking for config.json
def find_project_root(filename='config.json'):
    current_dir = os.getcwd()
    while True:
        if filename in os.listdir(current_dir):
            return current_dir
        parent_dir = os.path.dirname(current_dir)
        if parent_dir == current_dir:
            raise FileNotFoundError(f"{filename} not found in any parent directories.")
        current_dir = parent_dir

# Find the project root directory
root_dir = find_project_root()
print("Project root directory:", root_dir)

# Load configuration
config_path = os.path.join(root_dir, 'config.json')
print("Config path:", config_path)

# Ensure the config file exists
if not os.path.exists(config_path):
    raise FileNotFoundError(f"Config file not found at {config_path}")

with open(config_path, 'r') as f:
    config = json.load(f)

# Set file paths using the configuration settings
processed_data_path = os.path.join(root_dir, config['processed_data_path'])
train_path = os.path.join(root_dir, 'data', 'train', 'train_dataset.csv')
test_path = os.path.join(root_dir, 'data', 'test', 'test_dataset.csv')
train_path_prep = os.path.join(root_dir, 'Data_Preparation', 'training_sets', 'train_dataset.csv')
test_path_prep = os.path.join(root_dir, 'Data_Preparation', 'testing_sets', 'test_dataset.csv')

# Print the absolute paths for verification
print(f"Processed data path: {processed_data_path}")
print(f"Train path: {train_path}")
print(f"Test path: {test_path}")
print(f"Train path for Data Preparation: {train_path_prep}")
print(f"Test path for Data Preparation: {test_path_prep}")


Project root directory: d:\Customer-Churn-Analysis
Config path: d:\Customer-Churn-Analysis\config.json
Processed data path: d:\Customer-Churn-Analysis\data/processed/processed_dataset_with_features.csv
Train path: d:\Customer-Churn-Analysis\data\train\train_dataset.csv
Test path: d:\Customer-Churn-Analysis\data\test\test_dataset.csv
Train path for Data Preparation: d:\Customer-Churn-Analysis\Data_Preparation\training_sets\train_dataset.csv
Test path for Data Preparation: d:\Customer-Churn-Analysis\Data_Preparation\testing_sets\test_dataset.csv


## 2. Load Data

This cell defines a function to load data from a specified CSV file path and prints a confirmation message upon successful loading. The processed dataset with new features is then loaded, and the first few rows of the dataset are displayed for verification.


In [15]:
# Load the processed dataset with new features
def load_data(file_path):
    try:
        data = pd.read_csv(file_path)
        print(f"Data loaded successfully from {file_path}")
        return data
    except Exception as e:
        print(f"An error occurred: {e}")

df = load_data(processed_data_path)
df.head()


Data loaded successfully from d:\Customer-Churn-Analysis\data/processed/processed_dataset_with_features.csv


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn_No,Churn_Yes,Charges_Per_Tenure,TotalCharges
0,0.0,1.0,29.85,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14.925,29.85
1,0.0,34.0,56.95,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.627143,1936.3
2,0.0,2.0,53.85,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,17.95,107.7
3,0.0,45.0,42.3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.919565,1903.5
4,0.0,2.0,70.7,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,23.566667,141.4


## 3. Split Data

This cell defines a function to split the dataset into training and testing sets using a specified test size and random state for reproducibility. The dataset is split into training and testing sets, and the sizes of these sets are printed for verification.

In [16]:
# Split the dataset into training and testing sets
def split_data(df, test_size=0.2, random_state=42):
    train_df, test_df = train_test_split(df, test_size=test_size, random_state=random_state)
    print(f"Data split into training (size={len(train_df)}) and testing (size={len(test_df)}) sets.")
    return train_df, test_df

train_df, test_df = split_data(df)


Data split into training (size=5634) and testing (size=1409) sets.


## 4. Save Split Data 

This cell saves the training and testing datasets to the specified paths in the main data directory and the Data Preparation directory. The paths are printed for verification, ensuring that the datasets are saved correctly.

In [17]:
# Save the training and testing datasets to the specified paths
train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)
train_df.to_csv(train_path_prep, index=False)
test_df.to_csv(test_path_prep, index=False)
print(f"Training and testing datasets saved to {train_path} and {test_path}")
print(f"Training and testing datasets also saved to {train_path_prep} and {test_path_prep}")


Training and testing datasets saved to d:\Customer-Churn-Analysis\data\train\train_dataset.csv and d:\Customer-Churn-Analysis\data\test\test_dataset.csv
Training and testing datasets also saved to d:\Customer-Churn-Analysis\Data_Preparation\training_sets\train_dataset.csv and d:\Customer-Churn-Analysis\Data_Preparation\testing_sets\test_dataset.csv


## Overall Summary
In this notebook, we have successfully performed feature engineering and data splitting on our dataset. Below is a summary of the tasks accomplished and the results obtained:

## Task 1: Feature Engineering

### Import Libraries and Set Up Paths

- Imported necessary libraries and configuration settings.
- Set up paths for loading and saving datasets.

### Load Data

- Loaded the raw dataset from the specified path.
- Verified the initial structure and columns of the dataset.

### Create New Features

- Created new features to enhance the dataset, such as Charges_Per_Tenure and TotalCharges.
- Encoded contract types and payment methods into numerical values for better analysis and modeling.

### Save Processed Data

- Saved the dataset with newly created features to a specified path for further analysis and modeling.

## Task 2: Data Splitting

### Import Libraries and Configuration

- Imported necessary libraries and configuration settings.
- Set up paths for loading the processed dataset and saving the split datasets.

### Load Processed Data

- Loaded the processed dataset with new features.
- Verified the structure and columns of the loaded dataset.

### Split Data

- Split the dataset into training and testing sets using an 80-20 split.
- Ensured the split datasets have the appropriate sizes.

### Save Split Data

- Saved the training and testing datasets to specified paths in both the main data directory and the Data Preparation directory.

## Results Obtained

### Feature Engineering

- The dataset now includes additional features that provide more insights and improve the model's ability to learn from the data.
- The dataset is encoded and transformed, ready for further analysis and model training.

### Data Splitting

- The dataset is successfully split into training and testing sets, ensuring that we have separate data for model training and evaluation.
- The split datasets are saved in designated directories, making them easily accessible for the next steps in the analysis.

## Next Steps

1. Model Training: Use the training dataset to train various machine learning models.
2. Model Evaluation: Use the testing dataset to evaluate the performance of the trained models and fine-tune them as needed.
3. Exploratory Data Analysis (EDA): Perform detailed EDA on the training dataset to gain more insights and identify potential improvements for feature engineering.
4. Model Selection and Deployment: Select the best-performing model based on evaluation metrics and prepare it for deployment.

This concludes the feature engineering and data splitting tasks, setting a solid foundation for the subsequent stages in our data analysis and machine learning pipeline.
