# Data Splitting and Feature Engineering

In this notebook, we will perform data splitting and feature engineering for our customer churn analysis project. The steps include:
1. Importing necessary libraries and setting up paths for loading the cleaned data.
2. Creating new features to enhance model performance
3. Splitting the data into training and testing sets
4. Saving the processed data for further analysis


## Step 1: Importing necessary libraries and setting up paths for loading the cleaned data.

In this section, we will import the necessary libraries and set up paths for our project. We will also define a function to find the project root directory by looking for a config.json file. The configuration file will be used to set various file paths required for our data processing tasks.

**Steps**:
1. Import Libraries: We import essential libraries such as pandas, os, json, and scikit-learn modules for data preprocessing and model training.
2. Find Project Root Directory: We define a function find_project_root to locate the project's root directory by searching for config.json file.
3. Load Configuration: We load the configuration settings from the config.json file to set up file paths for raw data, cleaned data, preprocessed data, and other necessary paths.
4. Load Cleaned Data: We load the cleaned dataset from the specified path for further processing.


In [47]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import os

# Function to find the project root directory by looking for config.json
def find_project_root(filename='config.json'):
    """
    Find the project root directory by looking for the specified filename.
    
    Parameters:
    filename (str): The filename to search for (default is 'config.json').
    
    Returns:
    str: The path to the project root directory.
    """
    current_dir = os.getcwd()
    while True:
        if filename in os.listdir(current_dir):
            return current_dir
        parent_dir = os.path.dirname(current_dir)
        if parent_dir == current_dir:
            raise FileNotFoundError(f"{filename} not found in any parent directories.")
        current_dir = parent_dir
    # Returns: The path to the project root directory

# Find the project root directory
root_dir = find_project_root()
print("Project root directory:", root_dir)
# Returns: The path to the project root directory

# Load configuration
config_path = os.path.join(root_dir, 'config.json')
print("Config path:", config_path)
# Returns: The path to the configuration file

# Ensure the config file exists
if not os.path.exists(config_path):
    raise FileNotFoundError(f"Config file not found at {config_path}")
# Returns: Raises an error if the config file is not found

with open(config_path, 'r') as f:
    config = json.load(f)
    # Returns: A dictionary `config` containing the configuration settings

# Set file paths using the configuration settings
processed_data_path = os.path.join(root_dir, config['processed_data_path'])
train_path = os.path.join(root_dir, config['train_data_path'])
test_path = os.path.join(root_dir, config['test_data_path'])
# Returns: Absolute paths for processed data, training data, and testing data

# Load the dataset with new features from the specified path
df = pd.read_csv(processed_data_path)
# Returns: A DataFrame `df` containing the dataset with new features loaded from the processed data path

# Verify the columns of the loaded DataFrame
print(df.columns)
# Returns: Prints the column names of the loaded DataFrame for verification


Project root directory: d:\Customer-Churn-Analysis
Config path: d:\Customer-Churn-Analysis\config.json
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female',
       'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year',
       'Churn_No', 'Churn_Yes', 'Charges_Per_Tenure', 'TotalCharges'],
      dtype='object')


## Step 2: Feature Engineering: Creating New Features

In this section, we will create new features for our dataset to enhance its predictive power. The new features will include Charges_Per_Tenure, TotalCharges, Contract_Type, and Payment_Method. These features are derived from existing columns and are designed to provide additional insights for our machine learning models.

**Steps**:
1. Charges_Per_Tenure: This feature is calculated by dividing MonthlyCharges by the tenure of the customer. It gives an average charge per unit of tenure.
2. TotalCharges: This feature is calculated by multiplying MonthlyCharges by the tenure of the customer. It represents the total charges incurred by the customer.
3. Contract_Type: This feature is derived from the Contract column, where the contract type (e.g., month-to-month, one year, two years) is mapped to numerical values.
4. Payment_Method: This feature is derived from the PaymentMethod column, where different payment methods (e.g., electronic check, mailed check, bank transfer, credit card) are mapped to numerical values.


In [42]:
# Feature Engineering with Debugging
def create_new_features(df):
    """
    Create new features for the dataset.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The DataFrame with new features added.
    """
    # Create a copy of the dataframe to avoid modifying the original one
    df = df.copy()

    # Initial columns
    print("Initial columns:", df.columns.tolist())

    # Add Charges_Per_Tenure and TotalCharges if applicable
    if 'tenure' in df.columns and 'MonthlyCharges' in df.columns:
        df['Charges_Per_Tenure'] = df['MonthlyCharges'] / (df['tenure'] + 1)
        df['TotalCharges'] = df['MonthlyCharges'] * df['tenure']
        print("Added Charges_Per_Tenure and TotalCharges")

    # Mapping for PaymentMethod
    payment_mapping = {
        'Electronic check': 0,
        'Mailed check': 1,
        'Bank transfer (automatic)': 2,
        'Credit card (automatic)': 3
    }
    if 'PaymentMethod_Electronic check' in df.columns:
        df['Payment_Method'] = df[['PaymentMethod_Electronic check', 'PaymentMethod_Mailed check',
                                   'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)']].idxmax(axis=1)
        df['Payment_Method'] = df['Payment_Method'].map(payment_mapping)
        print("Mapped Payment_Method")

    # Columns after feature engineering
    print("Columns after feature engineering:", df.columns.tolist())

    print("New features created.")
    return df

# Apply feature engineering
df_features = create_new_features(df)

# Print final columns before saving
print("Final columns before saving:", df_features.columns.tolist())

# Save the DataFrame with new features to the specified path
processed_data_path = os.path.join(root_dir, config['processed_data_path'])
df_features.to_csv(processed_data_path, index=False)
print("Dataset with new features saved.")


Initial columns: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No', 'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes', 'InternetService_DSL', 'InternetService_Fiber optic', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'Churn_No', 'Churn_Yes', 'Charges_Per_Tenure', 'TotalCharges']
Added Charges_Per_Tenure and TotalCharges
Columns after feature engineering: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No', 'PhoneService_Yes', 'MultipleLines_No', 'MultipleLines_Yes', 'InternetService_DSL', 'InternetService_Fiber optic', 'Contract_Month-to-month', 'Contract_One year', 'Contract_Two year', 'Churn_No', 'Churn_Yes', 'Charges_Per_Tenure', 'TotalCharges']
New features created.
Final columns before saving: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender_Female', 'gender_Male', 'Dependents_No', 'D

## Step 3:Splitting the Dataset into Training and Testing Sets

In this section, we will split the dataset into training and testing sets to prepare it for model training and evaluation. We will define a function to handle the split and save the resulting datasets to specified paths.

**Steps**:
1. Define Split Function: We create a function split_data to divide the dataset into training and testing sets based on a specified test size and random seed for reproducibility.
2. Split the Data: We use the train_test_split function from scikit-learn to perform the split.
3. Save the Splits: The resulting training and testing datasets are saved to the specified file paths.

In [43]:
# Identify target variable columns
target_column = 'Churn_Yes'

# Split the dataset into training and testing sets
def split_data(df, target, test_size=0.2, random_state=42):
    """
    Split the dataset into training and testing sets, stratified by the target variable.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame to be split.
    target (str): The target variable column name.
    test_size (float): The proportion of the dataset to include in the test split. Default is 0.2 (20% testing, 80% training).
    random_state (int): The seed used by the random number generator.
    
    Returns:
    pd.DataFrame, pd.DataFrame: The training and testing DataFrames.
    """
    # Separate the features and the target variable
    X = df.drop(columns=[target])
    y = df[target]
    
    # Split the data into training and testing sets, stratified by the target variable
    train_df, test_df = train_test_split(df, test_size=test_size, random_state=random_state, stratify=y)
    print(f"Data split into training (size={len(train_df)}) and testing (size={len(test_df)}) sets.")
    # Returns: Two DataFrames - train_df containing the training data and test_df containing the testing data
    return train_df, test_df

# Split data
train_df, test_df = split_data(df_features, target_column)
# Save the training and testing datasets to the specified paths
train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)
print("Training and testing datasets saved.")
# Returns: Confirmation message indicating the training and testing datasets have been saved


Data split into training (size=5634) and testing (size=1409) sets.
Training and testing datasets saved.


## Summary of Completed Tasks

### Loading the Cleaned Data:

1. Imported necessary libraries.
2. Loaded the cleaned data from the specified path in the configuration file.

### Feature Engineering:

1. Created new features to enhance model performance.
  - Added Charges_Per_Tenure as the ratio of MonthlyCharges to tenure + 1.
  - Calculated TotalCharges as the product of MonthlyCharges and tenure.
  - Mapped the Contract column to numerical values.
  - Mapped the PaymentMethod column to numerical values.

2. Saved the dataset with the new features to the specified path.

### Data Splitting:

1. Split the dataset with the newly created features into training and testing sets.
2. Saved the training and testing datasets to the specified paths.
