# Data Preparation and Processing Documentation

## Overview

This document details the procedures used for preparing and processing the data in the machine learning project. The data is systematically divided, transformed, and cleaned to ensure optimal input for modeling.

## Data Splitting

Data is split into two distinct sets based on a cutoff date:
- **Training Data**: All records before May 1, 2017, are used for model training.
- **Testing Data**: Records from May 1, 2017, onwards are reserved for testing the model.

## Data Preparation Steps

### One-hot Encoding
Categorical data from the 'source' column is transformed into a numerical format using one-hot encoding. 

### Dimensionality Reduction with PCA
Principal Component Analysis (PCA) is applied to the one-hot encoded data to reduce its dimensionality. This process helps in maintaining as much variance in the 'source' feature while reducing the dimensionality that results from one-hot encoding.

### Data Cleaning and Feature Engineering
Several steps are taken to clean and prepare the data further:
- Missing values in columns like 'pageviews', 'bounces', and 'transactionRevenue' are replaced with zeros.
- A new binary column, 'at_least_one_Conversion', is introduced to indicate whether a transaction has occurred.
- Redundant and constant value columns are identified and removed to streamline the dataset.

### Merging Transformed Data
Post-PCA, the reduced dataset is merged back with the main dataset. 

### Final Dataset Preparation
The final steps involve sorting and merging datasets to ensure all transformations are consistently applied across both training and testing sets. T


In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv('/Users/Abdul/Desktop/MMA/Enterprise Data Science/Revenue-Radar/Data/train_dejsonified.csv', parse_dates=['date'])
pd.set_option('display.max_columns', None)

In [3]:
# Define the cutoff date for the split
test_start_date = pd.Timestamp('2017-05-01')
train_df = df[df['date'] < test_start_date]
test_df = df[df['date'] >= test_start_date]

In [4]:
all_sources = pd.concat([train_df['source'], test_df['source']]).unique()

In [5]:
def apply_pca(data, pca=None, fit=False):
    if fit:
        pca = PCA(n_components=3)
        pca.fit(data)
    if not pca:
        raise ValueError("PCA model is not provided for transformation.")
    transformed_data = pca.transform(data)
    return transformed_data, pca

In [6]:
def preprocess_data(df, all_sources, pca_model=None, fit_pca=True):
    # Handle one-hot encoding for 'source'
    source_df = pd.DataFrame(df[['source', 'fullVisitorId']])
    one_hot_encoded_df = pd.get_dummies(source_df['source'], prefix='', prefix_sep='', dtype=int)

    # Ensure all possible columns are included, filling missing with 0
    one_hot_encoded_df = one_hot_encoded_df.reindex(columns=all_sources, fill_value=0)
    one_hot_encoded_df['fullVisitorId'] = source_df['fullVisitorId'].values
    source_df = one_hot_encoded_df.groupby('fullVisitorId').sum().reset_index()

    
    # Remove 'fullVisitorId' temporarily for PCA processing
    pca_data = source_df.drop('fullVisitorId', axis=1).to_numpy()

    # if fit_pca:
    #     pca_model = PCA(n_components=3)
    #     pca_model.fit(pca_data)
    
    # transformed_data = pca_model.transform(pca_data)
    # Fit and transform PCA on the training data, or just transform on the test data
    transformed_data, pca_model = apply_pca(pca_data, pca_model, fit=fit_pca)
    
    # Convert PCA output to DataFrame and create meaningful column names
    pca_column_names = ['Source_PC1', 'Source_PC2', 'Source_PC3']
    pca_df = pd.DataFrame(transformed_data, columns=pca_column_names)
    pca_df['fullVisitorId'] = source_df['fullVisitorId'].values  # Ensure correct alignment
    
    # Fill missing values for 'pageviews' and 'bounces'
    df['pageviews'].fillna(0, inplace=True)
    df['bounces'].fillna(0, inplace=True)
    
    # Replace missing 'transactionRevenue' with 0 and convert to int
    df['transactionRevenue'].fillna(0, inplace=True)
    df['transactionRevenue'] = df['transactionRevenue'].astype(int)
    
    # Create a binary 'Conversion' flag
    df['Conversion'] = df['transactionRevenue'].apply(lambda x: 1 if x > 0 else 0)
    df['at_least_one_conversion'] = df.groupby('fullVisitorId')['Conversion'].transform('sum').apply(lambda x: 1 if x > 0 else 0)
    
    # Sort data by 'fullVisitorId' and 'date'
    df.sort_values(by=['fullVisitorId', 'date'], inplace=True)
    
    # Identify and drop constant columns except 'bounces'
    constant_columns = [col for col in df.columns if df[col].nunique() == 1 and col not in['Bounces','isTrueDirect']]    
    df.drop(constant_columns, axis=1, inplace=True)
    
    # Drop other unnecessary columns
    df.drop(['networkDomain', 'visitStartTime', 'visitNumber'], axis=1, inplace=True)
    
    # Merge data frames for channel visit analysis
    first_visit_channel = df.groupby('fullVisitorId')['channelGrouping'].first().reset_index(name='FirstChannelVisit')
    last_visit_channel = df.groupby('fullVisitorId')['channelGrouping'].last().reset_index(name='LastChannelVisit')
    channels_first_last = pd.merge(first_visit_channel, last_visit_channel, on='fullVisitorId', how='inner')
    
    df_unique = df.drop_duplicates(subset='fullVisitorId', keep='first')
    channels_first_last = pd.merge(channels_first_last, df_unique[['fullVisitorId', 'at_least_one_conversion', 'country', 'continent', 'subContinent']].drop_duplicates(), on='fullVisitorId', how='inner')
    
    
    # Calculating the total number of visits by each user
    total_visits_by_user=dict(df['fullVisitorId'].value_counts().reset_index(name='TotalVisits').values)

    
    # Calculating visits by channel for each user
    channel_visits_by_user=dict(df.groupby('fullVisitorId')['channelGrouping'].value_counts())
    list_of_dicts = []
    for (visitor_id, channel), visits in channel_visits_by_user.items():
        list_of_dicts.append({'fullVisitorId': visitor_id, channel: visits})

    # Creating a DataFrame
    df_channel_visits = pd.DataFrame(list_of_dicts)

    # Combining all dictionaries representing the same fullVisitorId
    df_channel_visits = df_channel_visits.groupby('fullVisitorId', as_index=False).sum()

    
    # Total pageviews by user
    total_pageviews_by_user=dict(df.groupby('fullVisitorId')['pageviews'].sum().reset_index(name='TotalPageviews').values)

    # Total bounces by user
    total_bounces_by_user=dict(df.groupby('fullVisitorId')['bounces'].sum().reset_index(name='TotalBounces').values)


    visits_by_device=dict(df.groupby('fullVisitorId')['deviceCategory'].value_counts())
    # converting the dictionary to a list of dictionaries
    list_of_dicts = []
    for (visitor_id, device), visits in visits_by_device.items():
        list_of_dicts.append({'fullVisitorId': visitor_id, device: visits})

    # Creating a DataFrame
    df_visits_by_device = pd.DataFrame(list_of_dicts)

    # Combining all dictionaries representing the same fullVisitorId
    df_visits_by_device = df_visits_by_device.groupby('fullVisitorId', as_index=False).sum()
    
    # Session pageviews
    first_session_pageviews=dict(df.groupby('fullVisitorId')['pageviews'].first().reset_index(name='FirstSessionPageviews').values)
    last_session_pageviews=dict(df.groupby('fullVisitorId')['pageviews'].last().reset_index(name='LastSessionPageviews').values)
    
    # Campaign data adjustment
    df['campaign'] = df['campaign'].apply(lambda x: 0 if x == '(not set)' else 1)
    
    # Calculating campaign visits
    campaign_visits_by_user=dict(df.groupby('fullVisitorId')['campaign'].sum().reset_index(name='CampaignVisits').values)

    
    # TrueDirect and AdContent data handling
    df['isTrueDirect'] = df['isTrueDirect'].fillna(False)
    true_Direct_by_user=dict(df.groupby('fullVisitorId')['isTrueDirect'].sum().reset_index(name='isTrueDirect').values)

    adcontent_by_user=dict(df.groupby('fullVisitorId')['adContent'].apply(lambda x: x.notnull().sum()).reset_index(name='AdContentVisits').values)


    # Convert dictionaries to DataFrames
    df_total_bounces = pd.DataFrame(list(total_bounces_by_user.items()), columns=['fullVisitorId', 'TotalBounces'])
    df_total_visits = pd.DataFrame(list(total_visits_by_user.items()), columns=['fullVisitorId', 'TotalVisits'])
    df_total_pageviews = pd.DataFrame(list(total_pageviews_by_user.items()), columns=['fullVisitorId', 'TotalPageviews'])
    df_first_session_pageviews = pd.DataFrame(list(first_session_pageviews.items()), columns=['fullVisitorId', 'FirstSessionPageviews'])
    df_last_session_pageviews = pd.DataFrame(list(last_session_pageviews.items()), columns=['fullVisitorId', 'LastSessionPageviews'])
    df_campaign_visits = pd.DataFrame(list(campaign_visits_by_user.items()), columns=['fullVisitorId', 'CampaignVisits'])
    df_true_Direct = pd.DataFrame(list(true_Direct_by_user.items()), columns=['fullVisitorId', 'isTrueDirect'])
    df_adcontent_visits = pd.DataFrame(list(adcontent_by_user.items()), columns=['fullVisitorId', 'AdContentVisits'])

    # Merging all computed DataFrames
    data_frames_to_merge = [
        df_total_visits, df_total_bounces, df_channel_visits,
        df_total_pageviews, df_visits_by_device, df_first_session_pageviews,
        df_last_session_pageviews, df_campaign_visits, df_true_Direct,
        df_adcontent_visits
    ]

    final_df = channels_first_last
    for data_frame in data_frames_to_merge:
        final_df = pd.merge(final_df, data_frame, on='fullVisitorId', how='inner')
    
    # Convert any floating-point columns to integers if necessary
    for col in final_df.select_dtypes(include=['float64']).columns:
        final_df[col] = final_df[col].astype(int)

    return final_df, pca_model, pca_df

In [7]:

# Apply preprocessing to both train and test datasets
processed_train, pca_model,pca_df_train = preprocess_data(train_df, all_sources, fit_pca=True)

In [8]:
processed_test,_, pca_df_test = preprocess_data(test_df, all_sources, pca_model, fit_pca=False)

In [9]:
# merge the pca_df_train and pca_df_test with the processed_train and processed_test
processed_train = pd.merge(processed_train, pca_df_train, on='fullVisitorId', how='inner')
processed_test = pd.merge(processed_test, pca_df_test, on='fullVisitorId', how='inner')

In [13]:
print(processed_train.shape)
print(processed_test.shape)

(567490, 29)
(161282, 29)


In [14]:
processed_test.to_csv('/Users/Abdul/Desktop/MMA/Enterprise Data Science/test_df.csv', index=False)
processed_train.to_csv('/Users/Abdul/Desktop/MMA/Enterprise Data Science/train_df.csv', index=False)