# Data merging

Confluence Documentation: https://openayeye.atlassian.net/wiki/x/AYAs

## Table of Contents

1. [Merging 1](#Merging-1)
    1. [Filling in missing data](#Filling-in-missing-data)
    2. [Check if distribution is preserved](#Check-if-distribution-is-preserved)
2. [Merging 2](#Merging-2)

### Import required packages

In [1]:
import pandas as pd
import numpy as np
import warnings

from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder


warnings.filterwarnings("ignore")

## Merging 1

In [9]:
df1 = pd.read_csv("data/bank-full.csv", sep = ";", header = 0) #from UCI Bank Marketing
df2 = pd.read_csv("data/Churn_Modelling.csv").iloc[:, 3:] #from data.gov Total Loans to Non-Bank Customers by Type
# df1

df1.rename({i:i.lower() for i in df2.columns.values}, axis=1, inplace=True)
df2.rename({i:i.lower() for i in df2.columns.values}, axis=1, inplace=True)

dtype_dict = pd.DataFrame(pd.concat([df1.dtypes, (df2.dtypes)], axis=0))
dtype_dict = dtype_dict.T.loc[:, ~dtype_dict.T.columns.duplicated()].T.copy().iloc[:, 0]
# dtype_dict

In [10]:
# Merge Dataframes
merged_df = pd.concat([df1, df2], axis=0, ignore_index=True)
# print(merged_df['isactivemember'])

# Find numerical & categorical columns
which_object = [i == np.dtype('O') for i in merged_df.dtypes]
categorical_columns = merged_df.columns[which_object].values
numerical_columns = merged_df.columns[np.invert(which_object)].values
all_columns = np.concatenate([numerical_columns, categorical_columns])

# Rearrange column sequence
merged_df = merged_df.loc[:, all_columns]
merged_df.reset_index(drop=True)
merged_df[categorical_columns] = merged_df.loc[:, categorical_columns].astype('category')

In [12]:
merged_df.columns

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'creditscore', 'tenure', 'numofproducts', 'estimatedsalary', 'job',
       'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'subscribed', 'surname', 'geography', 'gender',
       'hascrcard', 'isactivemember', 'exited'],
      dtype='object')

In [None]:
cat_dtypes = merged_df.dtypes[categorical_columns]
num_dtypes = dtype_dict[numerical_columns]
dtype_dict = dict(cat_dtypes) |  dict(num_dtypes)
dtype_dict

In [None]:
merged_df.loc[:, categorical_columns].head()

In [None]:
merged_df.loc[:, numerical_columns].head()

Separate numerical_columns and categorical_columns, as we'll be dealing with missing data in them differently.

### Filling in missing data
Use different techniques to "fill in missing data"   
Imputers will generate synthetic data based on existing features and use it to fill up the empty cells.
Below, I used IterativeImputer for numerical data and SimpleImputer("most_frequent") for categorical data. 

In [7]:
num_pipeline = Pipeline(
                steps=[
                    ("imputer", IterativeImputer(random_state=0)), # (Multivariate Imputation)
                    # Some examples of other imputation methods:
                    #   ("imputer", SimpleImputer(strategy='mean')), 
                    #   ("imputer", SimpleImputer(strategy='median')), 
                    #   ("imputer", SimpleImputer(strategy='most_frequent')), 
                    # ("scaler", MinMaxScaler()), # Scaling numerical data
                ]
            )

cat_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")), 
        # Some examples of other imputation methods:
        # ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),  
        # ("imputer", KNNImputer(n_neighbors=5, weights="uniform"))
    ]
)

preprocessor = ColumnTransformer(
                transformers=[
                    ("num_pipeline", num_pipeline, numerical_columns),
                    ("cat_pipeline", cat_pipeline, categorical_columns),
                ]
            )

In [None]:
# Apply transformation on dataset
processed_data = preprocessor.fit_transform(merged_df)

# Convert processed_data back to a DataFrame
processed_df = pd.DataFrame(processed_data, columns=all_columns)

# Convert numerical columns back to float
processed_df.loc[:, numerical_columns] = processed_df[numerical_columns].apply(pd.to_numeric)


processed_df = processed_df.astype(dtype_dict)
# print(processed_df.shape, merged_df.shape)
# print(processed_df.head())
feat_cols = [i for i in processed_df.columns if (i != 'subscribed' and i != 'exited')]
feat_cols
label_sub = ['subscribed']
# label_exit = ['exited']
X = processed_df[feat_cols]
y = processed_df[label_sub]
X.head(), y.head()
# processed_df.to_csv("data/merged.csv", index=False)

merged_df.shape should equal to processed_df.shape

### Check if distribution is preserved

### eg. Kolmogorov-Smirnov Test for Numerical Columns
#### Interpretation  
- **KS Statistic**: A KS statistic of 0.0 indicates that there is no difference between the distributions of the original and processed data for each column.  
- **P-value**: A p-value of 1.0 means that the test results are consistent with the null hypothesis, which states that the distributions of the two datasets are the same.  

In [None]:
from scipy.stats import ks_2samp

# Kolmogorov-Smirnov test to check if two distributions are the same

def ks_test_column(original_column, processed_column):
    # Drop any missing values from original column
    original_non_missing = original_column.dropna()
    # Kolmogorov-Smirnov test
    ks_stat, p_value = ks_2samp(original_non_missing, processed_column)
    print(f"Kolmogorov-Smirnov test for \033[96m{original_column.name}\033[00m:")
    print(f"KS Statistic: {ks_stat}, p-value: {p_value:.3f}")
    return ks_stat, p_value

# Apply the KS test to all numerical columns
for col in numerical_columns:
    ks_test_column(merged_df[col], processed_df[col])

#### Summary  
The KS test results suggest that the transformations applied to the numerical columns in your dataset did not alter their distributions. This outcome implies that the preprocessing steps (including scaling or imputation) did not change the fundamental distribution of the data in each column. Therefore, the original and processed data distributions are effectively identical for these columns.


### Chi-Square Test for Categorical Columns

#### Interpretation
- **Chi-Square Statistic**: Measures the magnitude of the difference between observed and expected frequencies. A higher value indicates a greater difference.    
- **P-value**: Indicates the probability of observing the data if the null hypothesis (that the distributions are the same) is true. A low p-value (typically < 0.05) suggests that there is a significant difference between the distributions.

In [None]:
# from scipy.stats import chi2_contingency

# def chi2_test_column(original_column, processed_column):
#     # Create contingency table
#     contingency_table = pd.crosstab(original_column, processed_column)
    
#     # Perform Chi-Square Test
#     chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    
#     print(f"Chi-Square test for \033[96m{original_column.name}\033[00m:")
#     print(f"Chi-Square Statistic: {chi2_stat}, p-value: {p_value:.3f}")
#     return chi2_stat, p_value

# # Apply the Chi-Square test to all categorical columns
# for col in categorical_columns:
#     chi2_test_column(merged_df[col], processed_df[col])

from scipy.stats import chisquare
print(pd.concat([merged_df['job'].value_counts(), processed_df['job'].value_counts()], axis=1))
chisquare(merged_df['job'].value_counts(), processed_df['job'].value_counts())

**Summary**  
Since all p-values are 0.000, there are significant differences between the observed and expected distributions for each categorical column. 
This suggests that the transformations or imputations performed have significantly altered the distributions from their original state.

<font color='red'>Hence.....we need to find better imputation methods for categorical columns!!</font>

## Merging 2
.... continueeeee 