# Importing the Dataset and Packages


## Import Script
The powershell in the assosiated GitHub repository allows for this notebook to download the datasetfor this project, provided you complete the ".env" file with your kaggle API key, and file path.

This script was written, so that this notebook may be run, tested, and modified anywhere.

In [1]:
%matplotlib inline

import dotenv
import kagglehub
import os
import subprocess
import ipywidgets
import sklearn as sk
import pandas as pd
import tensorflow as tf
import keras as ks
import matplotlib as mlt
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer


try:
    dotenv.load_dotenv()
except:
    print("--Dotenv not loaded--")


# Check for Kaggle environment and set the file path
if os.path.exists("/kaggle/input/churn-modelling/Churn_Modelling.csv"):
    # Kaggle Environment
    file_path = "/kaggle/input/churn-modelling/Churn_Modelling.csv"
else:
    # Local Environment
    file_path = (str((os.getenv("LOCAL_FILE_LOCATION"))))

# Load Dataset
try:
    df = pd.read_csv(file_path)
    print("Dataset Loaded Successfully!")
except FileNotFoundError:
    print(f"Error: File not found at : file_path")
    try:
        # Run Powershell script to download dataset
        print("Attempting to run download_data_ps1")
        path = os.getenv("SCRIPT_PATH")
        subprocess.run(["powershell", "-ExecutionPolicy", "Bypass", "-File", path],
                       check = True, capture_output =  True, text = True)
        print("Powershell Download Script Run Successfully. Now attempting to reload dataset...")
        df = pd.read_csv(file_path)
        if df is not None and not df.empty:
            print("Dataset Loaded Successfully")
        else:
            print("Data not loaded")
    except Exception as e:
        print(f"Error running powershell script: {e}")
        df = None

# Display the first few rows of the dataset
if df is not None:
    display(df.head())

Error: File not found at : file_path
Attempting to run download_data_ps1
Powershell Download Script Run Successfully. Now attempting to reload dataset...
Dataset Loaded Successfully


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


# Cleaning Dataset

The target variable is the last variable in this dataset. It is called "Exited". It is binary, where a 1 represents a customer closing thier account, and a 0 represents a retained customer.

Let's preview the data in order to understand what we have to work with.

First, I will drop the insignificant variables, which are the "RowNumber", "CustomerId", and "Surname" variables. They are arbitrary, and not useful for out algorithm.

In [2]:
X = df.iloc[:, 3:-1]
Y = df.iloc[:,-1:]

display(X.head())
display(Y.head())

display(f"{X.shape=}")
display(f"{Y.shape=}")

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,699,France,Female,39,1,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


Unnamed: 0,Exited
0,1
1,0
2,1
3,0
4,0


'X.shape=(10000, 10)'

'Y.shape=(10000, 1)'

As you can see, we have 10,000 observations for both the predictor and target variables.

Now, we will check the dataset for any Null values and duplicates.

In [3]:
print(df.isna().sum(), '\n')
print(f"Duplicate Count   ", df.duplicated().sum())

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64 

Duplicate Count    0


Since there are no NA values or duplicates, we can proceed with encoding.

In [4]:
adjust_length = 17 # offset for output formatting

for col in X:
    print(f"{col.ljust(adjust_length)} : {X[col].dtypes}")
print(f"{("Exited").ljust(adjust_length)} : {Y['Exited'].dtypes}")

CreditScore       : int64
Geography         : object
Gender            : object
Age               : int64
Tenure            : int64
Balance           : float64
NumOfProducts     : int64
HasCrCard         : int64
IsActiveMember    : int64
EstimatedSalary   : float64
Exited            : int64


# Label Encoding
Since we have two categorical variables, we must encode them before proceeding with the analysis.

The "Gender" variable will be encoded. A 0 represents the "Female" gender, a 1 represents the "Male" Gender.

Similarly, the "Geography" variable will be encoded. Each geographical location will recieve its own binary column, with a 1 occurring in the column where the observation is located.

In [5]:
Encoder = LabelEncoder()
X['Gender'] = Encoder.fit_transform(X.iloc[:, 2]).astype(int)

display((X.loc[:, 'CreditScore':'Gender']).iloc[0 : 5])


print(f"{str(X['Gender'].name)} : {X['Gender'].dtypes}")


Unnamed: 0,CreditScore,Geography,Gender
0,619,France,0
1,608,Spain,0
2,502,France,0
3,699,France,0
4,850,Spain,0


Gender : int64


Now, the "Gender" variable is represented as an integer.

Next, we must encode the "Geography" variable.

In [6]:
# Encode the geography variable
X = pd.get_dummies(X, columns=['Geography'], drop_first=True)

# Cast geography variables to integers
for col in X.columns:
    if 'Geography_' in col:
        X[col] = X[col].astype('int64')

# display
display(X.head())

# print the column names and their datatypes
for col in X:
    print(f"{col.ljust(adjust_length)} : {X[col].dtypes}")

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,0,0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,1
2,502,0,42,8,159660.8,3,1,0,113931.57,0,0
3,699,0,39,1,0.0,2,0,0,93826.63,0,0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,1


CreditScore       : int64
Gender            : int64
Age               : int64
Tenure            : int64
Balance           : float64
NumOfProducts     : int64
HasCrCard         : int64
IsActiveMember    : int64
EstimatedSalary   : float64
Geography_Germany : int64
Geography_Spain   : int64


In [7]:
fin_df = X.copy()
fin_df['Exited'] = Y

display(fin_df.iloc[0 : 5])

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Exited
0,619,0,42,2,0.0,1,1,1,101348.88,0,0,1
1,608,0,41,1,83807.86,1,0,1,112542.58,0,1,0
2,502,0,42,8,159660.8,3,1,0,113931.57,0,0,1
3,699,0,39,1,0.0,2,0,0,93826.63,0,0,0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,1,0


# Feature Scaling


to-do: 

       - Perform basic EDA BEFORE encoding / scaling
       - Scale features
       - Perform EDA AFTER encoding/scaling
       - Build neural network
       - Optimize NN
       - Analyze NN accuracy
       - Review + Document
       - Publish

