# 3 Pre-processing & Training Data Development <a id='3_Pre-processing_&_training_data_development'></a>
---

## 3.1 Contents <a id='31-contents'></a>

- [3.1 Contents](#31-contents)
- [3.2 Introduction](#32-introduction)
- [3.3 Imports](#33-imports)
- [3.4 Load The Data](#34-load-the-data)
- [3.5 Data Cleaning](#35_Data_cleaning)
    - [3.5.1 Imputing Missing/Removing Values](#351-imputing-missing-values)
    - [3.5.2 Data Type Manipulation](#352-Data-Type-Manipulation)
- [3.6 Encoding Categorical Features](#36-encoding-categorical-features)
     - [3.6.1 One-Hot Encoding](#361-One-Hot-Encoding)
     - [3.62 NLP-Encoding](#362-NLP-Encoding)
- [3.7 Train/Test Split](#37-traintest-split)
- [3.8 Scale the Data](#38-scale-the-data)
- [3.9 Train/Predict with a "Baseline Model"](#39-trainpredict-with-a-baseline-model)
- [3.10 Setting up Pipelines](#310-setting-up-pipelines)
    - [3.10.1 Define](#3101-define)
- [3.11 Fit/Train/Predict and Assess Models ](#311-fit-train-predict-and-assess)
- [3.12 Final Model Selection](#314-final-model-selection)
    - [3.12.1 Logistic Regression Model Performance](#3141-logistic-regression-model-performance)
    - [3.12.2 Random Forest Regression Model Performance](#3142-random-forest-regression-model-performance)
- [3.13 Conclusion](#315-conclusion)
 

## 3.2 Introduction <a id='32-introduction'></a>

This is a continuation of [2.0-faa-exploratory-data-analysis-cap3.ipynb](https://github.com/OCD0505/Springboard-Capstone-Three/blob/a828038e3a66cf911e11eda9a413b9a661266e08/notebooks/2.0-faa-exploratory-data-analysis-cap3.ipynb) focusing on feature engineering, training and model selection. 

Goals: Impute missing values, scale data, encode categorical types, train/test split, create a pipeline and model selection 

### **Problem Statement:**
Enhance the effectiveness of donor engagement and support for a nonprofit organization by analyzing donor lifetime value, predicting churn, and implementing personalized retention strategies.

## 3.3 Imports <a id='33-imports'></a>

In [210]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string 
import datetime
import pickle

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score 
from sklearn.metrics import f1_score, roc_auc_score, multilabel_confusion_matrix
from sklearn.neural_network import MLPClassifier

from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_regression


## 3.4 Load The Data <a id='34-load-the-data'></a>

In [2]:
# Storing file path in variable and then using pd.read_csv() to load the data as a dataframe into crimeData

dataFilePath = '/Users/frankyaraujo/Development/springboard_main/\
Capstone Three/Springboard-Capstone-Three/src/data/Donations _ Jan 2015 to Mar 2024_R3 .csv'
DonationData = pd.read_csv(dataFilePath, low_memory = False)

In [3]:
# just a quick column name change to align naming conventions
DonationData.rename(columns={'Gift Date_last': 'Last Gift Date'}, inplace=True)

# dropping 'Unnamed: 0' as it is equivalent to the index
DonationData.drop(columns='Unnamed: 0', inplace=True)

In [4]:
DonationData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17244 entries, 0 to 17243
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Contact Id                     17244 non-null  int64  
 1   Contact Type                   17244 non-null  object 
 2   Contact Tags                   16809 non-null  object 
 3   Contact Primary Full Address   17244 non-null  object 
 4   Contact Primary Address City   17244 non-null  object 
 5   Contact Primary Address State  17244 non-null  object 
 6   Contact Social Score           17243 non-null  float64
 7   Donor Tenure Years             17244 non-null  float64
 8   Churned                        17244 non-null  int64  
 9   Gift Date                      17244 non-null  object 
 10  Amount                         17244 non-null  float64
 11  Gift Type                      17244 non-null  object 
 12  Notes                          5915 non-null  

In [5]:
DonationData.head()

Unnamed: 0,Contact Id,Contact Type,Contact Tags,Contact Primary Full Address,Contact Primary Address City,Contact Primary Address State,Contact Social Score,Donor Tenure Years,Churned,Gift Date,...,Segment Name,Campaign Name,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,LTV,First Gift Date,Last Gift Date
0,1,Household,Do not call;Website Email Submit,Unknown,Unknown,Unknown,73.0,0.0,1,2015-01-01,...,General Segment,General Giving,2015,1,1,1,390.0,390.0,2015-01-01,2015-01-01
1,2,Household,Do not call;Website Email Submit,Unknown,Unknown,Unknown,30.0,0.0,1,2015-01-01,...,General Segment,General Giving,2015,1,1,1,50.0,50.0,2015-01-01,2015-01-01
2,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2015-01-01,...,General Segment,General Giving,2015,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18
3,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2015-01-01,...,General Segment,General Giving,2015,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18
4,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2016-01-01,...,General Segment,General Giving,2016,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18


## 3.5 Data Cleaning <a id='35_Data_cleaning'></a>

Before splitting it for training and testing, it's good practice to examine it closely. Looking for missing values or inconsistencies that could cause problems down the line. By fixing these issues, better choices can be made about which models to use.

In [6]:
# From the cells above, the date columns do not display the appropriate data type.
# Convert date columns to datetime64 format

DonationData['Gift Date'] = pd.to_datetime(DonationData['Gift Date'])
DonationData['First Gift Date'] = pd.to_datetime(DonationData['First Gift Date'])
DonationData['Last Gift Date'] = pd.to_datetime(DonationData['Last Gift Date'])

print(DonationData.dtypes)

Contact Id                                int64
Contact Type                             object
Contact Tags                             object
Contact Primary Full Address             object
Contact Primary Address City             object
Contact Primary Address State            object
Contact Social Score                    float64
Donor Tenure Years                      float64
Churned                                   int64
Gift Date                        datetime64[ns]
Amount                                  float64
Gift Type                                object
Notes                                    object
Segment Name                             object
Campaign Name                            object
Gift Year                                 int64
Quarter                                   int64
Gift Month                                int64
Donation Count                            int64
Average Amount                          float64
LTV                                     

<a id='351-imputing-missing-values'></a>
### 3.5.1 Imputing/Removing Missing Values

In [7]:
# look at the missing value %s
pd.DataFrame(DonationData.isnull().sum() / len(DonationData) * 100)\
    .sort_values(by=0, ascending=False)\
    .reset_index()\
    .rename(columns={'index': 'Column', 0: 'Missing Values %'})

Unnamed: 0,Column,Missing Values %
0,Notes,65.698214
1,Contact Tags,2.522617
2,Contact Social Score,0.005799
3,Contact Id,0.0
4,Segment Name,0.0
5,First Gift Date,0.0
6,LTV,0.0
7,Average Amount,0.0
8,Donation Count,0.0
9,Gift Month,0.0


Given that approximately 66% of the data in the 'Notes' column is missing, it indicates a substantial amount of missing information. This raises the consideration of dropping the column. However, the text data within the 'Contact Tags' column prompts exploration into applying Natural Language Processing (NLP) techniques. This approach offers the possibility of leveraging the available data without resorting to dropping or imputing.

In [8]:
# Fill missing values in 'Contact Tags' and 'Notes' columns with existing values
# Ensures successful concatenation despite missing values in either column.
DonationData['Contact Tags'].fillna('', inplace=True)
DonationData['Notes'].fillna('', inplace=True)

# Concatenate 'Contact Tags' and 'Notes' columns with a semicolon separator
DonationData['Tags_Notes_Combined'] = DonationData['Contact Tags'] + ';' + DonationData['Notes']

# Drop the old columns
DonationData.drop(columns=['Contact Tags', 'Notes'], inplace=True)

In [9]:
DonationData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17244 entries, 0 to 17243
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Contact Id                     17244 non-null  int64         
 1   Contact Type                   17244 non-null  object        
 2   Contact Primary Full Address   17244 non-null  object        
 3   Contact Primary Address City   17244 non-null  object        
 4   Contact Primary Address State  17244 non-null  object        
 5   Contact Social Score           17243 non-null  float64       
 6   Donor Tenure Years             17244 non-null  float64       
 7   Churned                        17244 non-null  int64         
 8   Gift Date                      17244 non-null  datetime64[ns]
 9   Amount                         17244 non-null  float64       
 10  Gift Type                      17244 non-null  object        
 11  Segment Name   

With the successful concatenation of 'Contact Tags' and 'Notes' into the new 'Tags_Notes_Combined' column, we've consolidated textual information for further analysis. The missing values have no been addressed and things can move forward.  

<a id='352-Data-Type-Manipulation'></a>
### 3.5.2 Data Type Manipulation

In preparation for modeling, numerical values will be extracted from the datetime values to ensure compatibility with a wider scope of model types.

In [10]:
# Extract numerical features from datetime columns
DonationData['Gift_Year'] = DonationData['Gift Date'].dt.year
DonationData['Gift_Month'] = DonationData['Gift Date'].dt.month
DonationData['Gift_Day'] = DonationData['Gift Date'].dt.day
DonationData['Gift_Hour'] = DonationData['Gift Date'].dt.hour
DonationData['Gift_Minute'] = DonationData['Gift Date'].dt.minute
DonationData['Gift_Second'] = DonationData['Gift Date'].dt.second

DonationData['First_Gift_Year'] = DonationData['First Gift Date'].dt.year
DonationData['First_Gift_Month'] = DonationData['First Gift Date'].dt.month
DonationData['First_Gift_Day'] = DonationData['First Gift Date'].dt.day
DonationData['First_Gift_Hour'] = DonationData['First Gift Date'].dt.hour
DonationData['First_Gift_Minute'] = DonationData['First Gift Date'].dt.minute
DonationData['First_Gift_Second'] = DonationData['First Gift Date'].dt.second

DonationData['Last_Gift_Year'] = DonationData['Last Gift Date'].dt.year
DonationData['Last_Gift_Month'] = DonationData['Last Gift Date'].dt.month
DonationData['Last_Gift_Day'] = DonationData['Last Gift Date'].dt.day
DonationData['Last_Gift_Hour'] = DonationData['Last Gift Date'].dt.hour
DonationData['Last_Gift_Minute'] = DonationData['Last Gift Date'].dt.minute
DonationData['Last_Gift_Second'] = DonationData['Last Gift Date'].dt.second

# Now these can be used as numerical features for modeling

In [11]:
# The information was extracted the from the datetime values as numerical data
# initial datetime columns can be dropped 

# Drop the original datetime columns
DonationData_V02 = DonationData.drop(['Gift Date', 'First Gift Date', 'Last Gift Date'], axis=1)

After extracting numerical data from datetime values, the focus shifts to handling categorical variables.

## 3.6 Encoding Categorical Features <a id="36-encoding-categorical-features"></a> 

Categorical data requires encoding techniques like one-hot or label encoding for machine learning compatibility. Proper treatment of categorical variables is essential for building accurate predictive models, as they contain valuable information affecting model performance.

In [12]:
# A look at the categorical variables in the dataset

DonationData_V02_Cat = DonationData_V02.select_dtypes(include=['object'])
DonationData_V02_Cat.head()

Unnamed: 0,Contact Type,Contact Primary Full Address,Contact Primary Address City,Contact Primary Address State,Gift Type,Segment Name,Campaign Name,Tags_Notes_Combined
0,Household,Unknown,Unknown,Unknown,Other,General Segment,General Giving,Do not call;Website Email Submit;Ticket: 2015 ...
1,Household,Unknown,Unknown,Unknown,Other,General Segment,General Giving,Do not call;Website Email Submit;Ticket: 4th ...
2,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...
3,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...
4,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...


In [129]:
# count of unique values to determine the best encoding technique
DonationData_V02_Cat.nunique().sort_values(ascending=False)

Tags_Notes_Combined              4338
Contact Primary Full Address     2009
Contact Primary Address City      833
Contact Primary Address State      76
Segment Name                       46
Gift Type                           7
Campaign Name                       5
Contact Type                        3
dtype: int64

One hot encoding will be applied to the following variables due to the low number of unique values present:
1. Gift Type
2. Campaign Name
3. Contact Type
4. Contact Primary Address State (in CA vs not in CA only - practical approach given the importance of California in dataset)


The following variables will be encoded using an NLP approach:
1. Tags_Notes_Combined     
2. Contact Primary Full Address
3. Contact Primary Address City
4. Segment Name


#### **3.6.1 One-Hot Encoding** <a id='361-One-Hot-Encoding'></a>  

In [138]:
# create a copy of dataframe for encoded variables
DonationData_V03_Encoded = DonationData_V02.copy(deep=True)

In [139]:
# Instantiate OneHotEncoder
hot_encoder = OneHotEncoder()

# Fit and transform column
encoded_values = hot_encoder.fit_transform(DonationData_V03_Encoded['Gift Type'].values.reshape(-1,1))
# Convert to dense array
encoded_values_dense = encoded_values.toarray()
# Create a DataFrame from the dense array with correct column names
encoded_df = pd.DataFrame(encoded_values_dense, columns=hot_encoder.get_feature_names_out(['Gift Type']))
# Drop the original 'Gift Type' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Gift Type'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, encoded_df], axis=1)


OneHotEncoder is applied to the 'Gift Type' column in the DataFrame, where each unique category becomes its own binary column. The resulting sparse matrix is converted to a dense array. Finally, the original 'Gift Type' column is replaced with the new binary columns, effectively transforming the categorical data into a binary representation.

This same approach will be applied to Campaign Name, Contact Type, and Contact Primary Address State.

In [140]:
# Fit and transform 'Campaign Name' column
campaign_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Campaign Name'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
campaign_encoded_dense = campaign_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
campaign_encoded_df = pd.DataFrame(campaign_encoded_dense, columns=hot_encoder.get_feature_names_out(['Campaign Name']))
# Drop the original 'Campaign Name' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Campaign Name'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, campaign_encoded_df], axis=1)

# Fit and transform 'Contact Type' column
contact_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Contact Type'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
contact_encoded_dense = contact_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
contact_encoded_df = pd.DataFrame(contact_encoded_dense, columns=hot_encoder.get_feature_names_out(['Contact Type']))
# Drop the original 'Contact Type' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Contact Type'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, contact_encoded_df], axis=1)

Encoding Contact Primary Address State will be slightly different since we only want to know CA vs Not In CA. 

In [141]:
# Transform 'Contact Primary Address State' column into 'CA' vs 'Not CA'
DonationData_V03_Encoded['Contact Primary Address State'] = DonationData_V03_Encoded['Contact Primary Address State'].apply(lambda x: 'CA' if x == 'CA' else 'Not CA')

# Fit and transform 'Contact Primary Address State' column
state_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Contact Primary Address State'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
state_encoded_dense = state_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
state_encoded_df = pd.DataFrame(state_encoded_dense, columns=hot_encoder.get_feature_names_out(['Contact Primary Address State']))
# Drop the original 'Contact Primary Address State' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Contact Primary Address State'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, state_encoded_df], axis=1)

In [142]:
DonationData_V03_Encoded.head()

Unnamed: 0,Contact Id,Contact Primary Full Address,Contact Primary Address City,Contact Social Score,Donor Tenure Years,Churned,Amount,Segment Name,Gift Year,Quarter,...,Campaign Name_CSEC Advocacy Course,Campaign Name_Capital Campaign,Campaign Name_End of Year 2023,Campaign Name_General Giving,Campaign Name_General Giving 2024,Contact Type_Foundation,Contact Type_Household,Contact Type_Organization,Contact Primary Address State_CA,Contact Primary Address State_Not CA
0,1,Unknown,Unknown,73.0,0.0,1,390.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,2,Unknown,Unknown,30.0,0.0,1,50.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,120.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,12.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,144.0,General Segment,2016,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


#### **3.6.2 NLP Encoding** <a id='362-NLP-Encoding'></a>

The following variables will be encoded using an NLP approach:

- Tags_Notes_Combined
- Contact Primary Full Address
- Contact Primary Address City
- Segment Name

In [143]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Set stopwords
stop_words = set(stopwords.words('english'))

# Define preprocess function
def preprocess(text):
    text = text.lower()  # Make strings lowercase
    text = ''.join([word for word in text if word not in string.punctuation])  # Remove punctuation
    tokens = nltk.tokenize.word_tokenize(text)  # Tokenize text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)  # Join tokens back into text

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [144]:
# Variables not encoded in 3.7.1
longer_text_columns = ["Tags_Notes_Combined",
                       "Contact Primary Full Address",
                       "Contact Primary Address City",
                       "Segment Name"]  

feature_name_mapping = {}

# Create a copy of the DataFrame
DonationData_V03_Encoded_copy = DonationData_V03_Encoded.copy()

# Iterate over each text column and each row to preprocess the text data
for col in longer_text_columns:  
    for index, value in DonationData_V03_Encoded_copy[col].items():  
        # Preprocess text data for the specific row and column
        processed_value = preprocess(value)
        DonationData_V03_Encoded_copy.at[index, col] = processed_value

# Initialize a dictionary to store vectorized values
vectorized_data = {}

# Train Word2Vec models for each text column
for col in longer_text_columns:
    # Tokenize text data
    sentences = [sentence.split() for sentence in DonationData_V03_Encoded_copy[col]]
    
    # Train Word2Vec model
    w2v_model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
    
    # Vectorize text data
    vectorized_values = np.array([np.mean([w2v_model.wv[word] for word in sentence if word in w2v_model.wv] or [np.zeros(50)], axis=0) for sentence in sentences])
    
    # Store vectorized values in the dictionary
    for j in range(w2v_model.vector_size):
        vectorized_data[f"{col}_vec_{j}"] = vectorized_values[:, j]

# Create a new DataFrame with vectorized values
DonationData_V03_Encoded_vectorized = pd.concat([DonationData_V03_Encoded_copy, pd.DataFrame(vectorized_data)], axis=1)

# Drop the original text columns
DonationData_V03_Encoded_vectorized.drop(columns=longer_text_columns, inplace=True)

In [148]:
DonationData_V03_Encoded_vectorized.head()

Unnamed: 0,Contact Id,Contact Social Score,Donor Tenure Years,Churned,Amount,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,...,Segment Name_vec_40,Segment Name_vec_41,Segment Name_vec_42,Segment Name_vec_43,Segment Name_vec_44,Segment Name_vec_45,Segment Name_vec_46,Segment Name_vec_47,Segment Name_vec_48,Segment Name_vec_49
0,1,73.0,0.0,1,390.0,2015,1,1,1,390.0,...,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
1,2,30.0,0.0,1,50.0,2015,1,1,1,50.0,...,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
2,3,91.0,9.21,0,120.0,2015,1,1,87,18.068966,...,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
3,3,91.0,9.21,0,12.0,2015,1,1,87,18.068966,...,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
4,3,91.0,9.21,0,144.0,2016,1,1,87,18.068966,...,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686


In [151]:
# store version of preprocessed data that is not scaled
unscaled_df_path = '/Users/frankyaraujo/Development/springboard_main/\
Capstone Three/Springboard-Capstone-Three/src/data/Donations _ Jan 2015 to Mar 2024_R4_vectorized_encoded .csv'

# Write DataFrames to CSV files
DonationData_V03_Encoded_vectorized.to_csv(unscaled_df_path, index=False)

## 3.7 Train/Test Split  <a id="37-traintest-split"></a> 

In [154]:
# Showing that all columns are now of a numerical data type
DonationData_V03_Encoded_vectorized.dtypes.unique()

array([dtype('int64'), dtype('float64'), dtype('float32')], dtype=object)

In [200]:
# Check for missing values

DonationData_V03_Encoded_vectorized[
DonationData_V03_Encoded_vectorized["Contact Social Score"].isna()]

# Fill missing value in Contact Social Score with zero 
DonationData_V03_Encoded_vectorized["Contact Social Score"].fillna(value=0,inplace=True)

In [202]:
#'Churn' is the target variable and the other columns are features
X = DonationData_V03_Encoded_vectorized.drop(columns=['Churned'])
y = DonationData_V03_Encoded_vectorized['Churned']

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [164]:
X.head()

Unnamed: 0,Contact Id,Contact Social Score,Donor Tenure Years,Amount,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,LTV,Gift_Year,Gift_Month,Gift_Day,Gift_Hour,Gift_Minute,Gift_Second,First_Gift_Year,First_Gift_Month,First_Gift_Day,First_Gift_Hour,First_Gift_Minute,First_Gift_Second,Last_Gift_Year,Last_Gift_Month,Last_Gift_Day,Last_Gift_Hour,Last_Gift_Minute,Last_Gift_Second,Gift Type_Cash,Gift Type_Check,Gift Type_Credit,Gift Type_Electronic Funds Transfer,Gift Type_Non-Cash,Gift Type_Other,Gift Type_Stock,Campaign Name_CSEC Advocacy Course,Campaign Name_Capital Campaign,Campaign Name_End of Year 2023,Campaign Name_General Giving,Campaign Name_General Giving 2024,Contact Type_Foundation,Contact Type_Household,Contact Type_Organization,Contact Primary Address State_CA,Contact Primary Address State_Not CA,Tags_Notes_Combined_vec_0,Tags_Notes_Combined_vec_1,Tags_Notes_Combined_vec_2,Tags_Notes_Combined_vec_3,Tags_Notes_Combined_vec_4,Tags_Notes_Combined_vec_5,Tags_Notes_Combined_vec_6,Tags_Notes_Combined_vec_7,Tags_Notes_Combined_vec_8,Tags_Notes_Combined_vec_9,Tags_Notes_Combined_vec_10,Tags_Notes_Combined_vec_11,Tags_Notes_Combined_vec_12,Tags_Notes_Combined_vec_13,Tags_Notes_Combined_vec_14,Tags_Notes_Combined_vec_15,Tags_Notes_Combined_vec_16,Tags_Notes_Combined_vec_17,Tags_Notes_Combined_vec_18,Tags_Notes_Combined_vec_19,Tags_Notes_Combined_vec_20,Tags_Notes_Combined_vec_21,Tags_Notes_Combined_vec_22,Tags_Notes_Combined_vec_23,Tags_Notes_Combined_vec_24,Tags_Notes_Combined_vec_25,Tags_Notes_Combined_vec_26,Tags_Notes_Combined_vec_27,Tags_Notes_Combined_vec_28,Tags_Notes_Combined_vec_29,Tags_Notes_Combined_vec_30,Tags_Notes_Combined_vec_31,Tags_Notes_Combined_vec_32,Tags_Notes_Combined_vec_33,Tags_Notes_Combined_vec_34,Tags_Notes_Combined_vec_35,Tags_Notes_Combined_vec_36,Tags_Notes_Combined_vec_37,Tags_Notes_Combined_vec_38,Tags_Notes_Combined_vec_39,Tags_Notes_Combined_vec_40,Tags_Notes_Combined_vec_41,Tags_Notes_Combined_vec_42,Tags_Notes_Combined_vec_43,Tags_Notes_Combined_vec_44,Tags_Notes_Combined_vec_45,Tags_Notes_Combined_vec_46,Tags_Notes_Combined_vec_47,Tags_Notes_Combined_vec_48,Tags_Notes_Combined_vec_49,Contact Primary Full Address_vec_0,Contact Primary Full Address_vec_1,Contact Primary Full Address_vec_2,Contact Primary Full Address_vec_3,Contact Primary Full Address_vec_4,Contact Primary Full Address_vec_5,Contact Primary Full Address_vec_6,Contact Primary Full Address_vec_7,Contact Primary Full Address_vec_8,Contact Primary Full Address_vec_9,Contact Primary Full Address_vec_10,Contact Primary Full Address_vec_11,Contact Primary Full Address_vec_12,Contact Primary Full Address_vec_13,Contact Primary Full Address_vec_14,Contact Primary Full Address_vec_15,Contact Primary Full Address_vec_16,Contact Primary Full Address_vec_17,Contact Primary Full Address_vec_18,Contact Primary Full Address_vec_19,Contact Primary Full Address_vec_20,Contact Primary Full Address_vec_21,Contact Primary Full Address_vec_22,Contact Primary Full Address_vec_23,Contact Primary Full Address_vec_24,Contact Primary Full Address_vec_25,Contact Primary Full Address_vec_26,Contact Primary Full Address_vec_27,Contact Primary Full Address_vec_28,Contact Primary Full Address_vec_29,Contact Primary Full Address_vec_30,Contact Primary Full Address_vec_31,Contact Primary Full Address_vec_32,Contact Primary Full Address_vec_33,Contact Primary Full Address_vec_34,Contact Primary Full Address_vec_35,Contact Primary Full Address_vec_36,Contact Primary Full Address_vec_37,Contact Primary Full Address_vec_38,Contact Primary Full Address_vec_39,Contact Primary Full Address_vec_40,Contact Primary Full Address_vec_41,Contact Primary Full Address_vec_42,Contact Primary Full Address_vec_43,Contact Primary Full Address_vec_44,Contact Primary Full Address_vec_45,Contact Primary Full Address_vec_46,Contact Primary Full Address_vec_47,Contact Primary Full Address_vec_48,Contact Primary Full Address_vec_49,Contact Primary Address City_vec_0,Contact Primary Address City_vec_1,Contact Primary Address City_vec_2,Contact Primary Address City_vec_3,Contact Primary Address City_vec_4,Contact Primary Address City_vec_5,Contact Primary Address City_vec_6,Contact Primary Address City_vec_7,Contact Primary Address City_vec_8,Contact Primary Address City_vec_9,Contact Primary Address City_vec_10,Contact Primary Address City_vec_11,Contact Primary Address City_vec_12,Contact Primary Address City_vec_13,Contact Primary Address City_vec_14,Contact Primary Address City_vec_15,Contact Primary Address City_vec_16,Contact Primary Address City_vec_17,Contact Primary Address City_vec_18,Contact Primary Address City_vec_19,Contact Primary Address City_vec_20,Contact Primary Address City_vec_21,Contact Primary Address City_vec_22,Contact Primary Address City_vec_23,Contact Primary Address City_vec_24,Contact Primary Address City_vec_25,Contact Primary Address City_vec_26,Contact Primary Address City_vec_27,Contact Primary Address City_vec_28,Contact Primary Address City_vec_29,Contact Primary Address City_vec_30,Contact Primary Address City_vec_31,Contact Primary Address City_vec_32,Contact Primary Address City_vec_33,Contact Primary Address City_vec_34,Contact Primary Address City_vec_35,Contact Primary Address City_vec_36,Contact Primary Address City_vec_37,Contact Primary Address City_vec_38,Contact Primary Address City_vec_39,Contact Primary Address City_vec_40,Contact Primary Address City_vec_41,Contact Primary Address City_vec_42,Contact Primary Address City_vec_43,Contact Primary Address City_vec_44,Contact Primary Address City_vec_45,Contact Primary Address City_vec_46,Contact Primary Address City_vec_47,Contact Primary Address City_vec_48,Contact Primary Address City_vec_49,Segment Name_vec_0,Segment Name_vec_1,Segment Name_vec_2,Segment Name_vec_3,Segment Name_vec_4,Segment Name_vec_5,Segment Name_vec_6,Segment Name_vec_7,Segment Name_vec_8,Segment Name_vec_9,Segment Name_vec_10,Segment Name_vec_11,Segment Name_vec_12,Segment Name_vec_13,Segment Name_vec_14,Segment Name_vec_15,Segment Name_vec_16,Segment Name_vec_17,Segment Name_vec_18,Segment Name_vec_19,Segment Name_vec_20,Segment Name_vec_21,Segment Name_vec_22,Segment Name_vec_23,Segment Name_vec_24,Segment Name_vec_25,Segment Name_vec_26,Segment Name_vec_27,Segment Name_vec_28,Segment Name_vec_29,Segment Name_vec_30,Segment Name_vec_31,Segment Name_vec_32,Segment Name_vec_33,Segment Name_vec_34,Segment Name_vec_35,Segment Name_vec_36,Segment Name_vec_37,Segment Name_vec_38,Segment Name_vec_39,Segment Name_vec_40,Segment Name_vec_41,Segment Name_vec_42,Segment Name_vec_43,Segment Name_vec_44,Segment Name_vec_45,Segment Name_vec_46,Segment Name_vec_47,Segment Name_vec_48,Segment Name_vec_49
0,1,73.0,0.0,390.0,2015,1,1,1,390.0,390.0,2015,1,1,0,0,0,2015,1,1,0,0,0,2015,1,1,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.654299,0.199458,-0.363562,0.159784,-0.403311,-0.45214,0.028981,0.525316,0.060092,0.277174,0.724722,-0.65426,0.851242,0.015334,-0.192282,-0.196604,0.827486,-0.584136,-0.075217,0.374486,0.204932,-0.24679,-0.194309,0.022532,0.435364,-0.534678,0.034977,0.224616,0.003365,-0.386344,0.712138,-0.770285,0.123028,0.194845,-0.309809,-0.491201,0.830734,0.008223,-0.220348,0.064453,-0.246629,-0.491006,-0.132539,-0.054286,-1.048945,-0.783534,0.212765,0.331711,-0.429115,0.486331,-0.016316,0.008992,-0.008274,0.001649,0.016997,-0.008924,0.009035,-0.013574,-0.007097,0.018797,-0.003155,0.000643,-0.008281,-0.015365,-0.003016,0.00494,-0.001776,0.011067,-0.005486,0.00452,0.010912,0.016692,-0.002907,-0.018416,0.008741,0.001144,0.014884,-0.001627,-0.005277,-0.017506,-0.001713,0.005653,0.010803,0.014105,-0.011406,0.003718,0.012178,-0.009596,-0.006215,0.013595,0.003263,0.00038,0.006947,0.000436,0.019238,0.010121,-0.017835,-0.014083,0.001803,0.012785,-0.001072,0.000473,0.010207,0.018019,-0.018606,-0.014234,0.012918,0.017946,-0.010031,-0.007527,0.014761,-0.003067,-0.009073,0.013108,-0.00972,-0.003632,0.005753,0.001984,-0.01657,-0.018898,0.014624,0.010141,0.013515,0.001526,0.012702,-0.006811,-0.001893,0.011537,-0.015043,-0.007872,-0.015023,-0.00186,0.019076,-0.014638,-0.004668,-0.003875,0.016155,-0.011862,9e-05,-0.009507,-0.019207,0.010015,-0.017519,-0.008784,-7e-05,-0.000592,-0.015322,0.019229,0.009964,0.018466,-0.022141,0.022883,0.012653,0.017343,-0.034844,-0.029945,0.04599,-0.008284,-0.054294,-0.033811,0.025937,-0.031531,-0.002927,-0.028796,-0.005654,0.017519,0.007725,0.018633,-0.043528,-0.08255,0.013898,0.044348,0.050147,0.012264,-0.004502,0.032557,-0.021866,0.00748,-0.006221,-0.008299,0.001098,-0.062827,0.046034,-0.050737,-0.019073,-0.015523,0.054553,0.045936,-0.025505,-0.007088,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
1,2,30.0,0.0,50.0,2015,1,1,1,50.0,50.0,2015,1,1,0,0,0,2015,1,1,0,0,0,2015,1,1,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.504203,0.229724,-0.366558,0.051531,-0.269202,-0.475554,0.039877,0.434967,0.116327,0.129988,0.726229,-0.644104,0.765907,0.001992,-0.081776,-0.196955,0.510067,-0.570266,0.01964,0.292362,0.078433,-0.08068,-0.296474,-0.02307,0.368441,-0.489794,-0.092535,0.026086,-0.058564,-0.311702,0.675508,-0.665601,-0.093419,0.277825,-0.281421,-0.353449,0.670081,0.144087,-0.077432,0.116892,-0.2111,-0.401764,-0.200086,-0.05301,-0.864323,-0.613467,0.107526,0.277623,-0.361355,0.305839,-0.016316,0.008992,-0.008274,0.001649,0.016997,-0.008924,0.009035,-0.013574,-0.007097,0.018797,-0.003155,0.000643,-0.008281,-0.015365,-0.003016,0.00494,-0.001776,0.011067,-0.005486,0.00452,0.010912,0.016692,-0.002907,-0.018416,0.008741,0.001144,0.014884,-0.001627,-0.005277,-0.017506,-0.001713,0.005653,0.010803,0.014105,-0.011406,0.003718,0.012178,-0.009596,-0.006215,0.013595,0.003263,0.00038,0.006947,0.000436,0.019238,0.010121,-0.017835,-0.014083,0.001803,0.012785,-0.001072,0.000473,0.010207,0.018019,-0.018606,-0.014234,0.012918,0.017946,-0.010031,-0.007527,0.014761,-0.003067,-0.009073,0.013108,-0.00972,-0.003632,0.005753,0.001984,-0.01657,-0.018898,0.014624,0.010141,0.013515,0.001526,0.012702,-0.006811,-0.001893,0.011537,-0.015043,-0.007872,-0.015023,-0.00186,0.019076,-0.014638,-0.004668,-0.003875,0.016155,-0.011862,9e-05,-0.009507,-0.019207,0.010015,-0.017519,-0.008784,-7e-05,-0.000592,-0.015322,0.019229,0.009964,0.018466,-0.022141,0.022883,0.012653,0.017343,-0.034844,-0.029945,0.04599,-0.008284,-0.054294,-0.033811,0.025937,-0.031531,-0.002927,-0.028796,-0.005654,0.017519,0.007725,0.018633,-0.043528,-0.08255,0.013898,0.044348,0.050147,0.012264,-0.004502,0.032557,-0.021866,0.00748,-0.006221,-0.008299,0.001098,-0.062827,0.046034,-0.050737,-0.019073,-0.015523,0.054553,0.045936,-0.025505,-0.007088,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
2,3,91.0,9.21,120.0,2015,1,1,87,18.068966,1415.473233,2015,1,1,0,0,0,2015,1,1,0,0,0,2024,3,18,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.528352,0.196191,-0.287676,0.085154,-0.336827,-0.475813,0.050493,0.46146,-0.018394,0.164027,0.602672,-0.549987,0.695219,-0.024784,-0.206348,-0.169388,0.765493,-0.445313,-0.107635,0.283829,0.205644,-0.194623,-0.057433,0.012599,0.318638,-0.397034,0.012923,0.213754,0.013157,-0.363715,0.542554,-0.588991,0.159334,0.173636,-0.249968,-0.384736,0.785276,-0.035642,-0.243123,0.000942,-0.232384,-0.474961,-0.063062,-0.067352,-0.744242,-0.642701,0.151805,0.286381,-0.310852,0.363256,0.236734,-0.596884,0.0591,0.883696,-0.373541,-1.072118,-0.505115,0.176888,-0.020569,0.693868,-0.789264,-0.628432,-1.005677,0.633899,-0.790053,0.691308,-1.14614,-0.250672,0.095467,-1.249284,0.841842,0.01102,0.115197,-0.020759,0.559159,0.044142,0.269154,0.186147,0.371155,-0.112323,1.056579,-0.125247,0.093316,0.16118,-0.1648,-0.048475,0.842904,-0.053666,-0.299422,-1.025106,0.747352,-0.768605,-0.053896,-0.04126,0.622131,0.692342,-0.526229,-1.033346,-0.048032,1.009912,-0.114835,-0.148513,-0.42472,-0.041384,-0.119968,-0.068098,0.066651,0.132182,-0.413611,-0.089113,-0.054915,-0.023953,0.083507,0.28902,0.136567,0.135773,0.278706,-0.009128,0.083999,-0.14522,0.319137,0.158328,0.213017,0.020888,0.253668,0.181464,-0.195191,0.113205,0.023739,0.028226,-0.059313,0.034832,-0.120293,-0.260826,-0.058455,-0.110418,0.087129,-0.069011,-0.103452,-0.175176,0.029051,-0.026919,0.07617,0.16731,0.125893,0.130031,-0.101138,0.250874,-0.176394,-0.152865,-0.022141,0.022883,0.012653,0.017343,-0.034844,-0.029945,0.04599,-0.008284,-0.054294,-0.033811,0.025937,-0.031531,-0.002927,-0.028796,-0.005654,0.017519,0.007725,0.018633,-0.043528,-0.08255,0.013898,0.044348,0.050147,0.012264,-0.004502,0.032557,-0.021866,0.00748,-0.006221,-0.008299,0.001098,-0.062827,0.046034,-0.050737,-0.019073,-0.015523,0.054553,0.045936,-0.025505,-0.007088,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
3,3,91.0,9.21,12.0,2015,1,1,87,18.068966,1415.473233,2015,1,1,0,0,0,2015,1,1,0,0,0,2024,3,18,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.355276,0.148548,-0.236525,0.091454,-0.27836,-0.584948,0.045952,0.526531,-0.017243,0.24345,0.63469,-0.731267,0.884374,-0.13701,-0.139807,-0.136404,0.865399,-0.414025,-0.087253,0.095768,0.272044,-0.034363,-0.060337,-0.034961,0.488086,-0.394108,0.010063,0.194405,-0.194593,-0.695273,0.421144,-0.554699,0.250112,0.229537,-0.366306,-0.387504,1.044361,0.015671,-0.42818,0.035594,-0.237097,-0.372448,-0.102085,0.035893,-0.66491,-0.658809,0.106839,0.413737,-0.22347,0.593068,0.236734,-0.596884,0.0591,0.883696,-0.373541,-1.072118,-0.505115,0.176888,-0.020569,0.693868,-0.789264,-0.628432,-1.005677,0.633899,-0.790053,0.691308,-1.14614,-0.250672,0.095467,-1.249284,0.841842,0.01102,0.115197,-0.020759,0.559159,0.044142,0.269154,0.186147,0.371155,-0.112323,1.056579,-0.125247,0.093316,0.16118,-0.1648,-0.048475,0.842904,-0.053666,-0.299422,-1.025106,0.747352,-0.768605,-0.053896,-0.04126,0.622131,0.692342,-0.526229,-1.033346,-0.048032,1.009912,-0.114835,-0.148513,-0.42472,-0.041384,-0.119968,-0.068098,0.066651,0.132182,-0.413611,-0.089113,-0.054915,-0.023953,0.083507,0.28902,0.136567,0.135773,0.278706,-0.009128,0.083999,-0.14522,0.319137,0.158328,0.213017,0.020888,0.253668,0.181464,-0.195191,0.113205,0.023739,0.028226,-0.059313,0.034832,-0.120293,-0.260826,-0.058455,-0.110418,0.087129,-0.069011,-0.103452,-0.175176,0.029051,-0.026919,0.07617,0.16731,0.125893,0.130031,-0.101138,0.250874,-0.176394,-0.152865,-0.022141,0.022883,0.012653,0.017343,-0.034844,-0.029945,0.04599,-0.008284,-0.054294,-0.033811,0.025937,-0.031531,-0.002927,-0.028796,-0.005654,0.017519,0.007725,0.018633,-0.043528,-0.08255,0.013898,0.044348,0.050147,0.012264,-0.004502,0.032557,-0.021866,0.00748,-0.006221,-0.008299,0.001098,-0.062827,0.046034,-0.050737,-0.019073,-0.015523,0.054553,0.045936,-0.025505,-0.007088,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686
4,3,91.0,9.21,144.0,2016,1,1,87,18.068966,1415.473233,2016,1,1,0,0,0,2015,1,1,0,0,0,2024,3,18,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.010284,-0.21362,0.005701,0.255464,-0.246665,-0.495662,0.099346,0.781347,-0.204255,0.667988,0.381557,-0.994455,1.131108,-0.358135,-0.186692,0.028413,1.487947,-0.246462,-0.112064,-0.334259,0.661839,0.051793,0.256616,-0.085069,0.868707,-0.309087,0.219261,0.421195,-0.613843,-1.442493,-0.041404,-0.358839,0.904991,0.121572,-0.563802,-0.450472,1.646506,-0.147998,-0.964659,-0.024371,-0.124551,-0.210566,-0.05369,0.188424,-0.563607,-0.804619,0.220004,0.708776,0.05834,1.352892,0.236734,-0.596884,0.0591,0.883696,-0.373541,-1.072118,-0.505115,0.176888,-0.020569,0.693868,-0.789264,-0.628432,-1.005677,0.633899,-0.790053,0.691308,-1.14614,-0.250672,0.095467,-1.249284,0.841842,0.01102,0.115197,-0.020759,0.559159,0.044142,0.269154,0.186147,0.371155,-0.112323,1.056579,-0.125247,0.093316,0.16118,-0.1648,-0.048475,0.842904,-0.053666,-0.299422,-1.025106,0.747352,-0.768605,-0.053896,-0.04126,0.622131,0.692342,-0.526229,-1.033346,-0.048032,1.009912,-0.114835,-0.148513,-0.42472,-0.041384,-0.119968,-0.068098,0.066651,0.132182,-0.413611,-0.089113,-0.054915,-0.023953,0.083507,0.28902,0.136567,0.135773,0.278706,-0.009128,0.083999,-0.14522,0.319137,0.158328,0.213017,0.020888,0.253668,0.181464,-0.195191,0.113205,0.023739,0.028226,-0.059313,0.034832,-0.120293,-0.260826,-0.058455,-0.110418,0.087129,-0.069011,-0.103452,-0.175176,0.029051,-0.026919,0.07617,0.16731,0.125893,0.130031,-0.101138,0.250874,-0.176394,-0.152865,-0.022141,0.022883,0.012653,0.017343,-0.034844,-0.029945,0.04599,-0.008284,-0.054294,-0.033811,0.025937,-0.031531,-0.002927,-0.028796,-0.005654,0.017519,0.007725,0.018633,-0.043528,-0.08255,0.013898,0.044348,0.050147,0.012264,-0.004502,0.032557,-0.021866,0.00748,-0.006221,-0.008299,0.001098,-0.062827,0.046034,-0.050737,-0.019073,-0.015523,0.054553,0.045936,-0.025505,-0.007088,0.050676,0.003257,-0.036793,0.027339,0.047814,-0.008899,-0.013134,-0.018401,0.033282,0.039686


In [165]:
y.head()

0    1
1    1
2    0
3    0
4    0
Name: Churned, dtype: int64

## 3.8 Scale the Data <a id='38-scale-the-data'></a>

Given the diverse data types in the original dataset, including datetime features, class labels, and continuous variables, Standard Scaling will be applied to the continuous variables. This normalization process helps prevent features with larger magnitudes from dominating the model training process.

Class labels, representing categorical variables, lack numerical magnitudes, making scaling unnecessary and inappropriate. Similarly, text vectors generated by Word2Vec are inherently normalized during vectorization and do not require additional scaling. Therefore, class labels and text vectors will remain untouched by the scaling process.


In [203]:
# Finding the continuous variables to scale
[col for col in DonationData_V03_Encoded_vectorized.columns 
                    if DonationData_V03_Encoded_vectorized[col].dtype == 'float64' 
                    and '_' not in col]

['Contact Social Score',
 'Donor Tenure Years',
 'Amount',
 'Average Amount',
 'LTV']

In [204]:
# Continuous columns
continuous_cols = ['Contact Social Score',
                   'Donor Tenure Years',
                   'Amount',
                   'Average Amount',
                   'LTV']

# Initialize the StandardScaler
num_scaler = StandardScaler()

# Fit and transform the scaler on the training set
X_train[continuous_cols] = num_scaler.fit_transform(X_train[continuous_cols])

# Transform the testing set using the scaler fitted on the training set
X_test[continuous_cols] = num_scaler.transform(X_test[continuous_cols])


## 3.9 Train/Predict with a "Baseline Model" <a id='39-trainpredict-with-a-baseline-model'></a>

#### Fit the dummy classifier

In [205]:
#Fit the dummy regressor on the training data - Victim Age
dumb_cls = DummyClassifier(strategy='stratified')
dumb_cls.fit(X_train, y_train)

#### Assess dummy classifer performance

In [206]:
# Obtain predictions from the Dummy Regressor
y_pred_dummy = dumb_cls.predict(X_test)

# Calculate Accuracy
accuracy_dummy = accuracy_score(y_test, y_pred_dummy)
print("Dummy Classifier - Accuracy:", accuracy_dummy)

# Calculate Precision
precision_dummy = precision_score(y_test, y_pred_dummy)
print("Dummy Classifier - Precision:", precision_dummy)

# Calculate Recall
recall_dummy = recall_score(y_test, y_pred_dummy)
print("Dummy Classifier - Recall:", recall_dummy)

# Calculate F1-score
f1_dummy = f1_score(y_test, y_pred_dummy)
print("Dummy Classifier - F1-score:", f1_dummy)

# Calculate ROC AUC
roc_auc_dummy = roc_auc_score(y_test, y_pred_dummy)
print("Dummy Classifier - ROC AUC:", roc_auc_dummy)

Dummy Classifier - Accuracy: 0.5224702812409394
Dummy Classifier - Precision: 0.4180440771349862
Dummy Classifier - Recall: 0.43080198722498225
Dummy Classifier - F1-score: 0.42432715833624607
Dummy Classifier - ROC AUC: 0.5082931504752362


**Accuracy**: The accuracy of the dummy classifier is approximately 52.25%. This means that around 52.25% of the predictions made by the model are correct.

**Precision**: The precision of the classifier is approximately 41.80%. Precision measures the proportion of true positive predictions among all positive predictions made by the model. In this case, it indicates that around 41.80% of the positive predictions made by the model are correct.

**Recall**: The recall of the classifier is approximately 43.08%. Recall, also known as sensitivity, measures the proportion of true positive predictions that were correctly identified by the model among all actual positive instances. In this case, it indicates that around 43.08% of the actual positive instances were correctly identified by the model.

**F1-score**: The F1-score of the classifier is approximately 42.43%. The F1-score is the harmonic mean of precision and recall and provides a balance between the two metrics. A higher F1-score indicates better overall performance.

**ROC AUC**: The ROC AUC (Receiver Operating Characteristic Area Under the Curve) of the classifier is approximately 50.83%. This metric measures the ability of the classifier to distinguish between positive and negative instances. A value close to 0.5 suggests that the classifier performs no better than random guessing.



## 3.10 Setting up Pipelines  <a id="310-setting-up-pipelines">

In this section, pipelines will be setup to streamline the model comparison process. Pipelines enable the chaining together of preprocessing steps with model fitting, ensuring consistency and reproducibility in the approach.

#### 3.10.1 Define Pipeline(s) <a id="3101-define"></a>

In [207]:
# Pipeline for classification models to be explored

classification_pipelines = [
    ('RandomForest', RandomForestClassifier()),  # Ensemble method based on decision trees
    ('GradientBoosting', GradientBoostingClassifier()),  # Ensemble method that builds trees sequentially
    ('LogisticRegression', LogisticRegression()),  # Linear model for binary classification
    ('DecisionTree', DecisionTreeClassifier()),  # A single decision tree
    ('SVM', SVC()),  # Support Vector Machine classifier
    ('KNN', KNeighborsClassifier()),  # K-Nearest Neighbors classifier
    ('ANN', MLPClassifier())  # Artificial Neural Network classifier
]



## 3.11 Fit/Train/Predict and Assess Models  <a id="311-fit-train-predict-and-assess"></a>

In [None]:
# Using sample sets to reduce training time for now

# for regression model
X_va=victim_age_feature_data_df.sample(frac=0.1, random_state=42).values
y_va=victim_age_target_data_df.sample(frac=0.1, random_state=42).values.ravel()

X_train_va, X_test_va, y_train_va, y_test_va = train_test_split(X_va,y_va,test_size=.3, 
                                                            stratify=y_va,random_state=42)
# for classification model
X_vs=victim_sex_feature_data_df.sample(frac=0.1, random_state=42).values
y_vs=victim_sex_target_data_df.sample(frac=0.1, random_state=42).values.ravel()

X_train_vs, X_test_vs, y_train_vs, y_test_vs = train_test_split(X_vs,y_vs,test_size=.3, 
                                                                stratify=y_vs,random_state=42)

In [213]:
import time
# Record start time
start_time = time.time()

# Results for classification
results_cls = []
for name, model in classification_pipelines:
    model.fit(X_train, y_train)
    y_pred_cls = model.predict(X_test)
    accuracy_cls = accuracy_score(y_test, y_pred_cls)
    scores = cross_val_score(model, X_train, y_train, cv=5)  # cv=5 for 5-fold cross-validation
    results_cls.append({'Model': name,
                        'Accuracy': accuracy_cls,
                        'Cross-val-score':scores,
                        'Cross-val-score-avg':scores.mean()})

end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
print("\nElapsed Time:", elapsed_time, "seconds")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Classification Results:
                Model  Accuracy  \
0        RandomForest  0.999130   
1    GradientBoosting  0.999710   
2  LogisticRegression  0.913018   
3        DecisionTree  1.000000   
4                 SVM  0.591476   
5                 KNN  0.979414   
6                 ANN  0.954769   

                                     Cross-val-score  Cross-val-score-avg  
0  [0.9978252990213845, 0.9971003986951794, 0.996...             0.997535  
1  [0.9992750996737948, 0.9996375498368975, 1.0, ...             0.999783  
2  [0.9057629575933309, 0.893802102210946, 0.9039...             0.899166  
3           [0.9985501993475897, 1.0, 1.0, 1.0, 1.0]             0.999710  
4  [0.5846321130844508, 0.5846321130844508, 0.584...             0.584632  
5  [0.9782529902138456, 0.9724537876042044, 0.983...             0.978108  
6  [0.9492569771656397, 0.8909025009061254, 0.963...             0.916636  

Elapsed Time: 318.285560131073 seconds


In [214]:
# Display results for classification
results_df_cls = pd.DataFrame(results_cls)
print("\nClassification Results:")
results_df_cls


Classification Results:


Unnamed: 0,Model,Accuracy,Cross-val-score,Cross-val-score-avg
0,RandomForest,0.99913,"[0.9978252990213845, 0.9971003986951794, 0.996...",0.997535
1,GradientBoosting,0.99971,"[0.9992750996737948, 0.9996375498368975, 1.0, ...",0.999783
2,LogisticRegression,0.913018,"[0.9057629575933309, 0.893802102210946, 0.9039...",0.899166
3,DecisionTree,1.0,"[0.9985501993475897, 1.0, 1.0, 1.0, 1.0]",0.99971
4,SVM,0.591476,"[0.5846321130844508, 0.5846321130844508, 0.584...",0.584632
5,KNN,0.979414,"[0.9782529902138456, 0.9724537876042044, 0.983...",0.978108
6,ANN,0.954769,"[0.9492569771656397, 0.8909025009061254, 0.963...",0.916636


## 3.12 Final Model Selection <a id='314-final-model-selection'></a>

#### 3.12.2 Classifier Model Performance <a id='3142-random-forest-regression-model-performance'></a>


## 3.13 Conclusion <a id='315-conclusion'></a>
   