# 3 Pre-processing & Training Data Development <a id='3_Pre-processing_&_training_data_development'></a>
---

## 3.1 Contents <a id='31-contents'></a>

- [3.1 Contents](#31-contents)
- [3.2 Introduction](#32-introduction)
- [3.3 Imports](#33-imports)
- [3.4 Load The Data](#34-load-the-data)
- [3.5 Data Cleaning](#35_Data_cleaning)
    - [3.5.1 Imputing Missing/Removing Values](#351-imputing-missing-values)
    - [3.5.2 Data Type Manipulation](#352-Data-Type-Manipulation)
- [3.6 Encoding Categorical Features](#36-encoding-categorical-features)
     - [3.6.1 One-Hot Encoding](#361-One-Hot-Encoding)
     - [3.62 NLP-Encoding](#362-NLP-Encoding)
- [3.7 Train/Test Split](#37-traintest-split)
- [3.8 Scale the Data](#38-scale-the-data)
- [3.9 Train/Predict with a "Baseline Model"](#39-trainpredict-with-a-baseline-model)
- [3.10 Setting up Pipelines](#310-setting-up-pipelines)
    - [3.10.1 Define](#3101-define)
- [3.11 Fit/Train/Predict and Assess Models ](#311-fit-train-predict-and-assess)
- [3.12 Final Model Selection](#314-final-model-selection)
- [3.13 Conclusion](#315-conclusion)
 

## 3.2 Introduction <a id='32-introduction'></a>

This is a continuation of [2.0-faa-exploratory-data-analysis-cap3.ipynb](https://github.com/OCD0505/Springboard-Capstone-Three/blob/a828038e3a66cf911e11eda9a413b9a661266e08/notebooks/2.0-faa-exploratory-data-analysis-cap3.ipynb) focusing on feature engineering, training and model selection. 

Goals: Impute missing values, scale data, encode categorical types, train/test split, create a pipeline and model selection 

### **Problem Statement:**
Enhance the effectiveness of donor engagement and support for a nonprofit organization by analyzing donor lifetime value, predicting churn, and implementing personalized retention strategies.

## 3.3 Imports <a id='33-imports'></a>

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string 
import datetime
import pickle

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score 
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, auc
from sklearn.neural_network import MLPClassifier

from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.feature_selection import SelectKBest, f_regression


## 3.4 Load The Data <a id='34-load-the-data'></a>

In [2]:
# Storing file path in variable and then using pd.read_csv() to load the data as a dataframe into crimeData

dataFilePath = '/Users/frankyaraujo/Development/springboard_main/\
Capstone Three/Springboard-Capstone-Three/src/data/Donations _ Jan 2015 to Mar 2024_R3 .csv'
DonationData = pd.read_csv(dataFilePath, low_memory = False)

In [3]:
# just a quick column name change to align naming conventions
DonationData.rename(columns={'Gift Date_last': 'Last Gift Date'}, inplace=True)

# dropping 'Unnamed: 0' as it is equivalent to the index
DonationData.drop(columns='Unnamed: 0', inplace=True)

In [4]:
DonationData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17244 entries, 0 to 17243
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Contact Id                     17244 non-null  int64  
 1   Contact Type                   17244 non-null  object 
 2   Contact Tags                   16809 non-null  object 
 3   Contact Primary Full Address   17244 non-null  object 
 4   Contact Primary Address City   17244 non-null  object 
 5   Contact Primary Address State  17244 non-null  object 
 6   Contact Social Score           17243 non-null  float64
 7   Donor Tenure Years             17244 non-null  float64
 8   Churned                        17244 non-null  int64  
 9   Gift Date                      17244 non-null  object 
 10  Amount                         17244 non-null  float64
 11  Gift Type                      17244 non-null  object 
 12  Notes                          5915 non-null  

In [5]:
DonationData.head()

Unnamed: 0,Contact Id,Contact Type,Contact Tags,Contact Primary Full Address,Contact Primary Address City,Contact Primary Address State,Contact Social Score,Donor Tenure Years,Churned,Gift Date,...,Segment Name,Campaign Name,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,LTV,First Gift Date,Last Gift Date
0,1,Household,Do not call;Website Email Submit,Unknown,Unknown,Unknown,73.0,0.0,1,2015-01-01,...,General Segment,General Giving,2015,1,1,1,390.0,390.0,2015-01-01,2015-01-01
1,2,Household,Do not call;Website Email Submit,Unknown,Unknown,Unknown,30.0,0.0,1,2015-01-01,...,General Segment,General Giving,2015,1,1,1,50.0,50.0,2015-01-01,2015-01-01
2,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2015-01-01,...,General Segment,General Giving,2015,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18
3,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2015-01-01,...,General Segment,General Giving,2015,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18
4,3,Household,Do not call;Website Email Submit;Sustaining Do...,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,91.0,9.21,0,2016-01-01,...,General Segment,General Giving,2016,1,1,87,18.068966,1415.473233,2015-01-01,2024-03-18


## 3.5 Data Cleaning <a id='35_Data_cleaning'></a>

Before splitting it for training and testing, it's good practice to examine it closely. Looking for missing values or inconsistencies that could cause problems down the line. By fixing these issues, better choices can be made about which models to use.

In [6]:
# From the cells above, the date columns do not display the appropriate data type.
# Convert date columns to datetime64 format

DonationData['Gift Date'] = pd.to_datetime(DonationData['Gift Date'])
DonationData['First Gift Date'] = pd.to_datetime(DonationData['First Gift Date'])
DonationData['Last Gift Date'] = pd.to_datetime(DonationData['Last Gift Date'])

print(DonationData.dtypes)

Contact Id                                int64
Contact Type                             object
Contact Tags                             object
Contact Primary Full Address             object
Contact Primary Address City             object
Contact Primary Address State            object
Contact Social Score                    float64
Donor Tenure Years                      float64
Churned                                   int64
Gift Date                        datetime64[ns]
Amount                                  float64
Gift Type                                object
Notes                                    object
Segment Name                             object
Campaign Name                            object
Gift Year                                 int64
Quarter                                   int64
Gift Month                                int64
Donation Count                            int64
Average Amount                          float64
LTV                                     

<a id='351-imputing-missing-values'></a>
### 3.5.1 Imputing/Removing Missing Values

In [7]:
# look at the missing value %s
pd.DataFrame(DonationData.isnull().sum() / len(DonationData) * 100)\
    .sort_values(by=0, ascending=False)\
    .reset_index()\
    .rename(columns={'index': 'Column', 0: 'Missing Values %'})

Unnamed: 0,Column,Missing Values %
0,Notes,65.698214
1,Contact Tags,2.522617
2,Contact Social Score,0.005799
3,Contact Id,0.0
4,Segment Name,0.0
5,First Gift Date,0.0
6,LTV,0.0
7,Average Amount,0.0
8,Donation Count,0.0
9,Gift Month,0.0


Given that approximately 66% of the data in the 'Notes' column is missing, it indicates a substantial amount of missing information. This raises the consideration of dropping the column. However, the text data within the 'Contact Tags' column prompts exploration into applying Natural Language Processing (NLP) techniques. This approach offers the possibility of leveraging the available data without resorting to dropping or imputing.

In [8]:
# Fill missing values in 'Contact Tags' and 'Notes' columns with existing values
# Ensures successful concatenation despite missing values in either column.
DonationData['Contact Tags'].fillna('', inplace=True)
DonationData['Notes'].fillna('', inplace=True)

# Concatenate 'Contact Tags' and 'Notes' columns with a semicolon separator
DonationData['Tags_Notes_Combined'] = DonationData['Contact Tags'] + ';' + DonationData['Notes']

# Drop the old columns
DonationData.drop(columns=['Contact Tags', 'Notes'], inplace=True)

In [9]:
DonationData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17244 entries, 0 to 17243
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Contact Id                     17244 non-null  int64         
 1   Contact Type                   17244 non-null  object        
 2   Contact Primary Full Address   17244 non-null  object        
 3   Contact Primary Address City   17244 non-null  object        
 4   Contact Primary Address State  17244 non-null  object        
 5   Contact Social Score           17243 non-null  float64       
 6   Donor Tenure Years             17244 non-null  float64       
 7   Churned                        17244 non-null  int64         
 8   Gift Date                      17244 non-null  datetime64[ns]
 9   Amount                         17244 non-null  float64       
 10  Gift Type                      17244 non-null  object        
 11  Segment Name   

With the successful concatenation of 'Contact Tags' and 'Notes' into the new 'Tags_Notes_Combined' column, we've consolidated textual information for further analysis. The missing values have no been addressed and things can move forward.  

<a id='352-Data-Type-Manipulation'></a>
### 3.5.2 Data Type Manipulation

In preparation for modeling, numerical values will be extracted from the datetime values to ensure compatibility with a wider scope of model types.

In [10]:
# Extract numerical features from datetime columns
DonationData['Gift_Year'] = DonationData['Gift Date'].dt.year
DonationData['Gift_Month'] = DonationData['Gift Date'].dt.month
DonationData['Gift_Day'] = DonationData['Gift Date'].dt.day
DonationData['Gift_Hour'] = DonationData['Gift Date'].dt.hour
DonationData['Gift_Minute'] = DonationData['Gift Date'].dt.minute
DonationData['Gift_Second'] = DonationData['Gift Date'].dt.second

DonationData['First_Gift_Year'] = DonationData['First Gift Date'].dt.year
DonationData['First_Gift_Month'] = DonationData['First Gift Date'].dt.month
DonationData['First_Gift_Day'] = DonationData['First Gift Date'].dt.day
DonationData['First_Gift_Hour'] = DonationData['First Gift Date'].dt.hour
DonationData['First_Gift_Minute'] = DonationData['First Gift Date'].dt.minute
DonationData['First_Gift_Second'] = DonationData['First Gift Date'].dt.second

DonationData['Last_Gift_Year'] = DonationData['Last Gift Date'].dt.year
DonationData['Last_Gift_Month'] = DonationData['Last Gift Date'].dt.month
DonationData['Last_Gift_Day'] = DonationData['Last Gift Date'].dt.day
DonationData['Last_Gift_Hour'] = DonationData['Last Gift Date'].dt.hour
DonationData['Last_Gift_Minute'] = DonationData['Last Gift Date'].dt.minute
DonationData['Last_Gift_Second'] = DonationData['Last Gift Date'].dt.second

# Now these can be used as numerical features for modeling

In [11]:
# The information was extracted the from the datetime values as numerical data
# initial datetime columns can be dropped 

# Drop the original datetime columns
DonationData_V02 = DonationData.drop(['Gift Date', 'First Gift Date', 'Last Gift Date'], axis=1)

After extracting numerical data from datetime values, the focus shifts to handling categorical variables.

## 3.6 Encoding Categorical Features <a id="36-encoding-categorical-features"></a> 

Categorical data requires encoding techniques like one-hot or label encoding for machine learning compatibility. Proper treatment of categorical variables is essential for building accurate predictive models, as they contain valuable information affecting model performance.

In [12]:
# A look at the categorical variables in the dataset

DonationData_V02_Cat = DonationData_V02.select_dtypes(include=['object'])
DonationData_V02_Cat.head()

Unnamed: 0,Contact Type,Contact Primary Full Address,Contact Primary Address City,Contact Primary Address State,Gift Type,Segment Name,Campaign Name,Tags_Notes_Combined
0,Household,Unknown,Unknown,Unknown,Other,General Segment,General Giving,Do not call;Website Email Submit;Ticket: 2015 ...
1,Household,Unknown,Unknown,Unknown,Other,General Segment,General Giving,Do not call;Website Email Submit;Ticket: 4th ...
2,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...
3,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...
4,Household,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,CA,Other,General Segment,General Giving,Do not call;Website Email Submit;Sustaining Do...


In [13]:
# count of unique values to determine the best encoding technique
DonationData_V02_Cat.nunique().sort_values(ascending=False)

Tags_Notes_Combined              4338
Contact Primary Full Address     2009
Contact Primary Address City      833
Contact Primary Address State      76
Segment Name                       46
Gift Type                           7
Campaign Name                       5
Contact Type                        3
dtype: int64

One hot encoding will be applied to the following variables due to the low number of unique values present:
1. Gift Type
2. Campaign Name
3. Contact Type
4. Contact Primary Address State (in CA vs not in CA only - practical approach given the importance of California in dataset)


The following variables will be encoded using an NLP approach:
1. Tags_Notes_Combined     
2. Contact Primary Full Address
3. Contact Primary Address City
4. Segment Name


#### **3.6.1 One-Hot Encoding** <a id='361-One-Hot-Encoding'></a>  

In [14]:
# create a copy of dataframe for encoded variables
DonationData_V03_Encoded = DonationData_V02.copy(deep=True)

In [15]:
# Instantiate OneHotEncoder
hot_encoder = OneHotEncoder()

# Fit and transform column
encoded_values = hot_encoder.fit_transform(DonationData_V03_Encoded['Gift Type'].values.reshape(-1,1))
# Convert to dense array
encoded_values_dense = encoded_values.toarray()
# Create a DataFrame from the dense array with correct column names
encoded_df = pd.DataFrame(encoded_values_dense, columns=hot_encoder.get_feature_names_out(['Gift Type']))
# Drop the original 'Gift Type' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Gift Type'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, encoded_df], axis=1)


OneHotEncoder is applied to the 'Gift Type' column in the DataFrame, where each unique category becomes its own binary column. The resulting sparse matrix is converted to a dense array. Finally, the original 'Gift Type' column is replaced with the new binary columns, effectively transforming the categorical data into a binary representation.

This same approach will be applied to Campaign Name, Contact Type, and Contact Primary Address State.

In [16]:
# Fit and transform 'Campaign Name' column
campaign_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Campaign Name'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
campaign_encoded_dense = campaign_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
campaign_encoded_df = pd.DataFrame(campaign_encoded_dense, columns=hot_encoder.get_feature_names_out(['Campaign Name']))
# Drop the original 'Campaign Name' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Campaign Name'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, campaign_encoded_df], axis=1)

# Fit and transform 'Contact Type' column
contact_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Contact Type'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
contact_encoded_dense = contact_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
contact_encoded_df = pd.DataFrame(contact_encoded_dense, columns=hot_encoder.get_feature_names_out(['Contact Type']))
# Drop the original 'Contact Type' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Contact Type'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, contact_encoded_df], axis=1)

Encoding Contact Primary Address State will be slightly different since we only want to know CA vs Not In CA. 

In [17]:
# Transform 'Contact Primary Address State' column into 'CA' vs 'Not CA'
DonationData_V03_Encoded['Contact Primary Address State'] = DonationData_V03_Encoded['Contact Primary Address State'].apply(lambda x: 'CA' if x == 'CA' else 'Not CA')

# Fit and transform 'Contact Primary Address State' column
state_encoded = hot_encoder.fit_transform(DonationData_V03_Encoded['Contact Primary Address State'].values.reshape(-1, 1))
# Convert the sparse matrix to a dense array
state_encoded_dense = state_encoded.toarray()
# Create a DataFrame from the dense array with appropriate column names
state_encoded_df = pd.DataFrame(state_encoded_dense, columns=hot_encoder.get_feature_names_out(['Contact Primary Address State']))
# Drop the original 'Contact Primary Address State' column from the DataFrame
DonationData_V03_Encoded.drop(columns=['Contact Primary Address State'], inplace=True)
# Concatenate the DataFrame with the encoded columns
DonationData_V03_Encoded = pd.concat([DonationData_V03_Encoded, state_encoded_df], axis=1)

In [18]:
DonationData_V03_Encoded.head()

Unnamed: 0,Contact Id,Contact Primary Full Address,Contact Primary Address City,Contact Social Score,Donor Tenure Years,Churned,Amount,Segment Name,Gift Year,Quarter,...,Campaign Name_CSEC Advocacy Course,Campaign Name_Capital Campaign,Campaign Name_End of Year 2023,Campaign Name_General Giving,Campaign Name_General Giving 2024,Contact Type_Foundation,Contact Type_Household,Contact Type_Organization,Contact Primary Address State_CA,Contact Primary Address State_Not CA
0,1,Unknown,Unknown,73.0,0.0,1,390.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,2,Unknown,Unknown,30.0,0.0,1,50.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,120.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,12.0,General Segment,2015,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,3,"""4423 Rhineland Dr Unit A\r\nFort Irwin, CA 92...",Fort Irwin,91.0,9.21,0,144.0,General Segment,2016,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


#### **3.6.2 NLP Encoding** <a id='362-NLP-Encoding'></a>

The following variables will be encoded using an NLP approach:

- Tags_Notes_Combined
- Contact Primary Full Address
- Contact Primary Address City
- Segment Name

In [19]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Set stopwords
stop_words = set(stopwords.words('english'))

# Define preprocess function
def preprocess(text):
    text = text.lower()  # Make strings lowercase
    text = ''.join([word for word in text if word not in string.punctuation])  # Remove punctuation
    tokens = nltk.tokenize.word_tokenize(text)  # Tokenize text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)  # Join tokens back into text

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# Variables not encoded in 3.7.1
longer_text_columns = ["Tags_Notes_Combined",
                       "Contact Primary Full Address",
                       "Contact Primary Address City",
                       "Segment Name"]  

feature_name_mapping = {}

# Create a copy of the DataFrame
DonationData_V03_Encoded_copy = DonationData_V03_Encoded.copy()

# Iterate over each text column and each row to preprocess the text data
for col in longer_text_columns:  
    for index, value in DonationData_V03_Encoded_copy[col].items():  
        # Preprocess text data for the specific row and column
        processed_value = preprocess(value)
        DonationData_V03_Encoded_copy.at[index, col] = processed_value

# Initialize a dictionary to store vectorized values
vectorized_data = {}

# Train Word2Vec models for each text column
for col in longer_text_columns:
    # Tokenize text data
    sentences = [sentence.split() for sentence in DonationData_V03_Encoded_copy[col]]
    
    # Train Word2Vec model
    w2v_model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
    
    # Vectorize text data
    vectorized_values = np.array([np.mean([w2v_model.wv[word] for word in sentence if word in w2v_model.wv] or [np.zeros(50)], axis=0) for sentence in sentences])
    
    # Store vectorized values in the dictionary
    for j in range(w2v_model.vector_size):
        vectorized_data[f"{col}_vec_{j}"] = vectorized_values[:, j]

# Create a new DataFrame with vectorized values
DonationData_V03_Encoded_vectorized = pd.concat([DonationData_V03_Encoded_copy, pd.DataFrame(vectorized_data)], axis=1)

# Drop the original text columns
DonationData_V03_Encoded_vectorized.drop(columns=longer_text_columns, inplace=True)

In [21]:
DonationData_V03_Encoded_vectorized.head()

Unnamed: 0,Contact Id,Contact Social Score,Donor Tenure Years,Churned,Amount,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,...,Segment Name_vec_40,Segment Name_vec_41,Segment Name_vec_42,Segment Name_vec_43,Segment Name_vec_44,Segment Name_vec_45,Segment Name_vec_46,Segment Name_vec_47,Segment Name_vec_48,Segment Name_vec_49
0,1,73.0,0.0,1,390.0,2015,1,1,1,390.0,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
1,2,30.0,0.0,1,50.0,2015,1,1,1,50.0,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
2,3,91.0,9.21,0,120.0,2015,1,1,87,18.068966,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
3,3,91.0,9.21,0,12.0,2015,1,1,87,18.068966,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
4,3,91.0,9.21,0,144.0,2016,1,1,87,18.068966,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426


In [22]:
# store version of preprocessed data that is not scaled
unscaled_df_path = '/Users/frankyaraujo/Development/springboard_main/\
Capstone Three/Springboard-Capstone-Three/src/data/Donations _ Jan 2015 to Mar 2024_R4_vectorized_encoded .csv'

# Write DataFrames to CSV files
DonationData_V03_Encoded_vectorized.to_csv(unscaled_df_path, index=False)

## 3.7 Train/Test Split  <a id="37-traintest-split"></a> 

In [23]:
# Showing that all columns are now of a numerical data type
DonationData_V03_Encoded_vectorized.dtypes.unique()

array([dtype('int64'), dtype('float64'), dtype('float32')], dtype=object)

In [24]:
# Check for missing values

DonationData_V03_Encoded_vectorized[
DonationData_V03_Encoded_vectorized["Contact Social Score"].isna()]

# Fill missing value in Contact Social Score with zero 
DonationData_V03_Encoded_vectorized["Contact Social Score"].fillna(value=0,inplace=True)

In [25]:
#'Churn' is the target variable and the other columns are features
X = DonationData_V03_Encoded_vectorized.drop(columns=['Churned'])
y = DonationData_V03_Encoded_vectorized['Churned']

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)


In [26]:
X.head()

Unnamed: 0,Contact Id,Contact Social Score,Donor Tenure Years,Amount,Gift Year,Quarter,Gift Month,Donation Count,Average Amount,LTV,...,Segment Name_vec_40,Segment Name_vec_41,Segment Name_vec_42,Segment Name_vec_43,Segment Name_vec_44,Segment Name_vec_45,Segment Name_vec_46,Segment Name_vec_47,Segment Name_vec_48,Segment Name_vec_49
0,1,73.0,0.0,390.0,2015,1,1,1,390.0,390.0,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
1,2,30.0,0.0,50.0,2015,1,1,1,50.0,50.0,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
2,3,91.0,9.21,120.0,2015,1,1,87,18.068966,1415.473233,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
3,3,91.0,9.21,12.0,2015,1,1,87,18.068966,1415.473233,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426
4,3,91.0,9.21,144.0,2016,1,1,87,18.068966,1415.473233,...,0.049966,0.003345,-0.036368,0.026876,0.047895,-0.008773,-0.013022,-0.01827,0.033266,0.039426


In [27]:
y.head()

0    1
1    1
2    0
3    0
4    0
Name: Churned, dtype: int64

## 3.8 Scale the Data <a id='38-scale-the-data'></a>

Given the diverse data types in the original dataset, including datetime features, class labels, and continuous variables, Standard Scaling will be applied to the continuous variables. This normalization process helps prevent features with larger magnitudes from dominating the model training process.

Class labels, representing categorical variables, lack numerical magnitudes, making scaling unnecessary and inappropriate. Similarly, text vectors generated by Word2Vec are inherently normalized during vectorization and do not require additional scaling. Therefore, class labels and text vectors will remain untouched by the scaling process.


In [28]:
# Finding the continuous variables to scale
[col for col in DonationData_V03_Encoded_vectorized.columns 
                    if DonationData_V03_Encoded_vectorized[col].dtype == 'float64' 
                    and '_' not in col]

['Contact Social Score',
 'Donor Tenure Years',
 'Amount',
 'Average Amount',
 'LTV']

In [29]:
# Continuous columns
continuous_cols = ['Contact Social Score',
                   'Donor Tenure Years',
                   'Amount',
                   'Average Amount',
                   'LTV']

# Initialize the StandardScaler
num_scaler = StandardScaler()

# Fit and transform the scaler on the training set
X_train[continuous_cols] = num_scaler.fit_transform(X_train[continuous_cols])

# Transform the testing set using the scaler fitted on the training set
X_test[continuous_cols] = num_scaler.transform(X_test[continuous_cols])


## 3.9 Train/Predict with a "Baseline Model" <a id='39-trainpredict-with-a-baseline-model'></a>

#### Fit the dummy classifier

In [30]:
#Fit the dummy regressor on the training data - Victim Age
dumb_cls = DummyClassifier(strategy='stratified')
dumb_cls.fit(X_train, y_train)

#### Assess dummy classifer performance

In [31]:
# Obtain predictions from the Dummy Regressor
y_pred_dummy = dumb_cls.predict(X_test)

# Calculate Accuracy
accuracy_dummy = accuracy_score(y_test, y_pred_dummy)
print("Dummy Classifier - Accuracy:", accuracy_dummy)

# Calculate Precision
precision_dummy = precision_score(y_test, y_pred_dummy)
print("Dummy Classifier - Precision:", precision_dummy)

# Calculate Recall
recall_dummy = recall_score(y_test, y_pred_dummy)
print("Dummy Classifier - Recall:", recall_dummy)

# Calculate F1-score
f1_dummy = f1_score(y_test, y_pred_dummy)
print("Dummy Classifier - F1-score:", f1_dummy)

# Calculate ROC AUC
roc_auc_dummy = roc_auc_score(y_test, y_pred_dummy)
print("Dummy Classifier - ROC AUC:", roc_auc_dummy)

Dummy Classifier - Accuracy: 0.5105827776167005
Dummy Classifier - Precision: 0.4076048329779673
Dummy Classifier - Recall: 0.4016106442577031
Dummy Classifier - F1-score: 0.40458553791887125
Dummy Classifier - ROC AUC: 0.49459552499871795


**Accuracy**: The accuracy of the dummy classifier is approximately 52.25%. This means that around 52.25% of the predictions made by the model are correct.

**Precision**: The precision of the classifier is approximately 41.80%. Precision measures the proportion of true positive predictions among all positive predictions made by the model. In this case, it indicates that around 41.80% of the positive predictions made by the model are correct.

**Recall**: The recall of the classifier is approximately 43.08%. Recall, also known as sensitivity, measures the proportion of true positive predictions that were correctly identified by the model among all actual positive instances. In this case, it indicates that around 43.08% of the actual positive instances were correctly identified by the model.

**F1-score**: The F1-score of the classifier is approximately 42.43%. The F1-score is the harmonic mean of precision and recall and provides a balance between the two metrics. A higher F1-score indicates better overall performance.

**ROC AUC**: The ROC AUC (Receiver Operating Characteristic Area Under the Curve) of the classifier is approximately 50.83%. This metric measures the ability of the classifier to distinguish between positive and negative instances. A value close to 0.5 suggests that the classifier performs no better than random guessing.



## 3.10 Setting up Pipelines  <a id="310-setting-up-pipelines">

In this section, pipelines will be setup to streamline the model comparison process. Pipelines enable the chaining together of preprocessing steps with model fitting, ensuring consistency and reproducibility in the approach.

#### 3.10.1 Define Pipeline(s) <a id="3101-define"></a>

In [32]:
# Define pipline for classification models being evaluated
classification_pipelines = [
    ('RandomForest', RandomForestClassifier()),  # Ensemble method using multiple decision trees, generally robust and less prone to overfitting.
    ('GradientBoosting', GradientBoostingClassifier()),  # Ensemble method that builds trees sequentially, optimizing residuals of previous trees, often yields high accuracy.
    ('LogisticRegression', LogisticRegression(max_iter=10000)),  # Linear model for binary classification, useful for understanding feature importance.
    ('DecisionTree', DecisionTreeClassifier()),  # Simple tree structure where decisions are made based on feature values, easy to interpret but prone to overfitting.
    ('SVM', SVC()),  # Support Vector Machine, finds the optimal hyperplane for classification, works well with clear margin of separation.
    ('KNN', KNeighborsClassifier()),  # K-Nearest Neighbors, classifies based on the majority class among the k-nearest neighbors, sensitive to feature scaling.
    ('ANN', MLPClassifier(max_iter=10000))  # Artificial Neural Network, model inspired by biological neural networks, powerful for capturing complex patterns.
]

## 3.11 Fit/Train/Predict and Assess Models  <a id="311-fit-train-predict-and-assess"></a>

In [48]:
import time
# Record start time
start_time = time.time()

# Results for classification
results_cls = []
trained_models = {}

for name, model in classification_pipelines:
    model.fit(X_train, y_train)
    
    # Store the trained model in memory
    trained_models[name] = model
    
    y_pred_cls = model.predict(X_test)
    accuracy_cls = accuracy_score(y_test, y_pred_cls)
    scores = cross_val_score(model, X_train, y_train, cv=3)  # cv=5 for 5-fold cross-validation

        
    # Predictions for confusion matrix and other metrics
    y_pred = cross_val_predict(model, X, y, cv=3)
    # Precision
    precision = precision_score(y, y_pred)
    # Recall
    recall = recall_score(y, y_pred)
    # F1 Score
    f1 = f1_score(y, y_pred)
    # Confusion Matrix
    conf_matrix = confusion_matrix(y, y_pred)
    # ROC Curve and AUC
    fpr, tpr, _ = roc_curve(y, y_pred)
    roc_auc = auc(fpr, tpr)
        
    # Classification Report
    class_rep = classification_report(y, y_pred, output_dict=True)    
    
    results_cls.append({
            'Model': name,
            'Accuracy': accuracy_cls, #
            'Precision': precision,
            'Recall': recall,
            'F1 Score': f1,
            'ROC AUC': roc_auc,
            'Cross-val-score': scores,
            'Cross-val-score-avg': scores.mean(),
        })

end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
print("\nElapsed Time:", elapsed_time, "seconds")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Elapsed Time: 594.2565040588379 seconds


In [49]:
# Convert results to DataFrame for better visualization
print("\nClassification Results:")
results_df = pd.DataFrame(results_cls)
results_df


Classification Results:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC AUC,Cross-val-score,Cross-val-score-avg
0,RandomForest,0.997391,0.909898,0.961899,0.935176,0.947303,"[0.9973905479849232, 0.9971006088721368, 0.996...",0.997004
1,GradientBoosting,0.998985,0.996093,0.99986,0.997973,0.998545,"[1.0, 1.0, 0.9997099767981439]",0.999903
2,LogisticRegression,0.979849,0.881846,0.931503,0.905995,0.921664,"[0.9794143229921717, 0.9802841403305306, 0.968...",0.976125
3,DecisionTree,0.99884,0.916842,0.976047,0.945519,0.956752,"[1.0, 1.0, 1.0]",1.0
4,SVM,0.585967,0.64694,0.806976,0.71815,0.747921,"[0.5859669469411424, 0.5859669469411424, 0.586...",0.586024
5,KNN,0.977385,0.686318,0.701219,0.693688,0.737398,"[0.9739054798492317, 0.9704262104957959, 0.967...",0.970713
6,ANN,0.943027,0.895954,0.542793,0.676029,0.74913,"[0.9121484488257466, 0.8967816758480719, 0.931...",0.913495


Analysis:

Accuracy: Measures the proportion of correctly predicted instances out of the total instances. Higher accuracy indicates better overall performance.

- Best: GradientBoosting (0.998985), DecisionTree (0.998840)
- Worst: SVM (0.585967)
- GradientBoosting and DecisionTree models show almost perfect accuracy, whereas SVM has the lowest, indicating it struggles with this dataset.

Precision: Measures the proportion of true positive predictions out of all positive predictions. High precision indicates fewer false positives.

- Best: GradientBoosting (0.996093)
- Worst: KNN (0.686318)
- GradientBoosting excels in precision, making it reliable when false positives are costly. KNN, however, is less reliable in this regard.

Recall: Measures the proportion of true positive predictions out of all actual positives. High recall indicates fewer false negatives.

- Best: GradientBoosting (0.999860)
- Worst: ANN (0.542793)
- GradientBoosting nearly captures all positive instances, whereas ANN has difficulty identifying true positives.

F1 Score: The harmonic mean of precision and recall, providing a balanced metric. High F1 indicates good balance between precision and recall.

- Best: GradientBoosting (0.997973)
- Worst: ANN (0.676029)
- GradientBoosting balances precision and recall exceptionally well, while ANN shows significant imbalance.

ROC AUC: Measures the model's ability to distinguish between classes. Higher AUC indicates better separability.

- Best: GradientBoosting (0.998545)
- Worst: SVM (0.747921)
- GradientBoosting excels in distinguishing between classes, while SVM struggles, as reflected in its lower AUC.

Cross-val-score: Reflects the model's performance stability across different subsets of data. Consistent scores indicate reliable performance.

- Best: DecisionTree ([1.0, 1.0, 1.0])
- Worst: SVM ([0.5859669469411424, 0.5859669469411424, 0.586...])
- DecisionTree shows perfect cross-validation scores, indicating highly reliable performance. SVM's scores confirm its instability.

Cross-val-score-avg: Summarizes the average performance across cross-validation folds.

- Best: DecisionTree (1.000000)
- Worst: SVM (0.586024)
- DecisionTree again stands out with perfect average performance, while SVM remains the weakest.

## 3.12 Final Model Selection <a id='314-final-model-selection'></a>

Based on the results, GradientBoosting is the best performing model for this data. DecisionTree also performs well, particularly in cross-validation, making it a stable choice.
All other classification models above show significantly lower performance across most metrics, indicating they may not be suitable for this dataset without further tuning.


In [60]:
# Save the model to a file
import pickle

selected_model = trained_models['GradientBoosting']
pickle_file = "gradient_boosting_model.pkl"
with open(pickle_file, 'wb') as file:
    pickle.dump(selected_model, file)

print(f"Model saved to {pickle_file}")

Model saved to gradient_boosting_model.pkl


## 3.13 Conclusion <a id='315-conclusion'></a>
   

In the analysis conducted, various classification metrics were utilized to evaluate the performance of multiple machine learning models on the dataset. Among these models, GradientBoosting emerged as the top performer across most metrics, showcasing impressive accuracy, precision, recall, and F1-score.

The feature engineering process was thorough, incorporating one-hot encoding for categorical variables with low unique values and an NLP approach for textual features. This ensured a comprehensive utilization of the dataset's information. Additionally, standard scaling was applied to continuous variables to address the potential dominance of features with larger magnitudes. Evaluation against a dummy classifier served as a pivotal benchmark, highlighting the substantial improvement achieved by the GradientBoosting model over random guessing.

Based on the analysis, it is recommended to deploy the GradientBoosting model for further application and testing, with potential fine-tuning to enhance its performance.

Looking ahead, several pivotal next steps can be considered based on the analysis results. These may include conducting hyperparameter tuning, performing additional feature importance analysis to identify influential features, exploring ensemble methods, and delving into model interpretability techniques for decision-making.

By integrating these next steps into the workflow, the performance and efficacy of the GradientBoosting model can be further refined, ultimately facilitating more accurate predictions and actionable insights.