# Logistic Regression: Banking Marketing Campaign

## 1. Data Aquisition & EDA

### 1.1. Sanity check

Firts thing's first - let's take a look a the data.

In [1]:
import pandas as pd

# Read the data into a pandas dataframe from the URL
data_url='https://raw.githubusercontent.com/4GeeksAcademy/logistic-regression-project-tutorial/main/bank-marketing-campaign-data.csv'
data_df=pd.read_csv(data_url, sep=';')

# Take a look at what features and data types we have
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

Lots of data to work with here - some numerical features but also a lot of string/object types which are probably catagorical and ordinal or interval. Let's take a closer look:

In [2]:
# Inspect the first few rows, transposing so columns don't get cut off
data_df.head().transpose()

Unnamed: 0,0,1,2,3,4
age,56,57,37,40,56
job,housemaid,services,services,admin.,services
marital,married,married,married,married,married
education,basic.4y,high.school,high.school,basic.6y,high.school
default,no,unknown,no,no,no
housing,no,no,yes,no,no
loan,no,no,no,no,yes
contact,telephone,telephone,telephone,telephone,telephone
month,may,may,may,may,may
day_of_week,mon,mon,mon,mon,mon


Beyond the Pandas datatypes, we have a few different kinds of varaibels here in the statistical sense:

1. **Ordinal**: 'categorical' variables like *education* which have a natural order, but no quantitative interaval or magnitude
2. **Interval**: for example, *month* has a defined degrees of difference between the categories. 
3. **Nominal**: things like *marital* which have categories only, no obvious quantity or order.

These types in particular will be intresting, because we will need to handle encoding them to numerical features differently based on their properties.

We also see several discrete and continuous random variables like *age* and *duration*.

There are a couple of other features that peak my intrest too, but in terms of data quality:

1. *poutcome*: first 5 values are 'nonexistent' - I assume meaning that we don't know the outcome of the previous markinting campaing?
2. *day_of_week* (and others): is Monday call day and this feaure is constant?
3. *pdays*: is 999 some kind of place holder for missing data? Is this value constant too?

Let's check how many unique values we actualy have for each feature. This information will help guide us in deciding how to encode the data for regression and if we should drop any features from the get-go.

## 1.2. Unique and/or missing values

In [3]:
data_df.nunique()

age                 78
job                 12
marital              4
education            8
default              3
housing              3
loan                 3
contact              2
month               10
day_of_week          5
duration          1544
campaign            42
pdays               27
previous             8
poutcome             3
emp.var.rate        10
cons.price.idx      26
cons.conf.idx       26
euribor3m          316
nr.employed         11
y                    2
dtype: int64

OK, not so bad. Looks like calls happen on weekdays - 5 unique values. No constant columns that are begging to be dropped. I'm still curious about *default*, *housing*, *loan* and *poutcome* - it seems like they should be binary but each has three values. Looking at the few rows we inspected above it seems like the string 'unknown' is being used for missing data. Let's find out how much data is missing.

In [4]:
# Define a reusable helper function here, since we will be looking at feature composition a lot
def feature_composition(df: pd.DataFrame, features: list) -> None:
    '''Takes a dataframe and a list of features. Prints out
    the unique levels of that feature with their count and 
    percent.'''

    for column_name in features:
        value_counts=df[column_name].value_counts().T.to_dict()

        print(f'\nFeature: {column_name}')

        for key, value in value_counts.items():
            percent_value=(value/len(data_df)) * 100
            print(f' {key}: {value} ({percent_value:.1f}%)')

In [5]:
feature_composition(data_df, ['default','housing','loan','poutcome'])


Feature: default
 no: 32588 (79.1%)
 unknown: 8597 (20.9%)
 yes: 3 (0.0%)

Feature: housing
 yes: 21576 (52.4%)
 no: 18622 (45.2%)
 unknown: 990 (2.4%)

Feature: loan
 no: 33950 (82.4%)
 yes: 6248 (15.2%)
 unknown: 990 (2.4%)

Feature: poutcome
 nonexistent: 35563 (86.3%)
 failure: 4252 (10.3%)
 success: 1373 (3.3%)


We have very, very few *defaults* - this column is a good candidate for exclusion based on low entropy. It's mostly missing or 'no' - probably not adding much information. Interestingly, both *housing* and *loan* are missing exactly 990 rows, I'd bet it's the same rows too. We could maybe drop those examples as missing/bad data. Let's look at the weird '999' value in *pdays* I bet that's a placeholder of some kind too.

In [6]:
feature_composition(data_df, ['pdays'])


Feature: pdays
 999: 39673 (96.3%)
 3: 439 (1.1%)
 6: 412 (1.0%)
 4: 118 (0.3%)
 9: 64 (0.2%)
 2: 61 (0.1%)
 7: 60 (0.1%)
 12: 58 (0.1%)
 10: 52 (0.1%)
 5: 46 (0.1%)
 13: 36 (0.1%)
 11: 28 (0.1%)
 1: 26 (0.1%)
 15: 24 (0.1%)
 14: 20 (0.0%)
 8: 18 (0.0%)
 0: 15 (0.0%)
 16: 11 (0.0%)
 17: 8 (0.0%)
 18: 7 (0.0%)
 22: 3 (0.0%)
 19: 3 (0.0%)
 21: 2 (0.0%)
 25: 1 (0.0%)
 26: 1 (0.0%)
 27: 1 (0.0%)
 20: 1 (0.0%)


Ok, now this makes sense - from the metadata, *pdays* is 'Number of days that elapsed since the last campaign until the customer was contacted (numeric)'. The 999 value is probably missing data here too, but it also could be for people who were not included in the previous campagin - this qualitativly tracks with the large percentage of people for which the outcome of the previous campaing is unknown. Whatever the explination, this is a very low entropy column too.

### 1.3. Feature engineering

#### 1.3.1. Feature drops

I think the first thing we should do it just get rid of features related to the pervious campaign - I don't think there is enough data about it to be meaningfull. This way we loose some low entropy features and make the model simpiler, without giving up any rows. Less is more.

In [7]:
column_drops=['poutcome', 'pdays', 'previous']

data_df.drop(column_drops, axis=1, inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  emp.var.rate    41188 non-null  float64
 13  cons.price.idx  41188 non-null  float64
 14  cons.conf.idx   41188 non-null  float64
 15  euribor3m       41188 non-null  float64
 16  nr.employed     41188 non-null  float64
 17  y               41188 non-null 

#### 1.3.2. Categorical interval feature encoding

Next, let's deal with the time features. Interval time features are better encoded with sin/cos components than single numbers - this method catches the 'near-ness' of Sunday to Monday or December to February where just numbering the months or days does not. If you are intereseted in cyclical encoding for time features check out [this Kaggle notebook](https://www.kaggle.com/code/avanwyk/encoding-cyclical-features-for-deep-learning)

In [8]:
feature_composition(data_df, ['day_of_week'])


Feature: day_of_week
 thu: 8623 (20.9%)
 mon: 8514 (20.7%)
 wed: 8134 (19.7%)
 tue: 8090 (19.6%)
 fri: 7827 (19.0%)


In [9]:
feature_composition(data_df, ['month'])


Feature: month
 may: 13769 (33.4%)
 jul: 7174 (17.4%)
 aug: 6178 (15.0%)
 jun: 5318 (12.9%)
 nov: 4101 (10.0%)
 apr: 2632 (6.4%)
 oct: 718 (1.7%)
 sep: 570 (1.4%)
 mar: 546 (1.3%)
 dec: 182 (0.4%)


In [10]:
import numpy as np

# Don't worry about downcasting FutureWarning
pd.set_option('future.no_silent_downcasting', True)

# First conver the features to numeric
dict={'mon' : 1, 'tue' : 2, 'wed': 3, 'thu' : 4, 'fri': 5}
data_df=data_df.replace(dict)

dict={'jan' : 1, 'feb' : 2, 'mar': 3, 'apr' : 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}
data_df=data_df.replace(dict)

# And fix the dtypes
data_df['day_of_week']=data_df['day_of_week'].astype(int)
data_df['month']=data_df['month'].astype(int)

# Take a look
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  int64  
 9   day_of_week     41188 non-null  int64  
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  emp.var.rate    41188 non-null  float64
 13  cons.price.idx  41188 non-null  float64
 14  cons.conf.idx   41188 non-null  float64
 15  euribor3m       41188 non-null  float64
 16  nr.employed     41188 non-null  float64
 17  y               41188 non-null 

In [11]:
# Now encode the day and month with sin/cos components
data_df['day_sin'] = np.sin(2 * np.pi * data_df['day_of_week']/7.0)
data_df['day_cos'] = np.cos(2 * np.pi * data_df['day_of_week']/7.0)

data_df['month_sin'] = np.sin(2 * np.pi * data_df['month']/12.0)
data_df['month_cos'] = np.cos(2 * np.pi * data_df['month']/12.0)

data_df['day_sin'] = np.sin(2 * np.pi * data_df['day_of_week']/7.0)
data_df['day_cos'] = np.cos(2 * np.pi * data_df['day_of_week']/7.0)

data_df.drop(['month', 'day_of_week'], axis=1, inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   duration        41188 non-null  int64  
 9   campaign        41188 non-null  int64  
 10  emp.var.rate    41188 non-null  float64
 11  cons.price.idx  41188 non-null  float64
 12  cons.conf.idx   41188 non-null  float64
 13  euribor3m       41188 non-null  float64
 14  nr.employed     41188 non-null  float64
 15  y               41188 non-null  object 
 16  day_sin         41188 non-null  float64
 17  day_cos         41188 non-null 

#### 1.3.3. Categorical ordinal feature encoding
Next, let's look at education and job and make sure we encode those in a way that makes sense. Here are there levels:

In [12]:
feature_composition(data_df, ['education'])


Feature: education
 university.degree: 12168 (29.5%)
 high.school: 9515 (23.1%)
 basic.9y: 6045 (14.7%)
 professional.course: 5243 (12.7%)
 basic.4y: 4176 (10.1%)
 basic.6y: 2292 (5.6%)
 unknown: 1731 (4.2%)
 illiterate: 18 (0.0%)


Encode these in order of increasing education level:

In [13]:
dict={'unknown': 0, 'illiterate': 1, 'basic.4y': 2, 'basic.6y': 3, 'basic.9y': 4, 'high.school': 5, 'professional.course': 6, 'university.degree': 7}
data_df=data_df.replace(dict)
data_df['education']=data_df['education'].astype(int)


And similarly for employment:

In [14]:
feature_composition(data_df, ['job'])


Feature: job
 admin.: 10422 (25.3%)
 blue-collar: 9254 (22.5%)
 technician: 6743 (16.4%)
 services: 3969 (9.6%)
 management: 2924 (7.1%)
 retired: 1720 (4.2%)
 entrepreneur: 1456 (3.5%)
 self-employed: 1421 (3.5%)
 housemaid: 1060 (2.6%)
 unemployed: 1014 (2.5%)
 student: 875 (2.1%)
 0: 330 (0.8%)


Was thinking that maybe we could do the same trick for employment - but it's hard to order the job types in a way that makes sense. If we could find where these classes came from - maybe some state department of labor statistics or something - we could maybe translate them to income level.. But, let's leave it alone for now.

#### 1.3.4. Categorical nominal feature encoding

For nominal features, we don't want to simply number the classes - this tells the model that they have a mangitude. For example, if we were to number the *housing* categories as follows:

```text
unknown => 0
yes => 1
no => 2
```

that would imply that 'no' is somehow double 'yes' - makes no sense. The proper way to do this is with 'one hot' encoding. This technique uses a new binary categorical feature for each level of the original scale. Every example then gets a '1' to indicate it's level in the original feature. Here's what one hot encoding a few hypothetical examples for *housing* would look like:

|          | housing_unknown | housing_yes | housing_no |
|----------|-----------------|-------------|------------|
| Person 1 |        1        |      0      |     0      |
| Person 2 |        0        |      0      |     1      |
| Person 3 |        0        |      1      |     0      |

However, there is one issue with this: the above table looks intuitivly good, but it introduces multicolinearity. Notice how the sum of all three one hot encoding features for any row is one? This is a problem for many machine learning models and is known as the 'dummy variable' problem. Without going into too much detail, we actualy only need two one hot encoded variabels to encode three feature levels and only four to encode five levels and so on. It's always one less than the original levels we are trying to encode. If you are intresed in this topic check out this [Medium article](https://towardsdatascience.com/one-hot-encoding-multicollinearity-and-the-dummy-variable-trap-b5840be3c41a)

Lucily, this type of encoding is a very common operation and Pandas can handle it for us - just be sure to set *drop_first* to **True**.

In [15]:
encoded_columnts=['job','marital','default','housing','loan','contact']
data_df=pd.get_dummies(data_df, columns=encoded_columnts, dtype=int, drop_first=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 35 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                41188 non-null  int64  
 1   education          41188 non-null  int64  
 2   duration           41188 non-null  int64  
 3   campaign           41188 non-null  int64  
 4   emp.var.rate       41188 non-null  float64
 5   cons.price.idx     41188 non-null  float64
 6   cons.conf.idx      41188 non-null  float64
 7   euribor3m          41188 non-null  float64
 8   nr.employed        41188 non-null  float64
 9   y                  41188 non-null  object 
 10  day_sin            41188 non-null  float64
 11  day_cos            41188 non-null  float64
 12  month_sin          41188 non-null  float64
 13  month_cos          41188 non-null  float64
 14  job_admin.         41188 non-null  int64  
 15  job_blue-collar    41188 non-null  int64  
 16  job_entrepreneur   411

#### 1.3.5. Random variable feature standarization

Last feature engineering step will be to standardize our numerical features. The goal here is to bring all of the features into a similar range. Just becuase the values for *nr.employed* are larger than those for *euribor3m* does not mean it is more important. The important thing is the variation in those feature and their relationship to the target variable.

Before we scale our features, we need to split the data into training and testing sets. This is because we will information about the data (e.g. it's mean and standard deviation) to conduct the scaling, and we don't want **ANY** information from or about the test set going into the model. The scaler will be fit on the training data only, and then used to scale the training and test data.

For more information, check out the article on [preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html) from the scikit-learn documentation.

In [16]:
data_df.head()

Unnamed: 0,age,education,duration,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,...,marital_divorced,marital_married,marital_single,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_telephone
0,56,2,261,1,1.1,93.994,-36.4,4.857,5191.0,no,...,0,1,0,1,0,1,0,1,0,1
1,57,5,149,1,1.1,93.994,-36.4,4.857,5191.0,no,...,0,1,0,0,0,1,0,1,0,1
2,37,5,226,1,1.1,93.994,-36.4,4.857,5191.0,no,...,0,1,0,1,0,0,1,1,0,1
3,40,3,151,1,1.1,93.994,-36.4,4.857,5191.0,no,...,0,1,0,1,0,1,0,1,0,1
4,56,5,307,1,1.1,93.994,-36.4,4.857,5191.0,no,...,0,1,0,1,0,1,0,0,1,1


In [17]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# First seperate the features from the labels
labels=data_df['y']
features=data_df.drop('y', axis=1)

# Do the test train fit
training_features, testing_features, training_labels, testing_labels=train_test_split(
    features,
    labels,
    test_size=0.33, 
    random_state=42
)

# Scale the features
standard_scaler=StandardScaler().fit(training_features)
training_features=standard_scaler.transform(training_features)
testing_features=standard_scaler.transform(testing_features)

print(f'Training features: {training_features.shape}')
print(f'Testing features: {testing_features.shape}')

Training features: (27595, 34)
Testing features: (13593, 34)


Last thing to do for feature engineering is to encode the labels to a binary numerical target variable.

In [18]:
# Last, encode the labels
label_encoder=LabelEncoder().fit(training_labels)
training_labels=label_encoder.transform(training_labels)
testing_labels=label_encoder.transform(testing_labels)

print(f'Training labels: {training_labels}')

Training labels: [0 1 0 ... 1 0 0]...
