# Data Cleaning
This part is about cleaning the data. We will fix missing values and handle some special cases in the dataset. This is important because clean data helps the model work better.

### Missing Values
Some columns have "unknown" values. We will replace "unknown" with "Other" so we don't lose any data. This is better than deleting rows because we want to keep as much data as possible.

In [1]:
import pandas as pd

# Load the dataset
file_path = '/Users/evansmac/Downloads/IAT461/Banking_Call_Data.csv'
data = pd.read_csv(file_path)

# Replace "unknown" values with "Other"
columns_with_unknown = ['education', 'contact', 'poutcome']
for column in columns_with_unknown:
    data[column] = data[column].replace('unknown', 'Other')

# Check the changes
data[columns_with_unknown].head()

Unnamed: 0,education,contact,poutcome
0,tertiary,Other,Other
1,secondary,Other,Other
2,secondary,Other,Other
3,Other,Other,Other
4,Other,Other,Other


### Special Values
The column `pdays` has a value of -1. This means the customer was not contacted before. We will replace -1 with "Not Contacted" to make it easier to understand.

In [2]:
# Replace -1 in 'pdays' with "Not Contacted"
data['pdays'] = data['pdays'].apply(lambda x: 'Not Contacted' if x == -1 else x)

# Check the changes
data['pdays'].value_counts()

pdays
Not Contacted    36954
182                167
92                 147
91                 126
183                126
                 ...  
449                  1
452                  1
648                  1
595                  1
530                  1
Name: count, Length: 559, dtype: int64

### Balance and Duration
We will check if there are any negative values in `balance` or `duration`. Negative values don't make sense for these columns, so we will remove rows with negative values.

In [3]:
# Remove rows with negative values in 'balance' and 'duration'
data = data[(data['balance'] >= 0) & (data['duration'] >= 0)]

# Check the changes
print(f"Number of rows after cleaning: {len(data)}")

Number of rows after cleaning: 41445


## Summary
We cleaned the data by:
1. Replacing "unknown" with "Other" in some columns.
2. Changing -1 in `pdays` to "Not Contacted".
3. Removing rows with negative values in `balance` and `duration`.

Now the data is ready for the next step: feature engineering.

# Feature Engineering
This part is about creating new features that can help the model predict better. We will focus on features that are important for our research question and algorithms.

### Important Features
We already have many good features in the dataset, but we will create a few new ones that are directly related to our research question. These features will help us understand customer behavior and activity better.

### Season
The column `month` tells us when the customer was contacted. We will group months into seasons like Spring, Summer, Fall, and Winter. This will help us see if the season affects the customer's decision.

In [4]:
# Create a new column for season based on the month
def get_season(month):
    if month in ['mar', 'apr', 'may']:
        return 'Spring'
    elif month in ['jun', 'jul', 'aug']:
        return 'Summer'
    elif month in ['sep', 'oct', 'nov']:
        return 'Fall'
    else:
        return 'Winter'

data['season'] = data['month'].apply(get_season)

# Check the new feature
data[['month', 'season']].head()

Unnamed: 0,month,season
0,may,Spring
1,may,Spring
2,may,Spring
3,may,Spring
4,may,Spring


### Duration Per Contact
The column `duration` tells us how long the call lasted, and `campaign` tells us how many times the customer was contacted. We will create a new feature called `duration_per_contact` to see the average call duration per contact.

In [5]:
# Create a new column for average duration per contact
data['duration_per_contact'] = data['duration'] / (data['campaign'] + 1)

# Check the new feature
data[['duration', 'campaign', 'duration_per_contact']].head()

Unnamed: 0,duration,campaign,duration_per_contact
0,261,1,130.5
1,151,1,75.5
2,76,1,38.0
3,92,1,46.0
4,198,1,99.0


### Was Contacted Before
The column `pdays` tells us how many days ago the customer was last contacted. If the value is "Not Contacted", it means the customer was never contacted before. We will create a new feature called `was_contacted_before` to show if the customer was contacted before or not.

In [6]:
# Create a new column for whether the customer was contacted before
data['was_contacted_before'] = data['pdays'].apply(lambda x: 0 if x == 'Not Contacted' else 1)

# Check the new feature
data[['pdays', 'was_contacted_before']].head()

Unnamed: 0,pdays,was_contacted_before
0,Not Contacted,0
1,Not Contacted,0
2,Not Contacted,0
3,Not Contacted,0
4,Not Contacted,0


## Summary
We created three new features:
1. `season`: Groups months into seasons to analyze seasonal trends.
2. `duration_per_contact`: Shows the average call duration per contact.
3. `was_contacted_before`: Indicates if the customer was contacted before.

These features will help us understand customer behavior and improve the model's predictions.

### One-Hot Encoding
We will encode some columns to make them easier for the model to understand. These columns are `default`, `housing`, `loan`, and `y`. One-Hot Encoding will turn these columns into numbers.

In [7]:
# Perform One-Hot Encoding on selected columns
encoded_columns = ['default', 'housing', 'loan', 'y']
data = pd.get_dummies(data, columns=encoded_columns, drop_first=True)

# Ensure the encoded columns are 0 and 1
data = data.astype({col: 'int' for col in data.columns if col.endswith('_yes')})

# Check the changes
data.head()

Unnamed: 0,age,job,marital,education,balance,contact,day,month,duration,campaign,pdays,previous,poutcome,season,duration_per_contact,was_contacted_before,default_yes,housing_yes,loan_yes,y_yes
0,58,management,married,tertiary,2143,Other,5,may,261,1,Not Contacted,0,Other,Spring,130.5,0,0,1,0,0
1,44,technician,single,secondary,29,Other,5,may,151,1,Not Contacted,0,Other,Spring,75.5,0,0,1,0,0
2,33,entrepreneur,married,secondary,2,Other,5,may,76,1,Not Contacted,0,Other,Spring,38.0,0,0,1,1,0
3,47,blue-collar,married,Other,1506,Other,5,may,92,1,Not Contacted,0,Other,Spring,46.0,0,0,1,0,0
4,33,unknown,single,Other,1,Other,5,may,198,1,Not Contacted,0,Other,Spring,99.0,0,0,0,0,0


### Save Changes
We will save the cleaned and processed data. This will make sure all the changes we made are kept. We will overwrite the original dataset to keep things simple.

In [8]:
# Save the processed data back to the original variable
processed_data = data  # If you want to keep a separate dataset, use this line
data.to_csv('/Users/evansmac/Downloads/IAT461/Processed_Banking_Call_Data.csv', index=False)

# Check the saved file
print("Data saved successfully!")

Data saved successfully!


In [10]:
data.head()  # Display the first few rows of the processed data

Unnamed: 0,age,job,marital,education,balance,contact,day,month,duration,campaign,pdays,previous,poutcome,season,duration_per_contact,was_contacted_before,default_yes,housing_yes,loan_yes,y_yes
0,58,management,married,tertiary,2143,Other,5,may,261,1,Not Contacted,0,Other,Spring,130.5,0,0,1,0,0
1,44,technician,single,secondary,29,Other,5,may,151,1,Not Contacted,0,Other,Spring,75.5,0,0,1,0,0
2,33,entrepreneur,married,secondary,2,Other,5,may,76,1,Not Contacted,0,Other,Spring,38.0,0,0,1,1,0
3,47,blue-collar,married,Other,1506,Other,5,may,92,1,Not Contacted,0,Other,Spring,46.0,0,0,1,0,0
4,33,unknown,single,Other,1,Other,5,may,198,1,Not Contacted,0,Other,Spring,99.0,0,0,0,0,0
