**Data Set Information:**

The dataset includes various types of information about customers who have been in contact with a company. The features of the dataset include demographic information (year of birth, education level, marital status, income), household composition (number of small children and teenagers), customer relationship with the company (date of enrollment, number of days since the last purchase, number of visits to the company's website in the last month, whether or not they complained in the last two years), spending habits (amount spent on various product categories in the last two years, number of purchases made with discount, through the company's website, using a catalogue, directly in stores), participation in marketing campaigns (whether they accepted the offer in the previous five campaigns), and financial measures related to the customer contact (cost of the contact, revenue of the contact). The dependent variable, or target, is whether or not the customer accepted the offer in the last campaign (a boolean value).

**Business Problem:**

 The primary business problem is developing a predictive model to determine whether a customer will accept an offer in a marketing campaign. The model type specified is Classification. The goal would likely be to use this model to target customers who are more likely to accept offers in future campaigns, thus optimizing marketing resources and potentially increasing revenue. The features provided could offer insights into which factors most influence a customer's decision to accept an offer, potentially guiding future marketing strategies.

**Model Type:**

Classification

**Feature Category:**

catrgorical, continuous, datetime


**Indenpendent Feature X:**
1. ID(identical key id) - identical ids of the sampling 2240 entries from 11191 total dataset
2. Year_Birth(ordinal) - year of birth
3. Education(categorical) - customer’s level of education
4. Marital_Status(categorical) - customer’s marital status
5. Income(numerical) - customer’s yearly household income(USD)
6. Kidhome(ordinal) - number of small children in customer’s household
7. Teenhome(ordinal) - number of teenagers in customer’s household
8. Dt_Customer(Date/time) - date of customer’s enrolment with the company
9. Recency(numerical) - number of days since the last purchase
10. MntWines(numerical) - amount spent on wine products in the last 2 years
11. MntFruits(numerical) - amount spent on fruits products in the last 2 years
12. MntMeatProducts(numerical) - amount spent on meat products in the last 2 years
13. MntFishProducts(numerical) - amount spent on fish products in the last 2 years
14. MntSweetProducts(numerical) - amount spent on sweet products in the last 2 years
15. MntGoldProds(numerical) - amount spent on gold products in the last 2 years
16. NumDealsPurchases(numerical) - number of purchases made with discount
17. NumWebPurchases(numerical) - number of purchases made through company’s web site
18. NumCatalogPurchases(numerical) - number of purchases made using catalogue  
19. NumStorePurchases(numerical) - number of purchases made directly in stores
20. NumWebVisitsMonth(numerical) - number of visits to company’s web site in the last month
21. ~ 25. AcceptedCmp1-5(Boolean) - 1 if customer accepted the offer in the x_th campaign, 0 otherwise
26. Complain(Boolean) - 1 if customer complained in the last 2 years
27. Z_CostContact(numerical) - cost of the contact
28. Z_Revenue(numerical) - revenue of the contact


**Dependent Feature Y:**

Response(Boolean) - 1 if customer accepted the offer in the last campaign, 0 otherwise

In [None]:
#@title Load Data
#@markdown Load and showcase Data
import pandas as pd
import numpy as np
from google.colab import drive
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from datetime import datetime
pd.set_option('display.max_columns', None)
drive.mount('/content/drive')
data=pd.read_csv('drive/MyDrive/data/marketing_campaign.csv', delimiter=';')
data.head(3)

Mounted at /content/drive


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0


In [None]:
data.shape

(2240, 29)

In [None]:
#@title Missing Value
#@markdown Fill missing value
df=data.copy()
print(df.isnull().sum())
df['Income'].fillna(df['Income'].median(),inplace=True)

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64


In [None]:
#@title Feed Example Data to ChatGPT
#@markdown **Prompts0:** This is the sample data for our marketing campaign dataset df, please keep in mind:
#@markdown {"ID":{"0":5524,"1":2174,"2":4141},"Year_Birth":{"0":1957,"1":1954,"2":1965},"Education":{"0":"Graduation","1":"Graduation","2":"Graduation"},"Marital_Status":{"0":"Single","1":"Single","2":"Together"},"Income":{"0":58138.0,"1":46344.0,"2":71613.0},"Kidhome":{"0":0,"1":1,"2":0},"Teenhome":{"0":0,"1":1,"2":0},"Dt_Customer":{"0":"2012-09-04","1":"2014-03-08","2":"2013-08-21"},"Recency":{"0":58,"1":38,"2":26},"MntWines":{"0":635,"1":11,"2":426},"MntFruits":{"0":88,"1":1,"2":49},"MntMeatProducts":{"0":546,"1":6,"2":127},"MntFishProducts":{"0":172,"1":2,"2":111},"MntSweetProducts":{"0":88,"1":1,"2":21},"MntGoldProds":{"0":88,"1":6,"2":42},"NumDealsPurchases":{"0":3,"1":2,"2":1},"NumWebPurchases":{"0":8,"1":1,"2":8},"NumCatalogPurchases":{"0":10,"1":1,"2":2},"NumStorePurchases":{"0":4,"1":2,"2":10},"NumWebVisitsMonth":{"0":7,"1":5,"2":4},"AcceptedCmp3":{"0":0,"1":0,"2":0},"AcceptedCmp4":{"0":0,"1":0,"2":0},"AcceptedCmp5":{"0":0,"1":0,"2":0},"AcceptedCmp1":{"0":0,"1":0,"2":0},"AcceptedCmp2":{"0":0,"1":0,"2":0},"Complain":{"0":0,"1":0,"2":0},"Z_CostContact":{"0":3,"1":3,"2":3},"Z_Revenue":{"0":11,"1":11,"2":11},"Response":{"0":1,"1":0,"2":0}}
#@markdown if you understand the sample data, please say 'I understand the sample data' and do not say anything else
df.head(3).to_json()

'{"ID":{"0":5524,"1":2174,"2":4141},"Year_Birth":{"0":1957,"1":1954,"2":1965},"Education":{"0":"Graduation","1":"Graduation","2":"Graduation"},"Marital_Status":{"0":"Single","1":"Single","2":"Together"},"Income":{"0":58138.0,"1":46344.0,"2":71613.0},"Kidhome":{"0":0,"1":1,"2":0},"Teenhome":{"0":0,"1":1,"2":0},"Dt_Customer":{"0":"2012-09-04","1":"2014-03-08","2":"2013-08-21"},"Recency":{"0":58,"1":38,"2":26},"MntWines":{"0":635,"1":11,"2":426},"MntFruits":{"0":88,"1":1,"2":49},"MntMeatProducts":{"0":546,"1":6,"2":127},"MntFishProducts":{"0":172,"1":2,"2":111},"MntSweetProducts":{"0":88,"1":1,"2":21},"MntGoldProds":{"0":88,"1":6,"2":42},"NumDealsPurchases":{"0":3,"1":2,"2":1},"NumWebPurchases":{"0":8,"1":1,"2":8},"NumCatalogPurchases":{"0":10,"1":1,"2":2},"NumStorePurchases":{"0":4,"1":2,"2":10},"NumWebVisitsMonth":{"0":7,"1":5,"2":4},"AcceptedCmp3":{"0":0,"1":0,"2":0},"AcceptedCmp4":{"0":0,"1":0,"2":0},"AcceptedCmp5":{"0":0,"1":0,"2":0},"AcceptedCmp1":{"0":0,"1":0,"2":0},"AcceptedCmp2":

In [None]:
#@title Feature Engineering for Customer Demographic Group
#@markdown **Prompts1**:  Create new features about customer demographic based on age, income, children amount, Marital Status as long as you think the new features might be helpful for modeling and prediction. give me a line of comment to describe your code and followed by a line of python code
#@markdown 
#@markdown **Prompts2**:  Create some typical household types or family types based on 'Age_Group', 'Income_Per_Family_Size_Droup', 'Education', 'Children_amt'

# Customer demographic group
# Age: Subtracting the year of birth from the current year
df['Age'] = pd.Timestamp.now().year - df['Year_Birth']
# Categorize age into groups
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 18, 30, 40, 50, 60, float('inf')], labels=['<18','18-30', '30-39', '40-49', '50-59', '60+'])
# Categorize income into groups
df['Income_Group'] = pd.qcut(df['Income'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
df['Is_High_Income'] = df['Income'].apply(lambda x: 1 if x >= df['Income'].median() else 0)
# Calculate family size based on Kidhome, Teenhome, and Marital_Status
df['Children_amt'] = df['Kidhome'] + df['Teenhome']
# Calculate family size based on Kidhome, Teenhome, and Marital_Status
df['Family_Size'] = df['Kidhome'] + df['Teenhome'] + 2  # Add 2 to account for the customer and assuming they are not married
# Adjust family size based on Marital_Status
single_categories = ['Single', 'Alone', 'Absurd', 'YOLO', 'Divorced', 'Widow']
df.loc[df['Marital_Status'].isin(single_categories), 'Family_Size'] -= 1  # Subtract 1 for customers with single-like marital status
# Binary indicator for customers who are married
df['Is_Married'] = df['Marital_Status'].apply(lambda x: 1 if x in ['Together', 'Married'] else 0)
# Binary indicator for customers who are single
df['Is_Single'] = df['Marital_Status'].apply(lambda x: 1 if x in single_categories else 0)
# Calculate income per family size
df['Income_Per_Family_Size'] = df['Income'] / df['Family_Size']
# Categorize Income_Per_Family_Size into groups
df['Income_Per_Family_Size_Group'] = pd.qcut(df['Income'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
# Create a feature for high education
df['High_Education'] = df['Education'].isin(['PhD', 'Master']).astype(int)

# Create typical household types based on Age_Group, Income_Per_Family_Size_Group, Education, and Children_amt
conditions = [
    (df['Age_Group'] == '<18') & (df['Income_Per_Family_Size_Group'].isin(['High', 'Very High'])) & (df['Education'].isin(['PhD', 'Master'])) & (df['Children_amt'] >= 2),  # Wealthy Educated Large Family with Young Children
    (df['Age_Group'] == '18-30') & (df['Income_Per_Family_Size_Group'].isin(['Low', 'Very Low'])) & (df['Education'].isin(['Basic'])) & (df['Children_amt'] == 0),  # Poor Young Uneducated Couple
    (df['Age_Group'] == '30-39') & (df['Income_Per_Family_Size_Group'].isin(['Medium'])) & (df['Education'].isin(['Graduation'])) & (df['Children_amt'] >= 1),  # Average Educated Family with Children
    (df['Age_Group'] == '40-49') & (df['Income_Per_Family_Size_Group'].isin(['High', 'Very High'])) & (df['Education'].isin(['Basic', '2n Cycle'])) & (df['Children_amt'] >= 2),  # Wealthy Large Family with Lower Education
    (df['Age_Group'] == '50-59') & (df['Income_Per_Family_Size_Group'].isin(['Medium'])) & (df['Education'].isin(['PhD', 'Master'])) & (df['Children_amt'] == 0),  # Average Educated Empty Nesters
    (df['Age_Group'] == '60+') & (df['Income_Per_Family_Size_Group'].isin(['High', 'Very High'])) & (df['Education'].isin(['Graduation'])) & (df['Children_amt'] == 0),  # Wealthy Educated Senior Couple
]

family_types = [
    'Wealthy Educated Large Family with Young Children',
    'Poor Young Uneducated Couple',
    'Average Educated Family with Children',
    'Wealthy Large Family with Lower Education',
    'Average Educated Empty Nesters',
    'Wealthy Educated Senior Couple',
]

# Assign family types based on conditions
df['Family_Type'] = np.select(conditions, family_types, default='Other')

In [None]:
# Customer demographic features
df[list(set(df.columns).difference(data.columns))].head(3)

Unnamed: 0,High_Education,Is_Single,Is_High_Income,Age_Group,Family_Size,Is_Married,Age,Family_Type,Income_Group,Income_Per_Family_Size_Group,Children_amt,Income_Per_Family_Size
0,0,1,1,60+,1,0,66,Other,Medium,Medium,0,58138.0
1,0,1,0,60+,3,0,69,Other,Medium,Medium,2,15448.0
2,0,0,1,50-59,2,1,58,Other,High,High,0,35806.5


In [None]:
#@title Customer Engagement Features Engineering
#@markdown **Prompts3:** Create some new features about customer engagement info based on customer behavior features, as long as you think the new features might be helpful for modeling and prediction. give me a line of comment to describe your code and followed by a line of python code.
#'today_date' is the current date
today_date = datetime.now()
# Convert Dt_Customer to datetime object
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])

# Length of relationship with the company in days
df['Days_with_company'] = (today_date - df['Dt_Customer']).dt.days

# Customer Purchase Type Preference: 
# Analyze if the customer prefers buying certain categories of products. 
# This can be done by computing the fraction of each category in the total spending.
product_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df['Total_spent'] = df[product_cols].sum(axis=1)
for col in product_cols:
    df['Fraction_' + col] = df[col] / np.where(df['Total_spent']==0,1,df['Total_spent'])

# Customer Purchase Channel Preference: 
# Understand if the customer has a preference for a particular channel - web, catalog, or store.
purchase_cols = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
df['Total_purchases'] = df[purchase_cols].sum(axis=1)
for col in purchase_cols:
    df['Fraction_' + col] = df[col] / np.where(df['Total_purchases']==0,1,df['Total_purchases'])

# Discount Affinity: 
# Ratio of purchases made with a discount to total purchases to understand how much a customer is attracted to discounts.
df['Discount_Affinity'] = df['NumDealsPurchases'] / np.where(df['Total_purchases']==0,1,df['Total_purchases'])

# Customer Engagement Score: 
# Create a custom score based on frequency and amount of purchases, and the recency of the last purchase.
df['Engagement_Score'] = df['Recency'] + df['Total_purchases'] + df['Total_spent']
df['Engagement_Score_bin'] = pd.cut(df['Engagement_Score'], bins=5, labels=range(1,6)).astype(int)

# Customer Loyalty Score: 
# Score based on the length of the customer relationship and the Customer Engagement.
df['Loyalty_Score'] = df['Days_with_company'] + df['Engagement_Score']
df['Loyalty_Score_bin']=pd.cut(df['Loyalty_Score'], bins=5, labels=range(1,6)).astype(int)

In [None]:
# Customer demographic features
df[list(set(df.columns).difference(data.columns))].head(3)

Unnamed: 0,Days_with_company,Fraction_NumStorePurchases,High_Education,Fraction_NumDealsPurchases,Engagement_Score_bin,Fraction_MntGoldProds,Fraction_MntSweetProducts,Fraction_MntMeatProducts,Is_Single,Is_High_Income,Age_Group,Family_Size,Fraction_MntFruits,Fraction_NumCatalogPurchases,Engagement_Score,Is_Married,Total_spent,Fraction_NumWebPurchases,Age,Total_purchases,Family_Type,Loyalty_Score_bin,Fraction_MntWines,Income_Group,Income_Per_Family_Size_Group,Discount_Affinity,Loyalty_Score,Children_amt,Fraction_MntFishProducts,Income_Per_Family_Size
0,3919,0.16,0,0.12,4,0.054422,0.054422,0.337662,1,1,60+,1,0.054422,0.4,1700,0,1617,0.32,66,25,Other,4,0.392703,Medium,Medium,0.12,5619,0,0.10637,58138.0
1,3369,0.333333,0,0.333333,1,0.222222,0.037037,0.222222,1,0,60+,3,0.037037,0.166667,71,0,27,0.166667,69,6,Other,1,0.407407,Medium,Medium,0.333333,3440,2,0.074074,15448.0
2,3568,0.47619,0,0.047619,2,0.054124,0.027062,0.16366,0,1,50-59,2,0.063144,0.095238,823,1,776,0.380952,58,21,Other,2,0.548969,High,High,0.047619,4391,0,0.143041,35806.5


In [None]:
#@title Marketing Campaign Response Engineering
#@markdown **Prompts4:** Create some new features based on historical marketing campaign response, marketing campaign cost and revenue, customer complain, as long as you think the new features might be helpful for modeling and prediction. give me python code and comments

# 1. Total number of campaigns accepted by a customer
df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']

# 2. Whether a customer accepted any campaign
df['AcceptedAnyCmp'] = df['TotalAcceptedCmp'].apply(lambda x: 1 if x > 0 else 0)

# 3. Whether a customer accepted all campaigns
df['AcceptedAllCmp'] = df['TotalAcceptedCmp'].apply(lambda x: 1 if x == 5 else 0)

# 4. Revenue to Cost ratio for the contact
df['RevenueCostRatio'] = df['Z_Revenue'] / np.where(df['Z_CostContact']==0,1,df['Z_CostContact'])

# 5. Did customer accept last two campaigns?
df['AcceptedLastTwoCmp'] = df[['AcceptedCmp4', 'AcceptedCmp5']].sum(axis=1).apply(lambda x: 1 if x == 2 else 0)

# 6. Did customer accept any of the first three campaigns?
df['AcceptedAnyFirstThreeCmp'] = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3']].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# 7. Has customer complained and accepted a campaign?
df['ComplainedAcceptedCmp'] = df.apply(lambda x: 1 if (x['Complain'] == 1 and x['AcceptedAnyCmp'] == 1) else 0, axis=1)

# 8. Has customer accepted a campaign but not complained?
df['AcceptedCmpNotComplained'] = df.apply(lambda x: 1 if (x['Complain'] == 0 and x['AcceptedAnyCmp'] == 1) else 0, axis=1)

# 9. Has customer complained and not accepted any campaign?
df['ComplainedNotAcceptedCmp'] = df.apply(lambda x: 1 if (x['Complain'] == 1 and x['AcceptedAnyCmp'] == 0) else 0, axis=1)

# 10. Average revenue per accepted campaign
df['AvgRevenuePerAcceptedCmp'] = df.apply(lambda x: x['Z_Revenue'] / x['TotalAcceptedCmp'] if x['TotalAcceptedCmp'] > 0 else 0, axis=1)


In [None]:
# Customer demographic features
df[list(set(df.columns).difference(data.columns))].head(3)

Unnamed: 0,Fraction_NumStorePurchases,Fraction_NumDealsPurchases,Engagement_Score_bin,RevenueCostRatio,AcceptedCmpNotComplained,Fraction_MntMeatProducts,Is_High_Income,AvgRevenuePerAcceptedCmp,Age_Group,Family_Size,Fraction_MntFruits,AcceptedAllCmp,TotalAcceptedCmp,Fraction_NumWebPurchases,Total_purchases,Loyalty_Score_bin,Income_Group,AcceptedAnyFirstThreeCmp,Income_Per_Family_Size_Group,Discount_Affinity,Loyalty_Score,Children_amt,Fraction_MntFishProducts,ComplainedAcceptedCmp,Days_with_company,High_Education,Fraction_MntGoldProds,AcceptedLastTwoCmp,Fraction_MntSweetProducts,Is_Single,Fraction_NumCatalogPurchases,Engagement_Score,Is_Married,Total_spent,Age,Family_Type,Fraction_MntWines,AcceptedAnyCmp,ComplainedNotAcceptedCmp,Income_Per_Family_Size
0,0.16,0.12,4,3.666667,0,0.337662,1,0.0,60+,1,0.054422,0,0,0.32,25,4,Medium,0,Medium,0.12,5619,0,0.10637,0,3919,0,0.054422,0,0.054422,1,0.4,1700,0,1617,66,Other,0.392703,0,0,58138.0
1,0.333333,0.333333,1,3.666667,0,0.222222,0,0.0,60+,3,0.037037,0,0,0.166667,6,1,Medium,0,Medium,0.333333,3440,2,0.074074,0,3369,0,0.222222,0,0.037037,1,0.166667,71,0,27,69,Other,0.407407,0,0,15448.0
2,0.47619,0.047619,2,3.666667,0,0.16366,1,0.0,50-59,2,0.063144,0,0,0.380952,21,2,High,0,High,0.047619,4391,0,0.143041,0,3568,0,0.054124,0,0.027062,0,0.095238,823,1,776,58,Other,0.548969,0,0,35806.5


In [None]:
#@title Feature Encoding
#@markdown **Prompts5:** I have generated many new features, think about which features need to be encoded to numeric features, and give me the best encoding strategy for them. This is the current sample data for df: (copy the json code below)
df.head(1).to_json()

'{"ID":{"0":5524},"Year_Birth":{"0":1957},"Education":{"0":"Graduation"},"Marital_Status":{"0":"Single"},"Income":{"0":58138.0},"Kidhome":{"0":0},"Teenhome":{"0":0},"Dt_Customer":{"0":1346716800000},"Recency":{"0":58},"MntWines":{"0":635},"MntFruits":{"0":88},"MntMeatProducts":{"0":546},"MntFishProducts":{"0":172},"MntSweetProducts":{"0":88},"MntGoldProds":{"0":88},"NumDealsPurchases":{"0":3},"NumWebPurchases":{"0":8},"NumCatalogPurchases":{"0":10},"NumStorePurchases":{"0":4},"NumWebVisitsMonth":{"0":7},"AcceptedCmp3":{"0":0},"AcceptedCmp4":{"0":0},"AcceptedCmp5":{"0":0},"AcceptedCmp1":{"0":0},"AcceptedCmp2":{"0":0},"Complain":{"0":0},"Z_CostContact":{"0":3},"Z_Revenue":{"0":11},"Response":{"0":1},"Age":{"0":66},"Age_Group":{"0":"60+"},"Income_Group":{"0":"Medium"},"Is_High_Income":{"0":1},"Children_amt":{"0":0},"Family_Size":{"0":1},"Is_Married":{"0":0},"Is_Single":{"0":1},"Income_Per_Family_Size":{"0":58138.0},"Income_Per_Family_Size_Group":{"0":"Medium"},"High_Education":{"0":0},"Fa

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Ordinal encoding for ordinal features
ordinal_features = ['Education', 'Age_Group', 'Income_Group', 'Income_Per_Family_Size_Group']
ordinal_encoder = OrdinalEncoder()
df[ordinal_features] = ordinal_encoder.fit_transform(df[ordinal_features])

# One-hot encoding for nominal features
nominal_features = ['Marital_Status', 'Family_Type']
df = pd.get_dummies(df, columns=nominal_features, prefix=nominal_features, drop_first=True)


In [None]:
# drop the useless features
df1=df.drop(['ID','Year_Birth','Dt_Customer'],axis=1)
df1.dtypes

Education                                                float64
Income                                                   float64
Kidhome                                                    int64
Teenhome                                                   int64
Recency                                                    int64
                                                          ...   
Family_Type_Average Educated Family with Children          uint8
Family_Type_Other                                          uint8
Family_Type_Poor Young Uneducated Couple                   uint8
Family_Type_Wealthy Educated Senior Couple                 uint8
Family_Type_Wealthy Large Family with Lower Education      uint8
Length: 76, dtype: object

In [None]:
# dataset is imbalanced, should use class_weights in model training
df1['Response'].value_counts()/df1['Response'].value_counts().sum()

0    0.850893
1    0.149107
Name: Response, dtype: float64

In [None]:
#@title Modeling RF on original data
#@markdown Train Model on original data without any feature engineering

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

df2=data.copy()
df2['Income'].fillna(df['Income'].median(),inplace=True)

# One-hot encoding for nominal features
nominal_features = ['Marital_Status','Year_Birth','Education']
df2 = pd.get_dummies(df2, columns=nominal_features, prefix=nominal_features, drop_first=True)

# drop ID and Dt_Customer
df2=df2.drop(['ID','Dt_Customer'],axis=1)

# Split the data into training and testing sets
X = df2.drop('Response', axis=1)  # Features (independent variables)
y = df2['Response']  # Target variable (dependent variable)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the Random Forest model
rf_model = RandomForestClassifier(random_state=42, class_weight= "balanced")
rf_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rf_model.predict(X_test)

# Evaluate model performance
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

F1-Score: 0.45454545454545453
Confusion Matrix:
[[363   5]
 [ 55  25]]


In [None]:
#@title Modeling RF on data with feature engineering
#@markdown Train Model on dataframe after feature engineering

# Split the data into training and testing sets
X = df1.drop('Response', axis=1)  # Features (independent variables)
y = df1['Response']  # Target variable (dependent variable)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the Random Forest model
rf_model = RandomForestClassifier(random_state=42, class_weight= "balanced")
rf_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rf_model.predict(X_test)

# Evaluate model performance
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

F1-Score: 0.5833333333333334
Confusion Matrix:
[[363   5]
 [ 45  35]]


In [None]:
#@title Modeling Xgboost on data with feature engineering
#@markdown Train Model on dataframe after feature engineering

import xgboost as xgb
# Split the data into training and testing sets
X = df1.drop('Response', axis=1)  # Features (independent variables)
y = df1['Response']  # Target variable (dependent variable)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Define the XGBoost model
xgb_model = xgb.XGBClassifier(random_state=42, scale_pos_weight= 5.67)

# Train the XGBoost model
xgb_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = xgb_model.predict(X_test)

# Evaluate model performance
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

F1-Score: 0.6986301369863013
Confusion Matrix:
[[353  15]
 [ 29  51]]


In [None]:
#@title Modeling with ensamble methods on data with feature engineering
#@markdown Train ensamble Model on dataframe after feature engineering

#!pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor
from sklearn.metrics import f1_score

#!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

# Split the data into training and testing sets
train_data, test_data = train_test_split(df1, test_size=0.2, random_state=42)

# Separate features and target
X = train_data.drop('Response', axis=1)
y = train_data['Response']

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, y)

train_balanced = X_sm.copy()
train_balanced['Response'] = y_sm


# Specify the label column name
label = 'Response'

# Initialize the TabularPredictor with the desired evaluation metric
metric = 'f1_macro'
predictor = TabularPredictor(label=label, eval_metric=metric)

# Define the models to include in the modeling process
models = {'GBM': {}, 'NN_TORCH': {}, 'RF': {}, 'XT': {}, 'CAT': {}, 'XGB': {}}

# Perform modeling and prediction using AutoGluon
predictor.fit(train_balanced, presets='best_quality', hyperparameters=models)

# Make predictions on the test data
predictions = predictor.predict(test_data)

# Get the predicted probabilities for each class
probabilities = predictor.predict_proba(test_data)

# Evaluate the model performance on the test data
f1 = f1_score(test_data[label], predictions, average='macro')

# Print the F1-score
print("F1-Score:", f1)

# Generate the confusion matrix
cm = confusion_matrix(test_data[label], predictions)
print("Confusion Matrix:")
print(cm)


No path specified. Models will be saved in: "AutogluonModels/ag-20230529_224219/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230529_224219/"
AutoGluon Version:  0.7.0
Python Version:     3.10.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Apr 29 09:15:28 UTC 2023
Train Data Rows:    3054
Train Data Columns: 75
Label Column: Response
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preproces

F1-Score: 0.7257380025940336
Confusion Matrix:
[[356  23]
 [ 36  33]]
