Problem Statement Predicting Coupon Redemption XYZ Credit Card company regularly helps it’s merchants understand their data better and take key business decisions accurately by providing machine learning and analytics consulting. ABC is an established Brick & Mortar retailer that frequently conducts marketing campaigns for its diverse product range. As a merchant of XYZ, they have sought XYZ to assist them in their discount marketing process using the power of machine learning. Can you wear the AmExpert hat and help out ABC?

Discount marketing and coupon usage are very widely used promotional techniques to attract new customers and to retain & reinforce loyalty of existing customers. The measurement of a consumer’s propensity towards coupon usage and the prediction of the redemption behaviour are crucial parameters in assessing the effectiveness of a marketing campaign.

ABC’s promotions are shared across various channels including email, notifications, etc. A number of these campaigns include coupon discounts that are offered for a specific product/range of products. The retailer would like the ability to predict whether customers redeem the coupons received across channels, which will enable the retailer’s marketing team to accurately design coupon construct, and develop more precise and targeted marketing strategies.

The data available in this problem contains the following information, including the details of a sample of campaigns and coupons used in previous campaigns -

User Demographic Details Campaign and coupon Details Product details Previous transactions Based on previous transaction & performance data from the last 18 campaigns, predict the probability for the next 10 campaigns in the test set for each coupon and customer combination, whether the customer will redeem the coupon or not?

Dataset Description Here is the schema for the different data tables available. The detailed data dictionary is provided next.

You are provided with the following files in train.zip:

train.csv: Train data containing the coupons offered to the given customers under the 18 campaigns

Variable Definition id Unique id for coupon customer impression campaign_id Unique id for a discount campaign coupon_id Unique id for a discount coupon customer_id Unique id for a customer redemption_status (target) (0 - Coupon not redeemed, 1 - Coupon redeemed) campaign_data.csv: Campaign information for each of the 28 campaigns

Variable Definition campaign_id Unique id for a discount campaign campaign_type Anonymised Campaign Type (X/Y) start_date Campaign Start Date end_date Campaign End Date coupon_item_mapping.csv: Mapping of coupon and items valid for discount under that coupon

Variable Definition coupon_id Unique id for a discount coupon (no order) item_id Unique id for items for which given coupon is valid (no order) customer_demographics.csv: Customer demographic information for some customers

Variable Definition customer_id Unique id for a customer age_range Age range of customer family in years marital_status Married/Single rented 0 - not rented accommodation, 1 - rented accommodation family_size Number of family members no_of_children Number of children in the family income_bracket Label Encoded Income Bracket (Higher income corresponds to higher number) customer_transaction_data.csv: Transaction data for all customers for duration of campaigns in the train data

Variable Definition date Date of Transaction customer_id Unique id for a customer item_id Unique id for item quantity quantity of item bought selling_price Sales value of the transaction other_discount Discount from other sources such as manufacturer coupon/loyalty card coupon_discount Discount availed from retailer coupon item_data.csv: Item information for each item sold by the retailer

Variable Definition item_id Unique id for item brand Unique id for item brand brand_type Brand Type (local/Established) category Item Category test.csv: Contains the coupon customer combination for which redemption status is to be predicted

Variable Definition id Unique id for coupon customer impression campaign_id Unique id for a discount campaign coupon_id Unique id for a discount coupon customer_id Unique id for a customer *Campaign, coupon and customer data for test set is also contained in train.zip

sample_submission.csv: This file contains the format in which you have to submit your predictions.

To summarise the entire process:

Customers receive coupons under various campaigns and may choose to redeem it. They can redeem the given coupon for any valid product for that coupon as per coupon item mapping within the duration between campaign start date and end date Next, the customer will redeem the coupon for an item at the retailer store and that will reflect in the transaction table in the column coupon_discount.



In [1]:
'''
https://www.kaggle.com/bharath901/amexpert-2019/data#
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
import csv

Import Function to various files 

In [2]:
'''
Defining a Class to import the Data into a pandas DataFrame for analysis
- Method to storing the Data into a DataFrame
- Method to extracting Information of the Data to understand datatype associated with each column
- Method to describing the Data
- Method to understanding Null values distribution
- Method to understanding Unique values distribution
'''

class import_data():
    
    '''
    Method to extract and store the data as pandas dataframe
    '''
    def __init__(self,path):
        self.raw_data = pd.read_csv(path)
        display (self.raw_data.head(10))
        

    '''
    Method to extract information about the data and display
    '''
    def get_info(self):
        display (self.raw_data.info())
        
    '''
    Method to describe the data
    '''
    def get_describe(self):
        display (self.raw_data.describe())
        
    '''
    Mehtod to understand Null values distribution
    '''
    def null_value(self):
        col_null = pd.DataFrame(self.raw_data.isnull().sum()).reset_index()
        col_null.columns = ['DataColumns','NullCount']
        col_null['NullCount_Pct'] = round((col_null['NullCount']/self.raw_data.shape[0])*100,2)
        display (col_null)
        
    '''
    Method to understand Unique values distribution
    '''
    def unique_value(self):
        col_uniq = pd.DataFrame(self.raw_data.nunique()).reset_index()
        col_uniq.columns = ['DataColumns','UniqCount']
        col_uniq_cnt = pd.DataFrame(self.raw_data.count(axis=0)).reset_index()
        col_uniq_cnt.columns = ['DataColumns','UniqCount']
        col_uniq['UniqCount_Pct'] = round((col_uniq['UniqCount']/col_uniq_cnt['UniqCount'])*100,2)
        display (col_uniq)
    '''
    Method to return the dataset as dataframe
    '''
    def return_data(self):
        base_loan_data = self.raw_data
        return (base_loan_data)

In [3]:
'''
Evaluation and Analysis starts here for train.csv
'''
#path = str(input('Enter the path to load the dataset:'))
path = './train.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
train_data = data.return_data()



Unnamed: 0,id,campaign_id,coupon_id,customer_id,redemption_status
0,1,13,27,1053,0
1,2,13,116,48,0
2,6,9,635,205,0
3,7,13,644,1050,0
4,9,8,1017,1489,0
5,11,11,795,793,0
6,14,9,444,590,0
7,15,29,538,368,0
8,17,30,857,523,0
9,19,2,559,679,0


In [4]:
'''
Evaluation and Analysis starts here for Campaign Data
'''
#path = str(input('Enter the path to load the dataset:'))
path = './campaign_data.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
campaign_data = data.return_data()



Unnamed: 0,campaign_id,campaign_type,start_date,end_date
0,24,Y,21/10/13,20/12/13
1,25,Y,21/10/13,22/11/13
2,20,Y,07/09/13,16/11/13
3,23,Y,08/10/13,15/11/13
4,21,Y,16/09/13,18/10/13
5,22,X,16/09/13,18/10/13
6,18,X,10/08/13,04/10/13
7,19,Y,26/08/13,27/09/13
8,17,Y,29/07/13,30/08/13
9,16,Y,15/07/13,16/08/13


In [5]:
'''
Evaluation and Analysis starts here for Coupon Data
'''
#path = str(input('Enter the path to load the dataset:'))
path = './coupon_item_mapping.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
coupon_data = data.return_data()



Unnamed: 0,coupon_id,item_id
0,105,37
1,107,75
2,494,76
3,522,77
4,518,77
5,520,77
6,529,77
7,524,77
8,522,81
9,518,81


In [6]:
'''
Evaluation and Analysis starts here for Item Data
'''
#path = str(input('Enter the path to load the dataset:'))
path = './item_data.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
item_data = data.return_data()



Unnamed: 0,item_id,brand,brand_type,category
0,1,1,Established,Grocery
1,2,1,Established,Miscellaneous
2,3,56,Local,Bakery
3,4,56,Local,Grocery
4,5,56,Local,Grocery
5,6,56,Local,Grocery
6,7,56,Local,Pharmaceutical
7,8,56,Local,Bakery
8,9,11,Local,Grocery
9,10,56,Local,Grocery


In [7]:
'''
Evaluation and Analysis starts here for Customer Demographic
'''
#path = str(input('Enter the path to load the dataset:'))
path = './customer_demographics.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
cust_demo_data = data.return_data()



Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,1,70+,Married,0,2,,4
1,6,46-55,Married,0,2,,5
2,7,26-35,,0,3,1.0,3
3,8,26-35,,0,4,2.0,6
4,10,46-55,Single,0,1,,5
5,11,70+,Single,0,2,,1
6,12,46-55,Married,0,2,,7
7,13,36-45,Single,0,1,,2
8,14,26-35,Married,1,2,,6
9,15,46-55,Married,0,2,,6


In [8]:
'''
Evaluation and Analysis starts here for Customer Transaction
'''
#path = str(input('Enter the path to load the dataset:'))
path = './customer_transaction_data.csv'
print ('='*100)
data = import_data(path)
#data.get_info()
#data.null_value()
#data.unique_value()
#data.get_describe()
cust_tran_data = data.return_data()



Unnamed: 0,date,customer_id,item_id,quantity,selling_price,other_discount,coupon_discount
0,2012-01-02,1501,26830,1,35.26,-10.69,0.0
1,2012-01-02,1501,54253,1,53.43,-13.89,0.0
2,2012-01-02,1501,31962,1,106.5,-14.25,0.0
3,2012-01-02,1501,33647,1,67.32,0.0,0.0
4,2012-01-02,1501,48199,1,71.24,-28.14,0.0
5,2012-01-02,1501,57397,1,71.24,-28.14,0.0
6,2012-01-02,857,12424,1,106.5,-14.25,0.0
7,2012-01-02,857,14930,1,110.07,0.0,0.0
8,2012-01-02,857,16657,1,89.05,-35.26,0.0
9,2012-01-02,67,10537,3,32.06,0.0,0.0


In [9]:
'''
Function to write the experiment result to csv file
'''

file_write_cnt = 1

# writing the baseline results to csv file 

def write_file(F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11):
        
    # field names 
    fields = ['Expt No','Outlier Treatment','Skewness Treatment','Null Treatment',
              'No of Features','Feature Selected','Model Used','Precision for 1',
              'Recall for 1','Accuracy','Comment']
  
    # data in the file
    rows = [[F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11]]
        
    # name of csv file 
    filename = "./score_dashboard.csv"
  
    if int(F1) == 1:
        # writing to csv file 
        with open(filename, 'w') as csvfile: 
        
            # creating a csv writer object 
            csvwriter = csv.writer(csvfile) 
      
            # writing the fields 
            csvwriter.writerow(fields) 
      
            # writing the data rows 
            csvwriter.writerows(rows)
    
    if int(F1) > 1:
        # writing to csv file 
        with open(filename, 'a') as csvfile: 
        
            # creating a csv writer object 
            csvwriter = csv.writer(csvfile) 
      
            # writing the data rows 
            csvwriter.writerows(rows)

In [10]:
'''
Function to convert the date into quarter
'''

def date_q(date,split):
    
    if split == '/':
        
        """
        Convert Date to Quarter when separated with /
        """
        qdate = date.strip().split('/')[1:]
        qdate1 = qdate[0]

        if qdate1 in ['01','02','03']:
            return (str('Q1' + '-' + qdate[1]))
        if qdate1 in ['04','05','06']:
            return (str('Q2' + '-' + qdate[1]))
        if qdate1 in ['07','08','09']:
            return (str('Q3' + '-' + qdate[1]))
        if qdate1 in ['10','11','12']:
            return (str('Q4' + '-' + qdate[1]))
        
    if split == '-':
        
        """
        Convert Date to Quarter when separated with -
        """
        qdate = date.strip().split('-')[0:2]
        qdate1 = qdate[1]
        qdate2 = str(qdate[0])
        if qdate1 in ['01','02','03']:
            return (str('Q1' + '-' + qdate2[2:]))
        if qdate1 in ['04','05','06']:
            return (str('Q2' + '-' + qdate2[2:]))
        if qdate1 in ['07','08','09']:
            return (str('Q3' + '-' + qdate2[2:]))
        if qdate1 in ['10','11','12']:
            return (str('Q4' + '-' + qdate2[2:]))

In [11]:
'''
Function to aggregate Customer Transaction Data
'''

def tran_summation(column):
    cust_tran_data_expt['tot_'+column] = pd.DataFrame(cust_tran_data_expt.groupby(['customer_id','item_id','coupon_id'])[column].transform('sum'))
    cust_tran_data_expt.drop([column],axis=1,inplace=True)


def tran_summation_1(column):
    cust_tran_data_expt['tot_'+column] = pd.DataFrame(cust_tran_data_expt.groupby(['customer_id','item_id','coupon_id','tran_date_q'])[column].transform('sum'))
    cust_tran_data_expt.drop([column],axis=1,inplace=True)
    
def tran_summation_2(column):
    cust_tran_data_expt['tot_'+column] = pd.DataFrame(cust_tran_data_expt.groupby(['customer_id','coupon_id'])[column].transform('sum'))
    cust_tran_data_expt.drop([column],axis=1,inplace=True)

In [12]:
'''
Function to label encode
'''

def label_encode(column):
    train_data_merge[column] = train_data_merge[column].astype('category').cat.codes

In [13]:
'''
Function to convert Categorical column to Integer using Coupon Redemption percentage
'''

def cat_percent(column):
    train_data_merge[column+'_redeem_sum'] = pd.DataFrame(train_data_merge.groupby([column])['redemption_status'].transform('sum'))
    train_data_merge[column+'_redeem_count'] = pd.DataFrame(train_data_merge.groupby([column])['redemption_status'].transform('count'))
    train_data_merge[column+'_redeem_percent'] = pd.DataFrame(train_data_merge[column+'_redeem_sum']*100/train_data_merge[column+'_redeem_count'])
    train_data_merge.drop(column,axis=1,inplace=True)
    train_data_merge.drop([column+'_redeem_sum'],axis=1,inplace=True)
    train_data_merge.drop([column+'_redeem_count'],axis=1,inplace=True)

# To define baseline model with basic data preprocessing.

In [14]:
cust_demo_data_expt = cust_demo_data.copy()
cust_demo_data_expt['marital_status'].fillna('Unspecified',inplace=True)
cust_demo_data_expt['no_of_children'].fillna(0,inplace=True)
cust_demo_data_expt['age_range'].replace(['18-25','26-35','36-45','46-55','56-70','70+'],[18,26,36,46,56,70],inplace=True)
cust_demo_data_expt['family_size'].replace('5+',5,inplace=True)
cust_demo_data_expt['no_of_children'].replace('3+',3,inplace=True)

In [15]:
cust_tran_data_expt = cust_tran_data.copy()
cust_tran_data_expt = pd.merge(cust_tran_data_expt,coupon_data,how='inner',on='item_id')
cust_tran_data_expt.drop('date',axis=1,inplace=True)

In [16]:
for column in ['quantity','coupon_discount','other_discount','selling_price']:
    tran_summation(column)

cust_tran_data_expt.drop_duplicates(subset=['customer_id','item_id','coupon_id'], keep='first', inplace=True)

In [17]:
train_data_merge = pd.merge(train_data,cust_tran_data_expt,how='inner',on=['customer_id','coupon_id'])
train_data_merge = pd.merge(train_data_merge,cust_demo_data_expt,how='left',on='customer_id')
train_data_merge = pd.merge(train_data_merge,item_data,how='left',on='item_id')

In [18]:
train_data_merge.drop('marital_status',axis=1,inplace=True)
train_data_merge.fillna({'age_range':0,'rented':0,'family_size':0,'no_of_children':0,'income_bracket':0},inplace=True)
train_data_merge['family_size'].astype('int8')
train_data_merge['no_of_children'].astype('int8')
train_data_merge = pd.get_dummies(train_data_merge, columns=['brand_type','category'], drop_first=False)

In [19]:
X = train_data_merge.drop('redemption_status', axis=1)
y = train_data_merge['redemption_status']

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=7)

In [20]:
# defining the model
classifier = LogisticRegression(solver='lbfgs',max_iter=10000)

# fitting the model
classifier.fit(X_train,y_train)

# predicting test result with model
y_pred = classifier.predict(X_test)

# Creating Classification report for Logistic Regression Baseline model

print ("Classification Report for Baseline Logistic Regression")
print(classification_report(y_test,y_pred))

report = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).transpose()

write_file(file_write_cnt,'No','No','Yes',len(X.columns),list(X.columns),'Logistic Regresssion',report['precision'][1],report['recall'][1],report['support']['accuracy'],'Baseline Model')
file_write_cnt = file_write_cnt + 1

Classification Report for Baseline Logistic Regression
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     20278
           1       0.67      0.07      0.13      2240

    accuracy                           0.90     22518
   macro avg       0.79      0.54      0.54     22518
weighted avg       0.88      0.90      0.87     22518



In [21]:
# defining the model
classifier = RandomForestClassifier(n_estimators=100)

# fitting the model
classifier.fit(X_train,y_train)

# predicting test result with model
y_pred = classifier.predict(X_test)

# Creating Classification report for RandomForest Classifier Baseline model

print ("Classification Report for Baseline RandomForest Classifier")
print(classification_report(y_test,y_pred))

report = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).transpose()

write_file(file_write_cnt,'No','No','Yes',len(X.columns),X.columns,'Random Forest Classifier',report['precision'][1],report['recall'][1],report['support']['accuracy'],'Baseline Model')
file_write_cnt += 1

Classification Report for Baseline RandomForest Classifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     20278
           1       0.97      0.82      0.89      2240

    accuracy                           0.98     22518
   macro avg       0.98      0.91      0.94     22518
weighted avg       0.98      0.98      0.98     22518



# One Hot Encoding

In [22]:
del train_data_merge

In [23]:
train_data_merge = pd.merge(train_data,cust_tran_data_expt,how='inner',on=['customer_id','coupon_id'])
train_data_merge = pd.merge(train_data_merge,cust_demo_data,how='left',on='customer_id')
train_data_merge = pd.merge(train_data_merge,item_data,how='left',on='item_id')

In [24]:
train_data_merge['no_of_children'].fillna(0,inplace=True)
train_data_merge.fillna({'marital_status':'Unspecified','rented':'Unspecified','family_size':'Unspecified','age_range':'Unspecified'},inplace=True)
train_data_merge['income_bracket'].fillna(train_data_merge['income_bracket'].mean(),inplace=True)
train_data_merge.drop(['id'],axis=1,inplace=True)
train_data_merge['no_of_children'].replace('3+',3,inplace=True)
train_data_merge['no_of_children'].astype('int')
train_data_merge = pd.get_dummies(train_data_merge, columns=['age_range','marital_status','rented','family_size','brand_type','category'], drop_first=False)

In [25]:
X = train_data_merge.drop('redemption_status', axis=1)
y = train_data_merge['redemption_status']

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=7)

In [26]:
# defining the model
classifier = RandomForestClassifier(n_estimators=100)

# fitting the model
classifier.fit(X_train,y_train)

# predicting test result with model
y_pred = classifier.predict(X_test)

# Creating Classification report for RandomForest Classifier Baseline model

print ("Classification Report for RandomForest Classifier")
print(classification_report(y_test,y_pred))

report = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).transpose()

write_file(file_write_cnt,'No','No','Yes',len(X.columns),X.columns,'Random Forest Classfier',report['precision'][1],report['recall'][1],report['support']['accuracy'],'with OHE')
file_write_cnt += 1

Classification Report for RandomForest Classifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     20278
           1       0.96      0.79      0.87      2240

    accuracy                           0.98     22518
   macro avg       0.97      0.89      0.93     22518
weighted avg       0.98      0.98      0.97     22518



# Label Encoding

In [27]:
del train_data_merge

In [28]:
train_data_merge = pd.merge(train_data,cust_tran_data_expt,how='inner',on=['customer_id','coupon_id'])
train_data_merge = pd.merge(train_data_merge,cust_demo_data,how='left',on='customer_id')
train_data_merge = pd.merge(train_data_merge,item_data,how='left',on='item_id')

In [29]:
train_data_merge['no_of_children'].fillna(0,inplace=True)
train_data_merge.fillna({'marital_status':'Unspecified','rented':'Unspecified','family_size':'Unspecified','age_range':'Unspecified'},inplace=True)
train_data_merge['income_bracket'].fillna(train_data_merge['income_bracket'].mean(),inplace=True)
train_data_merge['no_of_children'].replace('3+',3,inplace=True)
train_data_merge['no_of_children'].astype('int')
train_data_merge.drop(['id'],axis=1,inplace=True)

In [30]:
for column in ['marital_status','rented','family_size','age_range','brand_type','category']:
    label_encode(column)

In [31]:
X = train_data_merge.drop('redemption_status', axis=1)
y = train_data_merge['redemption_status']

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=7)

In [32]:
# defining the model
classifier = RandomForestClassifier(n_estimators=100)

# fitting the model
classifier.fit(X_train,y_train)

# predicting test result with model
y_pred = classifier.predict(X_test)

# Creating Classification report for RandomForest Classifier Baseline model

print ("Classification Report for RandomForest Classifier")
print(classification_report(y_test,y_pred))

report = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).transpose()

write_file(file_write_cnt,'No','No','Yes',len(X.columns),X.columns,'Random Forest Classifier',report['precision'][1],report['recall'][1],report['support']['accuracy'],'with Label Encoding')
file_write_cnt += 1

Classification Report for RandomForest Classifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     20278
           1       0.96      0.82      0.88      2240

    accuracy                           0.98     22518
   macro avg       0.97      0.91      0.94     22518
weighted avg       0.98      0.98      0.98     22518



# Feature Engineering with Treatment of Campaign Date, Transaction Date and using Coupon Redemption percentage as a value to convert categorical columns to integers

In [33]:
del train_data_merge
del cust_tran_data_expt

In [34]:
campaign_data_expt = campaign_data.copy()
campaign_data_expt['start_date_q'] = campaign_data_expt['start_date'].map(lambda x: date_q(x,'/'))
campaign_data_expt['end_date_q'] = campaign_data_expt['end_date'].map(lambda x: date_q(x,'/'))
campaign_data_expt.drop(['start_date','end_date'],axis=1,inplace=True)

In [35]:
cust_tran_data_expt = cust_tran_data.copy()
cust_tran_data_expt = pd.merge(cust_tran_data_expt,coupon_data,how='inner',on='item_id')
cust_tran_data_expt['tran_date_q'] = cust_tran_data_expt['date'].map(lambda x: date_q(x,'-'))
cust_tran_data_expt.drop('date',axis=1,inplace=True)

for column in ['quantity','coupon_discount','other_discount','selling_price']:
    tran_summation_1(column)

cust_tran_data_expt.drop_duplicates(subset=['customer_id','item_id','coupon_id','tran_date_q'], keep='first', inplace=True)

In [36]:
train_data_merge = pd.merge(train_data,cust_tran_data_expt,how='inner',on=['customer_id','coupon_id'])
train_data_merge = pd.merge(train_data_merge,cust_demo_data,how='left',on='customer_id')
train_data_merge = pd.merge(train_data_merge,item_data,how='left',on='item_id')
train_data_merge = pd.merge(train_data_merge,campaign_data_expt,how='left',on='campaign_id')

In [37]:
train_data_merge['no_of_children'].fillna(0,inplace=True)
train_data_merge.fillna({'marital_status':'Unspecified','rented':'Unspecified','family_size':'Unspecified','age_range':'Unspecified','income_bracket':'Unspecified'},inplace=True)
train_data_merge.drop('id',axis=1,inplace=True)
train_data_merge['no_of_children'].replace('3+',3,inplace=True)

In [38]:
for column in ['customer_id','coupon_id','item_id','campaign_id','no_of_children','marital_status','rented','family_size','age_range','income_bracket','start_date_q','end_date_q','tran_date_q','brand','category']:
    cat_percent(column)

In [39]:
train_data_merge = pd.get_dummies(train_data_merge, columns=['campaign_type','brand_type'], drop_first=False)

In [40]:
X = train_data_merge.drop('redemption_status', axis=1)
y = train_data_merge['redemption_status']

In [41]:
feature_sel = [5,10,15,23]

rforc = RandomForestClassifier(n_estimators=100)

for i in feature_sel:
    rfe = RFE(rforc, i)
    rfe.fit(X, y)

# Selecting columns

    sel_cols = []
    for a, b, c in zip(rfe.support_, rfe.ranking_, X.columns):
        if b == 1:
            sel_cols.append(c)
    print ('Number of features selected are ::',i)
    print ('Columns Selected are ::',sel_cols)

# Creating new DataFrame with selected columns only as X

    X_sel = X[sel_cols]

# Split data in to train and test

    X_sel_train, X_sel_test, y_sel_train, y_sel_test = train_test_split(X_sel, y, train_size=0.7, random_state=7)
    
# Fit and Predict the model using selected number of features    
    grid={"n_estimators":[100]}
    rforc_cv = GridSearchCV(rforc,grid,cv=10)
    rforc_cv.fit(X_sel_train, y_sel_train)
    rforc_pred = rforc_cv.predict(X_sel_test)

# Classification Report    
    
    print(classification_report(y_sel_test,rforc_pred))
    
    report = pd.DataFrame(classification_report(y_sel_test,rforc_pred,output_dict=True)).transpose()

    write_file(file_write_cnt,'No','No','Yes',len(X_sel.columns),X_sel.columns,'Random Forest Classifier',report['precision'][1],report['recall'][1],report['support']['accuracy'],'Treating Date and Label Encoding with RFE')
    file_write_cnt += 1

Number of features selected are :: 5
Columns Selected are :: ['customer_id_redeem_percent', 'coupon_id_redeem_percent', 'item_id_redeem_percent', 'campaign_id_redeem_percent', 'brand_redeem_percent']
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     27272
           1       0.97      0.97      0.97      3718

    accuracy                           0.99     30990
   macro avg       0.98      0.98      0.98     30990
weighted avg       0.99      0.99      0.99     30990

Number of features selected are :: 10
Columns Selected are :: ['tot_selling_price', 'customer_id_redeem_percent', 'coupon_id_redeem_percent', 'item_id_redeem_percent', 'campaign_id_redeem_percent', 'family_size_redeem_percent', 'age_range_redeem_percent', 'income_bracket_redeem_percent', 'end_date_q_redeem_percent', 'brand_redeem_percent']
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     27272
           1       0.9