# Problem Statement

A D2C startup develops products using cutting edge technologies like Web 3.0. Over the past few months, the company has started multiple marketing campaigns offline and digital both. As a result, the users have started showing interest in the product on the website. These users with intent to buy product(s) are generally known as leads (Potential Customers).

Leads are captured in 2 ways - Directly and Indirectly.

Direct leads are captured via forms embedded in the website while indirect leads are captured based on certain activity of a user on the platform such as time spent on the website, number of user sessions, etc.

Now, the marketing & sales team wants to identify the leads who are more likely to buy the product so that the sales team can manage their bandwidth efficiently by targeting these potential leads and increase the sales in a shorter span of time.

Can you identify the potential leads for a D2C startup?

# Relevant Libraries

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

# Data Inspection

In [2]:
train = pd.read_csv("train_wn75k28.csv")
test = pd.read_csv("test_Wf7sxXF.csv")

In [3]:
train.shape,test.shape

((39161, 19), (13184, 18))

In [4]:
train.dtypes

id                        int64
created_at               object
campaign_var_1            int64
campaign_var_2            int64
products_purchased      float64
signup_date              object
user_activity_var_1       int64
user_activity_var_2       int64
user_activity_var_3       int64
user_activity_var_4       int64
user_activity_var_5       int64
user_activity_var_6       int64
user_activity_var_7       int64
user_activity_var_8       int64
user_activity_var_9       int64
user_activity_var_10      int64
user_activity_var_11      int64
user_activity_var_12      int64
buy                       int64
dtype: object

In [5]:
test.dtypes

id                        int64
created_at               object
campaign_var_1            int64
campaign_var_2            int64
products_purchased      float64
signup_date              object
user_activity_var_1       int64
user_activity_var_2       int64
user_activity_var_3       int64
user_activity_var_4       int64
user_activity_var_5       int64
user_activity_var_6       int64
user_activity_var_7       int64
user_activity_var_8       int64
user_activity_var_9       int64
user_activity_var_10      int64
user_activity_var_11      int64
user_activity_var_12      int64
dtype: object

# Data Cleaning

Why missing values treatment is required? Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction.

In [6]:
train.isnull().sum()

id                          0
created_at                  0
campaign_var_1              0
campaign_var_2              0
products_purchased      20911
signup_date             15113
user_activity_var_1         0
user_activity_var_2         0
user_activity_var_3         0
user_activity_var_4         0
user_activity_var_5         0
user_activity_var_6         0
user_activity_var_7         0
user_activity_var_8         0
user_activity_var_9         0
user_activity_var_10        0
user_activity_var_11        0
user_activity_var_12        0
buy                         0
dtype: int64

In [7]:
test.isnull().sum()

id                         0
created_at                 0
campaign_var_1             0
campaign_var_2             0
products_purchased      8136
signup_date             6649
user_activity_var_1        0
user_activity_var_2        0
user_activity_var_3        0
user_activity_var_4        0
user_activity_var_5        0
user_activity_var_6        0
user_activity_var_7        0
user_activity_var_8        0
user_activity_var_9        0
user_activity_var_10       0
user_activity_var_11       0
user_activity_var_12       0
dtype: int64

In [8]:
train.fillna(value = 0,
          inplace = True)

In [9]:
test.fillna(value = 0,
          inplace = True)

# Encoding

In [10]:
train["created_at"] = pd.to_datetime(train["created_at"]).dt.strftime("%Y%m%d")

In [11]:
test["created_at"] = pd.to_datetime(test["created_at"]).dt.strftime("%Y%m%d")

In [12]:
train['created_at']  = train['created_at'].astype('int')

In [13]:
test['created_at']  = test['created_at'].astype('int')

In [14]:
train["signup_date"] = pd.to_datetime(train["signup_date"]).dt.strftime("%Y%m%d")

In [15]:
test["signup_date"] = pd.to_datetime(test["signup_date"]).dt.strftime("%Y%m%d")

In [16]:
train['signup_date']  = train['signup_date'].astype('int')

In [17]:
test['signup_date']  = test['signup_date'].astype('int')

In [18]:
train.dtypes

id                        int64
created_at                int32
campaign_var_1            int64
campaign_var_2            int64
products_purchased      float64
signup_date               int32
user_activity_var_1       int64
user_activity_var_2       int64
user_activity_var_3       int64
user_activity_var_4       int64
user_activity_var_5       int64
user_activity_var_6       int64
user_activity_var_7       int64
user_activity_var_8       int64
user_activity_var_9       int64
user_activity_var_10      int64
user_activity_var_11      int64
user_activity_var_12      int64
buy                       int64
dtype: object

In [19]:
X=train.iloc[:,:18].values

In [20]:
train.iloc[:,:18].values

array([[1.0000000e+00, 2.0210101e+07, 1.0000000e+00, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [2.0000000e+00, 2.0210101e+07, 2.0000000e+00, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [3.0000000e+00, 2.0210101e+07, 9.0000000e+00, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       ...,
       [3.9159000e+04, 2.0211231e+07, 8.0000000e+00, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00],
       [3.9160000e+04, 2.0211231e+07, 7.0000000e+00, ..., 0.0000000e+00,
        1.0000000e+00, 0.0000000e+00],
       [3.9161000e+04, 2.0211231e+07, 2.0000000e+00, ..., 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00]])

In [21]:
y=train.iloc[:,-1].values

In [22]:
train.iloc[:,-1].values

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [23]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=20)

In [24]:
from sklearn.ensemble import AdaBoostClassifier
# Create adaboost classifer object
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=1)
# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

In [25]:
from sklearn.metrics import f1_score

In [26]:
f1_score(y_test, y_pred, average='micro')

0.9753590807532716

In [27]:
submission = pd.read_csv('sample_submission_2zvVjBu.csv')
final_predictions = abc.predict(test)
submission['buy'] = final_predictions
#only positive predictions for the target variable
submission['buy'] = submission['buy'].apply(lambda x: 0 if x<0 else x)
submission.to_csv('submission01.csv', index=False)