<a href="https://colab.research.google.com/github/Emmanuelmak1/public-apis/blob/master/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  Predict  Customer Behavior

Binary Classification problem

In [35]:
# Import libraries
import numpy as np
import pandas as pd

### import data

The dataset that we have in this case is from an online platform about the historical transactions of customers. It contains the data points such as age, total pages viewed, and whether the customer is a new or repeat customer. The output variable contains whether the customer bought the product online or not.

In [36]:
#read the data
df = pd.read_csv('online_sales.csv')

In [37]:
df.shape

(316200, 4)

In [38]:
df.head()

Unnamed: 0,age,new_user,total_pages_visited,converted
0,25,1,1,0
1,23,1,5,0
2,28,1,4,0
3,39,1,5,0
4,30,1,6,0


In [39]:
#target class frequency
df.converted.value_counts()

0    306000
1     10200
Name: converted, dtype: int64

We can clearly see there is a skewed target class in this dataset that typically needs to be treated by some undersampling/oversampling technique

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316200 entries, 0 to 316199
Data columns (total 4 columns):
 #   Column               Non-Null Count   Dtype
---  ------               --------------   -----
 0   age                  316200 non-null  int64
 1   new_user             316200 non-null  int64
 2   total_pages_visited  316200 non-null  int64
 3   converted            316200 non-null  int64
dtypes: int64(4)
memory usage: 9.6 MB


In [41]:
df.describe()

Unnamed: 0,age,new_user,total_pages_visited,converted
count,316200.0,316200.0,316200.0,316200.0
mean,30.569858,0.685465,4.872966,0.032258
std,8.271802,0.464331,3.341104,0.176685
min,17.0,0.0,1.0,0.0
25%,24.0,0.0,2.0,0.0
50%,30.0,1.0,4.0,0.0
75%,36.0,1.0,7.0,0.0
max,123.0,1.0,29.0,1.0


### Preparing Data For Modeling

In [42]:
input_columns = [column for column in df.columns if column != 'converted']
output_column = 'converted'
print (input_columns)
print (output_column)

['age', 'new_user', 'total_pages_visited']
converted


In [43]:
#input data
X = df.loc[:,input_columns].values
#output data
y = df.loc[:,output_column]
#shape of input and output dataset
print (X.shape, y.shape)

(316200, 3) (316200,)


**In ideal ML scenarios, proper data exploration and feature engineering are advised before model training, Since the overall idea is to deploy the ML app, the focus is on the containerizing the app instead of improving the accuracy of the model**

### Modeling : Logistic Regression

we are going to train a simple logistic regression model to make the predictions on the test data and later export it for deployment purposes

In [44]:
#import model specific libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [45]:
#Split the data into training and test data (70/30 ratio)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=555, stratify=y)

In [46]:
#validate the shape of train and test dataset
print (X_train.shape)
print (y_train.shape)

print (X_test.shape)
print (y_test.shape)

(221340, 3)
(221340,)
(94860, 3)
(94860,)


In [47]:
#check on number of positive classes in train and test data set
print(np.sum(y_train))
print(np.sum(y_test))

7140
3060


## Train the Logistic Model

In [48]:
#fit the logisitc regression model on training dataset
logreg = LogisticRegression(class_weight='balanced').fit(X_train,y_train)

In [49]:
logreg.score(X_train, y_train)

0.9370470768952742

In [50]:
#validate the model performance on unseen data
logreg.score(X_test, y_test)

0.9369175627240144

In [51]:
#make predictions on unseen data
predictions=logreg.predict(X_test)

## Results

# New Section

In [52]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions,target_names=["Non Converted", "Converted"]))

               precision    recall  f1-score   support

Non Converted       1.00      0.94      0.97     91800
    Converted       0.33      0.92      0.49      3060

     accuracy                           0.94     94860
    macro avg       0.66      0.93      0.73     94860
 weighted avg       0.98      0.94      0.95     94860



In [53]:
logreg

## Export Model

In [54]:
### Create a Pickle file using serialization
import pickle

pickle_out = open("logreg.pkl","wb")
pickle.dump(logreg, pickle_out)
pickle_out.close()

In [55]:
pickle_in = open("logreg.pkl","rb")
model = pickle.load(pickle_in)

In [56]:
model

In [57]:
#predict using the model on customer input
model.predict([[32,1,1]])[0]


0

In [58]:
#Group prediction (multiple customers)
df_test = pd.read_csv('test_data.csv')
predictions = model.predict(df_test)

print(list(predictions))

[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]




As we can observe, the model seems to be making predictions for a single customer as well as a group of customers.

Now we can move on to the next step of building a Flask app to run this model.