# Chapter 8 - Classification Algorithm–Based Recommender Systems

<div style="text-align:center;">
    <img src='images/cluster.jpg' width='400'>
</div>

A classification algorithm-based recommender system is also known as the *buying propensity model*. The goal here is to predict the propensity of customers to buy a product using historical behavior and purchases.

The more accurately you predict future purchases, the better recommendation and, in turn, sales. This kind of recommender system is used more often to ensur 100% conversion from the users who are likely to purchase with certain probabilities Promotions are offered on those products, enticing users to make a purchase..

In [1]:
# Importing libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import Image
import os

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.linear_model import LogisticRegression
from imblearn.combine import SMOTETomek

from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

### Creating the Dataset

In [2]:
#read csv data

record_df = pd.read_excel("data/Rec_sys_data.xlsx")
customer_df = pd.read_excel("data/Rec_sys_data.xlsx", sheet_name = 'customer')
prod_df = pd.read_excel("data/Rec_sys_data.xlsx", sheet_name = 'product')

In [3]:
record_df.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [4]:
customer_df.head()

Unnamed: 0,CustomerID,Gender,Age,Income,Zipcode,Customer Segment
0,13089,male,53,High,8625,Small Business
1,15810,female,22,Low,87797,Small Business
2,15556,female,29,High,29257,Corporate
3,13137,male,29,Medium,97818,Middle class
4,16241,male,36,Low,79200,Small Business


In [5]:
prod_df.head()

Unnamed: 0,StockCode,Product Name,Description,Category,Brand,Unit Price
0,22629,Ganma Superheroes Ordinary Life Case For Samsu...,"New unique design, great gift.High quality pla...",Cell Phones|Cellphone Accessories|Cases & Prot...,Ganma,13.99
1,21238,Eye Buy Express Prescription Glasses Mens Wome...,Rounded rectangular cat-eye reading glasses. T...,Health|Home Health Care|Daily Living Aids,Eye Buy Express,19.22
2,22181,MightySkins Skin Decal Wrap Compatible with Ni...,Each Nintendo 2DS kit is printed with super-hi...,Video Games|Video Game Accessories|Accessories...,Mightyskins,14.99
3,84879,Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...,The sheerest compression stocking in its class...,Health|Medicine Cabinet|Braces & Supports,Medi,62.38
4,84836,Stupell Industries Chevron Initial Wall D cor,Features: -Made in the USA. -Sawtooth hanger o...,Home Improvement|Paint|Wall Decals|All Wall De...,Stupell Industries,35.99


### Pre Processing

In [6]:
# group By Stockcode and CustomerID and sum the Quantity
group = pd.DataFrame(record_df.groupby(['StockCode', 'CustomerID']).
Quantity.sum())
print(group.shape)
group.head()

(192758, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
StockCode,CustomerID,Unnamed: 2_level_1
10002,12451,12
10002,12510,24
10002,12583,48
10002,12637,12
10002,12673,1


In [7]:
#Check for null values
print(record_df.isnull().sum())
print("--------------")
print(customer_df.isnull().sum())

InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64
--------------
CustomerID          0
Gender              0
Age                 0
Income              0
Zipcode             0
Customer Segment    0
dtype: int64


In [8]:
#Loading the CustomerID and StockCode into different variable d1, d2
d2 = customer_df['CustomerID']
d1 = record_df["StockCode"]

# Taking the sample of data and storing into two variables
row = d1.sample(n= 900)
row1 = d2.sample(n=900)

# Cross product of row and row1
index = pd.MultiIndex.from_product([row, row1])
a = pd.DataFrame(index = index).reset_index()

a.head()

Unnamed: 0,StockCode,CustomerID
0,85049F,13829
1,85049F,12456
2,85049F,14507
3,85049F,13643
4,85049F,17600


In [9]:
#merge customerID and StockCode
data = pd.merge(group,a, on = ['CustomerID', 'StockCode'], how = 'right')

data.head()

Unnamed: 0,CustomerID,StockCode,Quantity
0,13829,85049F,
1,12456,85049F,
2,14507,85049F,1.0
3,13643,85049F,
4,17600,85049F,


As you can see, null values are present in the Quantity column. Let’s check for nulls.

In [10]:
#check total number of null values in quantity column
print(data['Quantity'].isnull().sum())

# check the shape of data that is number of rows and columns
print(data.shape)

777119
(810000, 3)


In [11]:
# replacing nan values with 0
data['Quantity'] = data['Quantity'].replace(np.nan, 0).astype(int)

# Check all unique value of quantity column
print(data['Quantity'].unique())

[   0    1   12    6    4    3   22    7   10    5  288   48   24    2
   18   36   96    8    9  120   66   60   92   30   15   16   40   32
   20  100   50  110   90   80  300   13   25  500   21  180  200   68
   84   28   17   72   38   14   64   23  103   52   19   83  104   87
   55   44  156  108  480 1848  102   51  144  240   26   42  216   11
  112  352  122  128  896   35  294   53  448  160   29   49   47  601
  254 3860  476   45  576   70  116  800  140  320   33  130   88  400
   37  192  384  170   75  125  124   76   54   67  126   31  504  372
  408  150  315   41  142   62   56  600   85   34   27  132   61  312
   82  410  248  208 1632  357  101  370  168  138  432   74  131  792
   43   39 2400  626 1080  298  269   46  328   58  880  176  532  360
  276  135  244  348  114  232  204   95  264  924  390   91  768  310
  336  191 1800  224  720 1824  344  295  118 1100 1248   98 1000   59
   57   86  252  440   78  380 1920 1340 1200  700 3900  460   94  220
 4000 

In [12]:
# drop product name and description column
product_data = prod_df.drop(['Product Name', 'Description'], axis = 1)
product_data['Category'].str.split('::').str[0]

product_data.head()

Unnamed: 0,StockCode,Category,Brand,Unit Price
0,22629,Cell Phones|Cellphone Accessories|Cases & Prot...,Ganma,13.99
1,21238,Health|Home Health Care|Daily Living Aids,Eye Buy Express,19.22
2,22181,Video Games|Video Game Accessories|Accessories...,Mightyskins,14.99
3,84879,Health|Medicine Cabinet|Braces & Supports,Medi,62.38
4,84836,Home Improvement|Paint|Wall Decals|All Wall De...,Stupell Industries,35.99


In [13]:
# extract the first string category column
cate = product_data['Category'].str.extract(r"(\w+)", expand=True)

# join cat column with original dataset
df2 = product_data.join(cate, lsuffix="_left")
df2.drop(['Category'], axis = 1, inplace = True)

# rename column to Category
df2 = df2.rename(columns = {0: 'Category'})
print(df2.shape)

df2.head()

(29912, 4)


Unnamed: 0,StockCode,Brand,Unit Price,Category
0,22629,Ganma,13.99,Cell
1,21238,Eye Buy Express,19.22,Health
2,22181,Mightyskins,14.99,Video
3,84879,Medi,62.38,Health
4,84836,Stupell Industries,35.99,Home


In [14]:
#check for null values and drop it
df2.isnull().sum()

StockCode     25954
Brand          1129
Unit Price      118
Category        792
dtype: int64

In [15]:
df2.dropna(inplace = True)
df2.isnull().sum()

StockCode     0
Brand         0
Unit Price    0
Category      0
dtype: int64

In [16]:
## save to csv file
df2.to_csv("data/Products.csv")

In [17]:
# Load product dataset
product = pd.read_csv("data/Products.csv")

product.head()

In [18]:
## Merge data and product dataset
final_data = pd.merge(data, product, on= 'StockCode')

# create final dataset by merging customer & final data
final_data1 = pd.merge(customer_df, final_data, on = 'CustomerID')

# Drop Unnamed and zipcode column
final_data1.drop(['Unnamed: 0', 'Zipcode'], axis = 1, inplace = True)

final_data1.head()

Unnamed: 0,CustomerID,Gender,Age,Income,Customer Segment,StockCode,Quantity,Brand,Unit Price,Category
0,15810,female,22,Low,Small Business,85049F,0,Style & Apply,69.95,Home
1,15810,female,22,Low,Small Business,85186A,0,UNOTUX,41.99,Clothing
2,15810,female,22,Low,Small Business,15056N,0,PeanutsÃƒâ€šÃ‚,9999.99,Home
3,15810,female,22,Low,Small Business,15056N,0,PeanutsÃƒâ€šÃ‚,9999.99,Home
4,15810,female,22,Low,Small Business,84029G,0,Ekena Milwork,136.0,Home


In [None]:
print(final_data1.shape)

In [19]:
# Check for null values in each columns
final_data1.isnull().sum()

(65700, 10)


CustomerID          0
Gender              0
Age                 0
Income              0
Customer Segment    0
StockCode           0
Quantity            0
Brand               0
Unit Price          0
Category            0
dtype: int64

In [20]:
#Check for unique value in each categorical columns
print(final_data1['Category'].unique())
print('------------\n')
print(final_data1['Income'].unique())
print('------------\n')
print(final_data1['Brand'].unique())
print('------------\n')
print(final_data1['Customer Segment'].unique())
print('------------\n')
print(final_data1['Gender'].unique())
print('------------\n')
print(final_data1['Quantity'].unique())

['Home' 'Clothing' 'Health' 'Beauty' 'Electronics' 'Auto' 'Party' 'Office'
 'Household' 'Movies' 'Jewelry' 'Baby' 'Cell']
------------

['Low' 'Medium' 'High']
------------

['Style & Apply' 'UNOTUX' 'PeanutsÃƒâ€šÃ‚' 'Ekena Milwork' 'Medi'
 'Tom Ford' 'Mightyskins' 'Stupell Industries' 'Style and Apply' 'Prop?t'
 '2Bhip' 'Zoan' 'PlushDeluxe' 'Envelopes.com' 'Business Essentials'
 'New Way' 'Ames Walker' 'JustVH' 'Aviana' 'Classique'
 'Augusta Sportswear' 'Edwards' 'Eye Buy Express' 'Sport-Tek'
 'WARNER STUDIOS' 'MusicBoxAttic' 'City Shirts' 'ELENXS' 'Casual Nights'
 'Duda Energy' 'Awkward Styles' 'CHOSEN SUPPLIES' 'Ishow Hair'
 'Port Authority']
------------

['Small Business' 'Middle class' 'Corporate']
------------

['female' 'male']
------------

[   0   24    8   52   16   11    9    3    4    2    1   45   47   56
   50  140  120  110  122  108    6   25   10   12   17   20   36   32
   48   53   30   13  768   44    7   15  200  220  896  248  208  500
 1200  400  448   60   23  

In [21]:
# test cleaning
final_data1['Brand'] = final_data1['Brand'].str.replace('?', '')
final_data1['Brand'] = final_data1['Brand'].str.replace('&', 'and')
final_data1['Brand'] = final_data1['Brand'].str.replace('(', '')
final_data1['Brand'] = final_data1['Brand'].str.replace(')', '')

print(final_data1['Brand'].unique())

['Style and Apply' 'UNOTUX' 'PeanutsÃƒâ€šÃ‚' 'Ekena Milwork' 'Medi'
 'Tom Ford' 'Mightyskins' 'Stupell Industries' 'Propt' '2Bhip' 'Zoan'
 'PlushDeluxe' 'Envelopes.com' 'Business Essentials' 'New Way'
 'Ames Walker' 'JustVH' 'Aviana' 'Classique' 'Augusta Sportswear'
 'Edwards' 'Eye Buy Express' 'Sport-Tek' 'WARNER STUDIOS' 'MusicBoxAttic'
 'City Shirts' 'ELENXS' 'Casual Nights' 'Duda Energy' 'Awkward Styles'
 'CHOSEN SUPPLIES' 'Ishow Hair' 'Port Authority']


### Feature Engineering