### Assignment Task - 3
<b>Task: </b> Recommend item to the given customer id for a given date.

<b>User Story:</b> User should be able to provide a Customer ID and Date, and program should be able to recommend item to be purchased.

<b>Hint: </b>Approach would be given importance over result

### Import neccessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

### Read Data

In [2]:
data = pd.read_excel("Dataset for Task 1,2,3/Online Retail.xlsx")

In [3]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
print(f"The shape of the data: {data.shape}")

The shape of the data: (541909, 8)


In [7]:
print(f"No.of Customers: {len(data['CustomerID'].unique())} " )
print(f"No.of Items: {len(data['StockCode'].unique())}")
print(f"No.of Countries: {len(data['Country'].unique())}")

No.of Customers: 4373 
No.of Items: 4070
No.of Countries: 38


In [9]:
# Information about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


Here, we can observe that we have <b>1,35,080</b> missing values in <b>CustomerID</b>    column. So, we should remove those missing  values by using <i>dropna()</i> function

### Remove missing values

In [12]:
# Remove missing values 
data = data.dropna()

#convert the type of CustomerID column from # float to int
data["CustomerID"] = data["CustomerID"].astype("Int64")
print(f"Type of CustomerID column: {data['CustomerID'].dtype}")

Type of CustomerID column: Int64


In [55]:
data[data['Quantity'] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.50,14527,United Kingdom,2010-12-01
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311,United Kingdom,2010-12-01
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548,United Kingdom,2010-12-01
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom,2010-12-01
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom,2010-12-01
...,...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,2011-12-09 09:57:00,0.83,14397,United Kingdom,2011-12-09
541541,C581499,M,Manual,-1,2011-12-09 10:28:00,224.69,15498,United Kingdom,2011-12-09
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,2011-12-09 11:57:00,10.95,15311,United Kingdom,2011-12-09
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,2011-12-09 11:58:00,1.25,17315,United Kingdom,2011-12-09


In data, we have some negative values in <b>Quantity</b> column.This is an issue because negative quantities <b>may not make sense </b> in the context of the data.So, we should remove those columns from the data.

In [58]:
# Consider only positive Quantity values
data = data[data['Quantity'] > 0]

In [59]:
data.shape

(397924, 9)

### Creating item features

In [60]:
# Add a new column called 'Date' which is separted from 'InvoiceDate' column
data['Date'] = data['InvoiceDate'].dt.date

# Create item features
item_features = data[['StockCode', 'Description']].drop_duplicates() # get unique items
print(f"The shape of item features: {item_features.shape}")

The shape of item features: (3897, 2)


In [61]:
# show items
item_features.head()

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.


In [62]:
# set index as stock code
item_features = item_features.set_index('StockCode')
item_features.head()

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
85123A,WHITE HANGING HEART T-LIGHT HOLDER
71053,WHITE METAL LANTERN
84406B,CREAM CUPID HEARTS COAT HANGER
84029G,KNITTED UNION FLAG HOT WATER BOTTLE
84029E,RED WOOLLY HOTTIE WHITE HEART.


### Creating customer features

In [63]:
#  creating customer feature 
# get customer transactions details
customer_features = data[['CustomerID', 'StockCode', 'InvoiceDate']]

# assign a rating of 1 for each purchase
customer_features['Rating']= 1 





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_features['Rating']= 1


In [64]:
# Aggregate ratings and dates
customer_features = customer_features.groupby(['CustomerID','StockCode']).agg({'Rating': 'sum', 'InvoiceDate': 'max'})

# Reset index
customer_features = customer_features.reset_index() 

In [65]:
# Show customers transcations and details
customer_features.head()

Unnamed: 0,CustomerID,StockCode,Rating,InvoiceDate
0,12346,23166,1,2011-01-18 10:01:00
1,12347,16008,1,2011-04-07 10:43:00
2,12347,17021,1,2011-06-09 13:01:00
3,12347,20665,1,2011-04-07 10:43:00
4,12347,20719,4,2011-12-07 15:52:00


### Creating pivot table 

In [66]:
 # Pivot the customer features to get a user-item matrix
user_item = customer_features.pivot(index='CustomerID', columns='StockCode', values='Rating').fillna(0)

user_item.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
12349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [67]:
# Convert it into array
user_item_array = user_item.to_numpy()

### Creating Recommendation function

In [80]:
# Define a function to make recommendations based on customer ID and date

def make_recommendations(customer_id, date, n_recommendations = 10):
    # Filter the data 
    customer_data = data[(data['CustomerID'] == customer_id) & (data['InvoiceDate'] <= date)]
    customer_items = set(customer_data['StockCode'])

    customer_index = user_item.index.get_loc(customer_id)
        
    # Create a knn model using cosine similarity as the metric
    knn = NearestNeighbors(metric='cosine', algorithm='brute')
    knn.fit(user_item_array)

    # Find the K nearest neighbors of the customer
    distances, indices = knn.kneighbors(user_item_array[customer_index].reshape(1, -1), n_neighbors=10)

    
    # Get the items purchased by the neighbors
    indices = indices.flatten()
    neighbor_items = set()
    for i in indices:
        neighbor_items = neighbor_items.union(set(user_item_matrix.iloc[i].loc[user_item_matrix.iloc[i] > 0].index))
    
    # Remove the items that the customer already bought
    recommended_items = neighbor_items.difference(customer_items)
    recommended_items = list(recommended_items)

    return recommended_items[:n_recommendations]

### Evaluate the model

In [86]:
data.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680,France,2011-12-09
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680,France,2011-12-09
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680,France,2011-12-09
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680,France,2011-12-09
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680,France,2011-12-09


In [87]:
recommended_items = make_recommendations(12680,'2011-12-09',10)

# Create dataframe for the recommendations
recommendations_df = pd.DataFrame(recommended_items, columns=['StockCode'])
recommendations_df['Description'] = recommendations_df['StockCode'].apply(lambda x: item_features.loc[x,'Description'])

In [88]:
recommendations_df

Unnamed: 0,StockCode,Description
0,22531,MAGIC DRAWING SLATE CIRCUS PARADE
1,22538,MINI JIGSAW GO TO THE FAIR
2,22539,MINI JIGSAW DOLLY GIRL
3,22540,MINI JIGSAW CIRCUS PARADE
4,22544,MINI JIGSAW SPACEBOY
5,22545,MINI JIGSAW BUNNIES
6,22547,MINI JIGSAW DINOSAUR
7,22549,PICTURE DOMINOES
8,22550,HOLIDAY FUN LUDO
9,22551,PLASTERS IN TIN SPACEBOY


In [89]:
# Test whether purchased item is present in the recommendations or not
purchased_item_code  = '22613'
purchased_item_recommend_df = recommendations_df[recommendations_df['StockCode'] == purchased_item_code]
purchased_item_recommend_df.head()

Unnamed: 0,StockCode,Description


In [90]:
if purchased_item_recommend_df.shape[0] == 0:
    print(f"Purchased item {purchased_item_code} NOT shown in recommendations!!")
else:
    print(f"Purchased item {purchased_item_code} shown in recommendations!!")

Purchased item 22613 NOT shown in recommendations!!
