# SVD for Product Recommendation System
A Recommender System refers to a system that is capable of predicting the future preference of a set of items for a user, and recommend the top items. One of its popular application is Netflix with an enormous collection of movies was able to recommend the best movies for you through various opinions from many people.

The most common method for recommendation systems often comes with Collaborating Filtering (CF) where it relies on the past user and item dataset. Two popular approaches of CF are **latent factor models**, which extract features from user and item matrices and neighborhood models, which finds similarities between products or users.

In this notebook, I am going to use **latent factor model** such as **Singular Value Decomposition (SVD)** extract features and correlation from the user-item matrix.

Singular Value Decomposition (SVD) will allow me to apply **Dimensionality Reduction technique** to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?

- I can discover hidden correlations / features in the raw data.
- I can remove redundant and noisy features that are not useful.
- I can interpret and visualize the data easier.
- I can also access easier data storage and processing.

With that goal, I'll be using Singular Vector Decomposition (SVD), a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system to build a solid product recommendation system for a sale/business department dataset

### Import necessary packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import math



###### Disable SettingWithCopyWarning
# reference: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None  # default='warn'

###### display full outputs in Jupyter Notebook, not only the last command's output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Loading the Dataset

In [2]:
df = pd.read_csv('D:\\DATASET\\E-Commerce Data.csv', encoding='cp1252')
df.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


### Preprocesssing & EDA data
With this step I will try to find as much information about the dataset as possible

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
df[df['CustomerID'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.00,,United Kingdom
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom
...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,12/9/2011 10:26,4.13,,United Kingdom
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,12/9/2011 10:26,4.13,,United Kingdom
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,12/9/2011 10:26,4.96,,United Kingdom
541539,581498,85174,S/4 CACTI CANDLES,1,12/9/2011 10:26,10.79,,United Kingdom


We can see that the CustomerID column has a lot of null values. Now we have 2 options: 
1. Drop all null value rows
2. Try to find ways to fill those Nan values

With option 2 in mind, I have a theory to fill in values. For example, InvoiceNo A has CustomerID A and Customer null, I can replace these NaN CustomerID with existing CustomerID A since it's just missing input error. 

If option 2 failed, I will do option 1 which is dropping all null value rows for further analysis.

### Let's create 2 different DataFrame from our original DF (df)
- DataFrame 1: null_df containing only null value CustomerID
- DataFrame 2: df_id containing only non-null value CustomerID

In [5]:
null_df = df[df['CustomerID'].isnull()]
null_df.reset_index(drop=True, inplace=True)
null_df.shape
null_df.head()

df_id=df[df['CustomerID'].isnull() == False]
df_id.reset_index(drop=True, inplace=True)
df_id.shape
df_id.head()

(135080, 8)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536414,22139,,56,12/1/2010 11:52,0.0,,United Kingdom
1,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
2,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
3,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
4,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom


(406829, 8)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


**Then we find the intersection between these two seperated DataFrame**
If there is nothing, it means these 2 DF have nothing in common -> left us only option 1 -> drop all null-value rows.

In [6]:
##### Tìm intersection giữa 2 set
l = df_id['InvoiceNo'].unique()
df_test = null_df[null_df['InvoiceNo'].isin(l)]
df_test

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


Option 2 failed, choose option 1.

In [7]:
df=df[df['CustomerID'].isnull() == False]
df.reset_index(drop=True, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406829 entries, 0 to 406828
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    406829 non-null  object 
 1   StockCode    406829 non-null  object 
 2   Description  406829 non-null  object 
 3   Quantity     406829 non-null  int64  
 4   InvoiceDate  406829 non-null  object 
 5   UnitPrice    406829 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      406829 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 24.8+ MB


### Clustering Data
Since we have alot of different Description, what I found out is that we have Product and Not Product Description. 

In Product Description, I can distinguish by understandable Description product name and positive value in Quantity.

In Not Product Description, I will divide them into 2 status: Promotion and Cancel since there is a lot of Canceled InvoiceNo (with StockCode begins with 'C' and negative Quantity value).

Last but not least, our Recommendation Analysis will only run on Products with Actual Sales status, Quantity will be selected as the rating scale of product rated by each customer. 

First, as I was browsing through the data, I discovered that *special* Description was either lowercase or short. Therefore, I will check the list of both these feature in Description value to see if it can help me to cluster the data into different groups as meantioned above.

In [8]:
### Check lowercase Description:
list_of_products_has_lowercase_words = []
for i in df['Description']:
    for e in i:
        if e.islower():
            list_of_products_has_lowercase_words.append(i)
            
list_of_products_has_lowercase_words = list(set(list_of_products_has_lowercase_words))
list_of_products_has_lowercase_words

['FRENCH BLUE METAL DOOR SIGN No',
 'Bank Charges',
 'FOLK ART GREETING CARD,pack/12',
 '3 TRADITIONAl BISCUIT CUTTERS  SET',
 'BAG 500g SWIRLY MARBLES',
 'ESSENTIAL BALM 3.5g TIN IN ENVELOPE',
 'POLYESTER FILLER PAD 30CMx30CM',
 'THE KING GIFT BAG 25x24x12cm',
 'FLOWERS HANDBAG blue and orange',
 'NUMBER TILE VINTAGE FONT No ',
 'High Resolution Image',
 'BAG 125g SWIRLY MARBLES',
 'POLYESTER FILLER PAD 45x45cm',
 'NUMBER TILE COTTAGE GARDEN No',
 'Discount',
 'CRUK Commission',
 'POLYESTER FILLER PAD 60x40cm',
 'POLYESTER FILLER PAD 65CMx65CM',
 'POLYESTER FILLER PAD 45x30cm',
 'BAG 250g SWIRLY MARBLES',
 'Manual',
 'POLYESTER FILLER PAD 40x40cm',
 'Next Day Carriage']

It is clearly shown that the description with lots of lowercase words such as discount and Next Day Carriage are not products. So I will have them appended to the not_products list

In [9]:
not_products = ['Next Day Carriage', 
                'Discount', 
                'CRUK Commission', 
                'Bank Charges', 
                'Manual', 
                'High Resolution Image']

In [10]:
### Check short Description:
list_of_products_has_short_length = []
for i in df['Description']:
    if len(i) <= 15:
        list_of_products_has_short_length.append(i)
            
list_of_products_has_short_length = list(set(list_of_products_has_short_length))
list_of_products_has_short_length

['DOORMAT TOPIARY',
 'PINK PARTY BAGS',
 'PHOTO CLIP LINE',
 'RAIN PONCHO ',
 'POPCORN HOLDER',
 'WRAP FOLK ART',
 'FIRST AID TIN',
 'ANIMAL STICKERS',
 'PACKING CHARGE',
 'DAISY HAIR COMB',
 'DOGGY RUBBER',
 'MIRROR CORNICE',
 'SOMBRERO ',
 'LOCAL CAFE MUG',
 'CORDIAL JUG',
 'RIBBONS PURSE ',
 'SPOTTY BUNTING',
 'LED TEA LIGHTS',
 'Discount',
 'WRAP CAROUSEL',
 'OWL DOORSTOP',
 'TUMBLER BAROQUE',
 'WRAP RED DOILEY',
 'BATHROOM HOOK',
 'BLUE FLY SWAT',
 'SPACE FROG',
 'PHOTO CUBE',
 'SANDALWOOD FAN',
 'DAISY HAIR BAND',
 'PARTY BUNTING',
 'MILK MAIDS MUG ',
 'RETROSPOT LAMP',
 'WRAP COWBOYS  ',
 'GOLD TEDDY BEAR',
 'DAISY NOTEBOOK ',
 'SKULLS TAPE',
 'CHAMBRE HOOK',
 'BLUE TILED TRAY',
 'WRAP, CAROUSEL',
 'CHILLI LIGHTS',
 'POTTERING MUG',
 'RETRO MOD TRAY',
 'CUTE CATS TAPE',
 'JUMBO BAG PEARS',
 'PINK DOG BOWL',
 'GOLD WASHBAG',
 'DAISY JOURNAL ',
 'POLKADOT PEN',
 'RED  EGG  SPOON',
 'KEY FOB , SHED',
 'JUMBO BAG OWLS',
 'JUMBO BAG TOYS ',
 'NEWSPAPER STAND',
 'FLAMINGO LIGHTS',
 'D

'POSTAGE' and 'CARRIAGE' are not products. So I will have them appended to the not_products list

In [11]:
not_products.append('POSTAGE')
not_products.append('CARRIAGE')
not_products

['Next Day Carriage',
 'Discount',
 'CRUK Commission',
 'Bank Charges',
 'Manual',
 'High Resolution Image',
 'POSTAGE',
 'CARRIAGE']

So if our Description values are in the not_products list -> Promotion

So if our Description values are not in the not_products list and the Invoice begins with a "C" -> Cancel

So if our Description values are not in the not_products list and is not Cancel -> Actual Sales

In [12]:
df[df['InvoiceNo'].str.contains('C')]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,12/1/2010 9:41,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
406377,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,12/9/2011 9:57,0.83,14397.0,United Kingdom
406461,C581499,M,Manual,-1,12/9/2011 10:28,224.69,15498.0,United Kingdom
406635,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,12/9/2011 11:57,10.95,15311.0,United Kingdom
406636,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,12/9/2011 11:58,1.25,17315.0,United Kingdom


In [13]:
df['Status'] = ['Cancel' if x[0] == 'C' else 'Actual Sales' for x in df['InvoiceNo']]
promotion_index = df[df['Description'].isin(not_products) == True].index
df['Promotion'] = 'No'
for i in promotion_index:
    df['Promotion'][i] = 'Yes'
df_pro = df[df['Promotion'] == 'Yes']
df_pro['Status'] = 'Promotion'
df.drop(promotion_index, inplace=True)
df.reset_index(drop=True,inplace=True)
df = df.drop(['Promotion'], axis=1)
df_pro.reset_index(drop=True, inplace=True)
df = pd.concat([df,df_pro],ignore_index=True)
df = df.drop(['Promotion'], axis=1)
df=df.drop_duplicates()
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Status
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom,Actual Sales
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom,Actual Sales
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
...,...,...,...,...,...,...,...,...,...
406824,581494,POST,POSTAGE,2,12/9/2011 10:13,18.00,12518.0,Germany,Promotion
406825,C581499,M,Manual,-1,12/9/2011 10:28,224.69,15498.0,United Kingdom,Promotion
406826,581570,POST,POSTAGE,1,12/9/2011 11:59,18.00,12662.0,Germany,Promotion
406827,581574,POST,POSTAGE,2,12/9/2011 12:09,18.00,12526.0,Germany,Promotion


### Recommendation System
With clustering the data into 3 different group, now I select only 'Actual Sales' group to do make a Product Recommendation System for customer

In [14]:
df = df[df['Status'] == 'Actual Sales']
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Status
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom,Actual Sales
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom,Actual Sales
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,Actual Sales
...,...,...,...,...,...,...,...,...,...
404841,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France,Actual Sales
404842,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France,Actual Sales
404843,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France,Actual Sales
404844,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France,Actual Sales


In [15]:
df = df.groupby(['CustomerID', 'Description']).agg({'Quantity':'sum'})
df.reset_index(drop=False, inplace=True)

In [16]:
df[df['Quantity'] < 0]

Unnamed: 0,CustomerID,Description,Quantity


In [17]:
number_cus = len(df['CustomerID'].unique())
print(f'We have {number_cus} customers')

number_pro = len(df['Description'].unique())
print(f'We have {number_pro} products')

We have 4335 customers
We have 3871 products


Now I want the format of my ratings matrix to be one row per customer and one column per product. To do so, I'll pivot ratings to get that and call the new variable Ratings with a capitalized R.

In [18]:
Ratings = df.pivot(index='CustomerID', columns='Description', values='Quantity').fillna(0)
Ratings

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12349.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12350.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18281.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18282.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18283.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


De-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array

In [19]:
ratings_matrix = Ratings.values
ratings_matrix.shape
ratings_matrix

(4335, 3871)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Mean

In [20]:
ratings_mean = np.mean(ratings_matrix, axis=1)
ratings_mean.shape
ratings_mean

(4335,)

array([19.17204857,  0.63497804,  0.60242831, ...,  0.02660811,
        0.35003875,  0.40971325])

### Reshape the Mean

In [21]:
ratings_mean = ratings_mean.reshape(-1,1)
ratings_mean.shape
ratings_mean

(4335, 1)

array([[19.17204857],
       [ 0.63497804],
       [ 0.60242831],
       ...,
       [ 0.02660811],
       [ 0.35003875],
       [ 0.40971325]])

### Demeaned the matrix

In [22]:
ratings_demeaned = ratings_matrix - ratings_mean
ratings_demeaned.shape
ratings_demeaned

(4335, 3871)

array([[-19.17204857, -19.17204857, -19.17204857, ..., -19.17204857,
        -19.17204857, -19.17204857],
       [ -0.63497804,  -0.63497804,  -0.63497804, ...,  -0.63497804,
         -0.63497804,  -0.63497804],
       [ -0.60242831,  -0.60242831,  -0.60242831, ...,  -0.60242831,
         -0.60242831,  -0.60242831],
       ...,
       [ -0.02660811,  -0.02660811,  -0.02660811, ...,  -0.02660811,
         -0.02660811,  -0.02660811],
       [ -0.35003875,  -0.35003875,  -0.35003875, ...,  -0.35003875,
         -0.35003875,  -0.35003875],
       [ -0.40971325,  -0.40971325,  -0.40971325, ...,  -0.40971325,
         -0.40971325,  -0.40971325]])

With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math.

## Model-Based Collaborative Filtering
Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

- The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
- When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
- You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

For example, let's check the sparsity of the ratings dataset:

In [23]:
sparsity = round(1 - len(df)/float(number_cus * number_pro), 3)
print (f'The sparsity level of our dataset is ' 
       +  str(sparsity * 100) + '%')

The sparsity level of our dataset is 98.4%


## Support Vector Decomposition (SVD)
A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix A into the best lower rank (i.e. smaller/simpler) approximation of the original matrix A . Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:

   ![svd.png](attachment:svd.png)

where A is the input data matrix (users's ratings), U is the left singular vectors (user "features" matrix), Σ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and VT is the right singluar vectors (movie "features" matrix). U and VT are column orthonomal, and represent different things. U represents how much users "like" each feature and VT represents how relevant each feature is to each movie.

To get the lower rank approximation, I take these matrices and **keep only the top k features (in this case, our k/rank = 50)**, which can be thought of as the underlying tastes and preferences vectors.

In [24]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(ratings_demeaned, k = 50)

**Scipy and Numpy both** have functions to do the singular value decomposition. I used the **Scipy function svds** because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [25]:
sigma

array([ 2353.73989029,  2416.00009089,  2456.1727885 ,  2504.81192071,
        2573.30799449,  2643.2450893 ,  2809.60727704,  2836.16962579,
        2931.69736557,  3067.16556516,  3169.2610677 ,  3223.13759446,
        3320.03513507,  3369.61658283,  3464.34369493,  3535.33166427,
        3688.74156858,  3718.88567874,  3877.08676149,  3910.90200852,
        4030.46275468,  4096.29906642,  4240.81624725,  4315.83562676,
        4613.55212615,  4712.60730834,  4791.909316  ,  4842.04662444,
        5079.40936282,  5330.33848098,  5562.64359854,  5849.6155602 ,
        6409.23719993,  6577.52829829,  7002.95227766,  7780.14825982,
        8556.20755683,  8616.50848519,  8802.76665475,  9867.03020466,
        9993.21302901, 11055.94588002, 11332.80710896, 11873.60207622,
       12244.11129845, 12541.99655104, 15353.02450053, 16537.56949493,
       74209.01849474, 80984.59602687])

Let's reshape **sigma** into diagonal matrix for leverage matrix multiplication to get predictions

In [26]:
sigma = np.diag(sigma)
sigma.shape
sigma

(50, 50)

array([[ 2353.73989029,     0.        ,     0.        , ...,
            0.        ,     0.        ,     0.        ],
       [    0.        ,  2416.00009089,     0.        , ...,
            0.        ,     0.        ,     0.        ],
       [    0.        ,     0.        ,  2456.1727885 , ...,
            0.        ,     0.        ,     0.        ],
       ...,
       [    0.        ,     0.        ,     0.        , ...,
        16537.56949493,     0.        ,     0.        ],
       [    0.        ,     0.        ,     0.        , ...,
            0.        , 74209.01849474,     0.        ],
       [    0.        ,     0.        ,     0.        , ...,
            0.        ,     0.        , 80984.59602687]])

### Making Predictions from the Decomposed Matrice
I now have everything I need to make product ratings predictions for every customer. I can do it all at once by following the math and matrix multiply U , Σ , and VT back to get the rank k=50 approximation of A.

**Matrix multiply U , Σ , and VT:**

In [27]:
k_rank_matrix = np.dot(np.dot(U, sigma), Vt)
k_rank_matrix.shape
k_rank_matrix

(4335, 3871)

array([[-1.91776384e+01, -1.91639088e+01, -1.91525115e+01, ...,
        -1.91731548e+01, -1.91757329e+01, -1.91806325e+01],
       [-4.32273157e-01, -8.16645309e-01,  2.77661081e+00, ...,
         4.06270881e-01, -4.39016445e-01, -2.81974968e-01],
       [-7.29073347e-01,  2.94474048e+00, -2.75635484e+00, ...,
        -1.20365594e+00, -7.25831149e-01, -3.75278855e-01],
       ...,
       [-1.52448882e-02,  3.25855989e-02, -4.94508785e-04, ...,
        -1.35691194e-02, -1.56646856e-02, -6.68683443e-03],
       [-3.65970195e-01,  6.73841037e-02,  2.64179171e+00, ...,
         7.25241657e-01, -3.64507108e-01, -3.65844936e-01],
       [-1.01431993e-01,  4.97132465e-02, -2.50308642e-01, ...,
         2.82073509e-02, -9.72358640e-02,  5.06929949e-02]])

Add the customer means back to get the actual star ratings prediction.

In [28]:
predicted_ratings = k_rank_matrix + ratings_mean
predicted_ratings.shape
predicted_ratings

(4335, 3871)

array([[-5.58981525e-03,  8.13973878e-03,  1.95370233e-02, ...,
        -1.10623176e-03, -3.68430680e-03, -8.58388617e-03],
       [ 2.02704885e-01, -1.81667267e-01,  3.41158885e+00, ...,
         1.04124892e+00,  1.95961597e-01,  3.53003074e-01],
       [-1.26645034e-01,  3.54716880e+00, -2.15392652e+00, ...,
        -6.01227622e-01, -1.23402836e-01,  2.27149458e-01],
       ...,
       [ 1.13632234e-02,  5.91937105e-02,  2.61136028e-02, ...,
         1.30389922e-02,  1.09434260e-02,  1.99212772e-02],
       [-1.59314457e-02,  4.17422853e-01,  2.99183046e+00, ...,
         1.07528041e+00, -1.44683585e-02, -1.58061861e-02],
       [ 3.08281259e-01,  4.59426499e-01,  1.59404610e-01, ...,
         4.37920603e-01,  3.12477388e-01,  4.60406247e-01]])

With our predicted_ratings matrix, let's create a prediction DataFrame with our index is the CustomerID and the columns are our Products

In [29]:
prediction_ratings = pd.DataFrame(data=predicted_ratings, index=df.CustomerID.unique(), columns=df.Description.unique())
prediction_ratings

Unnamed: 0,MEDIUM CERAMIC TOP STORAGE JAR,3D DOG PICTURE PLAYING CARDS,3D SHEET OF CAT STICKERS,3D SHEET OF DOG STICKERS,60 TEATIME FAIRY CAKE CASES,72 SWEETHEART FAIRY CAKE CASES,AIRLINE BAG VINTAGE JET SET BROWN,AIRLINE BAG VINTAGE JET SET RED,AIRLINE BAG VINTAGE JET SET WHITE,AIRLINE BAG VINTAGE TOKYO 78,...,S/2 BEACH HUT TREASURE CHESTS,PURPLE FRANGIPANI HAIRCLIP,GOLD PRINT PAPER BAG,LILAC FEATHERS CURTAIN,SET/3 TALL GLASS CANDLE HOLDER PINK,FLOWER SHOP DESIGN MUG,CAPIZ CHANDELIER,BLUE NEW BAROQUE FLOCK CANDLESTICK,CAT WITH SUNGLASSES BLANK CARD,SCALLOP SHELL SOAP DISH
12346.0,-0.005590,0.008140,0.019537,-0.010857,-0.003190,-0.003222,-0.007190,0.047034,-0.033719,-0.004463,...,-0.003242,-0.002655,-0.008356,-0.007799,-0.003866,-0.136390,-0.003332,-0.001106,-0.003684,-0.008584
12347.0,0.202705,-0.181667,3.411589,0.300238,0.197258,0.203371,0.347642,-0.229798,-2.003743,0.297246,...,0.195690,0.200417,0.366339,0.644415,0.194953,1.363778,0.195469,1.041249,0.195962,0.353003
12348.0,-0.126645,3.547169,-2.153927,-0.200981,-0.126440,-0.093060,0.242459,3.339277,3.573855,-0.016661,...,-0.123867,-0.122241,0.262659,0.456593,-0.120974,-0.456040,-0.120455,-0.601228,-0.123403,0.227149
12349.0,0.094483,-0.312302,0.833340,0.123198,0.088096,0.088992,0.083844,-0.310290,1.457185,0.156155,...,0.087862,0.087480,0.084434,0.135364,0.087789,0.243222,0.087869,0.237006,0.088069,0.083634
12350.0,0.035912,0.078433,0.191128,0.028067,0.032588,0.032035,0.030007,0.081887,-0.093889,0.051818,...,0.032590,0.032619,0.032330,0.018380,0.032511,0.005842,0.032588,0.040327,0.032492,0.030558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280.0,0.008536,-0.005141,0.016333,0.011943,0.008265,0.008185,0.006889,-0.004369,0.023980,0.011773,...,0.008243,0.008309,0.007104,0.007323,0.008235,0.010093,0.008243,0.021692,0.008244,0.007260
18281.0,0.006706,-0.015273,0.130187,0.019939,0.007480,0.007689,0.006391,-0.016282,0.034760,0.011763,...,0.007403,0.007372,0.005608,0.014079,0.007461,0.021611,0.007393,0.040404,0.007432,0.006530
18282.0,0.011363,0.059194,0.026114,0.006313,0.010862,0.011937,0.019749,0.057028,0.055550,0.016655,...,0.010911,0.011045,0.019826,0.035236,0.010950,0.076677,0.010920,0.013039,0.010943,0.019921
18283.0,-0.015931,0.417423,2.991830,0.313450,-0.013241,-0.010156,-0.013042,0.337851,-0.143240,0.174540,...,-0.015588,-0.017257,-0.010729,0.206813,-0.011378,0.726437,-0.015760,1.075280,-0.014468,-0.015806


Now I write a function to recommend the customer the products with the highest rating that that specific customer has not bought yet.


In [30]:
def recommend_products(customerID, number_of_product_for_recommendation):
    # Get and sort the customer's ratings
    sorted_customer_predictions = prediction_ratings.loc[customerID].sort_values(ascending=False).reset_index()
    sorted_customer_predictions = sorted_customer_predictions.rename(columns = {'index': 'Description'})
    
    customer_data = df[df.CustomerID == (customerID)]
    customer_data = customer_data.sort_values(['Quantity'], ascending=False)
    
    # List of products that the customer hasn't rated
    unrated_products = df[~df['Description'].isin(customer_data['Description'])]
    
    # Recommend the highest predicted rating products that the customer hasn't purchased yet.
    recommendations = unrated_products.merge(sorted_customer_predictions, how = 'left', left_on = 'Description', right_on = 'Description')
    recommendations = recommendations.rename(columns = {customerID: 'score'})
    recommendations = recommendations.sort_values('score', ascending = False)
    recommendations = recommendations[['Description', 'score']]
    recommendations = recommendations.rename(columns = {'Description': 'Product'})
    recommendations.drop_duplicates(inplace=True)
    recommendations = recommendations.iloc[:number_of_product_for_recommendation, :-1].reset_index(drop=True)
    
    return customer_data, recommendations

Let's test with customer 18287 to see 10 recommended products

In [31]:
already_bought, recommendation = recommend_products(18287, 10)
recommendation

Unnamed: 0,Product
0,HOT WATER BOTTLE BABUSHKA
1,36 PENCILS TUBE WOODLAND
2,PACK OF 20 NAPKINS RED APPLES
3,CHRISTMAS CARD SINGING ANGEL
4,SET OF 6 HEART CHOPSTICKS
5,GREEN GLASS TASSLE BAG CHARM
6,SET OF 6 RIBBONS VINTAGE CHRISTMAS
7,TRIANGULAR POUFFE VINTAGE
8,SET OF 4 NAPKIN CHARMS HEARTS
9,GIN AND TONIC MUG


Let's see if the recommended products are in the already_bought dataframe or not. if not in, then I think I have succeeded creating a Products Recommendation System for Customer.

In [32]:
for i in recommendation.Product:
    print(already_bought[already_bought['Description'] == i])

Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []
Empty DataFrame
Columns: [CustomerID, Description, Quantity]
Index: []


### There we go, this is how we apply SVD to create a Recommendation System. 