# Recommender system

- A recommender system, or a recommendation system  is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user.
- Typically, the suggestions refer to various decision-making processes, such as what product to purchase, what music to listen to, or what online news to read.


- Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.


- Recommender systems are used in a variety of areas, with commonly recognised examples taking the form of playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms and open web content recommenders.

- These systems can operate using a single input, like music, or multiple inputs within and across platforms like news, books and search queries.
- There are also popular recommender systems for specific topics like restaurants and online dating.
- Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services.

### Overview

- Recommender systems usually make use of either or both collaborative filtering and content-based filtering (also known as the personality-based approach), as well as other systems such as knowledge-based systems. 



- Collaborative filtering approaches build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. 
- This model is then used to predict items (or ratings for items) that the user may have an interest in. 



- Content-based filtering approaches utilize a series of discrete, pre-tagged characteristics of an item in order to recommend additional items with similar properties.



- We can demonstrate the differences between collaborative and content-based filtering by comparing two early music recommender systems – Last.fm and Pandora Radio.



- Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. 


- Last.fm will play tracks that do not appear in the user's library, but are often played by other users with similar interests.
- As this approach leverages the behavior of users, it is an example of a collaborative filtering technique.




- Pandora uses the properties of a song or artist (a subset of the 400 attributes provided by the Music Genome Project) to seed a "station" that plays music with similar properties. 
- User feedback is used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular song and emphasizing other attributes when a user "likes" a song. This is an example of a content-based approach.



- Each type of system has its strengths and weaknesses.
- In the above example, Last.fm requires a large amount of information about a user to make accurate recommendations.
- This is an example of the cold start problem, and is common in collaborative filtering systems.
- Whereas Pandora needs very little information to start, it is far more limited in scope (for example, it can only make recommendations that are similar to the original seed).


## Approaches

### Collaborative filtering

   
- Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.
- The system generates recommendations using only information about rating profiles for different users or items. 
- By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. 
- Collaborative filtering methods are classified as memory-based and model-based. 
- A well-known example of memory-based approaches is the user-based algorithm, while that of model-based approaches is Matrix factorization (recommender systems).


- A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself.
- Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation as first implemented by Allen.


- When building a model from a user's behavior, a distinction is often made between explicit and implicit forms of data collection.


- Examples of explicit data collection include the following:

    - Asking a user to rate an item on a sliding scale.
    - Asking a user to search.
    - Asking a user to rank a collection of items from favorite to least favorite.
    - Presenting two items to a user and asking him/her to choose the better one of them.
    - Asking a user to create a list of items that he/she likes 
    
- Examples of implicit data collection include the following:
    - Observing the items that a user views in an online store.
    - Analyzing item/user viewing times.
    - Keeping a record of the items that a user purchases online.
    - Obtaining a list of items that a user has listened to or watched on his/her computer.
    - Analyzing the user's social network and discovering similar likes and dislikes.
  
  
- Collaborative filtering approaches often suffer from three problems: cold start, scalability, and sparsity.

    - Cold start:
        - For a new user or item, there isn't enough data to make accurate recommendations. Note: one commonly implemented solution to this problem is the Multi-armed bandit algorithm.
    - Scalability: 
        - There are millions of users and products in many of the environments in which these systems make recommendations. Thus, a large amount of computation power is often necessary to calculate recommendations.
    - Sparsity:
        - The number of items sold on major e-commerce sites is extremely large. 
        - The most active users will only have rated a small subset of the overall database. 
        - Thus, even the most popular items have very few ratings.
        
- One of the most famous examples of collaborative filtering is item-to-item collaborative filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's recommender system.


- Many social networks originally used collaborative filtering to recommend new friends, groups, and other social connections by examining the network of connections between a user and their friends.
- Collaborative filtering is still used as part of hybrid systems.


### Content-based filtering

- Another common approach when designing recommender systems is content-based filtering.
- Content-based filtering methods are based on a description of the item and a profile of the user's preferences.
- These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user.
- Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user's likes and dislikes based on an item's features.


- In this system, keywords are used to describe the items, and a user profile is built to indicate the type of item this user likes. 
- In other words, these algorithms try to recommend items similar to those that a user liked in the past or is examining in the present.
- It does not rely on a user sign-in mechanism to generate this often temporary profile. In particular, various candidate items are compared with items previously rated by the user, and the best-matching items are recommended.
- This approach has its roots in information retrieval and information filtering research.


- To create a user profile, the system mostly focuses on two types of information:

    1. A model of the user's preference.

    2. A history of the user's interaction with the recommender system.

- Basically, these methods use an item profile (i.e., a set of discrete attributes and features) characterizing the item within the system.
- To abstract the features of the items in the system, an item presentation algorithm is applied.
- A widely used algorithm is the tf–idf representation (also called vector space representation).
- The system creates a content-based profile of users based on a weighted vector of item features. 
- The weights denote the importance of each feature to the user and can be computed from individually rated content vectors using a variety of techniques. 
- Simple approaches use the average values of the rated item vector while other sophisticated methods use machine learning techniques such as Bayesian Classifiers, cluster analysis, decision trees, and artificial neural networks in order to estimate the probability that the user is going to like the item.



- A key issue with content-based filtering is whether the system can learn user preferences from users' actions regarding one content source and use them across other content types. 
- When the system is limited to recommending content of the same type as the user is already using, the value from the recommendation system is significantly less than when other content types from other services can be recommended. 
- For example, recommending news articles based on news browsing is useful. Still, it would be much more useful when music, videos, products, discussions, etc., from different services, can be recommended based on news browsing. To overcome this, most content-based recommender systems now use some form of the hybrid system.


- Content-based recommender systems can also include opinion-based recommender systems. 
- In some cases, users are allowed to leave text reviews or feedback on the items.
- These user-generated texts are implicit data for the recommender system because they are potentially rich resources of both feature/aspects of the item and users' evaluation/sentiment to the item.
- Features extracted from the user-generated reviews are improved meta-data of items, because as they also reflect aspects of the item like meta-data, extracted features are widely concerned by the users. 
- Sentiments extracted from the reviews can be seen as users' rating scores on the corresponding features.
- Popular approaches of opinion-based recommender system utilize various techniques including text mining, information retrieval, sentiment analysis (see also Multimodal sentiment analysis) and deep learning.


### Hybrid recommendations approaches

-  combining collaborative filtering, content-based filtering, and other approaches.
- There is no reason why several different techniques of the same type could not be hybridized.
- Hybrid approaches can be implemented in several ways:
    - by making content-based and collaborative-based predictions separately and then combining them;
    - by adding content-based capabilities to a collaborative-based approach (and vice versa); or 
    - by unifying the approaches into one model 
- Several studies that empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrated that the hybrid methods can provide more accurate recommendations than pure approaches.
- These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem, as well as the knowledge engineering bottleneck in knowledge-based approaches.



- Netflix is a good example of the use of hybrid recommender systems.
- The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).


- Some hybridization techniques include:
    - Weighted: 
        - Combining the score of different recommendation components numerically.
    - Switching: 
        - Choosing among recommendation components and applying the selected one.
    - Mixed: 
        - Recommendations from different recommenders are presented together to give the recommendation.
    - Feature Combination:
        - Features derived from different knowledge sources are combined together and given to a single recommendation algorithm.
    - Feature Augmentation: 
        - Computing a feature or set of features, which is then part of the input to the next technique.
    - Cascade:
        - Recommenders are given strict priority, with the lower priority ones breaking ties in the scoring of the higher ones.
    - Meta-level:
        - One recommendation technique is applied and produces some sort of model, which is then the input used by the next technique.
        
        
        

https://en.wikipedia.org/wiki/Recommender_system

In [1]:
import warnings
warnings.filterwarnings("ignore")


In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [3]:
df = pd.read_csv("online_retail_ii.csv")
df.rename(columns = {'Invoice': 'InvoiceNo', 'Customer ID': 'CustomerID', 'Price': 'UnitPrice'}, inplace=True)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [4]:
df.shape

(1067371, 8)

In [5]:
df.sample(20)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
621781,544485,22722,SET OF 6 SPICE TINS PANTRY DESIGN,4,2011-02-21 11:35:00,3.95,17757.0,United Kingdom
1060217,581167,85099C,JUMBO BAG BAROQUE BLACK WHITE,10,2011-12-07 14:52:00,2.08,13534.0,United Kingdom
217060,510445,21621,VINTAGE UNION JACK BUNTING,1,2010-05-30 15:58:00,8.5,18069.0,United Kingdom
355153,523937,47590B,PINK HAPPY BIRTHDAY BUNTING,3,2010-09-26 10:50:00,5.45,15201.0,United Kingdom
71354,495805,22133,PINK LOVE HEART SHAPE CUP,1,2010-01-26 17:51:00,1.66,,United Kingdom
1006485,577333,22568,FELTCRAFT CUSHION OWL,8,2011-11-18 14:40:00,3.75,14868.0,United Kingdom
1019934,578270,22544,MINI JIGSAW SPACEBOY,1,2011-11-23 13:39:00,0.83,14096.0,United Kingdom
65867,495243,35809B,ENAMEL BLUE RIM TEA CONTAINER,6,2010-01-22 10:28:00,2.1,14209.0,United Kingdom
254307,513942,20727,LUNCH BAG BLACK SKULL.,1,2010-06-29 13:12:00,4.21,,United Kingdom
789399,560047,21936,RED RETROSPOT PICNIC BAG,1,2011-07-14 15:01:00,2.95,17259.0,United Kingdom


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   InvoiceNo    1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   UnitPrice    1067371 non-null  float64
 6   CustomerID   824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB


In [7]:
(df.isna().sum() / len(df)) *100

InvoiceNo       0.000000
StockCode       0.000000
Description     0.410541
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     22.766873
Country         0.000000
dtype: float64

In [8]:
df["InvoiceNo"].nunique()

53628

In [9]:
df["CustomerID"].nunique()

5942

In [10]:

df["StockCode"].nunique()  # Unique products 

5305

In [11]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,1067371.0,1067371.0,824364.0
mean,9.938898,4.649388,15324.638504
std,172.7058,123.5531,1697.46445
min,-80995.0,-53594.36,12346.0
25%,1.0,1.25,13975.0
50%,3.0,2.1,15255.0
75%,10.0,4.15,16797.0
max,80995.0,38970.0,18287.0


In [12]:
len(df[df["Quantity"] < 0])
# Returend Orders 

22950

In [13]:
len(df[df["Quantity"] < 0]) / len(df) * 100
# 2.15% return orders

2.1501427338760375

In [14]:
df = df[df['Quantity']>=0]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1044421 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   InvoiceNo    1044421 non-null  object 
 1   StockCode    1044421 non-null  object 
 2   Description  1042728 non-null  object 
 3   Quantity     1044421 non-null  int64  
 4   InvoiceDate  1044421 non-null  object 
 5   UnitPrice    1044421 non-null  float64
 6   CustomerID   805620 non-null   float64
 7   Country      1044421 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 71.7+ MB


In [15]:
len(df)

1044421

In [16]:
len(df.dropna(axis=0, subset=['InvoiceNo']))


1044421

In [17]:
df.dropna(axis=0, subset=['InvoiceNo'],inplace=True)


In [18]:
df['InvoiceNo']=df['InvoiceNo'].astype('str')
# Converting InvoiceNo to String 

In [19]:
# df["InvoiceNo"].sample(20)

In [20]:
df[df['InvoiceNo'].str.contains('C')]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
76799,C496350,M,Manual,1,2010-02-01 08:24:00,373.57,,United Kingdom


In [21]:
df[df['InvoiceNo'].str.contains('A')]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
179403,A506401,B,Adjust bad debt,1,2010-04-29 13:36:00,-53594.36,,United Kingdom
276274,A516228,B,Adjust bad debt,1,2010-07-19 11:24:00,-44031.79,,United Kingdom
403472,A528059,B,Adjust bad debt,1,2010-10-20 12:04:00,-38925.87,,United Kingdom
825443,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00,11062.06,,United Kingdom
825444,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom
825445,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom


In [22]:
len(df)

1044421

In [23]:
df = df[~df['InvoiceNo'].str.contains('C')]
df = df[~df['InvoiceNo'].str.contains('A')]
df.shape

(1044414, 8)

In [24]:
df.shape

(1044414, 8)

In [25]:
df['Country'].value_counts()

United Kingdom          961217
EIRE                     17354
Germany                  16703
France                   13941
Netherlands               5093
Spain                     3720
Switzerland               3137
Belgium                   3069
Portugal                  2562
Australia                 1815
Channel Islands           1569
Italy                     1468
Norway                    1437
Sweden                    1338
Cyprus                    1155
Finland                   1032
Austria                    922
Denmark                    798
Unspecified                752
Greece                     657
Poland                     512
Japan                      485
United Arab Emirates       467
USA                        409
Israel                     369
Hong Kong                  358
Singapore                  339
Malta                      282
Iceland                    253
Canada                     228
Lithuania                  189
RSA                        168
Bahrain 

In [26]:
# Filtering data for United Kingdom 
# grouping by InvoiceNo and Description 
# summing up the Quantity and Unstacking the data . 
# each product will have either Quantity or 0 as result 

In [27]:
df[df['Country'] =="United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum()

InvoiceNo  Description                        
489434      WHITE CHERRY LIGHTS                   12
           15CM CHRISTMAS GLASS BALL 20 LIGHTS    12
           FANCY FONT HOME SWEET HOME DOORMAT     10
           PINK CHERRY LIGHTS                     12
           PINK DOUGHNUT TRINKET POT              24
                                                  ..
581585     ZINC WILLIE WINKIE  CANDLE STICK       24
581586     DOORMAT RED RETROSPOT                  10
           LARGE CAKE STAND  HANGING STRAWBERY     8
           RED RETROSPOT ROUND CAKE TINS          24
           SET OF 3 HANGING OWLS OLLIE BEAK       24
Name: Quantity, Length: 915314, dtype: int64

In [28]:
df[df['Country'] =="United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo').sample(10)

Description,DOORMAT UNION JACK GUNS AND ROSES,3 STRIPEY MICE FELTCRAFT,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,ANIMAL STICKERS,BLACK PIRATE TREASURE CHEST,BROWN PIRATE TREASURE CHEST,Bank Charges,CAMPHOR WOOD PORTOBELLO MUSHROOM,CHERRY BLOSSOM DECORATIVE FLASK,...,tk maxx mix up with pink,to push order througha s stock was,update,website fixed,wrong invc,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
498027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
514499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
559549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
519832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
531806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
579667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
493376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
535777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
502063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
489641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
data = (df[df['Country'] =="United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity']
               .sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))


In [30]:
data.sample(15)

Description,DOORMAT UNION JACK GUNS AND ROSES,3 STRIPEY MICE FELTCRAFT,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,ANIMAL STICKERS,BLACK PIRATE TREASURE CHEST,BROWN PIRATE TREASURE CHEST,Bank Charges,CAMPHOR WOOD PORTOBELLO MUSHROOM,CHERRY BLOSSOM DECORATIVE FLASK,...,tk maxx mix up with pink,to push order througha s stock was,update,website fixed,wrong invc,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
549837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
510819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
511830,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
548260,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
524354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
499507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
561359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
556058,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
521476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
len(data)

36752

In [32]:
data.shape

(36752, 5438)

In [33]:
# 36752 Invoices , and 5438 products 

In [34]:
data.sum(axis=1).sample(5)

InvoiceNo
568194    123.0
519533    144.0
561394     48.0
532162     20.0
497317      6.0
dtype: float64

In [35]:
# where the Quantity sold more than 1 , mark it as 1 or else keep it 0 . 

In [36]:
(data > 0).astype(int)

Description,DOORMAT UNION JACK GUNS AND ROSES,3 STRIPEY MICE FELTCRAFT,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,ANIMAL STICKERS,BLACK PIRATE TREASURE CHEST,BROWN PIRATE TREASURE CHEST,Bank Charges,CAMPHOR WOOD PORTOBELLO MUSHROOM,CHERRY BLOSSOM DECORATIVE FLASK,...,tk maxx mix up with pink,to push order througha s stock was,update,website fixed,wrong invc,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
489434,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489435,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489436,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489437,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489438,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581582,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
data = (data > 0).astype(int)
data

Description,DOORMAT UNION JACK GUNS AND ROSES,3 STRIPEY MICE FELTCRAFT,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,ANIMAL STICKERS,BLACK PIRATE TREASURE CHEST,BROWN PIRATE TREASURE CHEST,Bank Charges,CAMPHOR WOOD PORTOBELLO MUSHROOM,CHERRY BLOSSOM DECORATIVE FLASK,...,tk maxx mix up with pink,to push order througha s stock was,update,website fixed,wrong invc,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
489434,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489435,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489436,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489437,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
489438,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581582,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
data.shape

(36752, 5438)

In [39]:
data.sum(axis=1).sample(5)

InvoiceNo
505790     7
574297    29
560117     8
570424    30
566048    23
dtype: int64

In [40]:
# Now we see,  per invoice, how many unique products were sold .

In [41]:
from mlxtend.frequent_patterns import apriori

In [42]:
frequent_itemsets_plus = apriori(data, min_support=0.021, 
                                 use_colnames=True).sort_values('support', ascending=False).reset_index(drop=True)





In [43]:
frequent_itemsets_plus['length'] = frequent_itemsets_plus['itemsets'].apply(lambda x: len(x))

frequent_itemsets_plus

Unnamed: 0,support,itemsets,length
0,0.143013,(WHITE HANGING HEART T-LIGHT HOLDER),1
1,0.093628,(REGENCY CAKESTAND 3 TIER),1
2,0.082989,(JUMBO BAG RED RETROSPOT),1
3,0.072051,(ASSORTED COLOUR BIRD ORNAMENT),1
4,0.069003,(PARTY BUNTING),1
...,...,...,...
225,0.021114,(FELTCRAFT CUSHION RABBIT),1
226,0.021087,(PAPER CHAIN KIT RETROSPOT),1
227,0.021087,(FOUR HOOK WHITE LOVEBIRDS),1
228,0.021033,"(JUMBO BAG BAROQUE BLACK WHITE, JUMBO STORAGE...",2


In [44]:
frequent_itemsets_plus.loc[frequent_itemsets_plus.length==2]

Unnamed: 0,support,itemsets,length
72,0.031971,"(RED HANGING HEART T-LIGHT HOLDER, WHITE HANGI...",2
90,0.029658,"(WOODEN PICTURE FRAME WHITE FINISH, WOODEN FRA...",2
100,0.028788,"(JUMBO BAG RED RETROSPOT, JUMBO STORAGE BAG SUKI)",2
121,0.026965,"(STRAWBERRY CERAMIC TRINKET BOX, SWEETHEART CE...",2
123,0.026937,"(HEART OF WICKER LARGE, HEART OF WICKER SMALL)",2
127,0.026529,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG ...",2
141,0.025468,"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGEN...",2
146,0.025033,"(JUMBO BAG PINK POLKADOT, JUMBO BAG RED RETROS...",2
151,0.024488,"(JUMBO BAG RED RETROSPOT, JUMBO BAG STRAWBERRY)",2
155,0.024271,"(JUMBO BAG RED RETROSPOT, JUMBO BAG BAROQUE B...",2


In [47]:
frequent_itemsets_plus.loc[frequent_itemsets_plus.length==2].shape

(21, 3)

In [49]:
frequent_itemsets_plus.loc[frequent_itemsets_plus.length==3]

Unnamed: 0,support,itemsets,length


In [45]:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets_plus, metric ="lift", min_threshold = 1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(RED HANGING HEART T-LIGHT HOLDER),(WHITE HANGING HEART T-LIGHT HOLDER),0.045521,0.143013,0.031971,0.702331,4.910973,0.025461,2.878996
1,(WHITE HANGING HEART T-LIGHT HOLDER),(RED HANGING HEART T-LIGHT HOLDER),0.143013,0.045521,0.031971,0.223554,4.910973,0.025461,1.229292
2,(WOODEN PICTURE FRAME WHITE FINISH),(WOODEN FRAME ANTIQUE WHITE ),0.053167,0.054419,0.029658,0.55783,10.250686,0.026765,2.138502
3,(WOODEN FRAME ANTIQUE WHITE ),(WOODEN PICTURE FRAME WHITE FINISH),0.054419,0.053167,0.029658,0.545,10.250686,0.026765,2.080951
4,(JUMBO BAG RED RETROSPOT),(JUMBO STORAGE BAG SUKI),0.082989,0.06065,0.028788,0.346885,5.719483,0.023754,1.438262


In [46]:
rules.loc[rules.antecedents == rules.antecedents.iloc[5]].sort_values('lift', ascending =False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
41,(JUMBO STORAGE BAG SUKI),(JUMBO BAG BAROQUE BLACK WHITE),0.06065,0.049848,0.021033,0.346792,6.957047,0.01801,1.454595
21,(JUMBO STORAGE BAG SUKI),(JUMBO SHOPPER VINTAGE RED PAISLEY),0.06065,0.057357,0.023917,0.394347,6.875261,0.020438,1.556408
5,(JUMBO STORAGE BAG SUKI),(JUMBO BAG RED RETROSPOT),0.06065,0.082989,0.028788,0.474652,5.719483,0.023754,1.745532


# Content Based Recommender System 

In [232]:
import numpy as np
import pandas as pd
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [233]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
users = pd.read_csv('users.csv')

In [234]:
movies.shape,ratings.shape,users.shape

((10329, 3), (105339, 4), (668, 3))

In [235]:
movies.sample(5)

Unnamed: 0,movieId,title,genres
3718,4748,3 Ninjas (1992),Action|Children|Comedy
3459,4419,All That Heaven Allows (1955),Drama|Romance
1512,1946,Marty (1955),Drama|Romance
1137,1398,In Love and War (1996),Romance|War
3458,4415,Cheech & Chong's Nice Dreams (1981),Comedy


In [236]:
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [237]:
users.sample(5)

Unnamed: 0,userId,age,time_spent_per_day
43,44,21,1.107918
464,465,28,3.910848
149,150,22,5.142108
308,309,19,2.041917
417,418,21,2.900026


In [238]:
## Item-Item similarity based recommendation 
## you may also like : Recommendation 
## Movie recommendation as per genere
movies.sample(5)

Unnamed: 0,movieId,title,genres
7866,61950,Boot Camp (2007),Thriller
3945,5077,Cousins (1989),Comedy|Romance
1467,1890,Little Boy Blue (1997),Drama
4899,6696,Bollywood/Hollywood (2002),Comedy|Drama|Musical|Romance
3111,3953,Dr. T and the Women (2000),Comedy|Romance


In [239]:

# ratings["movieId"].value_counts().head(1000).index.to_list()

# finding top 1000 , most rated movies, so  that we have more data 

In [240]:
selected_movies = ratings["movieId"].value_counts().head(1000).index.to_list()

In [241]:
movies = movies.loc[movies["movieId"].isin(selected_movies)]
ratings = ratings.loc[ratings["movieId"].isin(selected_movies)]


In [242]:
movies.shape,ratings.shape

((1000, 3), (63250, 4))

In [243]:
m = movies.copy()

In [244]:
m.sample()

Unnamed: 0,movieId,title,genres
886,1089,Reservoir Dogs (1992),Crime|Mystery|Thriller


In [245]:
m["genres"] = m["genres"].str.split("|")

In [246]:
m.sample()

Unnamed: 0,movieId,title,genres
2259,2826,"13th Warrior, The (1999)","[Action, Adventure, Fantasy]"


In [247]:
m = m.explode("genres").pivot(index = "movieId",
                         columns="genres",
                         values="title")

In [248]:
m.sample()

genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
541,Blade Runner (1982),,,,,,,,,,,,,,,Blade Runner (1982),Blade Runner (1982),,


In [249]:
m = ~m.isna()

In [250]:
m=  m.astype(int)

In [251]:
m.head(2)

genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [252]:
m.shape # hamming distance will work best to find similarity 

(1000, 19)

In [253]:
# Hamming distance 
q1 = m.iloc[100].values
q2 = m.iloc[101].values

def hamming_distance(q1,q2):
    return np.sum(abs(q1-q2))

print(q1)
print(q2)

print(hamming_distnace(q1,q2))

q1 = m.iloc[10].values
q2 = m.iloc[11].values

print(q1)
print(q2)

print(hamming_distnace(q1,q2))

[1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
4
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0]
2


In [254]:
ranks = []

for query in m.index[:50]:
    for candidate in m.index:
        if candidate == query:
            continue
        ranks.append([query, candidate, hamming_distance(m.loc[query], m.loc[candidate])])

In [255]:
ranks = pd.DataFrame(ranks,columns=["query","candidate","hamming_distance"])
ranks.head(3)

Unnamed: 0,query,candidate,hamming_distance
0,1,2,2
1,1,3,5
2,1,5,4


In [256]:

ranks = ranks.merge(movies[['movieId', 'title']], left_on='query', right_on='movieId').rename(columns={'title': 'query_tittle'}).drop(columns=['movieId'])
ranks = ranks.merge(movies[['movieId', 'title']], left_on='candidate', right_on='movieId').rename(columns={'title': 'candidate_tittle'}).drop(columns=['movieId'])
ranks = ranks.sort_values(by=['query', 'hamming_distance'])
ranks.head()

Unnamed: 0,query,candidate,hamming_distance,query_tittle,candidate_tittle
26951,1,2294,0,Toy Story (1995),Antz (1998)
33251,1,3114,0,Toy Story (1995),Toy Story 2 (1999)
39601,1,4886,0,Toy Story (1995),"Monsters, Inc. (2001)"
9351,1,673,1,Toy Story (1995),Space Jam (1996)
27451,1,2355,1,Toy Story (1995),"Bug's Life, A (1998)"


In [257]:
ranks["query_tittle"].unique()

array(['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',
       'Father of the Bride Part II (1995)', 'Heat (1995)',
       'Sabrina (1995)', 'GoldenEye (1995)',
       'American President, The (1995)', 'Casino (1995)',
       'Sense and Sensibility (1995)',
       'Ace Ventura: When Nature Calls (1995)', 'Get Shorty (1995)',
       'Copycat (1995)', 'Powder (1995)', 'Leaving Las Vegas (1995)',
       'City of Lost Children, The (Cité des enfants perdus, La) (1995)',
       'Dangerous Minds (1995)',
       'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Babe (1995)',
       'Dead Man Walking (1995)', 'Clueless (1995)',
       'Mortal Kombat (1995)', 'To Die For (1995)',
       'Seven (a.k.a. Se7en) (1995)', 'Pocahontas (1995)',
       'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
       'Postman, The (Postino, Il) (1994)', "Mr. Holland's Opus (1995)",
       'Bio-Dome (1996)', 'From Dusk Till Dawn (1996)',
       'Black Sheep (1996)', 'Broken Arrow (1996)',
    

In [258]:
# top 5 similarity based recommendation , for Query of Jumanji1995 
ranks.loc[ranks["query_tittle"] == "Jumanji (1995)"].head(5)

Unnamed: 0,query,candidate,hamming_distance,query_tittle,candidate_tittle
26152,2,2161,0,Jumanji (1995),"NeverEnding Story, The (1984)"
39652,2,4896,0,Jumanji (1995),Harry Potter and the Sorcerer's Stone (a.k.a. ...
45602,2,41566,0,Jumanji (1995),"Chronicles of Narnia: The Lion, the Witch and ..."
2157,2,158,1,Jumanji (1995),Casper (1995)
11502,2,919,1,Jumanji (1995),"Wizard of Oz, The (1939)"


In [259]:

ranks.loc[ranks["query_tittle"] == "Batman Forever (1995)"].head(5)

Unnamed: 0,query,candidate,hamming_distance,query_tittle,candidate_tittle
1856,153,112,0,Batman Forever (1995),Rumble in the Bronx (Hont faan kui) (1995)
20095,153,1517,1,Batman Forever (1995),Austin Powers: International Man of Mystery (1...
29895,153,2683,1,Batman Forever (1995),Austin Powers: The Spy Who Shagged Me (1999)
37695,153,4027,1,Batman Forever (1995),"O Brother, Where Art Thou? (2000)"
2495,153,170,2,Batman Forever (1995),Hackers (1995)


# User User Similarity Based RecSystem

In [260]:
users.head(5)

Unnamed: 0,userId,age,time_spent_per_day
0,1,16,3.976315
1,2,24,1.891303
2,3,20,4.521478
3,4,23,2.095284
4,5,35,1.75986


In [261]:
r = ratings.copy()
r.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [262]:
r["hour"] = r["timestamp"].apply(lambda x:datetime.fromtimestamp(x).hour)

In [263]:
r.sample(5) # hour is the at what time used watched movie and rated the movie

Unnamed: 0,userId,movieId,rating,timestamp,hour
94293,622,4011,4.0,1447034871,21
94344,622,8368,4.0,1447965634,15
84513,575,6874,3.5,1086717411,13
77456,543,3793,4.0,995891818,8
96381,632,31658,5.0,1388949751,14


In [264]:
users = users.merge(r.groupby("userId")["rating"].mean().reset_index(),on="userId")
users = users.merge(r.groupby("userId")["hour"].mean().reset_index(),on="userId")


In [265]:
users

Unnamed: 0,userId,age,time_spent_per_day,rating,hour
0,1,16,3.976315,3.691589,20.000000
1,2,24,1.891303,3.923077,12.000000
2,3,20,4.521478,3.806452,5.000000
3,4,23,2.095284,4.159420,21.057971
4,5,35,1.759860,2.864865,15.000000
...,...,...,...,...,...
663,664,22,5.288101,3.964286,11.000000
664,665,20,5.220446,3.553763,7.086022
665,666,19,3.262313,3.642857,4.111111
666,667,17,3.674356,3.807692,10.707692


In [266]:
users.shape

(668, 5)

In [267]:
u = users.copy()

In [268]:
u = u.set_index('userId')

In [270]:


u.columns = ['age', 'time_spent_per_day', 'u_avg_rating', 'hour']

In [272]:
# u

In [273]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
u = pd.DataFrame(scaler.fit_transform(u), columns=u.columns, index=u.index)

def euclidian_distance(x, y):
    return np.linalg.norm(x-y)

In [278]:
userid = 1
dist = []
for user in u.index:
    dist.append(euclidian_distance(u.loc[userid], u.loc[user]))

u_rank = pd.DataFrame()
u_rank['id'] = u.index
u_rank['dist'] = dist
u_rank = u_rank.loc[u_rank.id != userid]
u_rank = u_rank.sort_values(by='dist')
u_rank.head() # top 5 similar users 

Unnamed: 0,id,dist
91,92,0.485134
185,186,0.75939
176,177,0.795499
624,625,0.809456
489,490,0.849203


In [292]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

# Regression based Recommender system 

In [279]:
m

genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109374,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
109487,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
111759,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
112852,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [280]:
u

Unnamed: 0_level_0,age,time_spent_per_day,u_avg_rating,hour
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-1.470292,0.341073,-0.073572,1.326258
2,-0.135616,-1.079947,0.426461,-0.067437
3,-0.802954,0.712624,0.174541,-1.286921
4,-0.302450,-0.940926,0.936982,1.510569
5,1.699565,-1.169532,-1.859363,0.455198
...,...,...,...,...
664,-0.469285,1.235109,0.515476,-0.241649
665,-0.802954,1.188999,-0.371286,-0.923511
666,-0.969789,-0.145549,-0.178836,-1.441776
667,-1.303458,0.135276,0.177221,-0.292573


In [282]:
r = ratings[["movieId","userId","rating"]].copy()

In [283]:
r

Unnamed: 0,movieId,userId,rating
0,16,1,4.0
1,24,1,1.5
2,32,1,4.0
3,47,1,4.0
4,50,1,4.0
...,...,...,...
105148,109374,668,4.0
105151,109487,668,4.0
105185,111759,668,3.0
105205,112852,668,4.0


In [285]:
r.shape

(63250, 3)

In [288]:
u = u.reset_index()
m = m.reset_index()

In [290]:
u.columns,m.columns,r.columns

(Index(['userId', 'age', 'time_spent_per_day', 'u_avg_rating', 'hour'], dtype='object'),
 Index(['movieId', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy',
        'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
        'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
        'Western'],
       dtype='object', name='genres'),
 Index(['movieId', 'userId', 'rating'], dtype='object'))

In [294]:
X = r.merge(u,on="userId",how="right").merge(m,on="movieId",how="right")

In [298]:
X.head(10)

Unnamed: 0,movieId,userId,rating,age,time_spent_per_day,u_avg_rating,hour,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,2,5.0,-0.135616,-1.079947,0.426461,-0.067437,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,1,5,4.0,1.699565,-1.169532,-1.859363,0.455198,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,1,8,5.0,0.364888,0.298545,0.160605,-0.241649,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1,11,4.0,-1.303458,0.513712,-0.380602,-1.390299,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,1,14,4.0,-0.30245,1.251552,-0.379415,-1.461133,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
5,1,17,5.0,0.865392,0.600791,0.736627,-0.413925,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
6,1,28,3.0,0.198054,-0.344992,-0.319827,-0.453407,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
7,1,29,4.0,0.364888,0.765658,-0.830785,-0.674662,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
8,1,30,4.5,0.031219,-0.487987,0.669767,0.241581,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,1,31,4.0,0.531723,-0.238468,0.362826,-1.983768,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [297]:
X.shape

(63250, 26)

In [299]:
X = X.drop(columns=["movieId","userId"])
y = X["rating"]

In [302]:
X.sample(5)

Unnamed: 0,rating,age,time_spent_per_day,u_avg_rating,hour,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
44996,4.0,-0.469285,1.275938,-0.272891,0.228484,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
23468,5.0,0.364888,-0.568321,-0.020622,-1.090563,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0
46792,3.5,-0.135616,-0.918348,-1.362894,-0.441028,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
24238,3.0,0.531723,-1.532231,-0.628655,1.253893,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
45605,3.0,0.531723,-0.419353,-0.127406,-0.522324,0,1,1,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0


In [311]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=564)

In [312]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [313]:
from sklearn.metrics import mean_squared_error as mse
mse(y_test, y_pred)**0.5

3.0618394883826014e-05

In [314]:
model.score(X_test,y_test)

0.9999999990751834

In [315]:
X.shape

(63250, 24)