# <p style="background-color:lightgray; font-family:verdana; font-size:250%; text-align:center; border-radius: 15px 20px;">🟠Recommendation Systems🟠</p>

### **In today's world, there is an abundance of content, and users are inclined toward a narrower set of content based on their interests. Therefore, we need to filter the abundant content based on users' interests. In short, these are systems that recommend products or services to users by employing certain techniques.**

<center><span style="color:#f7c297;font-family:cursive;font-size:100%"></span></center>
<center><img src="https://i.imgur.com/e1omXMN.png" width="800" height="800"></center>

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;"> Simple Recommendation System </p>

### **It is a simple system carried out with business knowledge or basic techniques, providing general recommendations. Recommending the top-rated ones in the category, the trending ones, the Legends, and so on, without logic but based on business knowledge, is the essence of this straightforward system.**

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;">Assosication Rule Learning</p>

### **These are product recommendations based on the rules learned through association analysis. It emerges as basket analysis. Many companies employ this technique, particularly in the e-commerce field. Logically, it uses association rule-based machine learning techniques to derive the probabilities of co-purchased items from the data's patterns, relationships, and structures, and makes specific recommendations accordingly.**

> To give a famous example, it is often observed that customers buying diapers at Walmart also purchase beer. Walmart strategically places diapers and beer in adjacent aisles to maximize profits.

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'> Apriori Algorythm</font></h3>

### It is a basket analysis method used to uncover product associations. It has three important metrics.
### The Apriori algorithm calculates potential product pairs based on a predetermined support threshold value at the beginning of step-by-step operation. It creates the final output table by making eliminations in each iteration based on the established support value.

<center><span style="color:#f7c297;font-family:cursive;font-size:100%"> </span></center>
<center><img src="https://i.imgur.com/HXOBp5m.png" width="800" height="800"></center>

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'> 1- Data Preprocess</font></h3>

In [1]:
#importing our libraries
#!pip install mlxtend
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules, fpmax

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_ = pd.read_csv("/kaggle/input/online-retail-ii-data-set-from-ml-repository/Year 2010-2011.csv",encoding='iso-8859-9')

In [3]:
df = df_.copy()
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [4]:
def allinone(dataframe):
    dataframe.dropna(inplace=True)                                           #drops nan
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)] #C means return back so we dont take them
    dataframe = dataframe[dataframe["Quantity"] > 0]                         #taking quantity greater than 0
    dataframe = dataframe[dataframe["Price"] > 0]                            #taking price greater than 0
    return dataframe


In [5]:
df = allinone(df)
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity,397885.0,12.988208,179.331551,1.0,2.0,6.0,12.0,80995.0
Price,397885.0,3.116525,22.097861,0.001,1.25,1.95,3.75,8142.75
Customer ID,397885.0,15294.416882,1713.144421,12346.0,13969.0,15159.0,16795.0,18287.0


In [6]:
def corr_skew_outliner(df, cols):

    for col in cols:
        
        Q1 = df[col].quantile(0.05)
        Q3 = df[col].quantile(0.95)
        df.loc[df[col] < Q1, col] = Q1
        df.loc[df[col] > Q3, col] = Q3
        #df[col] = np.sqrt(df[col])
        
    return df

In [7]:
cols = ["Quantity","Price"]

corr_skew_outliner(df,cols)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,12/9/2011 12:50,4.95,12680.0,France


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>2- Preparing ARL data (invoice-product Matrix)</font></h3>

In [8]:
#We need a more measurable matrix structure that can be manipulated. Invoices will represent our baskets, and all products should be columns.
#To focus solely on customers from France, we wrote this code:
df_fr = df[df['Country'] == "France"]

In [9]:
fr_inv_pro_df = df_fr.groupby(["Invoice","StockCode"])["Quantity"].count().unstack().notnull()                                               

In [10]:
def check_id(dataframe, stock_code):
    product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0].tolist()
    print(product_name)
    
check_id(df_fr, "10120")

['DOGGY RUBBER']


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>3-Association Rules </font></h3>

In [11]:
#Using the apriori function method to find the support values of all possible product associations - our top priority is this:
frequent_itemsets = apriori(fr_inv_pro_df, 
                            min_support=0.01, 
                            use_colnames=True)

In [12]:
#Extracting all possible combinations and the probability of each product:
frequent_itemsets.sort_values("support", ascending=False)

Unnamed: 0,support,itemsets
538,0.773779,(POST)
392,0.187661,(23084)
112,0.179949,(21731)
248,0.172237,(22554)
250,0.169666,(22556)
...,...,...
18790,0.010283,"(21086, 22326, 22382, 23256)"
18789,0.010283,"(21086, 22326, 22728, 22382)"
18788,0.010283,"(21086, 22727, 22326, 22382)"
18787,0.010283,"(21086, 22726, 22382, 22326)"


In [13]:
#Extracting the association rules:
rules = association_rules(frequent_itemsets, 
                          metric="support", 
                          min_threshold=0.01)

In [14]:
#For example, we filtered the rules based on support higher than 0.05, confidence higher than 0.1, and lift higher than 5:
rules[(rules["support"]>0.05) & (rules["confidence"]>0.1) & (rules["lift"]>5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1652,(21080),(21086),0.133676,0.138817,0.102828,0.769231,5.541311,0.084271,3.731791,0.945994
1653,(21086),(21080),0.138817,0.133676,0.102828,0.740741,5.541311,0.084271,3.341535,0.951642
1654,(21080),(21094),0.133676,0.128535,0.102828,0.769231,5.984615,0.085646,3.776350,0.961424
1655,(21094),(21080),0.128535,0.133676,0.102828,0.800000,5.984615,0.085646,4.331620,0.955752
1822,(21086),(21094),0.138817,0.128535,0.123393,0.888889,6.915556,0.105550,7.843188,0.993284
...,...,...,...,...,...,...,...,...,...,...
213954,"(22727, 22728)","(POST, 22726)",0.074550,0.087404,0.059126,0.793103,9.074037,0.052610,4.410883,0.961473
213955,"(22726, 22728)","(POST, 22727)",0.074550,0.089974,0.059126,0.793103,8.814778,0.052418,4.398458,0.957971
213957,(22727),"(POST, 22726, 22728)",0.095116,0.064267,0.059126,0.621622,9.672432,0.053013,2.473008,0.990860
213958,(22726),"(POST, 22727, 22728)",0.097686,0.069409,0.059126,0.605263,8.720273,0.052346,2.357498,0.981172


In [15]:
#Checking the name of the product:
check_id(df_fr, "21086")

['SET/6 RED SPOTTY PAPER CUPS']


In [16]:
#Sorting the obtained results by confidence from high to low using sort values:
rules[(rules["support"]>0.05) & (rules["confidence"]>0.1) & (rules["lift"]>5)]. \
sort_values("confidence", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
23827,"(21080, 21094)",(21086),0.102828,0.138817,0.100257,0.975000,7.023611,0.085983,34.447301,0.955918
23826,"(21080, 21086)",(21094),0.102828,0.128535,0.100257,0.975000,7.585500,0.087040,34.858612,0.967673
108918,"(POST, 21080, 21086)",(21094),0.084833,0.128535,0.082262,0.969697,7.544242,0.071358,28.758355,0.947858
108919,"(POST, 21080, 21094)",(21086),0.084833,0.138817,0.082262,0.969697,6.985410,0.070486,28.419023,0.936271
1823,(21094),(21086),0.128535,0.138817,0.123393,0.960000,6.915556,0.105550,21.529563,0.981563
...,...,...,...,...,...,...,...,...,...,...
7235,(22629),(22630),0.125964,0.100257,0.071979,0.571429,5.699634,0.059351,2.099400,0.943382
62278,(22630),"(POST, 22629)",0.100257,0.100257,0.053985,0.538462,5.370809,0.043933,1.949443,0.904490
62275,"(POST, 22629)",(22630),0.100257,0.100257,0.053985,0.538462,5.370809,0.043933,1.949443,0.904490
62279,(22629),"(POST, 22630)",0.125964,0.074550,0.053985,0.428571,5.748768,0.044594,1.619537,0.945098


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>4-Providing Product Recommendations to Users at the Basket Stage </font></h3>

In [17]:
#We maintain a table of what to recommend to whom or for which product, based on possible scenarios that may have occurred earlier.
#We can directly pull the information from the ready-made place as soon as the user enters the site and adds something to their cart.
#Sample user product ID: 22492
product_id = "22492"
check_id(df, product_id)

['MINI PAINT SET VINTAGE ']


In [18]:
#In this scenario, we sorted based on lift; similarly, it could be sorted based on confidence, etc. It's up to interpretation.
sorted_rules = rules.sort_values("lift", ascending=False)

In [19]:
#We will iterate over the 'antecedents' section. Here, we will write what we see in the 'consequents' section at the same index as the items we 
#captured in the 'antecedents' section. Since it is sorted by lift, let's say we see product 5 in the antecedents, we will check its index, 1666
#and print the value in the 'consequents' at that index. We created a list in case there are multiple products.
recommendation_list = []

for i, product in enumerate(sorted_rules["antecedents"]):                                
    for j in list(product):                                                              
        if j == product_id:                                                              
            recommendation_list.append(list(sorted_rules.iloc[i]["consequents"])[0])     


In [20]:
#If there are multiple values, we limit the list to the first 10:
recommendation_list[0:10]

['22659',
 '23238',
 '22551',
 '22659',
 '23238',
 '22554',
 '22554',
 '22554',
 '22554',
 '22554']

In [21]:
#Checking the product with the ID 22728:
check_id(df, "22728")

['ALARM CLOCK BAKELIKE PINK']


In [22]:
def arl_recommender(rules_df, product_id, rec=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, product in enumerate(sorted_rules["antecedents"]):
        for j in list(product):
            if j == product_id :
                for k in list(sorted_rules.iloc[i]["consequents"]):
                    if k not in recommendation_list:
                        recommendation_list.append(k)

    return recommendation_list[0:rec]

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;">Content Based Recommendation</p>

### Recommendations are developed based on the similarities in product content. For instance, if a user reads a book of a certain category, a similar book is recommended based on its content. By representing texts mathematically, we can capture some key words in the product descriptions using text vectors through the Count Vector and TF-IDF methods, and identify other product descriptions that are similar.

### **Count Vectorizer:**

> Step 1: Place all unique terms (words) in columns and all documents (such as tweets, product titles) in rows.

> Step 2: Place the frequency of term occurrences in the documents in the cells.

### **TF-IDF:**
It performs normalization over the frequencies of words in their own texts as well as in all the focused data.

> Step 1: Compute the count vectorizer.

> Step 2: Compute TF (Term Frequency) - the frequency of the term 't' in the relevant document divided by the total number of terms in the document.

> Step 3: Compute IDF (Inverse Document Frequency) as follows:
* 1+ loge((total number of documents +1) / (number of documents containing the term 't' +1))

> Step 4: Compute TF * IDF - multiply TF by IDF.

> Step 5: Perform L2 normalization.

In [23]:
#importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>1. TF-IDF Matrix building </font></h3>

In [24]:
movies = pd.read_csv("/kaggle/input/the-movies-dataset/movies_metadata.csv",
                    usecols=["id","overview","title","vote_average","vote_count","release_date"],low_memory=False)

df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [25]:
movies = movies.reset_index(drop=True)
movies = movies.dropna()
movies = movies.drop_duplicates()
movies = movies.rename(columns={"id":"movieId"})
movies["movieId"] = movies["movieId"].astype("int64")

In [26]:
#We will focus only on the overview from the dataset:
movies["overview"].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [27]:
#We are using the TF-IDF method and setting up the model:
tfidf = TfidfVectorizer(stop_words="english", min_df = 4)
#We removed commonly used words such as 'and', 'the', 'on', 'in', as they do not carry significant values.

In [28]:
#We replaced NaN values with blanks as NaNs can cause issues in calculations:
movies['overview'] = movies['overview'].fillna('')

In [29]:
#After fitting, we transform the data:
tfidf_matrix = tfidf.fit_transform(movies['overview'])

In [30]:
#There are 45,466 movie reviews and 75,827 words:
tfidf_matrix.shape

(44407, 23834)

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>2. Cosine Similarity Matrix </font></h3>

In [31]:
#This is the part where we find which movies are similar to each other, mathematically speaking, using text vectors.
cosine_sim = cosine_similarity(tfidf_matrix,
                               tfidf_matrix)
#The cosine_sim.shape gives us the similarities between documents.
cosine_sim.shape

(44407, 44407)

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #FFF0F4; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>3.Making Recommendations Based on Similarities </font></h3>

In [33]:
#To evaluate the calculated scores, we retrieve the names.
indices = pd.Series(movies.index, index=movies['title'])

In [34]:
#There are multiple instances of movies.
indices.index.value_counts()

title
Cinderella              11
Hamlet                   9
Beauty and the Beast     8
Alice in Wonderland      8
Les Misérables           8
                        ..
No Greater Love          1
A Woman in Berlin        1
Talhotblond              1
Tortilla Flat            1
Queerama                 1
Name: count, Length: 41303, dtype: int64

In [35]:
#We keep one of the duplicate movies and delete the others. We take the last one for freshness.
indices = indices[~indices.index.duplicated(keep='last')]

In [36]:
#We note the index of the movie "Sherlock Holmes".
movie_index = indices["Sherlock Holmes"]

In [37]:
#Accessing cosine_sim with the index of Sherlock Holmes.
cosine_sim[movie_index]

array([0.00630183, 0.00923754, 0.        , ..., 0.        , 0.01089884,
       0.        ])

In [38]:
#We create a dataframe called similarity_scores and retrieve the similar ones, evaluating them as scores.
similarity_scores = pd.DataFrame(cosine_sim[movie_index],columns=["score"])

In [40]:
#Fetching the top 10 movies with the highest scores. The first observation includes the movie itself, so we use 1 to 11.
movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index

In [41]:
#Retrieving the titles of the movies with index information.
movies['title'].iloc[movie_indices]

35745    The Dog of Flanders
16735    The Heart Elsewhere
31594         We Can Do That
30608      Drama of Jealousy
25451             Marvellous
44609             The Mitten
21348      Darling Companion
12104        The Dog Problem
33454       The Empty Canvas
42413         Death by Death
Name: title, dtype: object

In [42]:
def content_based_recommender(title, cosine_sim, dataframe):
    # making index
    indices = pd.Series(dataframe.index, index=dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    # catch title's index
    movie_index = indices[title]
    # calculating similarty score to target
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
    # bring 10 movie
    movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
    return dataframe['title'].iloc[movie_indices]

In [43]:
content_based_recommender("The Matrix", cosine_sim, movies)

27610                So Sweet, So Dead
3534                             Lured
21                             Copycat
2069                            Frenzy
20626       The Wandering Soul Murders
7583             The Stendhal Syndrome
28141               Mark Strikes Again
23816    Tables Turned on the Gardener
26944            Whistling in Brooklyn
28203                Kommissarie Späck
Name: title, dtype: object

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;">Item-Based Collaborative Filtering</p>

### Recommendations are made based on item similarity.

> Example: Lets think viewer likes a particular movie. Based on the similarity in the structure of liked and disliked movies, another movie is recommended.

In [45]:
rating = pd.read_csv('/kaggle/input/the-movies-dataset/ratings_small.csv')

df = pd.merge(movies,rating, how="inner", on="movieId")

In [46]:
df.head()

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
0,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,23,3.5,1148721092
1,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,102,4.0,956598942
2,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,232,2.0,955092697
3,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,242,5.0,956688825
4,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,263,3.0,1117846575


In [48]:
#For instance, a user has rated one movie, but has not rated numerous others, leading to a large number of cells representing the missing data, 
#resulting in performance issues. Various reductions must be made, such as excluding movies with fewer than 1000 ratings.

#The resulting 'user_movie_df' dataframe will have users as the index, movie titles as the columns, and the corresponding ratings as the values
user_movie_df = df.groupby(["userId","movieId"])["rating"].mean().unstack().notnull()

In [49]:
user_movie_df.head()

movieId,2,3,5,6,11,12,13,14,15,16,...,132961,133365,134158,134569,134881,140174,142507,148652,158238,160718
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [50]:
#taking sample randomly
sample_movie = user_movie_df.sample(1,random_state=45).index[0]

sample_movie

196

In [51]:
#filtering our sample's movies on user_movie_df
filtered = user_movie_df[sample_movie]

In [52]:
#dropping the movies that our sample watched
user_movie_df_wo = user_movie_df.drop(sample_movie,axis=1)

In [53]:
#checking correlation
movies_similarity = user_movie_df_wo.corrwith(filtered)

In [54]:
movies_similarity.sort_values(ascending=False).head(20)

movieId
160    0.388101
172    0.359755
435    0.346429
173    0.342352
316    0.331515
253    0.331465
22     0.326370
317    0.324579
165    0.324017
592    0.321798
198    0.321120
145    0.321120
587    0.315399
329    0.313288
292    0.307931
204    0.296570
153    0.295394
344    0.292703
379    0.289515
426    0.287628
dtype: float64

In [55]:
#similar movies
movies_similarity = movies_similarity.sort_values(ascending=False).reset_index()
movies_similarity.columns = ["movieId","movies_similarity"]
movies_similarity.head()

Unnamed: 0,movieId,movies_similarity
0,160,0.388101
1,172,0.359755
2,435,0.346429
3,173,0.342352
4,316,0.331515


In [58]:
#filtering the movies
filtered_movies = df[df['movieId'].isin([160, 172, 435, 173, 316])]

In [59]:
filtered_movies['title'].value_counts()

title
Grill Point                            145
20,000 Leagues Under the Sea            70
The Arrival of a Train at La Ciotat     63
The Day After Tomorrow                  55
Star Trek V: The Final Frontier         48
Name: count, dtype: int64

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;">User-Based Collaborative Filtering</p>

### Recommendations are made based on the similarities between users.

> Example: Access is provided to the movies watched by a user, and then to the movies watched by other users who have watched the same movies. By examining the correlation of the movies watched by other users but not by our initial user, the highest correlated movie is recommended.

In [60]:
df.head()

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
0,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,23,3.5,1148721092
1,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,102,4.0,956598942
2,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,232,2.0,955092697
3,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,242,5.0,956688825
4,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,263,3.0,1117846575


In [62]:
#we have 44,823 rating
df.shape

(44823, 9)

In [63]:
#we have 2772 movie
df["title"].nunique()

2772

In [65]:
# number of the comments for each movie
comments = df["title"].value_counts()
comments

title
Terminator 3: Rise of the Machines    324
The Million Dollar Hotel              311
Solaris                               305
The 39 Steps                          291
Monsoon Wedding                       274
                                     ... 
Things to Come                          1
Portrait in Black                       1
Les Visiteurs du Soir                   1
The Warped Ones                         1
The One-Man Band                        1
Name: count, Length: 2772, dtype: int64

In [66]:
# taking movies which has lower than 5 
rare_movies = comments[comments < 5].index

rare_movies

Index(['Hotel Rwanda', 'Double Trouble', 'The African Queen', 'Che: Part Two',
       'Ronja Robbersdaughter', 'The Story of a Cheat', 'Jinxed!',
       'Cruel Intentions 3', 'Enigma', 'Jekyll and Hyde ... Together Again',
       ...
       'The Model Couple', 'Arthur and the Revenge of Maltazard',
       'Blue Like Jazz', 'Kismet', 'Erotic Nights of the Living Dead',
       'Things to Come', 'Portrait in Black', 'Les Visiteurs du Soir',
       'The Warped Ones', 'The One-Man Band'],
      dtype='object', name='title', length=1429)

In [68]:
#removing movies which is lower than 5
clean_df = df[~df["title"].isin(rare_movies)]

In [70]:
#building our user_tittle_df
user_title_df = clean_df.groupby(["userId","title"])["rating"].mean().unstack().notnull()

In [71]:
user_title_df.head()

title,10 Items or Less,10 Things I Hate About You,15 Minutes,1984,2 Days in Paris,"20,000 Leagues Under the Sea",2001: A Space Odyssey,24 Hour Party People,25th Hour,28 Days Later,...,Young Adam,Young Frankenstein,Young and Innocent,Z,Zatoichi,Zazie dans le métro,Zodiac,eXistenZ,xXx,À nos amours
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [72]:
#taking sample randomly
lucky_guy = user_title_df.sample(1,random_state=45).index[0]

In [73]:
#our lucky_guy's movies that he watch
random_user_df = user_title_df[user_title_df.index == lucky_guy]

In [74]:
#the ones which our guy voted
movies_watched = random_user_df.dropna(axis=1).columns.tolist()

In [75]:
movies_watched_df = user_title_df[movies_watched]

In [76]:
#number of the movies that other users watched same with our luckyguy
user_movie_count = movies_watched_df.notnull().sum(axis=1)

In [77]:
# users who watched same movies %60 with our guy 
users_same_movies = user_movie_count[user_movie_count > (movies_watched_df.shape[1] * 60 ) / 100].index

users_same_movies

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       662, 663, 664, 665, 666, 667, 668, 669, 670, 671],
      dtype='int64', name='userId', length=671)

In [78]:
#filtering our df
filted_df = movies_watched_df[movies_watched_df.index.isin(users_same_movies)]

filted_df

title,10 Items or Less,10 Things I Hate About You,15 Minutes,1984,2 Days in Paris,"20,000 Leagues Under the Sea",2001: A Space Odyssey,24 Hour Party People,25th Hour,28 Days Later,...,Young Adam,Young Frankenstein,Young and Innocent,Z,Zatoichi,Zazie dans le métro,Zodiac,eXistenZ,xXx,À nos amours
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
668,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
669,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
670,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [79]:
#correlation beetwen users
corr_df = filted_df.T.corr().unstack().drop_duplicates() 

In [80]:
# users who is correlated with our lucky guy 
corr_df[lucky_guy].sort_values(ascending=False)

userId
541    0.246396
476    0.146818
525    0.132043
592    0.110805
463    0.104639
         ...   
391   -0.037857
559   -0.040912
287   -0.041207
396   -0.047080
434   -0.049696
Length: 359, dtype: float64

In [83]:
# correlation higher than 0.10
top_users = pd.DataFrame(corr_df[lucky_guy][corr_df[lucky_guy] > 0.10], columns=["corr"])

top_users

Unnamed: 0_level_0,corr
userId,Unnamed: 1_level_1
294,0.102905
463,0.104639
476,0.146818
525,0.132043
541,0.246396
575,0.101738
592,0.110805


In [84]:
top_users_ratings = pd.merge(top_users, rating[["userId", "movieId", "rating"]], how='inner', on="userId")

top_users_ratings

Unnamed: 0,userId,corr,movieId,rating
0,294,0.102905,1,4.0
1,294,0.102905,5,3.5
2,294,0.102905,7,5.0
3,294,0.102905,10,3.5
4,294,0.102905,11,3.0
...,...,...,...,...
2318,592,0.110805,4299,4.0
2319,592,0.110805,4340,3.0
2320,592,0.110805,4344,5.0
2321,592,0.110805,4369,5.0


In [85]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']

In [86]:
recommendation_df = top_users_ratings.pivot_table(values="weighted_rating", index="movieId", aggfunc="mean")

recommendation_df

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
1,0.323708
2,0.335518
5,0.360166
6,0.528171
7,0.446676
...,...
42738,0.411618
44613,0.360166
45028,0.411618
45499,0.411618


In [87]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 0.7].sort_values(by="weighted_rating", ascending=False).head(10)

In [88]:
movies["title"][movies["movieId"].isin(movies_to_be_recommend.index)]

3786                  Dancer in the Dark
7234                    Dawn of the Dead
9399              A Very Long Engagement
9805             Elevator to the Gallows
11922                     License to Wed
21863    Frankenstein Conquers the World
Name: title, dtype: object

# <p style="border-radius:10px; border:#DEB887 solid; padding:25px; background-color: #FFFAF0; font-size:100%;color:#52017A;text-align:center;">Model-Based Filtering: MAtrix Factorization</p>

### It is a modeling technique used to fill in the gaps where one person has watched similar movies as others, but has not rated three of them. To fill in these gaps, the weights of the assumed latent features (hidden factors) for users and movies are determined based on the existing data, and predictions are made for the missing observations using these weights.

In [89]:
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate

In [90]:
movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')
df = pd.merge(movies,rating, how="inner", on="movieId")
df.head()

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
0,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story,7.7,5415.0,73,3.0,2000-07-08 04:17:32
1,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story,7.7,5415.0,614,4.0,2002-11-16 22:38:22
2,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story,7.7,5415.0,812,3.0,2001-10-10 07:30:34
3,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story,7.7,5415.0,1188,5.0,1999-12-22 20:44:01
4,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,Toy Story,7.7,5415.0,1199,4.0,2003-03-06 20:36:55


In [91]:
#picking randomly 4 movie and their id
movie_ids = [130219, 356, 4422, 541]
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]

In [92]:
#downgrading our data set, since too much 
sample_df = df[df.movieId.isin(movie_ids)]
sample_df.head()


Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
3174313,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,1,4.0,2005-04-02 23:30:03
3174314,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,2,5.0,2000-11-21 15:36:54
3174315,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,3,5.0,1999-12-11 13:14:07
3174316,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,11,4.5,2009-01-01 05:25:03
3174317,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,21,5.0,2001-06-10 16:09:04


In [94]:
#30.526 rate
sample_df.shape

(30526, 9)

In [95]:
# Rows = users, Columns = movies
user_movie_df = sample_df.pivot_table(index=["userId"],
                                      columns=["title"],
                                      values="rating")


In [97]:
#We provide a scale of 1 to 5 as the rating scale.
reader = Reader(rating_scale=(1, 5))

In [98]:
#We are converting it into a suitable data format for Surprise:
data = Dataset.load_from_df(sample_df[['userId','movieId','rating']], reader)


In [99]:
#Modelling phase
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD()
svd_model.fit(trainset)
predictions = svd_model.test(testset)

In [100]:
accuracy.rmse(predictions)

RMSE: 0.8758


0.8757937238818508

In [101]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.04   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.0382910972035875, details={'was_impossible': False})

In [102]:
sample_df[sample_df["userId"] == 1]

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
3174313,541,Frankie is a heroin addict and sits in prison....,1955-12-14,The Man with the Golden Arm,6.9,51.0,1,4.0,2005-04-02 23:30:03


our model predicted 4.04

but orginally it is 4.00 which is good accuracy

In [104]:
#suggesting 
def suggest(df,user_id,sug):
    
    didnt_watch = df["movieId"][~(df["userId"] == user_id)].drop_duplicates().values.tolist()
    temp_dict={}
    
    for i in didnt_watch:
        
        temp_dict[i] = svd_model.predict(uid=user_id, iid=i)[3]
        
    suggestions = pd.DataFrame(temp_dict.items(),columns=["movieId",'possible_rate']).sort_values(by="possible_rate", ascending=False).head(sug)
    merged = pd.merge(suggestions,movie[["movieId","title"]], how="inner", on="movieId")
    
    return merged

In [106]:
suggest(df,15,50).sort_values(by="title", ascending=False).head(10)

Unnamed: 0,movieId,possible_rate,title
1,89501,4.130886,William S. Burroughs: A Man Within (2010)
49,44671,4.130886,"Wild Blue Yonder, The (2005)"
15,87499,4.130886,Tyler Perry's Why Did I Get Married Too? (2010)
11,78128,4.130886,True Legend (Su Qi-Er) (2010)
48,84834,4.130886,Three Kingdoms: Resurrection of the Dragon (Sa...
16,41863,4.130886,"Three Burials of Melquiades Estrada, The (2006)"
42,126186,4.130886,The Sex and Violence Family Hour (1983)
19,124414,4.130886,The Secret of Convict Lake (1951)
5,55167,4.130886,Tekkonkinkreet (Tekkon kinkurîto) (2006)
3,56508,4.130886,Starting Out in the Evening (2007)
