## What is Recommnder System?

<h4 style="background-color:antiquewhite;padding:20px">A recommendation system, also known as a recommender system or recommendation engine, is a machine learning algorithm that uses data to suggest items to users based on their preferences and behaviors</h4>

#### In recommendation systems, filtering refers to the methods used to make personalized recommendations based on user preferences, behaviors, or other data

<h2 style="color:crimson">Collaborative Filtering -</h2>

- __Definition__: Recommendations are made based on the behavior and preferences of similar users or items
- __Types__ : There are two types
  - User-based
  - Item-based
- __If person A has the same opinion as person B on an issue, A is more likely to have B's opinion on the different issues,
  when compared the opinion of a person chosen randomly__

<h2 style="color:crimson">Traditional Collaborative Filtering -</h2>

- __Definition__: System that relies on the historical behavior and preferences of users to recommend items
- Customer as a p-dimensional vector of items :
  - __p__: the number of distinct catalog items
  - __Components__: bought(1)/ not bought(0); ratings; rated(1)/ not rated(0)


Find __Similarity__ between Customers A And B

____

<div>

<img src="images\\Recommender_System_D1.png" alt='image' width="40%" border="2px">
<img src="images\\Recommender_System_D2.png" alt='image' width="40%" border="2px" style="padding:10px">
    
</div>

In [7]:
# Example

<img src="images\\Recommender_System_D3.png" alt='image' width="500px" border="2px" style="padding:10px">


<h2 style="color:crimson">Similarity in Collaborative Filtering -</h2>

In collaborative filtering, __similarity__ refers to the degree of resemblance between two entities (users or items) based on their behavior or attributes. Similarity measures help the system identify relationships, such as:

- Between Users: Finding users who have similar preferences or behavior.
- Between Items: Finding items that are often interacted with by the same users.

Common similarity metrics-
 - Cosine Similarity
 - Pearson Correlation Coefficient
 - Jaccard Similarity
 - Euclidean Distance

<h2 style="color:limegreen">Cosine-Based Similarity -</h2>

- <h4 style="color:purple" >Measures the cosine of the angle between two vectors (user or item vectors) in multi-dimensional space</h4>

  <img src="images\\Recommender_System_Cos1.png" alt="image" width="500px" border="1px">

- __Cos(0)=1__, which means items are exact similar to each other
- For the example, cos(A,B) is 0.94 which means A and B are much similar

<h2 style="color:limegreen">Correlation-Based Similarity -</h2>

- <h4 style="color:purple" >Measures linear correlation between two entities, considering the mean of their ratings</h4>

  <img src="images\\Recommender_System_Corr.png" alt="image" width="500px" border="1px">

In [12]:
# Example

<img src="images\\Recommender_System_D4.png" alt='image' width="500px" border="1px">


<h2 style="color:red">Normalizing Ratings -</h2>

- Multiply the vector components by the __inverse frequency__
- __Inverse frequency__ is the inverse of the number of customers who have purchased or rated the item

____

<img src="images\\Recommender_System_Measures.png" alt='image' width="500px" border="1px">

<h2 style="color:red">Once similar, what items to recommend? -</h2>

- The item that hasn't been baught by the user yet
- You may create a list of multiple items to be considered for recommendation
- Finally, recommend the item he/she most likely to buy
  - Rank each item according to how many similar customers purchased it
  - or rated by most
  - or highest rated
  - or some other popularity criteria

<h2 style="color:red">Negatives of Recommendation System -</h2>

- __Memory-Based / Lazy-learning__
- __Computation-intensive__- as it calculates for n^2 similarities

<h2 style="color:red">How to reduce computation? -</h2>

- Randomly sample customers
- Discard infrequent buyers
- Discard items that are very popular or very unpopular
- __Clustering__ can also reduce - meaning considering similar customers as a single row
- __PCA__ can reduce

<h2 style="color:red">Runtime vs Quality of Recommendation -</h2>

__When to recommend?__

- Recommend while customer is browsing
- Recommnd better but later

The second option may sound good but in order to track all the activities and recommend better, we have to send the notifications like mails,
which gets easily ignored


It's better to recommend while browsing

<h2 style="color:red">Search Based Methods -</h2>

__Based on previous purchases__
  - Books of the same authors
  - DVD titles of same director
  - Products that are identified by similar keywords

<div><img src="images\\Recommender_System_D6.png" alt='image' width="500px" border="1px">
<img src="images\\Recommender_System_D7.png" alt='image' width="500px" border="1px"></div>

<h2 style="color:red">A Critical Limitation of collaborative Filtering: Cold Start -</h2>

__Cold Start__:

- How to create a recommendation for __new users__
- How about new items

____
<h3 style="color:green">How to address Cold Start -</h3>

__Approaches to address cold start with new users__-

- Popular items(get quick reactions of users)
- Demographically relevant items
- Browsing history
- Secondary source of data - social network, subscriptions
- e.g. Netflix - starts with rating a few movies

____

__Approaches to address cold start with new items__-

- Recommend to random users or some selective users based on certain criteria
- How about offering the product to influential people in the social network

____

# Example Code

In [24]:
import numpy as np
import pandas as pd

In [25]:
df = pd.read_csv("Datasets\\Movie.csv")
df

Unnamed: 0,userId,movie,rating
0,3,Toy Story (1995),4.0
1,6,Toy Story (1995),5.0
2,8,Toy Story (1995),4.0
3,10,Toy Story (1995),4.0
4,11,Toy Story (1995),4.5
...,...,...,...
8987,7087,GoldenEye (1995),3.0
8988,7088,GoldenEye (1995),1.0
8989,7105,GoldenEye (1995),2.0
8990,7113,GoldenEye (1995),3.0


In [26]:
# lets try to fetch details with user id say 24

df[df.userId==24]

Unnamed: 0,userId,movie,rating
12,24,Toy Story (1995),4.0
4548,24,Father of the Bride Part II (1995),2.0
5212,24,Heat (1995),4.0
6469,24,Sabrina (1995),3.0
7449,24,GoldenEye (1995),3.0


In [27]:
df.sort_values('userId')

Unnamed: 0,userId,movie,rating
2569,1,Jumanji (1995),3.5
3724,2,Grumpier Old Men (1995),4.0
0,3,Toy Story (1995),4.0
5204,4,Heat (1995),3.0
7444,4,GoldenEye (1995),4.0
...,...,...,...
6463,7117,Heat (1995),5.0
2567,7119,Toy Story (1995),5.0
2568,7120,Toy Story (1995),4.5
3723,7120,Jumanji (1995),4.0


In [28]:
# As we can clearly see users are repeated as well as movies

In [29]:
# lets find out how many unique users are there

len(df.userId.unique())

4081

In [30]:
# or we can use

df.userId.value_counts().shape[0]

4081

In [31]:
# about ratings 

df.rating.value_counts()

rating
3.0    2736
4.0    2660
5.0    1394
3.5     679
2.0     542
4.5     374
2.5     277
1.0     212
1.5      61
0.5      57
Name: count, dtype: int64

In [32]:
# about movies

df.movie.value_counts()

movie
Toy Story (1995)                      2569
GoldenEye (1995)                      1548
Heat (1995)                           1260
Jumanji (1995)                        1155
Sabrina (1995)                         700
Grumpier Old Men (1995)                685
Father of the Bride Part II (1995)     657
Sudden Death (1995)                    202
Waiting to Exhale (1995)               138
Tom and Huck (1995)                     78
Name: count, dtype: int64

In [33]:
len(df.movie.unique())

10

In [34]:
# total 10 movies and 4081 users are there

In [35]:
# change structure of dataset so that we can compute the similarity score

movies_df = df.pivot(index='userId',columns='movie',values='rating')
movies_df

movie,Father of the Bride Part II (1995),GoldenEye (1995),Grumpier Old Men (1995),Heat (1995),Jumanji (1995),Sabrina (1995),Sudden Death (1995),Tom and Huck (1995),Toy Story (1995),Waiting to Exhale (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,,,,,3.5,,,,,
2,,,4.0,,,,,,,
3,,,,,,,,,4.0,
4,,4.0,,3.0,,,,,,
5,,,,,3.0,,,,,
...,...,...,...,...,...,...,...,...,...,...
7115,4.0,,,,,,,,,
7116,3.5,,,,,,,,4.0,
7117,,3.0,4.0,5.0,,3.0,1.0,,4.0,
7119,,,,,,,,,5.0,


In [36]:
# NaN represents the user did not rated the perticular movie

In [37]:
# Let's fill NaNs with value 0 as there are no ratings

movies_df.fillna(0,inplace=True)

In [38]:
movies_df

movie,Father of the Bride Part II (1995),GoldenEye (1995),Grumpier Old Men (1995),Heat (1995),Jumanji (1995),Sabrina (1995),Sudden Death (1995),Tom and Huck (1995),Toy Story (1995),Waiting to Exhale (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.0,0.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
4,0.0,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
7115,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7116,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
7117,0.0,3.0,4.0,5.0,0.0,3.0,1.0,0.0,4.0,0.0
7119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0


In [39]:
# packages for calculating similarities between users

from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

In [40]:
# User similarity - pairwise - taking 2 rows at a time it will compute distance
# That distance is scaled between 0 to 1
# Suppose distance between 2 rows is 0.  We want similarity, not a distance.
# That is calculated as Similarity=1-distance. So 1-0=1. i.e. similarity in rows is 1.
# Suppose distance is 0.9. So 1-0.9=0.1
user_sim = 1 - pairwise_distances(movies_df.values,metric='cosine')

# pairwise_distances(..., metric='cosine') calculates pairwise distances between rows (users) in this matrix using the cosine distance metric.
# 1 - pairwise_distances(...) converts distances to similarities.
# Cosine distance ranges from 0 to 2, where 0 indicates no distance and 2 indicates maximum distance,
# so subtracting from 1 inverts this to give cosine similarity which ranges from -1 to 1.

In [41]:
user_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.55337157],
       [0.        , 1.        , 0.        , ..., 0.45883147, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.45883147, 1.        ,
        0.62254302],
       ...,
       [0.        , 0.45883147, 0.45883147, ..., 1.        , 0.45883147,
        0.47607054],
       [0.        , 0.        , 1.        , ..., 0.45883147, 1.        ,
        0.62254302],
       [0.55337157, 0.        , 0.62254302, ..., 0.47607054, 0.62254302,
        1.        ]])

In [42]:
# As we can see all diagonal elements are 1 i.e. similarity with self

In [43]:
# Let's replace all the diagonal with 0. This is because later we will consider only values>0, and this self similarity values may create confusion

In [44]:
np.fill_diagonal(user_sim,0)

In [45]:
user_sim

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.55337157],
       [0.        , 0.        , 0.        , ..., 0.45883147, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.45883147, 1.        ,
        0.62254302],
       ...,
       [0.        , 0.45883147, 0.45883147, ..., 0.        , 0.45883147,
        0.47607054],
       [0.        , 0.        , 1.        , ..., 0.45883147, 0.        ,
        0.62254302],
       [0.55337157, 0.        , 0.62254302, ..., 0.47607054, 0.62254302,
        0.        ]])

In [46]:
# Lets convert it to Dataframe

# both rows and columns represents userIds

user_sim_df = pd.DataFrame(user_sim,index=df.userId.unique(),columns=df.userId.unique())
user_sim_df

Unnamed: 0,3,6,8,10,11,12,13,14,16,19,...,6975,6979,6993,7030,7031,7044,7070,7080,7087,7105
3,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.000000,0.707107,0.000000,0.000000,0.000000,0.000000,0.000000,0.553372
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.390567,0.707107,0.615457,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.458831,0.000000,0.000000
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.650945,0.000000,0.492366,1.000000,0.874157,...,0.000000,1.000000,0.000000,0.707107,0.000000,0.000000,0.752577,0.458831,1.000000,0.622543
10,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.615457,0.000000,0.388514,...,0.800000,0.000000,0.000000,0.000000,0.989949,0.000000,0.000000,0.619422,0.000000,0.000000
11,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.000000,0.707107,0.000000,0.000000,0.000000,0.000000,0.000000,0.553372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7044,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.658505,0.000000,0.000000,0.000000
7070,0.000000,0.000000,0.752577,0.000000,0.000000,0.489886,0.000000,0.370543,0.752577,0.657870,...,0.000000,0.752577,0.000000,0.532152,0.000000,0.658505,0.000000,0.345306,0.752577,0.468511
7080,0.000000,0.458831,0.458831,0.619422,0.000000,0.701884,0.567775,0.889532,0.458831,0.568212,...,0.344124,0.458831,0.000000,0.324443,0.648886,0.000000,0.345306,0.000000,0.458831,0.476071
7087,0.000000,0.000000,1.000000,0.000000,0.000000,0.650945,0.000000,0.492366,1.000000,0.874157,...,0.000000,1.000000,0.000000,0.707107,0.000000,0.000000,0.752577,0.458831,0.000000,0.622543


In [47]:
# idxmax() method returns a Series with the index of the maximum value for each column. (row 3 anc col 11 has highest value as 1)
# By specifying the column axis (axis='columns' or 1), the idxmax() method returns a Series with the index of the maximum value for each row.
user_sim_df.idxmax(axis=1)

3         11
6        168
8         16
10      4047
11         3
        ... 
7044      80
7070    1808
7080     708
7087       8
7105    4110
Length: 4081, dtype: int64

In [48]:
# 3 and 11 are most similar to each other than others likewise

In [49]:
#Most Similar Users- top 10
user_sim_df.idxmax(axis=1)[0:10]

3       11
6      168
8       16
10    4047
11       3
12    6676
13    5953
14    4138
16       8
19    3603
dtype: int64

In [50]:
user_sim_df.iloc[0:5,0:5]

Unnamed: 0,3,6,8,10,11
3,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0
11,1.0,0.0,0.0,0.0,0.0


In [51]:
# As per the output 11 and 3 are very similar as similarity is 1

In [52]:
# Now we know customer 6 and 168 are similar

# eg. find movies watched by customer 6 and 168 as they are similar
df[(df['userId']==6) | (df['userId']==168)]
# Both watched Toy Story with good rating, 6 watched 2 more movies.
# Now rating for Sabrina is more than other movie. So we can recommend that movie to 168.

Unnamed: 0,userId,movie,rating
1,6,Toy Story (1995),5.0
60,168,Toy Story (1995),4.5
3725,6,Grumpier Old Men (1995),3.0
6464,6,Sabrina (1995),5.0


# Additional Code

In [54]:
# Another way of getting output as above

In [55]:
user_6 = df[df.userId==6]
user_6

Unnamed: 0,userId,movie,rating
1,6,Toy Story (1995),5.0
3725,6,Grumpier Old Men (1995),3.0
6464,6,Sabrina (1995),5.0


In [56]:
user_168 = df[df.userId==168]
user_168

Unnamed: 0,userId,movie,rating
60,168,Toy Story (1995),4.5


In [57]:
# Lets merge them together 

pd.merge(user_6,user_168,on='movie',how='left') 

# It is exaclty same as SQL query, it's a left join

Unnamed: 0,userId_x,movie,rating_x,userId_y,rating_y
0,6,Toy Story (1995),5.0,168.0,4.5
1,6,Grumpier Old Men (1995),3.0,,
2,6,Sabrina (1995),5.0,,


In [58]:
# Let's try inner join too


pd.merge(user_6,user_168,on='movie',how='inner')

Unnamed: 0,userId_x,movie,rating_x,userId_y,rating_y
0,6,Toy Story (1995),5.0,168,4.5
