# Introduction
Recommender Systems:
1. User Based Recommender Systems
1. Item Based Recommender Systems

<br>What is recommender System?
   * Based on previous(past) behaviours, it predicts the likelihood that a user would prefer an item.
   * For example, Netflix uses recommendation system. It suggest people new movies according to their past activities that are like watching and voting movies.
   * The purpose of recommender systems is recommending new things that are not seen before from people.
   
<br>
1. User Based Collaborative Filtering
    * Collaborative filtering is making recommend according to combination of your experience and experiences of other people.
    * First we need to make user vs item matrix.
        * Each row is users and each columns are items like movie, product or websites
    * Secondly, computes similarity scores between users.
        * Each row is users and each row is vector.
        * Compute similarity of these rows (users).
    * Thirdly, find users who are similar to you based on past behaviours
    * Finally, it suggests that you are not experienced before.
    * Lets make an example of user based collaborative filtering
        * Think that there are two people
        * First one watched 2 movies that are lord of the rings and hobbit
        * Second one watched only lord of the rings movie
        * User based collaborative filtering computes similarity of these two people and sees both are watched a lord of the rings.
        * Then it recommends hobbit movie to second one as it can be seen picture
        *<a href="https://ibb.co/droZMy"><img src="https://preview.ibb.co/feq3EJ/resim_a.jpg" alt="resim_a" border="0"></a>
        
    * User based collaborative filtering has some problems
        * In this system, each row of matrix is user. Therefore, comparing and finding similarity between of them is computationaly hard and spend too much computational power.
        * Also, habits of people can be changed. Therefore making correct and useful recommendation can be hard in time.
    * In order to solve these problems, lets look at another recommender system that is item based collaborative filtering
1. Item Based Collaborative Filtering
    * In this system, instead of finding relationship between users, used items like movies or stuffs are compared with each others.
    * In user based recommendation systems, habits of users can be changed. This situation makes hard to recommendation. However, in item based recommendation systems, movies or stuffs does not change. Therefore recommendation is easier.
    * On the other hand, there are almost 7 billion people all over the world. Comparing people increases the computational power. However, if items are compared, computational power is less.
    * In item based recommendation systems, we need to make user vs item matrix that we use also in user based recommender systems.
        * Each row is user and each column is items like movie, product or websites.
        * However, at this time instead of calculating similarity between rows, we need to calculate similarity between columns that are items like movies or stuffs.
    * Lets look at how it is works.
        * Firstly, there are similarities between lord of the rings and hobbit movies because both are liked by three different people. There is a similarity point between these two movies.
        * If the similarity is high enough, we can recommend hobbit to other people who only watched lord of the rings movie as it can be seen in figure below.
        *<a href="https://imgbb.com/"><img src="https://image.ibb.co/maEQdd/resim_b.jpg" alt="resim_b" border="0"></a>




# Starting Code

In [2]:
import pandas as pd

import os
print(os.listdir())


['genome_scores.csv', 'genome_tags.csv', 'link.csv', 'movie.csv', 'rating.csv', 'RecommenderSystemEDA.ipynb', 'tag.csv']


In [3]:
movie = pd.read_csv("./movie.csv")
movie.columns

Index(['movieId', 'title', 'genres'], dtype='object')

In [4]:
movie = movie.loc[:,["movieId","title"]]
movie.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [7]:
rating = pd.read_csv("./rating.csv")
rating.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [17]:
rating = rating.loc[:,["userId","movieId","rating"]]
rating.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [9]:

data = pd.merge(movie,rating)

In [10]:

data.head(10)

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5
5,1,Toy Story (1995),12,4.0
6,1,Toy Story (1995),13,4.0
7,1,Toy Story (1995),14,4.5
8,1,Toy Story (1995),16,3.0
9,1,Toy Story (1995),19,5.0


* As it can be seen data frame that is above, we have 4 features that are movie id, title user id and rating
* According to these data frame, we will make item based recommendation system
* Lets look at shape of the data. The number of sample in data frame is 20 million that is too much. There can be problem in kaggle even if their own desktop ide's like spyder or pycharm.
* Therefore, in order to learn item based recommendation system lets use 1 million of sample in data.

In [11]:
data.shape

(20000263, 4)

In [12]:
data = data.iloc[:1000000,:]

In [13]:
# lets make a pivot table in order to make rows are users and columns are movies. And values are rating
pivot_table = data.pivot_table(index = ["userId"],columns = ["title"],values = "rating")
pivot_table.head(10)

title,Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),"Amazing Panda Adventure, The (1995)","American President, The (1995)",Angela (1995),Angels and Insects (1995),Anne Frank Remembered (1995),Antonia's Line (Antonia) (1995),Assassins (1995),Babe (1995),...,Unforgettable (1996),Up Close and Personal (1996),"Usual Suspects, The (1995)",Vampire in Brooklyn (1995),Waiting to Exhale (1995),When Night Is Falling (1995),"White Balloon, The (Badkonake sefid) (1995)",White Squall (1996),Wings of Courage (1995),"Young Poisoner's Handbook, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,3.5,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,5.0,,,,,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
5,,,,5.0,,,,,,,...,,2.0,,,,,,,,
6,,,,,,,,,,,...,,4.0,,,,,,,,
7,,,,4.0,,,,,,,...,,,,,,,,,,
8,1.0,,,,,,,,,,...,,,,,,,,,,
10,,,,4.0,,,,,,,...,,,,,,,,,,
11,3.5,,,,,,,,,,...,,,,,,,,,,


* As it can be seen from table above, rows are users, columns are movies and values are ratings
* For example user 11 gives 3.5 rating to movie "Ace Ventura: When Nature Calls (1995)" and gives 3.0 rating to movie "Bad Boys (1995)".
* Now lets make a scenario, we have movie web site and "Bad Boys (1995)" movie are watched and rated by people. The question is that which movie do we recommend these people who watched "Bad Boys (1995)" movie.
* In order to answer this question we will find similarities between "Bad Boys (1995)" movie and other movies.

In [14]:
movie_watched = pivot_table["Bad Boys (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched)  # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()

Bad Boys (1995)                        1.000000
Headless Body in Topless Bar (1995)    0.723747
Last Summer in the Hamptons (1995)     0.607554
Two Bits (1995)                        0.507008
Shadows (Cienie) (1988)                0.494186
dtype: float64

* It can be concluded that we need to recommend "Headless Body in Topless Bar (1995)" movie to people who watched "Bad Boys (1995)".
* On the other hand even if we do not consider, number of rating for each movie is also important.

# Conclusion
Currently I have identified:

User based recommentation systems are not ideal and more resource intensive
Item based recommentation systems are more likely to yield better results

How to find correlation or similarity between two vectors
Able to make very basic movie recommendation system on a subsample of the data


**TODO:**

* Revisit the kmeans clustering notes to group movies to n-typical clusters of user profiles

* Then, reapply this algorithm (or the improved one mentioned in my AI class) to give users a recommendation to most similiar movies within their profile cluster

* Also, I need to try to determine the best feedback mechanism/loss function and identify how we can incorporate that with the user like/disliking a movie.  This will likely be F1 precision which would rely on the user identifying true positives and false positives

**NOTE:**

This is NOT a perfect way, but its basically an easy way to get pretty good recommendations

# Reviewing link.csv

In [6]:
ls

 Volume in drive C is OS
 Volume Serial Number is C21C-532E

 Directory of c:\Users\mlar5\OneDrive\Desktop\Code Folder\Python Projects\MoveMe\MoveMe\Recommender System R&D

10/21/2022  03:57 PM    <DIR>          .
10/24/2022  01:38 PM    <DIR>          ..
10/24/2022  01:38 PM       214,322,450 genome_scores.csv
10/24/2022  01:38 PM            20,363 genome_tags.csv
10/24/2022  01:38 PM           539,334 link.csv
10/24/2022  01:38 PM         1,493,648 movie.csv
10/24/2022  01:38 PM       690,353,377 rating.csv
10/21/2022  07:35 PM            32,711 RecommenderSystemEDA.ipynb
10/24/2022  01:38 PM        21,725,514 tag.csv
               7 File(s)    928,487,397 bytes
               2 Dir(s)  581,971,386,368 bytes free


In [8]:
links = pd.read_csv('./link.csv')

In [10]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
27273,131254,466713,4436.0
27274,131256,277703,9274.0
27275,131258,3485166,285213.0
27276,131260,249110,32099.0


In [11]:
links.columns

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

In [16]:
brokenLinks = links[links['imdbId'].isnull()]

In [17]:
brokenLinks.head()

Unnamed: 0,movieId,imdbId,tmdbId


In [20]:
a = movie[movie['movieId']==131260]

In [21]:
a

Unnamed: 0,movieId,title
27276,131260,Rentun Ruusu (2001)


In [23]:
links.dtypes

movieId      int64
imdbId       int64
tmdbId     float64
dtype: object

# **Important** Link.csv Conclusion

Some of the links that are supposed to start with a 0 must have been removed since they are an int64!!!

AS a Result, We will have to convert the int64 to a string and if it is length 6 instead of 7, prepend a 0!!