# Introduction

This is a movie recommender system from Explore-AI competition at Kaggle

## The importance of making recommendations

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Long_tail_problem.jpg"
     alt="The Long Tail Problem"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
The Long tail problem is often experienced by content distributors. 
</div>


We exist in a technological era where there is far too much content (movies, news articles, shopping products, websites, etc.) for individual items to receive our consideration. For example, consider that the average Google search returns well over 1 million results, yet when was the last time you looked at the websites past the [first page](https://backlinko.com/google-ctr-stats)?  This fact is often illustrated by what is known as the "long tail problem" (represented in the figure above), where tracking user engagement with items in a large content repository sees a small number of these items receiving a disproportionate amount of attention. In contrast, the majority of items remain unexplored. The truth is that a user doesn't know of each item that exists, nor has the time to inspect each item even if it were known. 

In light of the above challenge, a natural question for service providers becomes: "How do I ensure that an individual is shown a manageable portion of the total content I have available while also ensuring that this content is relevant to and desired by them?" This question turns out to be extremely valuable, both economically and within society. Luckily for us, decades of hard work by very intelligent individuals have largely answered this question through a collection of algorithms and computing techniques known as recommender systems.


Simply put, a recommender system functions by predicting a user’s rating or preference for an item. This allows a service provider to build up a catalogue of items it believes the user will want to examine, thereby increasing their engagement with the service and allowing a wider array of content to be considered.

### Terminology: Users, items, and ratings  

The first thing we need to do when discussing recommender systems is to clarify some terminology. A recommender system has two primary sets of entities: the users and the items.

As we’d expect, **an item is consumed**. It can be watched, read, bought, clicked on, or considered. Items are passive, meaning that their properties or nature do not change.

**Users are individuals who interact with the items in a recommendation system.  Users create ratings for specific items within a recommendation system through their actions.** Ratings can be either *explicit* (such as giving your favourite movie 5/5 stars on a review) or *implicit* (such as watching a movie; even though you haven't rated it directly, by viewing something, you indicate that you have some interest in it).

A given user can have ratings for many items in the system or none at all. Generally, as a user continues to interact with a recommender system, it can capture her preferences and ratings for items more easily.

## Measuring similarity 

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Cosine_similarity.jpg"
     alt="Cosine Similarity "
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Measuring the similarity between the ratings of two users (A) and (B) for the books 'Harry Potter and the Philosopher's Stone' and 'The Diary of a Young Girl', using the Cosine similarity metric.  
</div>


Having learnt about the entities which exist within recommender systems, we may wonder how they function. One fundamental principle we need to understand is that recommender systems are built up by utilising the _existing relations_ between items and users. As such, these systems always require a mechanism to measure how related or _similar_ a user is to another user or an item is to another item. 

We accomplish this similarity measurement through, a _similarity metric_.  

Generally speaking, a similarity metric can be considered the inverse of a distance measure. If two things are considered very similar, they should be assigned a high similarity value (close to 1), while dissimilar items should receive a low similarity value (close to zero).  Other [important properties](https://online.stat.psu.edu/stat508/lesson/1b/1b.2/1b.2.1) include:
 - (Symmetry) $Sim(A,B) = Sim(B,A)$ 
 - (Identity) $Sim(A,A) = 1$
 - (Uniqueness) $Sim(A,B) = 1 \leftrightarrow A = B$
 
While there are many similarity metrics to choose from when building a recommender system (and more than one can certainly be used simultaneously), a popular choice is the **Cosine similarity**. We won't go into the fundamental trig here, but recall that as an angle becomes smaller (approaching $0^o$), the value of its cosine increases. Conversely, as the angle increases, the cosine value decreases. It turns out that this behaviour makes the cosine of the angle between two p-dimensional vectors desirable as a [similarity metric](https://en.wikipedia.org/wiki/Cosine_similarity) which can easily be computed.

Using the figure above to help guide our understanding, the Cosine similarity between two p-dimensional vectors ${A}$ and $B$ can be given as:

$$ \begin{align}
Sim(A,B)  &= \frac{A \cdot B}{||A|| \times ||B||} \\ \\
& = \frac{\sum_{i=1}^{p}A_{i}B_{i}}{\sqrt{{\sum_{i=1}^{p}A_{i}^2}} \sqrt{\sum_{i=1}^{p}B_{i}^2}}, \\
\end{align} $$ 
  

Let’s work out the cosine similarity using the above example to make things a little more concrete. Here, each vector represents the ratings given by one of two *users*, $A$ and $B$, who have each rated two books (rating#1 $ \rightarrow r_1$, and rating#2 $ \rightarrow r_2$). To work out how similar these two users are based on their supplied ratings, we can use the Cosine similarity definition as follows:   


$$ \begin{align}
Sim(A,B)  & = \frac{(A_{r1} \times B_{r1})+(A_{r2} \times B_{r2})}{\sqrt{A_{r1}^2 + A_{r2}^2} \times \sqrt{B_{r1}^2 + B_{r2}^2}} \\ \\
& = \frac{(3 \times 5) + (4 \times 2)}{\sqrt{9 + 16} \times \sqrt{25 + 4}} \\ \\
& = \frac{23}{26.93} \\ \\
& = 0.854
\end{align} $$

It would be a pain to work this out manually each time! Thankfully, we can obtain the same result using the `cosine_similarity` function provided to us in `sklearn`.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

In [None]:
A = np.array([[3,4]]) # <-- Rating vector A
B = np.array([[5,2]]) # <-- Rating vector B
cosine_similarity(A,B) # Sim(A,B)

# Data Overview

This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

For this Predict, we'll be using a special version of the MovieLens dataset which has been enriched with additional data and resampled for fair evaluation purposes.

## Source

The data for the MovieLens dataset is maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB.

## Supplied Files

- `genome_scores.csv` - a score mapping the strength between movies and tag-related properties. Read more [here](https://files.grouplens.org/papers/tag_genome.pdf).
- `genome_tags.csv` - user-assigned tags for genome-related scores.
- `imdb_data.csv` - Additional movie metadata scraped from IMDB using the `links.csv` file.
- `links.csv` - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- `sample_submission.csv` - Sample of the submission format for the hackathon.
- `tags.csv` - User-assigned tags for the movies within the dataset.
- `test.csv` - The test split of the dataset. Contains user and movie IDs with no rating data.
- `train.csv` - The training split of the dataset. Contains user and movie IDs with associated rating data.

## Additional Information

The below information is provided directly from the MovieLens dataset description files:

### Ratings Data File Structure (`train.csv`)

All ratings are contained in the file `train.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format: userId,movieId,rating,timestamp

The lines within this file are ordered first by `userId`, then, within user, by `movieId`.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Tags Data File Structure (`tags.csv`)

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format: userId,movieId,tag,timestamp

Movie titles are entered manually or imported from [The Movie Database (TMDb)](https://www.themoviedb.org/), and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- (no genres listed)

### Links Data File Structure (`links.csv`)

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format: movieId,imdbId,tmdbId

- `movieId` is an identifier for movies used by [MovieLens](https://movielens.org). E.g., the movie Toy Story has the link [https://movielens.org/movies/1](https://movielens.org/movies/1).

- `imdbId` is an identifier for movies used by [IMDB](http://www.imdb.com). E.g., the movie Toy Story has the link [http://www.imdb.com/title/tt0114709/](http://www.imdb.com/title/tt0114709/).

- `tmdbId` is an identifier for movies used by [The Movie Database (TMDb)](https://www.themoviedb.org). E.g., the movie Toy Story has the link [https://www.themoviedb.org/movie/862](https://www.themoviedb.org/movie/862).

Use of the resources listed above is subject to the terms of each provider.

### Tag Genome (`genome-scores.csv` and `genome-tags.csv`)

As described in [this article](https://dl.acm.org/doi/10.1145/1864708.1864767), the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files:

- The file `genome-scores.csv` contains movie-tag relevance data in the following format: movieId,tagId,relevance
- The second file, `genome-tags.csv`, provides the tag descriptions for the tag IDs in the genome file, in the following format: tagId,tag

## Installing packages
Please download all relevant packages in. There is no terminal so you will pip install everything.

You can find a list of recommended install in the `requirements.txt` file.

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Package to suppress warnings
import warnings
warnings.filterwarnings("ignore")

## Reading in data

In [4]:
train_df = pd.read_csv('alx-movie-recommendation-project-2024/train.csv')
movies_df = pd.read_csv('alx-movie-recommendation-project-2024/movies.csv')
imdb_df = pd.read_csv('alx-movie-recommendation-project-2024/imdb_data.csv')
test_df = pd.read_csv('alx-movie-recommendation-project-2024/test.csv')
links_df = pd.read_csv('alx-movie-recommendation-project-2024/links.csv')
tags = pd.read_csv('alx-movie-recommendation-project-2024/tags.csv')
genome_scores = pd.read_csv('alx-movie-recommendation-project-2024/genome_scores.csv')
genome_tags = pd.read_csv('alx-movie-recommendation-project-2024/genome_tags.csv')
sample_submissions = pd.read_csv('alx-movie-recommendation-project-2024/sample_submission.csv')

## The dataset will not be included in the rebo
#### You can download it from [Kaggle](https://www.kaggle.com/competitions/alx-movie-recommendation-project-2024/data)