# Innovative Technologien und Services
### von Fabian Harmsen, Phillip Krumpholz, Till Waller
### Recommenderservice mit Python
Ziel dieser Abgabe ist es ein funktionierendes Recommenders system zu entwickeln, welches contend-based und collaborative filtering ermöglicht.


### Imports für den Recommender
Im folgenden Abschnitt werden Bibiliotheken für den Recommender importiert.

In [14]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Aufgabe 1
Im folgenden Abschnitt wird der Aufbau des contend-basted Recommenders Schritt für Schritt erläutert. Zum Abschluss werden einige Empfehlungen zu bestimmten Filmen exemplarisch ausgegeben.

### Import der CSV "metadata"
Im folgenden Abschnitt importieren wir die metadata.csv. Um den erfolgreichen import zu demonstrieren, werden die ersten drei Zeilen rausgeschrieben.

In [15]:
metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)
credits = pd.read_csv('data/credits.csv', low_memory=False)
keywords = pd.read_csv('data/keywords.csv', low_memory=False)
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


### Content based filtering und das Problem dahinter
Jeder Film hat eine Beschreibung welche wir im folgenden Abschnitt einmal exemplarisch zeigen. Das offensichtliche Problem hierbei ist die automatische Verarbeitung von natürlicher Sprache.

In [16]:
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

### Lösung des Problems
As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other.

You will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

From the above output, you observe that 75,827 different vocabularies or words in your dataset have 45,000 movies.

With this matrix in hand, you can now compute a similarity score. There are several similarity metrics that you can use for this, such as the manhattan, euclidean, the Pearson, and the cosine similarity scores.


In [17]:
tfidf = TfidfVectorizer(stop_words='english')
metadata['overview'] = metadata['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
print(tfidf_matrix.shape)

(45466, 75827)


### Kosinus-Ähnlichkeit
Durch die Vektorisierung der Wörter ist es nun möglich Beschreibungen anhand der Kosinus-Ähnlichkeit (cosine similarity) zu berechnen.

Since you have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie overview. Hence, each movie will be a 1x45466 column vector where each column will be a similarity score with each movie.



In [18]:
cosine_matrix_description = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_matrix_description.shape)

(45466, 45466)


### Index-Mapping
Firstly, for this, you need a reverse mapping of movie titles and DataFrame indices. In other words, you need a mechanism to identify the index of a movie in your metadata DataFrame, given its title.

In [19]:
index_mapping = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
print(index_mapping[:5])

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64


### Funktion
You're going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies

In [20]:
def get_recommendations(title, cosine_matrix_description=cosine_matrix_description):
    index = index_mapping[title]

    score = list(enumerate(cosine_matrix_description[index]))

    score = sorted(score, key=lambda x: x[1], reverse=True)

    score = score[1:11]

    movie_indices = [i[0] for i in score]

    return metadata['title'].iloc[movie_indices]

get_recommendations('The Dark Knight Rises')

15348                                     Toy Story 3
2997                                      Toy Story 2
10301                          The 40 Year Old Virgin
24523                                       Small Fry
23843                     Andy Hardy's Blonde Trouble
29202                                      Hot Splash
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
Name: title, dtype: object