In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Essentially 2 types of recommending systems, first, one based on content filtering, second, collaborative filtering. The project aims to achieve both if possible. 

1) In content filtering, the similarities between difffernt prodects are calculated over the base of the attributes of the products. For example, in a film recomending system based on content, the similarity is calculated based on genre, film actors, film directors, etc. 

2) Collaborative filtering, takes advantage of the power of volume. The backbone intuition is that if user A likes product Y and X, and user B likes product X, there are many possibilities that he will like Y too. Same example... lets supose we have a large number of users, that have assigned the same ratings to films X and Y. New user arrives and assigns same rating to Y, but still he has not seen X. The collaborative filtering system will recommend X to that user. It has two approaches; one based on the user and another based on the article/product. The collaborative filtering based on the article bases on the similarity between aricles. The one based on user is on the similarity of users. 

To make a recommendation system based on votes follow the link sent by Ras

Basados en la memoria: Se utilizan técnicas estadísticas para aproximar usuarios a los artículos. Correlación Pearson, Lasimilitud de Coseno, La Distancia Euclidiana. En los enfoques con modelos, es necesario crear usuarios con técnicas de machine learning como la regresión, agrupación o clasificación.

In [3]:
dfmov = pd.read_csv('../../data/imdb_movies_clean_1st.csv')

In [4]:
dfmov.isnull().sum()

Unnamed: 0            0
imdb_title_id         0
original_title        0
year                  0
genre                 0
duration              0
country               0
language              0
director              0
writer                0
production_company    0
actors                0
description           0
duration_sets         0
dtype: int64

This is an example of a content filtering recommending system specifically looking at the films names and description.

In [5]:
rec_cont = dfmov[['original_title','description']]

In [6]:
rec_cont = rec_cont[rec_cont[['original_title','description']] != 0]

In [7]:
print(rec_cont[['original_title','description']] != 0)

       original_title  description
0                True         True
1                True         True
2                True         True
3                True         True
4                True         True
...               ...          ...
85850            True         True
85851            True         True
85852            True         True
85853            True         True
85854            True         True

[85855 rows x 2 columns]


In [8]:
rec_cont

Unnamed: 0,original_title,description
0,Miss Jerry,The adventures of a female reporter in the 1890s.
1,The Story of the Kelly Gang,True story of notorious Australian outlaw Ned ...
2,Den sorte drøm,Two men of high rank are both wooing the beaut...
3,Cleopatra,The fabled queen of Egypt's affair with Roman ...
4,L'Inferno,Loosely adapted from Dante's Divine Comedy and...
...,...,...
85850,Le lion,A psychiatric hospital patient pretends to be ...
85851,De Beentjes van Sint-Hildegard,A middle-aged veterinary surgeon believes his ...
85852,Padmavyuhathile Abhimanyu,0
85853,Sokagin Çocuklari,0


## Content(Title)-Based Recommender

In [9]:
take_out = (rec_cont != 0).any(axis=1)

In [10]:
rec_contn = rec_cont.loc[take_out]

In [11]:
rec_cont

Unnamed: 0,original_title,description
0,Miss Jerry,The adventures of a female reporter in the 1890s.
1,The Story of the Kelly Gang,True story of notorious Australian outlaw Ned ...
2,Den sorte drøm,Two men of high rank are both wooing the beaut...
3,Cleopatra,The fabled queen of Egypt's affair with Roman ...
4,L'Inferno,Loosely adapted from Dante's Divine Comedy and...
...,...,...
85850,Le lion,A psychiatric hospital patient pretends to be ...
85851,De Beentjes van Sint-Hildegard,A middle-aged veterinary surgeon believes his ...
85852,Padmavyuhathile Abhimanyu,0
85853,Sokagin Çocuklari,0


In [14]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

In [15]:
matrix = tf.fit_transform(rec_cont['description'])

In [16]:
matrix.shape
# 1920179 different types of voabularies or words in the dataset of 85855 movies

(85855, 1920179)

In [17]:
tf.get_feature_names()[6000:6010]

['1930s 50s movie',
 '1930s american',
 '1930s american scandalous',
 '1930s american socialite',
 '1930s amoral',
 '1930s amoral blonde',
 '1930s amsterdam',
 '1930s arab',
 '1930s arab states',
 '1930s area']

In [18]:
cosine_similarities = linear_kernel(matrix,matrix)

In [19]:
cosine_similarities[1]

array([0., 1., 0., ..., 0., 0., 0.])

In [20]:
movie_title = rec_cont['original_title']

In [21]:
indices = pd.Series(rec_cont.index, index=rec_cont['original_title']).drop_duplicates()

In [22]:
indices[:10]

original_title
Miss Jerry                                             0
The Story of the Kelly Gang                            1
Den sorte drøm                                         2
Cleopatra                                              3
L'Inferno                                              4
From the Manger to the Cross; or, Jesus of Nazareth    5
Madame DuBarry                                         6
Quo Vadis?                                             7
Independenta Romaniei                                  8
Richard III                                            9
dtype: int64

In [23]:
def movie_recommend(original_title, cosine_similarities=cosine_similarities):
    
    '''
    Fuction computes recommendation given a movie title and description
    '''

    idx = indices[original_title]

    sim_scores = list(enumerate(cosine_similarities[idx]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:31]

    movie_indices = [i[0] for i in sim_scores]

    return movie_title.iloc[movie_indices]

In [24]:
movie_recommend('The Godfather').head(10)
# 10  recommended movies through the film The Godfather using original title and description columns

53975                                    Yangjamoolrihak
45869                                  Romanzo criminale
77259                                       Moving Parts
61578                                         Blood Ties
33539                                              Belly
15497    I familiari delle vittime non saranno avvertiti
43377                                       Sabita naifu
41165                                            Bookies
22349                                 Year of the Dragon
4968                                         Crime, Inc.
Name: original_title, dtype: object

In [25]:
movie_recommend('The Dark Knight Rises').head(10)

43935                 Batman Begins
58269               William Vincent
82580                  Batman Ninja
30399                Batman & Robin
81987    Batman: Gotham by Gaslight
24426                        Batman
48078               The Dark Knight
82239                         Joker
26413                Batman Returns
73755         The Lego Batman Movie
Name: original_title, dtype: object

In [26]:
movie_recommend('American Pie').head(10)

62887                Date and Switch
67690                       Blockers
46938              Another Gay Movie
21993                      Hot Moves
63249                Very Good Girls
16630    Es war nicht die Nachtigall
32869                American Virgin
25549                        Rockula
17386               Cherry Hill High
73450                 The Honor Farm
Name: original_title, dtype: object

In [27]:
movie_recommend('Thelma & Louise').head(10)

64881                          Hell and Back
49690                            Deep Winter
78636    1 Kezban 1 Mahmut: Adana Yollarinda
51533                            Road to Red
27131          The Bikini Carwash Company II
46994                      Falscher Bekenner
68300                            Last Minute
31940            The Boy Who Saved Christmas
23883                              El heroob
4135                              Crossroads
Name: original_title, dtype: object

## Content (film) - Based Recommender

The recommending system can be finer with the addition of other features to our system like directors, actors, genres,...

In [121]:
dfmov.drop('Unnamed: 0', axis=1)

Unnamed: 0,imdb_title_id,original_title,year,genre,duration,country,language,director,writer,production_company,actors,description,duration_sets
0,tt0000009,Miss Jerry,1894,Romance,45,USA,0,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,0 < 1h
1,tt0000574,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,0,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,1h < 1h30m
2,tt0001892,Den sorte drøm,1911,Drama,53,"Germany, Denmark",0,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,0 < 1h
3,tt0002101,Cleopatra,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,1h30m < 2h
4,tt0002130,L'Inferno,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,1h < 1h30m
...,...,...,...,...,...,...,...,...,...,...,...,...,...
85850,tt9908390,Le lion,2020,Comedy,95,"France, Belgium",French,Ludovic Colbeau-Justin,"Alexandre Coquelle, Matthieu Le Naour",Monkey Pack Films,"Dany Boon, Philippe Katerine, Anne Serra, Samu...",A psychiatric hospital patient pretends to be ...,1h30m < 2h
85851,tt9911196,De Beentjes van Sint-Hildegard,2020,"Comedy, Drama",103,Netherlands,"German, Dutch",Johan Nijenhuis,"Radek Bajgar, Herman Finkers",Johan Nijenhuis & Co,"Herman Finkers, Johanna ter Steege, Leonie ter...",A middle-aged veterinary surgeon believes his ...,1h30m < 2h
85852,tt9911774,Padmavyuhathile Abhimanyu,2019,Drama,130,India,Malayalam,Vineesh Aaradya,"Vineesh Aaradya, Vineesh Aaradya",RMCC Productions,"Anoop Chandran, Indrans, Sona Nair, Simon Brit...",0,2h < 2h30m
85853,tt9914286,Sokagin Çocuklari,2019,"Drama, Family",98,Turkey,Turkish,Ahmet Faik Akinci,"Ahmet Faik Akinci, Kasim Uçkan",Gizem Ajans,"Ahmet Faik Akinci, Belma Mamati, Metin Keçeci,...",0,1h30m < 2h


In [122]:
rec_film = dfmov[['actors', 'director', 'writer', 'genre']]
# The dataframe used for this recommender system

In [123]:
type(rec_film)

pandas.core.frame.DataFrame

In [81]:
#rec_film.update('"' + rec_film[['actors', 'director', 'writer', 'genre']].astype(str) + '"')
#print(rec_film)

                                                  actors  \
0      "Blanche Bayliss, William Courtenay, Chauncey ...   
1      "Elizabeth Tait, John Tait, Norman Campbell, B...   
2      "Asta Nielsen, Valdemar Psilander, Gunnar Hels...   
3      "Helen Gardner, Pearl Sindelar, Miss Fielding,...   
4      "Salvatore Papa, Arturo Pirovano, Giuseppe de ...   
...                                                  ...   
85850  "Dany Boon, Philippe Katerine, Anne Serra, Sam...   
85851  "Herman Finkers, Johanna ter Steege, Leonie te...   
85852  "Anoop Chandran, Indrans, Sona Nair, Simon Bri...   
85853  "Ahmet Faik Akinci, Belma Mamati, Metin Keçeci...   
85854  "Maria Morera Colomer, Biel Rossell Pelfort, I...   

                                    director  \
0                          "Alexander Black"   
1                             "Charles Tait"   
2                                "Urban Gad"   
3                       "Charles L. Gaskill"   
4      "Francesco Bertolini, Adolfo Pad

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = expressions.where(mask, this, that)


In [124]:
import re
import nltk

In [126]:
rec_film['writers'] = rec_film['writer'].str.lower()
rec_film['writers'] = rec_film['writers'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
rec_film['writers'] = rec_film['writers'].apply(lambda x: re.sub('\s+', ' ', x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film['writers'] = rec_film['writer'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film['writers'] = rec_film['writers'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film['writers'] = rec_film['writers'].apply(lambda x: re.sub(

In [82]:
#rec_film = rec_film.update(rec_film[['actors', 'director', 'writer', 'genre']].applymap('"{}"'.format))

In [83]:
type(rec_film['writer'][0])

str

In [84]:
features = ['actors', 'director', 'writer', 'genre']

In [85]:
from ast import literal_eval
# I have a list of strings and I have to tranform them into usable data for the system

In [86]:
def f(x):
    try:
        return literal_eval(str(x))   
    except Exception as e:
        print(e)
        return []

rec_film[['actors', 'director', 'writer', 'genre']] = rec_film[['actors', 'director', 'writer', 'genre']].apply(lambda x: f(x))

EOL while scanning string literal (<unknown>, line 1)
invalid syntax (<unknown>, line 1)
invalid syntax (<unknown>, line 1)
invalid syntax (<unknown>, line 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [None]:
#def literal_return(val):
 #   try:
  #      return ast.literal_eval(val)
   # except (ValueError, SyntaxError) as e:
    #    return val

In [78]:
# Converting the data to be usable in the model

#for feature in features:

 #   rec_film[feature] = rec_film[feature].apply(literal_eval)

In [109]:
# Tu lo tienes dividido ya
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [110]:
def get_list(x):
    '''
    Function returns top 3 elements or entire list, whichever more.
    The lists of actors, directors, writers or genres
    It will return empty list in case of missing or malformed data
    '''
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # Check to see if there are 3 elements and stop there
        if len(names)>3:
            names = names[:3]
        
        return names
    
    # Return empty list in case of bad data
    return []

In [111]:
# Define new director, cast, genres and keywords features that are in a suitable form.
rec_film['director'] = rec_film['actors'].apply(get_director)

features = ['actors', 'director', 'writer', 'genre']
for feature in features:
    rec_film[feature] = rec_film[feature].apply(get_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film['director'] = rec_film['actors'].apply(get_director)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film[feature] = rec_film[feature].apply(get_list)


In [112]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [113]:
# Apply clean_data function to your features.
features = ['actors', 'director', 'writer', 'genre']

for feature in features:
    rec_film[feature] = rec_film[feature].apply(clean_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film[feature] = rec_film[feature].apply(clean_data)


In [114]:
def create_soup(x):
    return ' '.join(x['actors']) + ' ' + ' '.join(x['director']) + ' ' + ' ' .join(x['writer']) + ' ' + ' '.join(x['genre'])

In [115]:
# Create a new soup feature
rec_film['soup'] = rec_film.apply(create_soup, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_film['soup'] = rec_film.apply(create_soup, axis=1)


In [116]:
rec_film

Unnamed: 0,actors,director,writer,genre,soup
0,[],[],[],[],
1,[],[],[],[],
2,[],[],[],[],
3,[],[],[],[],
4,[],[],[],[],
...,...,...,...,...,...
85850,[],[],[],[],
85851,[],[],[],[],
85852,[],[],[],[],
85853,[],[],[],[],


In [59]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(rec_film['soup'])

KeyError: 'soup'

In [None]:
count_matrix.shape


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [None]:
get_recommendations('The Dark Knight Rises', cosine_sim2)


In [None]:
get_recommendations('The Godfather', cosine_sim2)
