# Building a movie recommendation system based on tweets
> You may want to take a look to the previous notebok "Sentiment_Analysis.ipynb" before we start

## Importing the necessary libraries

In [1]:
%pip install -U textblob
%pip install tweepy
!python -m textblob.download_corpora
%pip install -U scikit-learn
%pip install --upgrade scikit-learn

Requirement already up-to-date: textblob in d:\programdata\anaconda3\lib\site-packages (0.15.3)
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to

Requirement already up-to-date: scikit-learn in d:\programdata\anaconda3\lib\site-packages (0.24.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already up-to-date: scikit-learn in d:\programdata\anaconda3\lib\site-packages (0.24.2)
Note: you may need to restart the kernel to use updated packages.


In [2]:
import re
import ast
import pandas as pd
import numpy as np
import sklearn
import nltk
import tweepy

nltk.download(['wordnet', 'punkt', 'averaged_perceptron_tagger', 'stopwords'])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Robert\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

##  Importing the models
> We can import the model "svcmodel.joblib" from the previous section or use the TextBlob default model (NaiveBayesAnalyzer)

### SVC model

In [3]:
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer

classifier = joblib.load('trained_models/svcmodel.joblib')
tfidf = joblib.load('trained_models/tfidf.joblib') #it is necessary for the test cases to share the same TF-IDF vectorizer with which was trained the classifier

### TextBlob default model

In [4]:
from textblob import TextBlob, Blobber
from textblob.sentiments import NaiveBayesAnalyzer

classifier2 = Blobber(analyzer=NaiveBayesAnalyzer())

## Importing the Movies dataset

In [5]:
import json
import ast

# this function is used to convert the genres column (in JSON format) to a string separated by whitespaces
def parse_json(text):
        text = ast.literal_eval(text)

        r = []
        for i in text:
            i = str(i).replace("\'", "\"")
            movie = json.loads(i)
            r.append(movie['name'])

        return " ".join(r)

dataset = pd.read_csv('datasets/movies_metadata.csv', dtype=str).loc[:8000, ['original_title', 'genres']].dropna()
dataset['genres'] = dataset['genres'].apply(lambda x: parse_json(x))

print(dataset[1:5])

                original_title                    genres
1                      Jumanji  Adventure Fantasy Family
2             Grumpier Old Men            Romance Comedy
3            Waiting to Exhale      Comedy Drama Romance
4  Father of the Bride Part II                    Comedy


## Extracting tweets

### Setting up tweepy API

In [6]:
from tweepy import OAuthHandler

consumer_key = 'AAAA'
consumer_secret = 'BBBB'
access_token = 'CCCC'
access_token_secret = 'DDDD'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

### Getting tweets

In [7]:
username = 'DevFarzan' #Replace by the desired username
count = 5 #Replace by the desired number of tweets to analyze

tweets = api.user_timeline(screen_name=username, count=count, include_rts = False, tweet_mode = 'extended')
tweets = list(map(lambda tw: tw.full_text, tweets))

for tweet in tweets:
    print(tweet, '\n')
    

Ace Ventura. Neither terrible, boring nor soporific, just not very funny. 

Jumanji, with plenty of laughs, action-packed excitement, great music, spectacular sets, and inspirational themes, this film is an absolutely winning adventure. 

Die Hard. There are good performances from everyone in this long, often funny, very violent but exciting melodrama. 

Meet Joe Black. I've never encountered such dramatic flatulence, never heard so many pregnant silences that don't deliver, never watched so many close-ups that graze on actors' faces until every last trace of expression has been devoured. 

Toy Story is a Pixar classic, one of the best kids' movies of all time. 



## Analyze tweets' sentiment

### Text preprocessing

#### Cleaning links, @ users, html tags and special characters

In [8]:
#clean links, @ users, html tags and special characters
def clean_text(raw_text):
    clean = re.compile("<.*?>|([^A-Za-z'])|('s)")
    cleantext = re.sub(clean, ' ', raw_text)
    cleantext = " ".join(re.split('[!?\., ]', cleantext))
    cleantext = re.sub(r'\s+', ' ', cleantext)
    cleantext = re.sub("\s\W+\s", ' ', cleantext)
    return cleantext

#### Removing stopwords from text (except useful words like not)

In [9]:
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#remove stopwords from text
def remove_stopwords(text):
    
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not') #excluding not as stopword

    negation = False
    result = []
    delims = "?.,!:; "
    
    # no permitimos que las negaciones sean borradas de los textos
    for word in text.split():
        stripped = word.strip(delims).lower()
        negated = "not " + stripped if negation else stripped
        negated = re.sub("n\'t", " not", stripped)
        negated = re.sub("'ve", " have", stripped)
        result.append(negated)
        
        if any(neg in word for neg in ["not", "n't", "no"]):
            negation = not negation

        if any(c in word for c in delims):
            negation = False

    text = [word for word in result if not word in set(all_stopwords)]
    text = ' '.join(text)

    return text.lower()

#### Text lemmatization

In [10]:
#extract tag from WordNet
def wordnet_tag(text):
    
    #extrae el tag de wordnet del string devuelto por nltk.pos_tag
    if text.startswith('J'):
        return wordnet.ADJ
    elif text.startswith('V'):
        return wordnet.VERB
    elif text.startswith('N'):
        return wordnet.NOUN
    elif text.startswith('R'):
        return wordnet.ADV
    else:          
        return None

#Text lemmatization
def lemmatize_text(text):

    lem = WordNetLemmatizer()
    tag_text = nltk.pos_tag(nltk.word_tokenize(text))
    text = map(lambda x: (x[0], wordnet_tag(x[1])), tag_text)
    lemmatized_sentence = []
    for word, tag in text:
        if tag is None:
            lemmatized_sentence.append(word)
        else:        
            lemmatized_sentence.append(lem.lemmatize(word, tag))

    return " ".join(lemmatized_sentence)

In [11]:
#assembling previous functions into one
def preprocess_text(text):
    text = remove_stopwords(text)
    text = clean_text(text)
    text = lemmatize_text(text)
    
    return text

test = list(map(lambda text: preprocess_text(text), tweets))
print(test)

['ace ventura neither terrible boring soporific not funny', 'jumanji plenty laugh action pack excitement great music spectacular set inspirational theme film absolutely win adventure', 'die hard good performance everyone long often funny violent excite melodrama', "meet joe black i have never encounter dramatic flatulence never hear many pregnant silence deliver never watch many close ups graze actor ' face every last trace expression devour", "toy story pixar classic one best kid ' movie time"]


### Get sentiment

In [13]:
# vectorize test array into TF-IDF
tftest = tfidf.transform(test).toarray()

#analyze sentiment
sentiment = ['pos' if x == 1 else 'neg' for x in classifier.predict(tftest)]

print("SVC classifier: ", sentiment)

#try the textblob sentiment classifier
s = []
for x in tweets:
    s.append(classifier2(x).sentiment.classification)

print("TextBlob NB Classifier: ", s)

SVC classifier:  ['neg', 'pos', 'pos', 'pos', 'pos']
TextBlob NB Classifier:  ['neg', 'pos', 'pos', 'pos', 'pos']


In [14]:
#append sentiment to review

tw = np.array(tweets)
sent = np.array(sentiment)

reviews = np.hstack((tw.reshape((len(tw), 1)), sent.reshape((len(sent), 1))))

print(reviews)

[['Ace Ventura. Neither terrible, boring nor soporific, just not very funny.'
  'neg']
 ['Jumanji, with plenty of laughs, action-packed excitement, great music, spectacular sets, and inspirational themes, this film is an absolutely winning adventure.'
  'pos']
 ['Die Hard. There are good performances from everyone in this long, often funny, very violent but exciting melodrama.'
  'pos']
 ["Meet Joe Black. I've never encountered such dramatic flatulence, never heard so many pregnant silences that don't deliver, never watched so many close-ups that graze on actors' faces until every last trace of expression has been devoured."
  'pos']
 ["Toy Story is a Pixar classic, one of the best kids' movies of all time."
  'pos']]


## Get recommendations based on positive tweets

In [15]:
# get the positive tweets
positive_reviews = np.array(list(filter(lambda x: x[1] == 'pos', reviews)))

print(positive_reviews)

[['Jumanji, with plenty of laughs, action-packed excitement, great music, spectacular sets, and inspirational themes, this film is an absolutely winning adventure.'
  'pos']
 ['Die Hard. There are good performances from everyone in this long, often funny, very violent but exciting melodrama.'
  'pos']
 ["Meet Joe Black. I've never encountered such dramatic flatulence, never heard so many pregnant silences that don't deliver, never watched so many close-ups that graze on actors' faces until every last trace of expression has been devoured."
  'pos']
 ["Toy Story is a Pixar classic, one of the best kids' movies of all time."
  'pos']]


### Find the movie (in the dataset) mentioned in the tweet

> **Warning:** This function may fail to find the right movie sometimes because it does a linear search on the dataset, which means that if another title matches in the review, it will choose the first coincidence

In [16]:
import re

#search movie titles in the dataset using regex, and return the movie index of the dataset
def is_movie_tweet(text):

    dataset['original_title'] = dataset.loc[:,'original_title'].str.lower()
    for series in dataset.iterrows():
        t = r"\b" + re.escape(series[1]['original_title']) + r"\b"
        if re.search(t, text.lower()) != None:
            return series
        
    return False

movies = list(filter(lambda x: x != False, [is_movie_tweet(text) for text in positive_reviews[:, 0]]))

print(movies)

[(1, original_title                     jumanji
genres            Adventure Fantasy Family
Name: 1, dtype: object), (1007, original_title           die hard
genres            Action Thriller
Name: 1007, dtype: object), (688, original_title    faces
genres            Drama
Name: 688, dtype: object), (0, original_title                  toy story
genres            Animation Comedy Family
Name: 0, dtype: object)]


### Content based filtering recommendations
We base our recommendations on the genres of the liked movies to perform cosine similarity on them and find movies with very similar genres.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
    
def similarContentBased(movieId):

    cv = CountVectorizer()
    X = cv.fit_transform(dataset['genres']).toarray()[:10000]

    sim = cosine_similarity(X)
    sim = pd.Series(sim[movieId]).sort_values(ascending=False)
    indexes = list(sim.index)

    limit = 5
    recommendations = []

    i = 0
    while(i < limit):
        ind = indexes[i]
        if ind != movieId:
            movie = dataset.loc[ind, ['original_title', 'genres']]
            recommendations.append((movie[0], movie[1], sim[ind]))
        else:
            limit += 1

        i += 1

    return recommendations

In [18]:
recommendations = [similarContentBased(r[0]) for r in movies]

print(recommendations)

[[('return to oz', 'Adventure Family Fantasy', 1.0000000000000002), ('peter pan', 'Adventure Fantasy Family', 1.0000000000000002), ('harry potter and the prisoner of azkaban', 'Adventure Fantasy Family', 1.0000000000000002), ('jason and the argonauts', 'Adventure Family Fantasy', 1.0000000000000002), ('clash of the titans', 'Adventure Fantasy Family', 1.0000000000000002)], [('on deadly ground', 'Action Thriller', 0.9999999999999998), ('air force one', 'Action Thriller', 0.9999999999999998), ('iron eagle iii', 'Action Thriller', 0.9999999999999998), ('the peacemaker', 'Action Thriller', 0.9999999999999998), ('d-tox', 'Action Thriller', 0.9999999999999998)], [('the hours', 'Drama', 1.0), ('querelle', 'Drama', 1.0), ('dead poets society', 'Drama', 1.0), ('the graduate', 'Drama', 1.0), ('coming apart', 'Drama', 1.0)], [('the great mouse detective', 'Comedy Animation Family', 1.0000000000000002), ('the wrong trousers', 'Animation Comedy Family', 1.0000000000000002), ("bon voyage, charlie br

In [19]:
#printing each movie with its recommendations

for i, movie in enumerate(movies):
    print("\n-- Movie: \033[1m" + movie[1]['original_title'].title() + \
          "\033[0m | Genres: " + movie[1]['genres'] + " | Index: " + str(movie[0]))
    
    for rec in recommendations[i]:
        print("\n\t*Recommendation: \033[1m" + rec[0].title() + \
              "\033[0m | Genres: " + rec[1] + " | Similarity: " + str(round(rec[2], 2)))


-- Movie: [1mJumanji[0m | Genres: Adventure Fantasy Family | Index: 1

	*Recommendation: [1mReturn To Oz[0m | Genres: Adventure Family Fantasy | Similarity: 1.0

	*Recommendation: [1mPeter Pan[0m | Genres: Adventure Fantasy Family | Similarity: 1.0

	*Recommendation: [1mHarry Potter And The Prisoner Of Azkaban[0m | Genres: Adventure Fantasy Family | Similarity: 1.0

	*Recommendation: [1mJason And The Argonauts[0m | Genres: Adventure Family Fantasy | Similarity: 1.0

	*Recommendation: [1mClash Of The Titans[0m | Genres: Adventure Fantasy Family | Similarity: 1.0

-- Movie: [1mDie Hard[0m | Genres: Action Thriller | Index: 1007

	*Recommendation: [1mOn Deadly Ground[0m | Genres: Action Thriller | Similarity: 1.0

	*Recommendation: [1mAir Force One[0m | Genres: Action Thriller | Similarity: 1.0

	*Recommendation: [1mIron Eagle Iii[0m | Genres: Action Thriller | Similarity: 1.0

	*Recommendation: [1mThe Peacemaker[0m | Genres: Action Thriller | Similarity: 1.0

	*Rec