# Groep Opdracht Week 4 Zoekmachines

## Group name: Yahoe

## Students: Jasper van Eck, Ghislaine van den Boogerd, Joris Galema, Lotte Bottema
## Student IDs: 6228194, 10996087, 11335165, 11269642

### Github link: https://github.com/JasperVanEck/ZoekmachinesGroepsOpdracht

### Link to our demo video: https://www.youtube.com/watch?v=R8OJBv0Kur4

### Link to presentation: https://docs.google.com/presentation/d/1vsgE_xHZwQ-ztudv7EKTB0tiEP7fSZqsUIMs9K4zZE0/edit?usp=sharing



# Table of Content<a name='Top'></a>
[Import data](#ImportData)

[Create the TF Dict](#TFDict)

[Create the TF-IDF and Normalize](#TFIDFNorm)

[Vectorize Query](#InputQuery)

[Results](#Results)

- [WordCloud](#WordCloud) Requirement 3
- [Interact with Filters](#Filters) Requirements 1, 2, 4 and 5

[Cohen's Kappa](#Cohen) Requirement 6

[Graph of Timestamps Hits](#Graph)

[Extra Information](#Extra)



# Import Data<a name='ImportData'></a>

In [None]:
#Imports
import pandas as pd
import math
import numpy as np
from elasticsearch import Elasticsearch
import nltk
import PIL
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import re
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import json
from collections import Counter, defaultdict
from sklearn import preprocessing
from datetime import datetime
import operator
from IPython.core.display import HTML

In [None]:
pd.set_option('display.max_colwidth', -1)

#Open & read JSON file
#Init empty list for json data to be stored
jsonDataReviews = []
with open('IMDB_reviews.json') as json_file:
    #Loop through lines in json file, each review is on seperate line
    for line in json_file:
        #Append to the list of json data
        jsonDataReviews.append(json.loads(line))

#Read the data from the json file
dataReviews = pd.DataFrame(jsonDataReviews)

#Add Review_id column
#Create index range
review_id = list(range(len(dataReviews)))
#Insert the index range into the DF
dataReviews.insert(0,'review_id',review_id,True)
#Cast to string from obj
dataReviews['review_summary'] = dataReviews['review_summary'].astype(str)
dataReviews['review_text'] = dataReviews['review_text'].astype(str)
#Cast to int from str
dataReviews['rating'] = dataReviews['rating'].astype(int)
#Cast to bool from obj
dataReviews['is_spoiler'] = dataReviews['is_spoiler'].astype(bool)
#Create datetime objects from the review_date string
dataReviews['review_date'] = [datetime.strptime(dateString, '%d %B %Y') for dateString in dataReviews['review_date'].values]

In [None]:
#Open & read TSV file with movie details
dataMovies = pd.read_csv('data.tsv', sep='\t', header=0, dtype={'tconst':str,'titleType':str,
                                                                'primaryTitle':str,'OriginalTitle':str,
                                                                'isAdult':str,'startYear':str,'endYear':str,
                                                                'runtimeMinutes':str,'genres':str})

In [None]:
movieTitles = dataMovies[dataMovies['tconst'].isin(dataReviews['movie_id'].values)]
movieTitles.head(1)

In [None]:
#Replace the movie_id with the movie name
movieTitlesInsertList = [movieTitles[movieTitles['tconst']==movie_id]['primaryTitle'].values[0] for movie_id in dataReviews['movie_id'].values]
dataReviews.insert(7, 'movie_title', movieTitlesInsertList, True)

In [None]:
#Example of data
dataReviews.head(10)

# Create the TF Dict<a name='TFDict'></a>

[Top](#Top)

In [None]:
#Init a default dict
tfDict = defaultdict(lambda: defaultdict(int))

#Init Porter Stemmer
ps = nltk.stem.PorterStemmer()

#Use less reviews to reduce runtimes for testing/practice
dataReviewsLess = dataReviews.head(50000).copy()

#Retrieve the actual reviews
reviewTexts = dataReviewsLess['review_text'].values

#Loop through reviews
for i in range(len(reviewTexts)):
    #Tokenize reviews and lowercase the text
    line = re.split('\W+',reviewTexts[i].lower())
    #Loop through tokens in review
    for word in line:
        #Stem token
        stem = ps.stem(word)
        #Increment frequency
        tfDict[stem][i] += 1

#Add in Corpus Frequency, Document Frequency and reposition the frequencies per document
tfDictXtra = defaultdict(lambda: defaultdict(int))
for word in tfDict:
    tfDictXtra[word]['CorpusFreq'] = sum(tfDict[word].values())
    tfDictXtra[word]['DocFreq'] = len(tfDict[word])
    tfDictXtra[word]['Freq_per_doc'] = tfDict[word]


# Create the TF-IDF and Normalize<a name='TFIDFNorm'></a>

[Top](#Top)

In [None]:
#Get the total number of reviews/documents
totalDocs = len(dataReviewsLess)

#Total unique words
totalUniqueWords = len(tfDictXtra)

#Create np matrix with zeros
tfIdf = np.zeros((totalUniqueWords,totalDocs))

#Create dataframe of words with index list to get the word position in matrix for future reference
wordsIndex = pd.DataFrame(list(tfDictXtra.keys()),columns=['Words'])
#Create index range
wordID = list(range(totalUniqueWords))
#Insert the index range
wordsIndex.insert(0,'Index',wordID,True)
#Index counter, to keep track of location in word list
wordCounter = 0


#loop through words in dict
for word in tfDictXtra:
    #Loop through frequencies of word in a doc from dict; LET OP deze regel geeft soms AttributeError: 'int' object has no attribute 'keys'
    #run de vorige cellen dan weer even opnieuw. Dat verhelpt t meestal
    dictLoop = list(tfDictXtra[word]['Freq_per_doc'].keys())
    for doc in dictLoop:
        #Calculate the TF-IDF
        tfIdf[wordCounter,doc] = tfDictXtra[word]['Freq_per_doc'][doc]*math.log((totalDocs/(1+tfDictXtra[word]['DocFreq'])))
    wordCounter += 1


In [None]:
#Transpose the tfIdf matrix and normalize, since the normalize works on rows, and we need to normalize the columns
tfIdfNorm = preprocessing.normalize(tfIdf.T, norm='l2')

# Vectorize query<a name='InputQuery'></a>

[Top](#Top)

In [None]:
#Starting/test query
query = input('Enter your query:' )
#query = "kid friendly movie"

#Create a normalized vector of query
def vectorizeQuery(query):
    #Create empty base vector for Term Freq
    queryVector = np.zeros(totalUniqueWords)
    #Tokenize and make lowercase
    line = re.split('\W+',query.lower())
    #Loop through words
    for word in line:
        #Stem each word
        stem = ps.stem(word)
        #Increase term freq of query term
        queryVector[wordsIndex[wordsIndex['Words']==stem]['Index'].values] += 1
    
    #Create empty base vector for TF-IDF
    queryVectorTfIdf = np.zeros(totalUniqueWords)
    #Loop through TF vector of query
    for i in range(len(queryVector)):
        #Act where a term frequency was recorded
        if queryVector[i] != 0:
            #Determine the which word it was based on the index
            word = str(wordsIndex[wordsIndex['Index']==i]['Words'].values)
            #Calculate the TF-IDF
            queryVectorTfIdf[i] = queryVector[i]*math.log((totalDocs/(1+tfDictXtra[word]['DocFreq'])))
    
    #Make the TF-IDF vector a unit vector
    length = np.sqrt(queryVectorTfIdf.dot(queryVectorTfIdf))
    queryVectorNorm = queryVectorTfIdf/length
    
    #Return the unit vector
    return queryVectorNorm


In [None]:
#Cosine similarity matching
def cosineSim(vector, docVector):
    #Only dot product needed since vectors are already unit vectors and therefore the lengths are 1
    return vector.dot(docVector)#/(length vector * length docVector)
    
def rankedList(queryVector):
    #Create empty score list
    scoreList = np.zeros(totalDocs)
    #Loop through each doc
    for i in range(len(tfIdfNorm)):
        #Calculate for each doc the cosine sim. Index of scoreList = review_id
        scoreList[i] = cosineSim(queryVector,tfIdfNorm[i])
    
    #Create new data frame for ranked list based on smaller DF of data
    rankedDocList = dataReviewsLess.copy()
    #Insert the similarity score for each review
    rankedDocList.insert(0,'Score',scoreList,True)
    #Sort the review similarity based on the score and return
    return rankedDocList.sort_values(by='Score',ascending=False)

In [None]:
#Create the ranking list
rankings = rankedList(vectorizeQuery(query))

# Results<a name='Results'></a>

[Top](#Top)

### WordCloud <a name='WordCloud'></a>

[Top](#Top)

In [None]:
#Source: https://stackoverflow.com/questions/16645799/how-to-create-a-word-cloud-from-a-corpus-in-python
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = "WordCloud of Query Results"):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=40,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

@interact
def showingWordcloudsOfKRanking(k=(1,50,1)):
    show_wordcloud(rankings.head(k)['review_text'])
    

@interact
def showingWordCloudOfOneReview(i=(1,len(dataReviewsLess),1)):
    show_wordcloud(dataReviewsLess[dataReviewsLess['review_id']==i]['review_text'].values,'WordCloud of a review')

### Interact with Filters<a name='Filters'></a>

[Top](#Top)

In [None]:
#Function to filter on the variables created by interact widget
def showResultsTime(start_date, end_date, AmountResults, AtleastRating, spoiler, movie_title):
    start_date = pd.Timestamp(start_date)
    end_date = pd.Timestamp(end_date)
    if movie_title == 'None':
        if spoiler == 'Both':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)].head(AmountResults)
        elif spoiler == 'Yes':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == True)].head(AmountResults)
        elif spoiler == 'No':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == False)].head(AmountResults)
    else:
        if spoiler == 'Both':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.movie_title == movie_title)].head(AmountResults)
        elif spoiler == 'Yes':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == True)
                        & (rankings.movie_title == movie_title)].head(AmountResults)
        elif spoiler == 'No':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == False)
                        & (rankings.movie_title == movie_title)].head(AmountResults)

#Sort the movieTitles DF
tmp = movieTitles.sort_values(by='primaryTitle')
#Prep a list of movie titles for filter
titles = ['None']
titles.extend(tmp['primaryTitle'].values)
#The interact function for faceted search
_ = interact(showResultsTime,
             start_date=widgets.DatePicker(value=pd.to_datetime('2014-01-01')),
             end_date=widgets.DatePicker(value=pd.to_datetime('2019-01-01')),
             AmountResults=(10, 100, 10),
             AtleastRating=(1,10,1),
             spoiler=['Both','Yes','No'],
             movie_title=titles)

## Cohen's Kappa<a name='Cohen'></a>
[Top](#Top)

In [None]:
query1judge = np.array([[1,1],[1,1],[1,1],[1,1],[1,1],[1,1],[1,1],[1,1],[1,1],[1,1]])
query2judge = np.array([[1,1],[0,1],[1,0],[0,0],[0,1],[1,1],[1,1],[1,1],[0,0],[1,1]])
query3judge = np.array([[1,1],[1,1],[1,1],[1,1],[1,1],[1,1],[0,0],[0,0],[0,0],[1,1]])
query4judge = np.array([[1,1],[1,1],[0,0],[0,0],[0,0],[0,1],[0,0],[0,0],[0,0],[1,1]])
query5judge = np.array([[1,1],[1,1],[0,0],[1,1],[1,1],[0,1],[1,1],[0,1],[0,0],[1,0]])

def AveragePrecision(ranked_list_of_results, list_of_relevant_objects):
    total = len(list_of_relevant_objects)
    sumPk = 0
    rank = 0
    relevant = 0
    for result in ranked_list_of_results:
        rank += 1
        if result in list_of_relevant_objects:
            relevant += 1
            sumPk += relevant/rank
            
    aprecision = sumPk/total
    return aprecision

#def AveragePrecision(ranked_list_of_results, list_of_relevant_objects):
#    begin = 1/len(list_of_relevant_objects)
#    count = 0
#    for i, res in enumerate(ranked_list_of_results):
#        for j, obj in enumerate(list_of_relevant_objects):
#            if obj == res:
#                itera = (j+1) / (i+1)
#            count = count + itera
#    return begin * count

def PE(data):
    '''On input data, return the P(E) (expected agreement).'''
    relevant = 0
    nonrelevant = 0
    # Iterate over the data
    for i in data:
        for j in i:
            
            # Top up the relevant documents by one if 1 is encountered
            if j == 1:
                relevant += 1
            # Top up the nonrelevant documents by one if 0 is encountered
            if j == 0:
                nonrelevant += 1

    # Calculates the total of inspected documents for the judges combined
    total = len(data)*2

    # Calculates the pooled marginals
    rel = relevant/total
    nonrel = nonrelevant/total

    # Calculates the P(E)
    P_E = nonrel**2 + rel **2    
    return    P_E 


def kappa(data, P_E):
    agree = 0
    for i in data:
        temp = None
        for j in i:
            if temp == j:
                agree += 1
            temp = j
    P_A = agree / len(data)
    if P_E == 1:
        kappa = 1
    else:
        kappa = (P_A - P_E)/(1 - P_E)   
    return kappa

P_EQ1 = PE(query1judge)
P_EQ2 = PE(query2judge)
P_EQ3 = PE(query3judge)
P_EQ4 = PE(query4judge)
P_EQ5 = PE(query5judge)

KappaQ1 = kappa(query1judge, P_EQ1)
KappaQ2 = kappa(query2judge, P_EQ2)
KappaQ3 = kappa(query3judge, P_EQ3)
KappaQ4 = kappa(query4judge, P_EQ4)
KappaQ5 = kappa(query5judge, P_EQ5)

KappaVal = [KappaQ1, KappaQ2, KappaQ3, KappaQ4, KappaQ5]

plt.xlabel('Query')
plt.ylabel('Kappa Value')
plt.bar(np.arange(len(KappaVal)), KappaVal)

## Graph of Timestamps Hits<a name='Graph'></a>
[Top](#Top)

In [None]:
def count_occurences(movietitles):
    count = Counter(movietitles) 
    # sort the words from highest to lowest (first one is the highest rank)
    sorted_count = sorted(count.items(), key=operator.itemgetter(1), reverse=True)
    #gives the list of occurences and not the words
    occurences = []
    for i in sorted_count:
        occurences.append(i[1])
    return occurences

def plot_bar(lists, nameplot):
    y = count_occurences(lists)
    x = np.arange(len(y))
    
    #plots the graph of word frequency versus rank of a word in this corpus
    plt.bar(x,y)
    plt.xlabel(nameplot)
    plt.ylabel('Occurancies')
    plt.show()

plt.ylim(0,250)
plot_bar(dataReviews['review_date'], 'Date of review')

plt.ylim(0,1000)
plot_bar(dataReviews['movie_title'], 'Movie title')



## Extra information <a name='Extra'></a>
[Top](#Top)

We chose for the imdb-review dataset because it is a very interesting dataset with a lot of different information. It also provided us with the possibility to use other dataset to combine even more extra information to get a full package of info about one review. 

Are search engine works very well, but sometimes returns "funny" results. For example: reviews about 'The Godfather' appear if you search for "Kid friendly movies". However this is logical if you look closer, because the people who wrote te review do mention that they think kids should watch it.

Are wordclouds also do work very well. Same with our faceted search. Both of which execute in real-time. 

We did have problems implementing elastic search.

Overall, we asses our quality of work to be good. We made a functioning search engine which does what we wanted it to do. It was a very meaningful and inspiring exercise and thought us all to bring the last four weeks of theory into practice. We definitly think we succeeded in making a working search engine, with more time we would have loved to have done more and expand on our current search engine.


## A notion on Elasticsearch 

We do not have a sufficient implementation of Elasticsearch ready. 
Initially we had the idea that Elasticsearch was not necessary. When we did we found out that it needed more time than we thought. 

We had problems with storing the data in an Elastic search server and performing queries with it. The data did not seem to fit on the server even though this seemed unlikely to us. Unfortunately we did not find a solution on time. We hope that we will in the future because Elastic search does seem like a great tool. 