# Groep Opdracht Week 4 Zoekmachines

## Students: Jasper van Eck, Ghislaine, Joris Galema, Lotte
## Student IDs: 6228194, 10996087, 11335165, 11269642


# Table of Content<a name='Top'></a>
[Import data](#ImportData)

[Create the TF Dict](#TFDict)

[Create the TF-IDF and Normalize](#TFIDFNorm)

[Vectorize Query](#InputQuery)

[Results](#Results)

- [WordCloud](#WordCloud) Requirement 3
- [Interact with Filters](#Filters) Requirements 1, 2, 4 and 5

[Cohen's Kappa](#Cohen) Requirement 6



# Import Data<a name='ImportData'></a>

In [2]:
#Imports
import pandas as pd
import math
import numpy as np
from elasticsearch import Elasticsearch
import nltk
import PIL
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import re
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import json
from collections import Counter, defaultdict
from sklearn import preprocessing
from datetime import datetime

In [3]:
pd.set_option('display.max_colwidth', -1)

#Open & read JSON file
#Init empty list for json data to be stored
jsonDataReviews = []
with open('IMDB_reviews.json') as json_file:
    #Loop through lines in json file, each review is on seperate line
    for line in json_file:
        #Append to the list of json data
        jsonDataReviews.append(json.loads(line))

#Read the data from the json file
dataReviews = pd.DataFrame(jsonDataReviews)

#Add Review_id column
#Create index range
review_id = list(range(len(dataReviews)))
#Insert the index range into the DF
dataReviews.insert(0,'review_id',review_id,True)
#Cast to string from obj
dataReviews['review_summary'] = dataReviews['review_summary'].astype(str)
dataReviews['review_text'] = dataReviews['review_text'].astype(str)
#Cast to int from str
dataReviews['rating'] = dataReviews['rating'].astype(int)
#Cast to bool from obj
dataReviews['is_spoiler'] = dataReviews['is_spoiler'].astype(bool)
#Create datetime objects from the review_date string
dataReviews['review_date'] = [datetime.strptime(dateString, '%d %B %Y') for dateString in dataReviews['review_date'].values]

In [4]:
#Open & read TSV file with movie details
dataMovies = pd.read_csv('data.tsv', sep='\t', header=0, dtype={'tconst':str,'titleType':str,
                                                                'primaryTitle':str,'OriginalTitle':str,
                                                                'isAdult':str,'startYear':str,'endYear':str,
                                                                'runtimeMinutes':str,'genres':str})

In [5]:
movieTitles = dataMovies[dataMovies['tconst'].isin(dataReviews['movie_id'].values)]
movieTitles.head(1)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
18072,tt0012349,movie,The Kid,The Kid,0,1921,\N,68,"Comedy,Drama,Family"


In [6]:
#Replace the movie_id with the movie name
movieTitlesInsertList = [movieTitles[movieTitles['tconst']==movie_id]['primaryTitle'].values[0] for movie_id in dataReviews['movie_id'].values]
dataReviews.insert(7, 'movie_title', movieTitlesInsertList, True)

In [7]:
#Example of data
dataReviews.head(10)

Unnamed: 0,review_id,review_date,movie_id,user_id,is_spoiler,review_text,rating,movie_title,review_summary
0,0,2006-02-10,tt0111161,ur1898687,True,"In its Oscar year, Shawshank Redemption (written and directed by Frank Darabont, after the novella Rita Hayworth and the Shawshank Redemption, by Stephen King) was nominated for seven Academy Awards, and walked away with zero. Best Picture went to Forrest Gump, while Shawshank and Pulp Fiction were ""just happy to be nominated."" Of course hindsight is 20/20, but while history looks back on Gump as a good film, Pulp and Redemption are remembered as some of the all-time best. Pulp, however, was a success from the word ""go,"" making a huge splash at Cannes and making its writer-director an American master after only two films. For Andy Dufresne and Co., success didn't come easy. Fortunately, failure wasn't a life sentence.After opening on 33 screens with take of $727,327, the $25M film fell fast from theatres and finished with a mere $28.3M. The reasons for failure are many. Firstly, the title is a clunker. While iconic to fans today, in 1994, people knew not and cared not what a 'Shawshank' was. On the DVD, Tim Robbins laughs recounting fans congratulating him on ""that 'Rickshaw' movie."" Marketing-wise, the film's a nightmare, as 'prison drama' is a tough sell to women, and the story of love between two best friends doesn't spell winner to men. Worst of all, the movie is slow as molasses. As Desson Thomson writes for the Washington Post, ""it wanders down subplots at every opportunity and ignores an abundance of narrative exit points before settling on its finale."" But it is these same weaknesses that make the film so strong.Firstly, its setting. The opening aerial shots of the prison are a total eye-opener. This is an amazing piece of architecture, strong and Gothic in design. Immediately, the prison becomes a character. It casts its shadow over most of the film, its tall stone walls stretching above every shot. It towers over the men it contains, blotting out all memories of the outside world. Only Andy (Robbins) holds onto hope. It's in music, it's in the sandy beaches of Zihuatanejo; ""In here's where you need it most,"" he says. ""You need it so you don't forget. Forget that there are places in the world that aren't made out of stone. That there's a - there's a - there's something inside that's yours, that they can't touch."" Red (Morgan Freeman) doesn't think much of Andy at first, picking ""that tall glass o' milk with the silver spoon up his ass"" as the first new fish to crack. Andy says not a word, and losing his bet, Red resents him for it. But over time, as the two get to know each other, they quickly become the best of friends. This again, is one of the film's major strengths. Many movies are about love, many flicks have a side-kick to the hero, but Shawshank is the only one I can think of that looks honestly at the love between two best friends. It seems odd that Hollywood would skip this relationship time and again, when it's a feeling that weighs so much into everyone's day to day lives. Perhaps it's too sentimental to seem conventional, but Shawshank's core friendship hits all the right notes, and the film is much better for it.It's pacing is deliberate as well. As we spend the film watching the same actors, it is easy to forget that the movie's timeline spans well over 20 years. Such a huge measure of time would pass slowly in reality, and would only be amplified in prison. And it's not as if the film lacks interest in these moments. It still knows where it's going, it merely intends on taking its sweet time getting there. It pays off as well, as the tedium of prison life makes the climax that much more exhilarating. For anyone who sees it, it is a moment never to be forgotten.With themes of faith and hope, there is a definite religious subtext to be found here. Quiet, selfless and carefree, Andy is an obvious Christ figure. Warden Norton (Bob Gunton) is obviously modeled on Richard Nixon, who, in his day, was as close to a personified Satan as they come. But if you aren't looking for subtexts, the movie speaks to anyone in search of hope. It is a compelling drama, and a very moving film, perfectly written, acted and shot. They just don't come much better than this.OVERALL SCORE: 9.8/10 = A+ The Shawshank Redemption served as a message of hope to Hollywood as well. More than any film in memory, it proved there is life after box office. Besting Forrest and Fiction, it ran solely on strong word of mouth and became the hottest rented film of 1995. It currently sits at #2 in the IMDb's Top 250 Films, occasionally swapping spots with The Godfather as the top ranked film of all time -- redemption indeed. If you haven't seen it yet, what the hell are you waiting for? As Andy says, ""It comes down a simple choice, really. Either get busy living, or get busy dying.""",10,The Shawshank Redemption,A classic piece of unforgettable film-making.
1,1,2000-09-06,tt0111161,ur0842118,True,"The Shawshank Redemption is without a doubt one of the most brilliant movies I have ever seen. Similar to The Green Mile in many respects (and better than it in almost all of them), these two movies have shown us that Stephen King is a master not only of horror but also of prose that shakes the soul and moves the heart. The plot is average, but King did great things with it in his novella that are only furthered by the direction, and the acting is so top-rate it's almost scary.Tim Robbins plays Andy Dufrane, wrongly imprisoned for 20 years for the murder of his wife. The story focuses on Andy's relationship with ""Red"" Redding (Morgan Freeman, in probably his best role) and his attempts to escape from Shawshank. Bob Gunton is positively evil and frightening as Warden Norton, and there are great performances and cameos all around; the most prominent one being Gil Bellows (late as Billy of Ally McBeal) as Tommy, a fellow inmate of Andy's who suffers under the iron will of Norton.If you haven't seen this movie, GO AND RENT IT NOW. You will not be disappointed. It is positively the best movie of the '90's, and one of my Top 3 of all time. This movie is a spectacle to move the mind, soul, and heart. 10/10",10,The Shawshank Redemption,Simply amazing. The best film of the 90's.
2,2,2001-08-03,tt0111161,ur1285640,True,"I believe that this film is the best story ever told on film, and I'm about to tell you why.Tim Robbins plays Andy Dufresne, a city banker, wrongfully convicted of murdering his wife and her lover. He is sent to Shawshank Prison in 1947 and receives a double life sentence for the crime. Andy forms an unlikely friendship with ""Red"" (Morgan Freeman), the man who knows how to get things. Andy faces many trials in prison, but forms an alliance with the wardens because he is able to use his banking experience to help the corrupt officials amass personal fortunes. The story unfolds....I was so impressed with how every single subplot was given a great deal of respect and attention from the director. The acting was world-class. I have never seen Tim Robbins act as well since, Morgan Freeman maybe (e.g. Seven). The twists were unexpected, an although this film had a familiar feel, it wasn't even slightly pretentious or cliched, it was original. The cinematography was grand and expressive. It gave a real impression of the sheer magnitude of this daunting prison.But the one thing which makes THE SHAWSHANK REDEMPTION stand above all other films, is the attention given to the story. The film depends on the story and the way in which it unravels. It's a powerful, poignant, thought-provoking, challenging film like no other. If Andy were to comment on this film, I think he might say: ""Get busy watching, or get busy dying."" Take his advice.Thoroughly recommended.",8,The Shawshank Redemption,The best story ever told on film
3,3,2002-09-01,tt0111161,ur1003471,True,"**Yes, there are SPOILERS here**This film has had such an emotional impact on me, I find it hard to write comments on it. I've read a lot of the previous comments; all those that gush and eulogise as well as those who think it's over-rated or cliched. Most have got good points to make, however the thing that I think everyone is struggling to both explain and come to terms with is just why this film is *so* loved. Loved to the extent that for many it is an almost spiritual experience or for those of a more secular nature like myself, loved as one of the most devastatingly uplifting things that can happen to you while watching a film.So I'm not going to review it, I'm just going to struggle in my own way to explain this film. It took me a few viewings to get why I connect with it so deeply, but here goes.Many people in this world are unhappy. Most people in this world don't want to be unhappy. Lots of people wish, pray and above all hope for that magic wand to wave and wash them of their fears, losses, angers and pains once and for all. They see lots of other people seemingly in this magical state, while they suffer. To borrow the words of another film, they're watching the bluebirds flying over the rainbow.Many unhappy people have learned that the magic wand doesn't exist. They're not destined to join the bluebirds and fairytales don't come true. It's not that no one lives happily ever after, it's just that they're not going to. They're busy dying.In this film, or as some people have quite correctly said, this fairytale, magic wands exist. And that magic wand is Andy Dufrense imitating Houdini. However this film is not about him. Neither is it about the prison, the governor, the guard, the plot, the acting, the cinematography, the script, the direction or the score.It's about Red. He is the one who has become institutionally unhappy, he's not only trapped in a prison, not only has he given up on the idea of ever leaving, not only does he have no hope, he knows that if the miracle would ever happen to him, he couldn't cope. He's safe in his unhappiness and that security is what keeps him going. Hope is, as Red say, dangerous. The metaphor for a certain illness here is very clear to me and I know that a rather large number of people suffer from it. A large proportion of those don't understand what's wrong, but they certainly can recognise a fellow sufferer. Those who are mercifully untouched by this illness definitely don't understand what's going on in those who do. They're too busy living.The miracle in this film is not only that Red is redeemed but that the world outside the prison isn't all warm and sandy and sunny and with excellent fishing. Some of it is rocky and uncertain. Fairytales don't get this far. They'd end as Red left the gates of the prison and the credits would say 'and he lived happily ever after'. This is the only film I can currently think of where they show how to get to the living happily ever after bit from your redemption via the rocky and uncertain ground of bagging groceries at the local store. In other words, they're not going to cheat you and tell you everything's going to be alright.This is crucial. For two and half hours, those of us who are quite content to mooch around our own personal prisons can see an escape route quite different to Andy's mapped out on the screen. And it's a real way out. It's hard and upsetting, but ultimately rewarding. The high you get from finding out and knowing that is only comparable diamorphine.The trouble is, if you're already busy living, this film won't mean that much and you'll see it a little more clearly than those busy dying. To those fortunate individuals, watch this film and understand what the rest of us are going through.So, yes, this film is a cliched fairytale and maybe as a story it isn't realistic and at second on the IMDb all time list, it is a bit over-rated. However if you could have a chart of films listing the number of lives saved, altered and improved, the Shawshank Redemption would be way out in front at number one.",10,The Shawshank Redemption,Busy dying or busy living?
4,4,2004-05-20,tt0111161,ur0226855,True,"At the heart of this extraordinary movie is a brilliant and indelible performance by Morgan Freeman as Red, the man who knows how to get things, the ""only"" guilty man at Shawshank prison. He was nominated by the Academy for Best Actor in 1995 but didn't win. (Tom Hanks won for Forrest Gump.) What Freeman does so beautifully is to slightly underplay the part so that the eternal boredom and cynicism of the lifer comes through, and yet we can see how very much alive with the warmth of life the man is despite his confinement. Someday Morgan Freeman is going to win an Academy Award and it will be in belated recognition for this performance, which I think was a little too subtle for some Academy members to fully appreciate at the time.But Freeman is not alone. Tim Robbins plays the hero of the story, banker Andy Dufresne, who has been falsely convicted of murdering his wife and her lover. Robbins has a unique quality as an actor in that he lends ever so slightly a bemused irony to the characters he plays. It is as though part of him is amused at what he is doing. I believe this is the best performance of his career, but it might be compared with his work in The Player (1992), another excellent movie, and in Mystic River (2003) for which he won an Oscar as Best Supporting Actor.It is said that every good story needs a villain, and in the Bible-quoting, Bible-thumping, massively hypocritical, sadistic Warden Samuel Norton, played perfectly by Bob Gunton, we have a doozy. I want to tell you that Norton is so evil that fundamentalist Christians actually hate this movie because of how precisely his vile character is revealed. They also hate the movie because of its depiction of violent, predatory homosexual behavior (which is the reason the movie is rated R). On the wall of his office (hiding his safe with its ill-gotten contents and duplicitous accounts) is a framed plaque of the words ""His judgment cometh and that right soon."" The irony of these words as they apply to the men in the prison and ultimately to the warden himself is just perfect. You will take delight, I promise.Here is some other information about the movie that may interest you. As most people know, it was adapted from a novella by Stephen King entitled ""Rita Hayworth and the Shawshank Redemption."" Rita Hayworth figures in the story because Red procures a poster of her for Andy that he pins up on the wall of his cell. The poster is a still from the film Gilda (1946) starring her and Glenn Ford. We see a clip from the black and white film as the prisoners watch, cheering and hollering when Rita Hayworth appears. If you haven't seen her, check out that old movie. She really is gorgeous and a forerunner of Marilyn Monroe, who next appears on Andy's wall in a still from The Seven Year Itch (1955). It's the famous shot of her in which her skirt is blown up to reveal her shapely legs. Following her on Andy's wall (and, by the way, these pinups figure prominently in the plot) is Rachel Welsh from One Million Years B.C. (1966). In a simple and effective device these pinups show us graphically how long Andy and Red have been pining away.Frank Darabont's direction is full of similar devices that clearly and naturally tell the story. There is Brooks (James Whitmore) who gets out after fifty years but is so institutionalized that he can't cope with life on the outside and hangs himself. Playing off of this is Red's periodic appearance before the parole board where his parole is summarily REJECTED. Watch how this plays out at the end.The cinematography by Roger Deakins is excellent. The editing superb: there's not a single dead spot in the whole movie. The difference between the good guys (Red, Andy, Brooks, etc.) and the bad guys (the warden, the guards, the ""sisters,"" etc.) is perhaps too starkly drawn, and perhaps Andy is a bit too heroic and determined beyond what might be realistic, and perhaps the ""redemption"" is a bit too miraculous in how beautifully it works out. But never mind. We love it.All in all this is a great story vividly told that will leave you with a true sense of redemption in your soul. It is not a chick flick, and that is an understatement. It is a male bonding movie about friendship and the strength of character, about going up against what is wrong and unfair and coming out on top through pure true grit and a little luck.Bottom line: one of the best ever made, currently rated #2 (behind The Godfather) at the IMDb. Don't miss it.(Note: Over 500 of my movie reviews are now available in my book ""Cut to the Chaise Lounge or I Can't Believe I Swallowed the Remote!"" Get it at Amazon!)",8,The Shawshank Redemption,"Great story, wondrously told and acted"
5,5,2004-08-12,tt0111161,ur1532177,True,"In recent years the IMDB top 250 movies has had THE GODFATHER at number 1 while THE SHAWSHANK REDEMPTION has remained at number 2 . The only exception was early in 2002 when FELLOWSHIP OF THE RING topped the chart for a couple of months then dropped down to number 2 for a couple of more months . I`ll probably make myself very unpopular for saying this but I don`t think SHAWSHANK REDEMPTION deserves to be so high !!!!!! SPOILERS !!!!!!What I don`t like about it is the amount of cliches . New prisoner arrives and finds a maggot in his food , prison cliche 37 . New prisoner gives maggot to old prisoner to feed his pet bird , prison cliche 43 . It`s revealed at the end that the prisoner who has spent so many years inside is innocent after all , prison cliche numero uno . Did anyone believe during any part of this movie that Andy Dufresne was guilty ? Neither did I . Maybe that`s why I love the American prison series OZ because all the inmates there are totally guilty . There`s other things wrong with the movie . It`s about half an hour overlong , and there`s rather unrealistic bits like the warder having someone killed after finding out Dufresne is probably innocent. Oh and how many prison friendships has there been between a black man and a white man ? Maybe that last point shouldn`t be taken as a criticism because the performances of Morgan Freeman and Tim Robbins are very good and make the movie . Neither of them give a flashy performance ( Again not a criticism ) but both are very subtle in their roles , can you imagine how different this movie would have been if we`d had Tom Cruise and Denzil Washington as the stars ? Perhaps because Freeman`s character of Red does seem to have been written as a white character he`s so good in the role . Am I alone in thinking Freeman has been the best black actor in Hollywood for the last decade because he`s more interested in exploring the character instead of playing someone who`s black ? There`s also some outstanding touches from director Frank Darabont . Witness the scene early in the film where Andy spends his first night in prison with the darkness falling upon the prisoners faces . It`s almost like the artwork of Andy Dogg as the prisoners look out onto the landing as they search for fresh prey , and there is quite a touching sequence as Red leaves prison out into the harsh outside world to the strains of Thomas Newman`s scoreI gave THE SHAWSHANK REDEMPTION eight out of ten . It is a classic feel good movie but unfortunately being a cynic I do think it`s slightly overrated by IMDB voters",8,The Shawshank Redemption,"Good , But It Is Overrated By Some"
6,6,2005-10-09,tt0111161,ur6574726,True,"I have been a fan of this movie for a long time.It seems that ever time my life hits a downward spiral, I can always seem to pop this movie in, and come up with a solution to my pending problem. It somehow gives Me sense of peace and inner strength.So, It wasn't all that strange for me to pop it in when I was going through a rough patch in my marriage. I found myself identifying with many of the characters in this movie.Many of them are trapped in a world of regret and mourning, due to a mistake that had been made early on in their lives. This film gave me the strength to escape from my world of misery. And now I am able to say I, Like Andy broke out of my own personal Shawshank.My Ex isn't too happy about the divorce. But life is much better now. And I feel saved from a life of unhappiness, Due to a mistake of a marriage to early in life.Thank You Frank Durabont.Your film saved my life.",9,The Shawshank Redemption,This Movie Saved My Life.
7,7,2012-02-04,tt0111161,ur31182745,True,"I made my account on IMDb Just to Rate this movie. :-) I had heard from someone that Tim Robins has done a great job in this movie. but when i started watching this movie, i could not move my ass for 142 Min.its not just about Tim Robins or Morgen Freeman.. its the whole storyline,dialogs and cinematography which insist you to watch this whenever you feel low in your life.Movie has some great lines. When Andy(Tim Robins) brake the Jail, Morgon Freeman says ""I have to remind myself that some birds aren't meant to be caged. Their feathers are just too bright"" This movie has entertainment, Feelings, Action, Drama a little comedy. In short everything we want to see in a movie.I have seen this movie around 100 times and sure will watch more then 10000 times before i die. 1 of the best movies i have ever seen in my life.. Wish i could see that sort of movie again in my life",10,The Shawshank Redemption,Movie you can see 1000 times
8,8,2008-10-24,tt0111161,ur9871443,True,"A friend of mine listed ""The Shawshank Redemption"" as one her all time favorite movies and that brought on my curiosity what Shawshank Redemption was, I even thought it was a typo. The next day I went out and brought home the 10 years anniversary DVD without knowing I was going to spend the most worthwhile 2 and a half hours of my life in front of the TV (a melodrama doesn't kill, right?)I don't know how to write a good movie review so I guess telling you what I feel about it may do better. I remembered crying a few times, being astounded by the plot twist for a moment or two and ending up sitting alone, inert and all, just because of the once-in-a-lifetime beauty of Shawshank. Just recall the scene in which Andy locked himself in the room and turned the music on, all the cons stood paralyzed on the ground, listening to the magical tune like the song of hope they had never heard before.Shawshank is more than a feel good movie, it's about life-changing experiences, about the endless struggle against life's harshness and unfairness, about true friends who will stand by you forever... The simple yet hard to convey messages could not be handled more subtly than this.Shawshank should have won every Oscar for any categories it was nominated, but it's all about struggling against unfairness, time's proved its monumental magnitude.Shawshank is a lesson of life that you have to learn in just 150 minutes.",10,The Shawshank Redemption,The Shawshank Redemption
9,9,2011-07-30,tt0111161,ur2707735,True,"Well I guess I'm a little late to the party as far as writing down a review for this picture. I've seen it a couple of times, but that was before I became a regular contributor to the IMDb. When I first discovered this site a few years ago, ""The Godfather"" was in the Number #1 spot, and since then the films have traded places for first and second, with Shawshank maintaining the top spot most of the time. That puzzled me a bit until I watched it again tonight, and I've come away from the picture with a new found appreciation. My favorite movies tend to be the story of underdogs in some way, shape or form, and my personal Top Ten list includes titles like ""On The Waterfront"", ""To Kill a Mockingbird"" and ""One Flew Over the Cuckoo's Nest"". I may have to reconsider that list, an infrequent exercise but one I don't mind doing every now and then as situations warrant.Overall, the film is darn near perfect. I know it's pretty cliché to state it that way, but when you analyze the dialog, the characters, the directing and the tone of the movie, the picture flows flawlessly, even when it detours into side stories like Brooks Hatlen's release and new prisoner Tommy's introduction late in the picture. Every set-up, every nuance has some importance that eventually converges to symbolize Andy's quest for escape and personal redemption. Remember Brooks feeding Jake for the first time and eventually setting him free when he receives his own pardon? How about Andy playing Mozart into the prison yard while settling back with a smile of contentment on his face. The story transcends one man's confinement for a crime he didn't commit, and focuses instead on his reaction to circumstances beyond his control. Paul Newman showed us a different way to react to those kinds of conditions in 1967's ""Cool Hand Luke"", but his method was self destructive. Andy Dufresne (Tim Robbins) never loses his ability to keep his eye on the prize, even if it takes him a couple of decades to do so.But even more so, you have here the story of Ellis Boyd Redding (Morgan Freeman), a convict who sees Andy as a person, and over time, an inspiration to himself and the rest of the prisoners who call him friend. From Andy, he comes to understand that even as a prisoner, a man can live life on his own terms if he can keep his mind uncluttered by thoughts of desperation and hopelessness. Not bad for a convict who started out believing that 'hope can drive a man insane'.I really can't recommend this picture highly enough, both for it's masterful story telling and it's technical execution. The actors, even those portraying the most minor characters were seemingly born for their roles. They deliver a seamless performance that's virtually unmatched by most modern films, in a picture that hits all the right notes with an inspiring message of discipline and perseverance.",10,The Shawshank Redemption,"""I'm a convicted murderer who provides sound financial planning""."


# Create the TF Dict<a name='TFDict'></a>

[Top](#Top)

In [8]:
#Init a default dict
tfDict = defaultdict(lambda: defaultdict(int))

#Init Porter Stemmer
ps = nltk.stem.PorterStemmer()

#Use less reviews to reduce runtimes for testing/practice
dataReviewsLess = dataReviews.head(50000).copy()

#Retrieve the actual reviews
reviewTexts = dataReviewsLess['review_text'].values

#Loop through reviews
for i in range(len(reviewTexts)):
    #Tokenize reviews and lowercase the text
    line = re.split('\W+',reviewTexts[i].lower())
    #Loop through tokens in review
    for word in line:
        #Stem token
        stem = ps.stem(word)
        #Increment frequency
        tfDict[stem][i] += 1

#Add in Corpus Frequency, Document Frequency and reposition the frequencies per document
tfDictXtra = defaultdict(lambda: defaultdict(int))
for word in tfDict:
    tfDictXtra[word]['CorpusFreq'] = sum(tfDict[word].values())
    tfDictXtra[word]['DocFreq'] = len(tfDict[word])
    tfDictXtra[word]['Freq_per_doc'] = tfDict[word]


# Create the TF-IDF and Normalize<a name='TFIDFNorm'></a>

[Top](#Top)

In [9]:
#Get the total number of reviews/documents
totalDocs = len(dataReviewsLess)

#Total unique words
totalUniqueWords = len(tfDictXtra)

#Create np matrix with zeros
tfIdf = np.zeros((totalUniqueWords,totalDocs))

#Create dataframe of words with index list to get the word position in matrix for future reference
wordsIndex = pd.DataFrame(list(tfDictXtra.keys()),columns=['Words'])
#Create index range
wordID = list(range(totalUniqueWords))
#Insert the index range
wordsIndex.insert(0,'Index',wordID,True)
#Index counter, to keep track of location in word list
wordCounter = 0


#loop through words in dict
for word in tfDictXtra:
    #Loop through frequencies of word in a doc from dict; LET OP deze regel geeft soms AttributeError: 'int' object has no attribute 'keys'
    #run de vorige cellen dan weer even opnieuw. Dat verhelpt t meestal
    dictLoop = list(tfDictXtra[word]['Freq_per_doc'].keys())
    for doc in dictLoop:
        #Calculate the TF-IDF
        tfIdf[wordCounter,doc] = tfDictXtra[word]['Freq_per_doc'][doc]*math.log((totalDocs/(1+tfDictXtra[word]['DocFreq'])))
    wordCounter += 1


In [10]:
#Transpose the tfIdf matrix and normalize, since the normalize works on rows, and we need to normalize the columns
tfIdfNorm = preprocessing.normalize(tfIdf.T, norm='l2')

# Vectorize query<a name='InputQuery'></a>

[Top](#Top)

In [17]:
#Starting/test query
query = "Worst acting 2015"

#Create a normalized vector of query
def vectorizeQuery(query):
    #Create empty base vector for Term Freq
    queryVector = np.zeros(totalUniqueWords)
    #Tokenize and make lowercase
    line = re.split('\W+',query.lower())
    #Loop through words
    for word in line:
        #Stem each word
        stem = ps.stem(word)
        #Increase term freq of query term
        queryVector[wordsIndex[wordsIndex['Words']==stem]['Index'].values] += 1
    
    #Create empty base vector for TF-IDF
    queryVectorTfIdf = np.zeros(totalUniqueWords)
    #Loop through TF vector of query
    for i in range(len(queryVector)):
        #Act where a term frequency was recorded
        if queryVector[i] != 0:
            #Determine the which word it was based on the index
            word = str(wordsIndex[wordsIndex['Index']==i]['Words'].values)
            #Calculate the TF-IDF
            queryVectorTfIdf[i] = queryVector[i]*math.log((totalDocs/(1+tfDictXtra[word]['DocFreq'])))
    
    #Make the TF-IDF vector a unit vector
    length = np.sqrt(queryVectorTfIdf.dot(queryVectorTfIdf))
    queryVectorNorm = queryVectorTfIdf/length
    
    #Return the unit vector
    return queryVectorNorm


In [18]:
#Cosine similarity matching
def cosineSim(vector, docVector):
    #Only dot product needed since vectors are already unit vectors and therefore the lengths are 1
    return vector.dot(docVector)#/(length vector * length docVector)
    
def rankedList(queryVector):
    #Create empty score list
    scoreList = np.zeros(totalDocs)
    #Loop through each doc
    for i in range(len(tfIdfNorm)):
        #Calculate for each doc the cosine sim. Index of scoreList = review_id
        scoreList[i] = cosineSim(queryVector,tfIdfNorm[i])
    
    #Create new data frame for ranked list based on smaller DF of data
    rankedDocList = dataReviewsLess.copy()
    #Insert the similarity score for each review
    rankedDocList.insert(0,'Score',scoreList,True)
    #Sort the review similarity based on the score and return
    return rankedDocList.sort_values(by='Score',ascending=False)

In [19]:
#Create the ranking list
rankings = rankedList(vectorizeQuery(query))

# Results<a name='Results'></a>

[Top](#Top)

### WordCloud <a name='WordCloud'></a>

[Top](#Top)

In [20]:
#Source: https://stackoverflow.com/questions/16645799/how-to-create-a-word-cloud-from-a-corpus-in-python
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = "WordCloud of Query Results"):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=40,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

@interact
def showingWordcloudsOfKRanking(k=(1,50,1)):
    show_wordcloud(rankings.head(k)['review_text'])
    

@interact
def showingWordCloudOfOneReview(i=(1,len(dataReviewsLess),1)):
    show_wordcloud(dataReviewsLess[dataReviewsLess['review_id']==i]['review_text'].values,'WordCloud of a review')

interactive(children=(IntSlider(value=25, description='k', max=50, min=1), Output()), _dom_classes=('widget-in…

interactive(children=(IntSlider(value=25000, description='i', max=50000, min=1), Output()), _dom_classes=('wid…

### Interact with Filters<a name='Filters'></a>

[Top](#Top)

In [21]:
#Function to filter on the variables created by interact widget
def showResultsTime(start_date, end_date, AmountResults, AtleastRating, spoiler, movie_title):
    start_date = pd.Timestamp(start_date)
    end_date = pd.Timestamp(end_date)
    if movie_title == 'None':
        if spoiler == 'Both':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)].head(AmountResults)
        elif spoiler == 'Yes':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == True)].head(AmountResults)
        elif spoiler == 'No':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == False)].head(AmountResults)
    else:
        if spoiler == 'Both':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.movie_title == movie_title)].head(AmountResults)
        elif spoiler == 'Yes':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == True)
                        & (rankings.movie_title == movie_title)].head(AmountResults)
        elif spoiler == 'No':
            return rankings[(rankings.review_date > start_date) 
                        & (rankings.review_date < end_date) 
                        & (rankings.rating >= AtleastRating)
                        & (rankings.is_spoiler == False)
                        & (rankings.movie_title == movie_title)].head(AmountResults)

#Sort the movieTitles DF
tmp = movieTitles.sort_values(by='primaryTitle')
#Prep a list of movie titles for filter
titles = ['None']
titles.extend(tmp['primaryTitle'].values)
#The interact function for faceted search
_ = interact(showResultsTime,
             start_date=widgets.DatePicker(value=pd.to_datetime('2014-01-01')),
             end_date=widgets.DatePicker(value=pd.to_datetime('2019-01-01')),
             AmountResults=(10, 100, 10),
             AtleastRating=(1,10,1),
             spoiler=['Both','Yes','No'],
             movie_title=titles)

interactive(children=(DatePicker(value=Timestamp('2014-01-01 00:00:00'), description='start_date'), DatePicker…

## Cohen's Kappa<a name='Cohen'></a>
[Top](#Top)

In [16]:
def AveragePrecision(ranked_list_of_results, list_of_relevant_objects):
    begin = 1/len(list_of_relevant_objects)
    count = 0
    for i, res in enumerate(ranked_list_of_results):
        for j, obj in enumerate(list_of_relevant_objects):
            if obj == res:
                itera = (j+1) / (i+1)
            count = count + itera
    return begin * count

def PE(data):
    '''On input data, return the P(E) (expected agreement).'''
    relevant = 0
    nonrelevant = 0
    # Iterate over the data
    for i in data:
        for j in i:
            
            # Top up the relevant documents by one if 1 is encountered
            if j == 1:
                relevant += 1
            # Top up the nonrelevant documents by one if 0 is encountered
            if j == 0:
                nonrelevant += 1

    # Calculates the total of inspected documents for the judges combined
    total = len(data)*2

    # Calculates the pooled marginals
    rel = relevant/total
    nonrel = nonrelevant/total

    # Calculates the P(E)
    P_E = nonrel**2 + rel **2    
    return    P_E 


def kappa(data, P_E):
    agree = 0
    for i in data:
        temp = None
        for j in i:
            if temp == j:
                agree += 1
            temp = j
    P_A = agree / len(data)
    kappa = (P_A - P_E)/(1 - P_E)   
    return kappa


