# Assignment 6

## Text Analysis

### Step 2: Shakespeare word frequency
- Make a Python string that contains the text of a Shakespeare play (obtained, for example, from Project Gutenberg)
    - You can use requests and BeautifulSoup to get the text or you can read in the content from a file, but do not copy the entire play into a notebook cell
- Tokenize the words and remove stopwords
- Find the top 20 most frequent words in the play
- Comment on whether these words give an accurate sense of the play

In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')
from nltk.collocations import *

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
response = requests.get('https://www.gutenberg.org/ebooks/1531.txt.utf-8')
response

<Response [200]>

In [4]:
# Save all the html in a string variable
html_string = response.text
othello = html_string.lower()

# Use BeautifulSoup to create a new object that will allow you to search for HTML tags
document = BeautifulSoup(othello, "html.parser")
#document
#html_string

In [5]:
sent = sent_tokenize(othello)

In [6]:
words = []
for s in sent:
    for w in word_tokenize(s):
        words.append(w)

In [7]:
myStopWords = list(punctuation) + stopwords.words('english')
wordsNoStop = []
for i in words:
    if i not in myStopWords:
        wordsNoStop.append(i)
#print(words)
#print(wordsNoStop)

In [8]:
wordsNoStopComp = [w for w in words if w not in myStopWords]
#print(wordsNoStopComp)

In [9]:
from collections import Counter
Counter = Counter(wordsNoStopComp)
most_occur = Counter.most_common(20)
print(most_occur)

[('’', 871), ('iago', 353), ('othello', 337), ('cassio', 248), ('desdemona', 223), ('thou', 143), ('emilia', 129), ('shall', 100), ('roderigo', 98), ('good', 92), ('lord', 90), ('project', 89), ('tis', 86), ('let', 83), ('come', 82), ('thy', 80), ('would', 80), ('love', 80), ('well', 80), ('may', 75)]


These words give an accurate sense of the characters in the play more than the actual content or plot of the play. 

With the character names we see, we can get a sense of who the play revolves around and the old english words that are frequently used gives us a sense of the genre if one did not know what text these words were from. However, these words don't tell us much about what the play is actually about.

### Step 3: Yelp sentiments

- Find your favorite restaurant on Yelp and copy 15 of its reviews into your notebook as Python strings
    - You don't have to use requests for this, you can just copy and paste from a browser
- Also note the numbers of stars for each review in your notebook
- Use Vader to find the polarity of each review
- Compare Vader's scores against user-specified numbers of stars

In [50]:
import nltk
from nltk.sentiment import vader
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [51]:
sia = vader.SentimentIntensityAnalyzer()

In [52]:
#4 stars
review1 = 'Four stars for the quality of the food  and service we received. My wife is Japanese and very much enjoyed the teriyaki bento box she ordered. My kids loved their dishes (one ordered the giyoza dish and one ordered the udon). I got the chicken katsu with yellow tail lunch bento box special, and I was very pleased with my dish. Portions were good sized during our visit and price was right.  Staff were pleasant and attentive. The place definitely had the vibe of a bustling Japanese style diner, which we liked. We were there for lunch, and there was definitely a crowd. Now for the bad. I grew up with parents who lived in Japan. They always talk about how clean things are there. Like, impeccably clean. The place was not dirty, but it definitely could have used some cleaning, dusting, and sprucing up. When paying for my meal, I noted a cloth flag/divider behind the main cash register that was dirty to the naked eye that made me think - that could use a wash. The floor had a slippery yet sticky feel to it'
sia.polarity_scores(review1)

{'neg': 0.012, 'neu': 0.824, 'pos': 0.165, 'compound': 0.9718}

Review 1 had a 4 stars rating meaning the reviewer had a more positive experience. The compound score reflects a relatively more positive experience as well which is consistent with the star rating.

In [53]:
#3 stars
review2 = 'Not as good as I remembered it growing up, but still worth eating at if your in the area looking for a bite.  They have daily specials they write on the board near the cash register with trying.  There is plenty of seating indoor and outdoor and it is a very kid friendly.  The sushi is okay here we got the Hawaiian roll, Shrimp tempura and a rainbow roll, nothing out of the ordinary here.  We really love the chicken Katsu here and is our go to meal.  Staff is still friendly and service is still good.'
sia.polarity_scores(review2)

{'neg': 0.043, 'neu': 0.762, 'pos': 0.195, 'compound': 0.9578}

Review 2 had a 3 stars rating meaning the reviewer had a nuetral experience. The compound score reflects a relatively more positive experience which differs from the star rating.

In [54]:
#3 stars
review3 = 'Niban was the first place I tried sushi....seems like a lifetime ago. I remember my 1st visit where I mistakenly ordered a Philadelphia Roll (not knowing it had raw fish) I thought the Wasabi was avacado and my nostrils really felt the burn. I used to love getting the Chicken Katsu and Fried Seafood mix here. Fast forward years later I came back to try Teriyaki Beef with a side of Seafood Mix . The beef was fine and loved the little noodles on the side where I have not seen anywhere else. I missed the taste of little zingy noodles. I think that was one of the reasons why I wanted to come back. The Seafood Mix made me sad. It was overcooked. One piece was so tough it had the consistency of an eraser. I could not even bite into it. I tried the Yellow Tail and it tasted just okay. It was nice to come back but the food was not as good as I remembered.'
sia.polarity_scores(review3)

{'neg': 0.061, 'neu': 0.868, 'pos': 0.071, 'compound': 0.1767}

Review 3 had a 3 stars rating meaning the reviewer had a neutral experience. The compound score reflects a nuetral, slightly more positive experience which is pretty consistent with the star rating.

In [55]:
#5 stars
review4 = 'Been going to this restaurant for over 20 years. Still AWESOME and delicious as always, same great service and quality Japanese food reasonably priced. I am sure I have a lot of their food pictures from the past but this place knows consistency and quality food. They have always made sure their food is fresh and that they hold high value what they serve to you. That is why they have been around for such a long time. Like us, they have many loyal customers that  would travel a good distance just to get their food. Highly recommend!!! ! Make sure you check the restaurant hours as they are closed for a few hours in between the day.'
sia.polarity_scores(review4)

{'neg': 0.0, 'neu': 0.725, 'pos': 0.275, 'compound': 0.9887}

Review 4 had a 5 stars rating meaning the reviewer had a positive experience. The compound score reflects a  positive experience which is consistent with the star rating.

In [56]:
#3 stars
review5 = 'So you order the sit and wait for it to be delivered.  We had to ask for napkins, utensils & water.  Table was sticky.  Not cleanable.  All tables had same sticky surface like they had been wiped with acid.  Easy fix with shelf paper.  So we used a few napkins because sticky table was so yucky.  No one asked if we wanted more water or needed anything else- like soy sauce or pickled ginger.  The food was fine as was price but it was not comfortable.  Most likey will not come here again, maybe take out. Too bad'
sia.polarity_scores(review5)

{'neg': 0.115, 'neu': 0.821, 'pos': 0.064, 'compound': -0.8176}

Review 5 had a 3 stars rating meaning the reviewer had a neutral experience. The compound score reflects a more negative experience which differs from the star rating.

In [57]:
#4 stars
review6 = 'Pretty good Japanese food. Lunch special recommended was the chicken teriyaki special. Just under $10 with plenty of food. I could not finish the salad or rice, however, the chicken was probably one of the best teriyaki dishes I can recall. Good food and good prices. Plenty of sushi options as well to try next trip.' 
sia.polarity_scores(review6)

{'neg': 0.0, 'neu': 0.639, 'pos': 0.361, 'compound': 0.9732}

Review 6 had a 4 stars rating meaning the reviewer had a more positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [58]:
#1 star
review7 = 'Do not bother. This place serves substandard food. The "jumbo" shrimp were skinny and coated with think bread crumb. Everything was salty. I ended up not eating most of the food because they were so bad. Wasted $25 for two and will never go back.'
sia.polarity_scores(review7)

{'neg': 0.156, 'neu': 0.804, 'pos': 0.04, 'compound': -0.7866}

Review 7 had a 1 star rating meaning the reviewer had a negative experience. The compound score reflects a negative experience which is consistent with the star rating.

In [59]:
#5 stars
review8 = 'We love coming here for sushi. The servers are very nice and friendly.'
sia.polarity_scores(review8)

{'neg': 0.0, 'neu': 0.482, 'pos': 0.518, 'compound': 0.8947}

Review 8 had a 5 stars rating meaning the reviewer had a positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [60]:
#1 star
review9 = 'Dude stay away from this place!!! They are unprofessional and liars. They did not answer their phone from opening.  And then lied about it. The girl at the counter attempted to force me to take my order after I requested a refund. I am not going to wait 30 minutes because you chose to send out first the food people which are dining in. And then waste more time refusing  to give me a refund.This is appalling.'
sia.polarity_scores(review9)

{'neg': 0.216, 'neu': 0.784, 'pos': 0.0, 'compound': -0.9549}

Review 9 had a 1 star rating meaning the reviewer had a negative experience. The compound score reflects a negative experience which is consistent with the star rating.

In [61]:
#5 stars
review10 = 'Absolutely happy my family actually came across this place while getting McDonalds for my son. Food was absolutely delicious, staff was great and friendly. I wish I lived in the area because I would patronize this restaurant regularly.'
sia.polarity_scores(review10)

{'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.9583}

Review 10 had a 5 stars rating meaning the reviewer had a positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [62]:
#4 stars
review11 = 'This used to be my go to spot but it has been years since I have visited! I recently had time to stop by to pick up some lunch before work. I got the orange blossom roll, dragon roll, salmon hand roll and the calamari appetizer. The food still taste the same after all these years. Definitely will be giving them another visit soon!'
sia.polarity_scores(review11)

{'neg': 0.039, 'neu': 0.799, 'pos': 0.162, 'compound': 0.8669}

Review 11 had a 4 stars rating meaning the reviewer had a more positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [63]:
#4 stars
review12 = 'Niban has been one of my goto spots for both dine in and takeout for a long time. I love that they always have their daily specials. You can call ahead to ask what the specials are as they are usually busy and it will take a while if you order togo at the restaurant. Normally I like dining in because I feel that they give you more in the combo bentos with the little salmon and bean sprout side dishes that they do not include in takeout. The quality of food has stayed the same for 5+ years now. I enjoy this type of no fuss japanese food and will continue to support this business!'
sia.polarity_scores(review12)

{'neg': 0.018, 'neu': 0.874, 'pos': 0.107, 'compound': 0.8932}

Review 12 had a 4 stars rating meaning the reviewer had a more positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [64]:
#2 stars
review13 = 'The lady that took our order on phone was awesome . But when we got there ; this male worker told us we were not allowed to sit because we did not order while sitting down. Even though we paid for our food and everyone else was sitting outside and eating but he decided to tell us we could not sit and said that we can sit somewhere else . The other customers did not understand why my sister and I were told to leave. Never coming back, their miso soup taste like tap water'
sia.polarity_scores(review13)

{'neg': 0.014, 'neu': 0.925, 'pos': 0.061, 'compound': 0.6705}

Review 13 had a 2 stars rating meaning the reviewer had a more negative experience. The compound score reflects a more positive experience which differs from the star rating.

In [65]:
#5 stars
review14 = 'My friend showed me this place and I love coming here! Friendly staff, delicious meals and affordable prices. During this time they offer takeouts, delivery and outdoor seating :)'
sia.polarity_scores(review14)

{'neg': 0.0, 'neu': 0.567, 'pos': 0.433, 'compound': 0.9558}

Review 14 had a 5 stars rating meaning the reviewer had a positive experience. The compound score reflects a positive experience which is consistent with the star rating.

In [66]:
#3 stars
review15 = 'Called order in at 11:26. Door for on site diners opened just after 11:30. On site party of 4 diners had their complete bento orders 10 full minutes before our phone order was handed to me, 30 minute after ordering. When I asked about it, I was given an apathetic bs reply. Honesty is preferred.'
sia.polarity_scores(review15)

{'neg': 0.039, 'neu': 0.858, 'pos': 0.103, 'compound': 0.5719}

Review 15 had a 3 stars rating meaning the reviewer had a neutral experience. The compound score reflects a relatively positive experience which differs from the star rating.

### Step 4: Your movie reviews

- Make 5 strings that contain reviews (3 sentences each) of your favorite movie comedies
- Make 5 strings that contain reviews (3 sentence each) of your favorite movie dramas
- Make a Python list that contains these 10 strings
- Replicate the analysis pipeline from "04_news_topics.ipynb"
    - You don't have to open any files
    - Instead of using "listOfNews", use your list of movie reviews
    - Modify the characters in "extrastop" if you want to
    - For the LDA model step, use "num_topics = 2"
- Comment on the words that the model chooses to represent the 2 topics, and whether they match with your split between comedies and dramas

In [67]:
comedy1 = 'The most important thing you might wanna know about this review is that I am certainly NOT in the target demographic group for this film. As a 53 year-old man, I am not the type to ever watch "Pitch Perfect" in the first place and did so only because my daughter insisted that it was a good film and I would enjoy it. And, fortunately, I did enjoy it quite a bit.'
comedy2 = 'Anna Kendrick is solid in controlling the screen. What works best is the chemistry between everybody. They make this movie fun to watch.'
comedy3 = 'First of all, there is absolutely no realistic character development. They are types, thrown together because of their quirkiness, their ethnicity, their race, or even their sexual preference. The sad thing is that doing this does not lead to anything interesting.'
comedy4 = 'While the romance part does not work quite as well (mostly towards the end of the movie) like the rest of it, it does not really take much away from it either. It a fun light movie, that you can watch with almost everybody. And it brings a smile on your face (actually more than one)!'
comedy5 = 'One thing this movie about college level a capella singing competition demonstrates is that the actresses involved are multi talented from the control freak played by Anna Camp to the lovely Brittany Snow and of course Anna Kendrick. But Anna Kendrick as a the lead was not quite right. She is better in a supporting role.'

drama1 = 'Sadly, however, I was left feeling ambivalent about it...and I noticed that my wife and oldest daughter felt pretty much the same way. This is because although the film is more like the book, to do this they also omit a lot of things....making the story seem a bit disjoint and confusing. Overall, a decent story but even with its sticking closer to the book, I much preferred the 1990s version...which was much more charming, fun and likable.'
drama2 = 'It is such a good story and the characters are so good that it remains timeless regardless of how many it is read and how often it is adapted. Some may not be totally enamoured with it as an adaptation, as the chronology is different and there is a lot of back and forth, but on its own terms it left me and my sisters totally satisfied.Gerwig directs with great confidence and the script sparkles, the charm and poignancy of the story and Alcotts text never lost or jarring.'
drama3 = 'Greta Gerwig has changed it up and in the process, has given even more depth and more life to the characters. She has added to the theme of the story. She divides the movie into about half teenage years and half young adult years.'
drama4 = 'It is beautifully shot, beautifully acted, and I understand that in order to appeal to succeeding generations, any classic must have some point that the latest kids will find telling. Nonetheless, Greta Gerwigs efforts to make this appealing to modern, Fourth Wave Feminists by having everyone from Meryl Streep as Aunt March on down go on about the subservient position of women in this era, by having Saoirse Ronan as Jo March write Little Woman, and then get into an argument with the publisher, insisting that the character remain unmarried. This destroys the ending, but of course there are more important things'
drama5 = 'The Louisa May Alcott novel has often been filmed, so I wondered what new take could be given on it. Thankfully this turns out to be a joy, with a wonderful central performance from Saoise Ronan. Certainly a film worth watching.'

In [69]:
movie_reviews = [comedy1, comedy2, comedy3, comedy4, comedy5, drama1, drama2, drama3, drama4, drama5]
#movie_reviews

In [70]:
import pandas as pd
from pathlib import Path  
import glob
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [71]:
extrastop = ['``',"''","'re","'s","'re",'``',"''","'ll","--","\'\'","...",
             "n\'t",'one','would','use',"\'m","\'ve"]

In [72]:
myStopWords = list(punctuation) + stopwords.words('english') + extrastop

In [73]:
[w for w in word_tokenize(movie_reviews[0].lower()) if w not in myStopWords]

['important',
 'thing',
 'might',
 'wan',
 'na',
 'know',
 'review',
 'certainly',
 'target',
 'demographic',
 'group',
 'film',
 '53',
 'year-old',
 'man',
 'type',
 'ever',
 'watch',
 'pitch',
 'perfect',
 'first',
 'place',
 'daughter',
 'insisted',
 'good',
 'film',
 'enjoy',
 'fortunately',
 'enjoy',
 'quite',
 'bit']

In [74]:
listOfMovieReviews = []
for i in movie_reviews:
    listOfMovieReviews.append([w for w in word_tokenize(i.lower()) if w not in myStopWords])

In [75]:
#listOfMovieReviews[0]

In [76]:
from nltk.stem.porter import PorterStemmer
#from nltk.stem import LancasterStemmer

In [77]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [78]:
listOfStemmedWords = []
for i in listOfMovieReviews:
    listOfStemmedWords.append([p_stemmer.stem(w) for w in i])

In [79]:
#listOfStemmedWords[0]

In [80]:
!pip install gensim



In [81]:
from gensim import corpora, models
import gensim

In [82]:
dictionary = corpora.Dictionary(listOfStemmedWords)

In [83]:
#print(dictionary.token2id)

In [84]:
corpus = [dictionary.doc2bow(text) for text in listOfStemmedWords]

In [85]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=2, 
                                           id2word = dictionary, 
                                           passes=20)

In [86]:
for i in ldamodel.print_topics(num_topics=2, num_words=20):
    print(i)

(0, '0.020*"movi" + 0.017*"anna" + 0.017*"much" + 0.013*"watch" + 0.013*"film" + 0.013*"stori" + 0.013*"fun" + 0.013*"kendrick" + 0.009*"even" + 0.009*"quit" + 0.009*"book" + 0.009*"like" + 0.009*"take" + 0.009*"wonder" + 0.009*"work" + 0.009*"control" + 0.009*"everybodi" + 0.009*"given" + 0.009*"half" + 0.009*"year"')
(1, '0.014*"charact" + 0.014*"thing" + 0.014*"good" + 0.010*"march" + 0.010*"beauti" + 0.010*"stori" + 0.010*"appeal" + 0.010*"remain" + 0.010*"adapt" + 0.010*"import" + 0.010*"insist" + 0.010*"total" + 0.010*"first" + 0.010*"type" + 0.010*"enjoy" + 0.010*"film" + 0.006*"greta" + 0.006*"gerwig" + 0.006*"ronan" + 0.006*"alcott"')


For the first topic, we don't see many words that would be specifically associated with a comedy over a drama film besides maybe the word "fun". Many words like "anna", "kendrick", "quit", and "year" which are more specific to the plot of the movie itself rather than the genre since the film stars Anna Kendrick and is about a struggling college acapella group.

For the second topic, we also don't see many words that would be specifically associated with the comedy or drama genre. Again, we see words that are associated with the film itself like "greta", "gerwig", and "alcott" which represents Greta Gerwig as the film's director and Alcott as in the author of the original album.

Both films that I chose are film adaptations of novels so we see words like "stori", "adapt", "movi", "film", and "book". However, these words are common in both topics due to the plot of the films themselves rather than being related to the genres.

The words the model chose to represent the 2 topics don't particularly match with the split between comedies and dramas. This might be because many reviews discuss the plot of the movies, the actors and producers involved, and their general emotions. So words like "like", "good", and "enjoy" are vague and would not provide a distinction between the two genres as they can be used to describe either comedies or dramas. 