# PHD: Sentiment Analysis on Lion King Movie Review (2019)

### What is Sentiment Analysis ?

- Sentiment Analysis is basically concerned with analysis of emotions and opinions from text. We can refer sentiment analysis as opinion mining. Sentiment analysis finds and justifies the sentiment of the person with respect to a given source of content.


- In today's virtual world, as everything is connected to internet one can easily form an opinion about the certain thing and express it. This acts as a litmus test for many companies who depend on people opinion to sell their product(s).

### Problem Statement:

- Moview Reviews are important for the production house to get an idea about how the movie is been recieved by the auidence. As more than 90% of people book movie tickets online, analysing the reviews and ratings will give a clear idea and popularity of the movie.


- Gather the reviews of the latest movie " The Lion King" from the online ticketing site "Rotten Tomatoes" and perform sentiment analysis on the reviews such that it will give clear picture about the opinion of the people on the movie and the production house will be able to tackle the "negative" reviews to  control damage.


### Tasks to be performed:

1. Collect Audience reviews from "www.rottentomatoes.com" for the film: “The Lion King (2019)”

2. Label the Review "sentiment" (Target). ( Rating >3 as "Positive" and Rating <3 as "Negative" )

3. Perform the Visualizations & EDA on the data gathered.

4. Perform Sentiment Classification using supervised learning

5. Clustering the Reviews – Comparing ‘Cluster Label’ with Train data Target ‘sentiment’


### Web Scrapping and Collecting Data

- Collect 3000 reviews from the "Rotten Tomatoes" website (live)

In [1]:
#Load the libraries
#!pip install requests
import requests
import time 
import pandas as pd
import numpy as  np

In [2]:
#Create a header for the request.

headers = {
    'Referer': 'https://www.rottentomatoes.com/m/the_lion_king_2019/reviews?type=user',     
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',    
    'X-Requested-With': 'XMLHttpRequest',
          }

In [3]:
#Mention the URL details.
url = 'https://www.rottentomatoes.com/napi/movie/9057c2cf-7cab-317f-876f-e50b245ca76e/reviews/user' 

#Create the session object
session_object = requests.Session() 

In [4]:
#We need to gather the data for 3000 reviews and each page has 10 reviews. So to gather the data from 300 pages,
#call everything in a loop.

all_reviews = []
end_cursor = ''
start_cursor = ''

# Fetch the data using GET from the mentioned URL using the parameters for 300 pages

for index in range(300):
    payload = {'direction':'next','endCursor':end_cursor,'startCursor':start_cursor} #Set the Parameters
    time.sleep(5) 
    r= session_object.get(url, params=payload, headers=headers)    # GET Call 
    data = r.json()
    all_reviews = all_reviews + data['reviews']
    end_cursor = data['pageInfo']['endCursor']
    start_cursor = data['pageInfo']['startCursor']

all_reviews

[{'rating': 'STAR_5',
  'review': 'I liked most that the animation made the animals look so real',
  'displayName': 'Joanne H',
  'displayImageUrl': None,
  'isVerified': True,
  'isSuperReviewer': False,
  'hasSpoilers': False,
  'hasProfanity': False,
  'createDate': '2019-08-18T08:54:30.664Z',
  'updateDate': '2019-08-18T08:54:30.890Z',
  'user': {'userId': '2c73ed20-5b9f-41b3-a4fd-8dd3ff8bb20a',
   'realm': 'Fandango',
   'displayName': 'Joanne H',
   'accountLink': None},
  'score': 5,
  'timeFromCreation': '31m ago'},
 {'rating': 'STAR_5',
  'review': 'Amazing! So realistic and incredible music',
  'displayName': 'Frankie C',
  'displayImageUrl': 'https://graph.facebook.com/v3.3/594379764/picture',
  'isVerified': False,
  'isSuperReviewer': False,
  'hasSpoilers': False,
  'hasProfanity': False,
  'createDate': '2019-08-18T08:03:49.380Z',
  'updateDate': '2019-08-18T08:03:49.380Z',
  'user': {'userId': '871398953',
   'realm': 'RT',
   'displayName': 'Frankie C',
   'accountLink

In [5]:
#We can see from above json file that the user information is captured in seperate dictionary within the review dictionary.
#So, pull the user information from user dictionary in key:value pair seperately.and then delete the attribute to avoid repetition.


for value in all_reviews:
    value['userId'] = value['user']['userId']
    value['userRealm'] = value['user']['realm']
    value['userDisplayName'] = value['user']['displayName']
    value['userAccountLink'] = value['user']['accountLink']
    del value['user'];

In [6]:
#Check the length of the all_review.

len(all_reviews)

3000

In [7]:
#As We have seperated the user attribute, lets check the content now.

all_reviews

[{'rating': 'STAR_5',
  'review': 'I liked most that the animation made the animals look so real',
  'displayName': 'Joanne H',
  'displayImageUrl': None,
  'isVerified': True,
  'isSuperReviewer': False,
  'hasSpoilers': False,
  'hasProfanity': False,
  'createDate': '2019-08-18T08:54:30.664Z',
  'updateDate': '2019-08-18T08:54:30.890Z',
  'score': 5,
  'timeFromCreation': '31m ago',
  'userId': '2c73ed20-5b9f-41b3-a4fd-8dd3ff8bb20a',
  'userRealm': 'Fandango',
  'userDisplayName': 'Joanne H',
  'userAccountLink': None},
 {'rating': 'STAR_5',
  'review': 'Amazing! So realistic and incredible music',
  'displayName': 'Frankie C',
  'displayImageUrl': 'https://graph.facebook.com/v3.3/594379764/picture',
  'isVerified': False,
  'isSuperReviewer': False,
  'hasSpoilers': False,
  'hasProfanity': False,
  'createDate': '2019-08-18T08:03:49.380Z',
  'updateDate': '2019-08-18T08:03:49.380Z',
  'score': 5,
  'timeFromCreation': '1h ago',
  'userId': '871398953',
  'userRealm': 'RT',
  'user

In [8]:
#Save the file in json format on your system.
import json

with open("Lion_King_Review.json", "w") as write_file:
    json.dump(all_reviews, write_file)

In [9]:
#Read the json file from the system.

reviews_data= pd.read_json('Lion_King_Review.json')

In [10]:
#check the top 3 rows of the dataset.

reviews_data.head(3)

Unnamed: 0,createDate,displayImageUrl,displayName,hasProfanity,hasSpoilers,isSuperReviewer,isVerified,rating,review,score,timeFromCreation,updateDate,userAccountLink,userDisplayName,userId,userRealm
0,2019-08-18T08:54:30.664Z,,Joanne H,False,False,False,True,STAR_5,I liked most that the animation made the anima...,5.0,31m ago,2019-08-18T08:54:30.890Z,,Joanne H,2c73ed20-5b9f-41b3-a4fd-8dd3ff8bb20a,Fandango
1,2019-08-18T08:03:49.380Z,https://graph.facebook.com/v3.3/594379764/picture,Frankie C,False,False,False,False,STAR_5,Amazing! So realistic and incredible music,5.0,1h ago,2019-08-18T08:03:49.380Z,/user/id/871398953,Frankie C,871398953,RT
2,2019-08-18T07:13:32.422Z,,jaycee,False,False,False,False,STAR_5,Classic. Good remake. Loved it. Glover was out...,5.0,2h ago,2019-08-18T07:13:32.422Z,,jaycee,DD2453B0-37CE-4B47-A099-D15378FC310E,Fandango


In [11]:
#Check the shape of the dataset.

reviews_data.shape

(3000, 16)

In [12]:
#Saving the dataframe in CSV file format.

reviews_data.to_csv('Lion_King_Review.csv')

This concludes the data collection part. I will load the saved file ("Lion_King_Review.csv") in the next sheet and will perform pre processings and EDA steps there.