# TextBlob for predicting sentiment score

In this notebook we are using TextBlob library to predict the score associated with each movie review posted on the subreddit r/HorrorReviewed [3]

In [1]:
import pandas as pd

reddit = pd.read_csv('data/horror1000.csv')

reddit.head()

Unnamed: 0.1,Unnamed: 0,id,author,title,score,comments,selftext,created,pinned,total awards,filter,url,created_date,created_time
0,0,c8eo0c,CulturalHater,Midsommar (2019) [occultism/folk-inspired],85,9,**“Midsommar” basks in its own radiant glory. ...,2019-07-02 20:34:51,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2019-07-02,20:34:51
1,1,sx3u5z,FuturistMoon,"PONTYPOOL (2008) [Zombie Apocalypse, Art House]",83,13,**PONTYPOOL (2008)** \- Last year I watched (o...,2022-02-20 15:37:14,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2022-02-20,15:37:14
2,2,hpyn2c,StacysBlog,Color Out of Space (2020) [Supernatural/Body H...,71,13,"""It's just a color.""\n-Ezra\n\n\n\nThe Gardner...",2020-07-12 17:52:35,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2020-07-12,17:52:35
3,3,cpluy7,cdown13,"10,000 Subscribers!",74,4,**Well we did it! And a whole lot sooner than ...,2019-08-13 01:20:01,False,0,Moderator Post,https://www.reddit.com/r/HorrorReviewed/commen...,2019-08-13,01:20:01
4,4,ix6dkd,FuturistMoon,The Autopsy Of Jane Doe (2016) [Witchcraft],69,9,**THE AUTOPSY OF JANE DOE (2016)**\n\nTommy (B...,2020-09-21 18:38:44,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2020-09-21,18:38:44


In [2]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    1000 non-null   int64 
 1   id            1000 non-null   object
 2   author        987 non-null    object
 3   title         1000 non-null   object
 4   score         1000 non-null   int64 
 5   comments      1000 non-null   int64 
 6   selftext      999 non-null    object
 7   created       1000 non-null   object
 8   pinned        1000 non-null   bool  
 9   total awards  1000 non-null   int64 
 10  filter        997 non-null    object
 11  url           1000 non-null   object
 12  created_date  1000 non-null   object
 13  created_time  1000 non-null   object
dtypes: bool(1), int64(4), object(9)
memory usage: 102.7+ KB


## Preprocessing

Reddit has provided us with raw data, which must be processed before being further analysed. Every post's title is written in the following format: "Title (year) [Genres]". But the movie's title is the only thing we are interested in for our project. This is accomplished by using regular expressions to split the movie title and store it in a different column for further analysis. With the help of this preprocessing phase, we can be sure that a dedicated column will contain the pertinent movie names for our study.

In [3]:
# Extract title, year, and genre using regular expressions
reddit['title'] = reddit['title'].str.strip()  # Remove leading/trailing spaces
reddit['title'] = reddit['title'].str.extract(r'^(.*?)\s*\(')  # Extract title
reddit['year'] = reddit['title'].str.extract(r'\((.*?)\)')  # Extract year
reddit['genre'] = reddit['title'].str.extract(r'\[(.*?)\]')  # Extract genre

# Remove the extracted parts from the original title column
reddit['title'] = reddit['title'].str.replace(r'\(.*?\)', '').str.strip()  # Remove year
reddit['title'] = reddit['title'].str.replace(r'\[.*?\]', '').str.strip()  # Remove genre

reddit.head()

  reddit['title'] = reddit['title'].str.replace(r'\(.*?\)', '').str.strip()  # Remove year
  reddit['title'] = reddit['title'].str.replace(r'\[.*?\]', '').str.strip()  # Remove genre


Unnamed: 0.1,Unnamed: 0,id,author,title,score,comments,selftext,created,pinned,total awards,filter,url,created_date,created_time,year,genre
0,0,c8eo0c,CulturalHater,Midsommar,85,9,**“Midsommar” basks in its own radiant glory. ...,2019-07-02 20:34:51,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2019-07-02,20:34:51,,
1,1,sx3u5z,FuturistMoon,PONTYPOOL,83,13,**PONTYPOOL (2008)** \- Last year I watched (o...,2022-02-20 15:37:14,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2022-02-20,15:37:14,,
2,2,hpyn2c,StacysBlog,Color Out of Space,71,13,"""It's just a color.""\n-Ezra\n\n\n\nThe Gardner...",2020-07-12 17:52:35,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2020-07-12,17:52:35,,
3,3,cpluy7,cdown13,,74,4,**Well we did it! And a whole lot sooner than ...,2019-08-13 01:20:01,False,0,Moderator Post,https://www.reddit.com/r/HorrorReviewed/commen...,2019-08-13,01:20:01,,
4,4,ix6dkd,FuturistMoon,The Autopsy Of Jane Doe,69,9,**THE AUTOPSY OF JANE DOE (2016)**\n\nTommy (B...,2020-09-21 18:38:44,False,0,Movie Review,https://www.reddit.com/r/HorrorReviewed/commen...,2020-09-21,18:38:44,,


Moving forward we will be considering only the following columns
<ul>
    <li> title: indicating the title of the movie which has been reviewed </li>
    <li> selftext: containing the text of the review </li>
</ul>

In [4]:
data = reddit[["title","selftext"]]

In [5]:
data.head()

Unnamed: 0,title,selftext
0,Midsommar,**“Midsommar” basks in its own radiant glory. ...
1,PONTYPOOL,**PONTYPOOL (2008)** \- Last year I watched (o...
2,Color Out of Space,"""It's just a color.""\n-Ezra\n\n\n\nThe Gardner..."
3,,**Well we did it! And a whole lot sooner than ...
4,The Autopsy Of Jane Doe,**THE AUTOPSY OF JANE DOE (2016)**\n\nTommy (B...


In [6]:
movies = pd.read_csv('movies.csv')

In [7]:
movies.head()

Unnamed: 0.1,Unnamed: 0,title,averageRating,numVotes
0,11,Juna saapuu asemalle,7.4,12312.0
1,13,O Regador Regado,7.1,5539.0
2,23,The Oxford and Cambridge University Boat Race,3.8,46.0
3,37,Barnet Horse Fair,3.3,32.0
4,40,Seine nehrindeki gemiler,3.9,39.0


### Mapping

The data from reddit pulls and IMDB, was then to be mapped on the attribute movies title. This was done so that the resulting table can have the title of the movies that are common to both the tables and and we tried two ways,
Using difflib: utilized the various functions provided by the library to perform string comparisons and found the similarities or differences between the.

However, we found a better and a more simple way to map the movie title in IMDB and the movie titles in Reddit data i.e Joining.We found that joining (left)gave us better results than difflab, as we used a str. Contains tag along with it.


In [8]:
r = pd.merge(data, movies, left_on='title',right_on='title', how='inner')

In [9]:
r.head()

Unnamed: 0.1,title,selftext,Unnamed: 0,averageRating,numVotes
0,Midsommar,**“Midsommar” basks in its own radiant glory. ...,6755416,7.1,355330.0
1,Midsommar,"""Tomorrow's a big day.""\n-Pelle\n\n\n\n\nAfter...",6755416,7.1,355330.0
2,Midsommar,Hi there! My name is Mandy and I’m one of the ...,6755416,7.1,355330.0
3,Hereditary,By far the best horror flick of 2018. A master...,6439985,7.3,344077.0
4,Hereditary,**Release Date:** June 8th 2018\n\n**Director:...,6439985,7.3,344077.0


In [10]:
r.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2028 entries, 0 to 2027
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          2028 non-null   object 
 1   selftext       2028 non-null   object 
 2   Unnamed: 0     2028 non-null   int64  
 3   averageRating  2028 non-null   float64
 4   numVotes       2028 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 95.1+ KB


In [11]:
r

Unnamed: 0.1,title,selftext,Unnamed: 0,averageRating,numVotes
0,Midsommar,**“Midsommar” basks in its own radiant glory. ...,6755416,7.1,355330.0
1,Midsommar,"""Tomorrow's a big day.""\n-Pelle\n\n\n\n\nAfter...",6755416,7.1,355330.0
2,Midsommar,Hi there! My name is Mandy and I’m one of the ...,6755416,7.1,355330.0
3,Hereditary,By far the best horror flick of 2018. A master...,6439985,7.3,344077.0
4,Hereditary,**Release Date:** June 8th 2018\n\n**Director:...,6439985,7.3,344077.0
...,...,...,...,...,...
2023,Dr. Jekyll and Mr. Hyde,Basic plot: A scientist (Fredric March) creat...,3267015,7.1,41.0
2024,The Dark,**THE DARK** always stood out in my mind as on...,535469,8.0,95.0
2025,The Dark,**THE DARK** always stood out in my mind as on...,2103739,7.6,6.0
2026,The Dark,**THE DARK** always stood out in my mind as on...,4034783,9.6,8.0


In [12]:
scoring = r[["title","selftext","averageRating"]]

In [13]:
scoring.head()

Unnamed: 0,title,selftext,averageRating
0,Midsommar,**“Midsommar” basks in its own radiant glory. ...,7.1
1,Midsommar,"""Tomorrow's a big day.""\n-Pelle\n\n\n\n\nAfter...",7.1
2,Midsommar,Hi there! My name is Mandy and I’m one of the ...,7.1
3,Hereditary,By far the best horror flick of 2018. A master...,7.3
4,Hereditary,**Release Date:** June 8th 2018\n\n**Director:...,7.3


In [14]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [15]:
from textblob import TextBlob

def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

scoring['sentiment'] = scoring['selftext'].apply(get_sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  scoring['sentiment'] = scoring['selftext'].apply(get_sentiment)


In [16]:
scoring.head()

Unnamed: 0,title,selftext,averageRating,sentiment
0,Midsommar,**“Midsommar” basks in its own radiant glory. ...,7.1,0.099604
1,Midsommar,"""Tomorrow's a big day.""\n-Pelle\n\n\n\n\nAfter...",7.1,0.079944
2,Midsommar,Hi there! My name is Mandy and I’m one of the ...,7.1,0.164479
3,Hereditary,By far the best horror flick of 2018. A master...,7.3,-0.030238
4,Hereditary,**Release Date:** June 8th 2018\n\n**Director:...,7.3,0.076175


In [17]:
avg_sent = scoring.groupby('title')['sentiment'].mean().reset_index()

In [22]:
avg_sent[:10]

Unnamed: 0,title,sentiment
0,10 Cloverfield Lane,Positive
1,31,Positive
2,47 Meters Down: Uncaged,Positive
3,A Nightmare on Elm Street,Positive
4,A Quiet Place,Positive
5,A Tale of Two Sisters,Positive
6,Abominable,Positive
7,After Midnight,Positive
8,Alien,Positive
9,Alien Abduction,Negative
