# Data Engineering Group Project

## Team: Data Magicians
Members:
<p>
Arda Putra Ryandika
</p>
<p>
Atthaya Busayaruengrat (Hong)
</p>
<p>
Jingxue (Vera) Cao
</p>
<p>
Katharina Wiedmann  
</p>
<p>
Ying Tung (Debbie) Lau
</p>

## Objective
In this project, we are taking three sources of movie review data (csv, tsv, web scraping) and aiming to create weak labelling functions based on the data. The objective of the project is to compare the performance of a machine learning based classifier with that of the combination of weak labelling functions. 


## Movie Review Data

We split the work of obtaining data to three group members (Hong, Debbie, Vera) . Hong was responsible for scraping movie reviews from the rotten tomato website, while Debbie and Vera sought for other data formats like tsv and csv to ensure the sufficiency of data sample. 

The dataset contains two columns: one is the text review posted by people, another is the label 1 or 0. 1 indicates a positive review (fresh), and 0 indicates an non-positive review (not fresh).

The final dataset consists of 15000 reviews in total, with 5000 from web scrapping, 5000 from csv file, and 5000 from tsv file. 

We further sampled 2000 data points for labelling function development (development data set) and made sure the positive and non-positive reviews had the same amounts. 


## Building classifiers 

Arda was responsible for building a NLP classifier, which will be later applied to compare with the labelling function classifier. This model was done by utilizing tokens generated from two types of reviews and fed as features. This NLP classifier yielded 70% accuracy on the testing set. 
Arda also implemented the spark on Faculty so that the following labelling functions can be implemented in a spark environment. 

## Building Labelling functions

Meanwhile, the rest of the team members worked on generating weak labelling functions based on the findings in data exploration. 

During data exploration, as implementing spark slowed down the computation process as it partitions the dataset, we chose to use pandas to notice any patterns and differences in positive and non-positive reviews. We faced a few challenges in this stage. For example, we built a lemmatization function on the word count dataframe to avoid classifier bias caused by word inflections. However, many words were converted into completely different words incorrectly. Due to the mis-correction on words, we decided not to use lemmatization. 

As for building the labelling functions, Katie looked for stop words in the review first and counted the word occurrences in positive reviews and negative reviews. By identifying the difference in the word occurrences, we built our labelling functions to separate positive and non-positive reviews based on exclusive words. 

We also looked for capital letters mentioned in each type of reviews, but since the capital letters were irrelevant to emotions and most didn’t make sense for understanding, we decided not to create a labelling function based on it.

As we noticed keywords like “too” and “far” occurred more often in a specific type of reviews, so other labelling functions were created based on the keywords. 
Similarly, exclamation mark and question mark also appeared more often in one type of reviews, so we created labelling functions based on them as well.

After finalising the labeling functions, Arda built a classifier to combine the labeling functions together. 

## Results

<>

## GitHub
Every time we made a change, we used terminal in Faculty to push the changes to our group repository. 
After committing the change, we used “git status” to double check the state of repository. Using “git diff” also enables us to see all the changes in repository.

Link to GitHub repository:
https://github.com/KatharinaWiedmann/DataEngGroupProject.git


## JIRA board 
We used JIRA to manage the progress of our project and record our meeting topics. The brief of meetings is shown as below:
<p>
First meeting:
•	Rotten Tomatoes labelling functions - good movie or not (positive/ negative rating). 
•	Twitter data: Labelling whether someone is a Boris Johnson supporter or not. 
•	YouTube comments: positive or negative comment 
•	Promotion emails - is an email a spam email or a genuine email. 
</p>
<p>
Second meeting:
•	Labelling functions (Vera, Debbie, Katie) 
•	Classifier (Arda) 
•	Web Scraping (Hong) 
•	GitHub (Katie)
</p>
<p>
Third meeting:
•	Rewriting labelling functions (SparkLFApplier - done together)
•	Combining labelling functions (Hong & Katie)
•	Analyzing summary/ results labelling functions (Hong & Katie)
•	Plug in classifier (Arda)
•	Compare results between using labelling functions and not labelling functions (Arda)
•	Iterate on Mark-up/ write-down (Vera)
•	JIRA cleanup & additional notes (Katie)
•	Github reminder - don't forget to push/ pull (everyone)
•	Make sensitivity analysis work (Debbie) 
•	nt function (Hong)
</p>

Further details can be found at:
http://csjira2.cs.ucl.ac.uk:8080/secure/RapidBoard.jspa?rapidView=316&view=detail&selectedIssue=DED-16


In [3]:
conda install pandas==0.24.2

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [None]:
#conda install networkx==2.3.0
#run once and then need to restart the kernel?

In [4]:
conda upgrade --all -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda/envs/Python3


The following packages will be UPDATED:

  networkx                                         2.3-py_0 --> 2.4-py_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


In [5]:
conda install snorkel==0.9.0 -c conda-forge

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda/envs/Python3

  added / updated specs:
    - snorkel==0.9.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         149 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates     pkgs/main::ca-certificates-2020.1.1-0 --> conda-forge::ca-certificates-2019.11.28-hecc5488_0
  certifi                                         pkgs/main --> conda-forge
  openssl              pkgs/main::openssl-1.1.1d-h7b6447c_4 --> conda-forge::openssl-1.1.1d-h516909a_0



Downloading and Extracting Packages
certifi-2019.11.28   | 149 KB    | #

In [6]:
conda install -c conda-forge textblob

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [7]:
from snorkel.labeling import LFAnalysis
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier,LabelModel
from snorkel.preprocess import preprocessor
from textblob import TextBlob
import nltk
from itertools import repeat
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
import re

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
from bs4 import BeautifulSoup
from csv import writer
import re
import pickle
import time
import json

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/faculty/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
#Spark 

# Spark Environment
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
import pyspark

number_cores = 4
memory_gb = 16
conf = (
    pyspark.SparkConf()
        .setMaster('local[{}]'.format(number_cores))
        .set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext.getOrCreate(conf=conf)
print(sc)

# get the context
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark) 

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

<SparkContext master=local[4] appName=pyspark-shell>
<pyspark.sql.session.SparkSession object at 0x7fdc4cad5160>


# Web Scraping

## Browse all from DVD releases page

In [9]:
main = 'https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu%3Bamazon_prime%3Bfandango_now&certified&sortBy=release&type=dvd-streaming-all&page='

In [10]:
# Get movie url
movie_url = []
start_page = 1 ; end_page = 1
while start_page <= end_page:
#     time.sleep(7)
    url = main + str(start_page)
    response = requests.get(url)
    if response.status_code !=200:
        print('Request error')
        break
    file = json.loads(response.text)
    for i in file['results']:
        movie_url = movie_url + [i['url']]
    start_page +=1

In [11]:
print('Examples for the url:\n')
for i in range(3):
    print(movie_url[i])
print('\nNumber of movies in list: {}'.format(len(movie_url)))

Examples for the url:

/m/frozen_ii
/m/playmobil_the_movie
/m/charlies_angels_2019

Number of movies in list: 32


In [12]:
# Split into lists of 50 movies to do the scraping
movie_url_split = [movie_url[i:i+50] for i in range(0,600,50)]

In [13]:
len(movie_url_split)

12

In [14]:
# Get reviews from the web
reviews = []
titles = []
ratings = []
for split in movie_url_split: # Loop through each split
#     time.sleep(7)
    for title in split: # Loop through each movie title
        url = 'https://www.rottentomatoes.com'+title
#         time.sleep(7)
        response = requests.get(url)
        # Check the request status code
        if response.status_code != 200:
            print('Request error')
            break
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Get labels from each review (fresh vs. rotten)
        fresh_rotten = soup.find_all(class_="review_quote")
        
        # Get movie title
        title = soup.find(class_="mop-ratings-wrap__title mop-ratings-wrap__title--top").getText()
        
        # Get reviews
        review = soup.find_all('blockquote')
        for i in review:
            j = str(i.contents[1])
            j = j.replace("<p>\n                    \n                        ","")
            j = j.replace("\n                    \n                </p>","")
            reviews = reviews + [j]
            titles = titles + [title]
        
        # Identify labels
        for i in fresh_rotten:
            temp = str(i.findChildren()[2])
            if re.search('rotten',temp):
                ratings = ratings + ['rotten']
            else:
                ratings = ratings + ['fresh']
            

In [15]:
# Create the DataFrame to store the scraped data
df = pd.DataFrame([titles,reviews,ratings],index = ['title','review','rating']).T

In [16]:
# Clean the data (drop duplicates, check missing values etc.)
df = df.drop_duplicates()
df = df.replace([None],np.nan)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 335 entries, 0 to 335
Data columns (total 3 columns):
title     335 non-null object
review    335 non-null object
rating    335 non-null object
dtypes: object(3)
memory usage: 10.5+ KB


In [17]:
df.dropna(inplace=True)
df.head(3)

Unnamed: 0,title,review,rating
0,Frozen II,Frozen II is a worthy follow-up with enough he...,fresh
1,Frozen II,[Some] sequences have a gleam and a rhythm to ...,fresh
2,Frozen II,...one of the most beautifully animated films ...,fresh


In [18]:
# Export to CSV files
# df.to_csv('web_scraping_rotten_tomatoes.csv')

# Reading & Preparing TSV file 

In [19]:
# Read TSV file
tsv_reviews = pd.read_csv('/project/reviews.tsv', sep='\t', header=0, encoding='unicode_escape')

In [20]:
tsv_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [21]:
# Extract review and fresh columns
tsv_reviews = pd.DataFrame(tsv_reviews, columns = ['review', 'fresh'])

In [22]:
tsv_reviews.head()

Unnamed: 0,review,fresh
0,A distinctly gallows take on contemporary fina...,fresh
1,It's an allegory in search of a meaning that n...,rotten
2,... life lived in a bubble in financial dealin...,fresh
3,Continuing along a line introduced in last yea...,fresh
4,... a perverse twist on neorealism...,fresh


In [23]:
tsv_reviews.isnull().sum()

review    5563
fresh        0
dtype: int64

In [24]:
# drop NaN rows in reviews
index_name = tsv_reviews[(tsv_reviews['review'].isnull())].index
tsv_reviews.drop(index_name, inplace= True)

In [25]:
tsv_reviews.isnull().sum()

review    0
fresh     0
dtype: int64

In [26]:
# rename fresh as 1 and rotten as 0
tsv_reviews['fresh'].replace({'fresh':'1', 'rotten':'0'}, inplace = True)

In [27]:
#Rename columns
tsv_reviews.rename(columns={'fresh':'Freshness','review':'Review'},inplace=True)
tsv_reviews = tsv_reviews.sample(5000)
#take 5000
tsv_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 50737 to 34957
Data columns (total 2 columns):
Review       5000 non-null object
Freshness    5000 non-null object
dtypes: object(2)
memory usage: 117.2+ KB


# Reading & Preparing CSV file 

In [28]:
# Read CSV file
csv_reviews= pd.read_csv('/project/rotten_tomatoes_reviews.csv')
csv_reviews.head()


Unnamed: 0,Freshness,Review
0,1,"Manakamana doesn't answer any questions, yet ..."
1,1,Wilfully offensive and powered by a chest-thu...
2,0,It would be difficult to imagine material mor...
3,0,Despite the gusto its star brings to the role...
4,0,If there was a good idea at the core of this ...


In [29]:
#Swap Freshness and Review 
columns_titles = ["Review","Freshness"]
csv_reviews=csv_reviews.reindex(columns=columns_titles)
csv_reviews = csv_reviews.sample(5000)

csv_reviews.head()
csv_reviews.info()

Unnamed: 0,Review,Freshness
75458,The narrative dares to break the insular herm...,1
368729,"It's literally hard to watch, as copious amou...",0
38248,"If you insist on seeing the entire, seemingly...",0
449372,70 odd minutes of medical tragedy and cops ma...,0
292863,The Nice Guys is funny enough when it sticks ...,1


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 75458 to 439480
Data columns (total 2 columns):
Review       5000 non-null object
Freshness    5000 non-null int64
dtypes: int64(1), object(1)
memory usage: 117.2+ KB


# Web Scraping & Preparing scrapped data 

In [30]:
#Read Web Scraping Data
web_scraping_reviews= pd.read_csv('/project/web_scraping_rotten_tomatoes.csv')
web_scraping_reviews.head()

web_scraping_reviews = web_scraping_reviews.sample(5000)

Unnamed: 0.1,Unnamed: 0,title,review,rating
0,0,My Hindu Friend (Meu amigo Hindu),Babenco's cinematic farewell isn't perfect by ...,fresh
1,1,My Hindu Friend (Meu amigo Hindu),This is a good film if you are looking for som...,fresh
2,2,My Hindu Friend (Meu amigo Hindu),"My Hindu Friend is a celebration of life, love...",fresh
3,3,My Hindu Friend (Meu amigo Hindu),I wouldn't miss it; it's a film that's more th...,fresh
4,4,My Hindu Friend (Meu amigo Hindu),"...surreal, reflective (though never sentiment...",fresh


In [31]:
# rename fresh as 1 and rotten as 0
web_scraping_reviews['rating'].replace({'fresh':'1', 'rotten':'0'}, inplace = True)

#Rename Rating to Review 
web_scraping_reviews.rename(columns={'rating':'Freshness', 'review':'Review'},inplace=True)
web_scraping_reviews.head()

# Extract Review and Freshness columns
web_scraping_reviews= pd.DataFrame(web_scraping_reviews, columns = ['Review', 'Freshness'])

Unnamed: 0.1,Unnamed: 0,title,Review,Freshness
98,98,First Love (Hatsukoi),If your partner calls First Love the perfect d...,1
432,432,Mrs. Lowry & Son,"On the screen, Mrs Lowry and Son, much like on...",1
3987,3987,"Frankenstein's Monster's Monster, Frankenstein",The main benefit of its brevity is that you ca...,1
1427,1427,The Shed,Without turning into a cheesy after school spe...,1
1923,1923,Emanuel,Forgiveness is a radical act. So is speaking a...,1


In [32]:
web_scraping_reviews.head()

Unnamed: 0,Review,Freshness
98,If your partner calls First Love the perfect d...,1
432,"On the screen, Mrs Lowry and Son, much like on...",1
3987,The main benefit of its brevity is that you ca...,1
1427,Without turning into a cheesy after school spe...,1
1923,Forgiveness is a radical act. So is speaking a...,1


# Combining all the data together 

In [33]:
# csv_reviews.info()
# tsv_reviews.info()
# web_scraping_reviews.info()



# Concat two files into all_reviews
all_reviews=pd.concat([csv_reviews, tsv_reviews,web_scraping_reviews],axis=0, sort=False)
all_reviews.head()

Unnamed: 0,Review,Freshness
75458,The narrative dares to break the insular herm...,1
368729,"It's literally hard to watch, as copious amou...",0
38248,"If you insist on seeing the entire, seemingly...",0
449372,70 odd minutes of medical tragedy and cops ma...,0
292863,The Nice Guys is funny enough when it sticks ...,1


In [34]:
all_reviews.shape

(15000, 2)

# Split into test and training set 

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
all_reviews['Freshness'] = all_reviews['Freshness'].astype(int)

In [37]:
train, test = train_test_split(all_reviews,test_size=0.2,stratify = all_reviews['Freshness'])

In [38]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 187186 to 435467
Data columns (total 2 columns):
Review       12000 non-null object
Freshness    12000 non-null int64
dtypes: int64(1), object(1)
memory usage: 281.2+ KB


In [39]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 38911 to 3960
Data columns (total 2 columns):
Review       3000 non-null object
Freshness    3000 non-null int64
dtypes: int64(1), object(1)
memory usage: 70.3+ KB


In [40]:
train.head()

Unnamed: 0,Review,Freshness
187186,One character gets knocked out so many times ...,0
53153,A lot more fun than most of the director's pom...,1
8124,Friends with Kids is funny and likable and whi...,1
56239,[Shows] less 'the majesty of rock' ... than t...,1
42383,If you can follow the plotlines of this swinge...,0


In [41]:
#get rid off training labels 
train = train.drop('Freshness', 1)

In [42]:
test['Freshness'].value_counts()

1    1789
0    1211
Name: Freshness, dtype: int64

In [43]:
#From labelled test set, extract a sample to find out about which labelling functions could be written
#Not sure how big the development split_ can be --> take sample of 1000 data points 

development_split = test.sample(2000,random_state=42)
development_split.head()


Unnamed: 0,Review,Freshness
351693,"Nerve's biggest problem is that, much like a ...",0
291546,Fledgling director Bong Joon-ho received the ...,1
20579,Unconventional romantic comedy with a slap and...,1
54199,"Make no mistake, this movie is trash, but it's...",1
96286,This film adaptation of his James and the Gia...,1


In [44]:
# test2000.index

In [45]:
# test1000 = pd.merge(test,test2000,how='left',on = 'Review')
# test1000 = test1000[test1000.Freshness_y.isnull()]

In [46]:
# test1000.shape

In [47]:
# test1000.drop('Freshness_y',axis=1, inplace=True)
# test1000.columns = ['Review','Freshness']
# test1000

In [48]:
# test1000.to_csv('1000_labels.csv')

In [49]:
#For finding labelling functions: 
development_split

Unnamed: 0,Review,Freshness
351693,"Nerve's biggest problem is that, much like a ...",0
291546,Fledgling director Bong Joon-ho received the ...,1
20579,Unconventional romantic comedy with a slap and...,1
54199,"Make no mistake, this movie is trash, but it's...",1
96286,This film adaptation of his James and the Gia...,1
3779,Just when it's on the brink of becoming one of...,0
37177,Stiller plays a familiar character with a nice...,1
4614,"Deft, daft and wildly endearing.",1
699,"To Reeder's immense credit, her film glides fo...",1
113837,I've never been a big fan of this Bond flick....,0


In [50]:
development_split.count()

Review       2000
Freshness    2000
dtype: int64

In [51]:
development_split.to_csv('development_split.csv')

In [52]:
# development_split = pd.read_csv('development_split.csv',index_col = 'Unnamed: 0')

In [53]:
development_split.count()

Review       2000
Freshness    2000
dtype: int64

In [54]:
development_split.head()

Unnamed: 0,Review,Freshness
351693,"Nerve's biggest problem is that, much like a ...",0
291546,Fledgling director Bong Joon-ho received the ...,1
20579,Unconventional romantic comedy with a slap and...,1
54199,"Make no mistake, this movie is trash, but it's...",1
96286,This film adaptation of his James and the Gia...,1


In [55]:
#Might have to get rid off index?

development_split[development_split['Freshness'] !=1].count()

Review       813
Freshness    813
dtype: int64

In [56]:
development_split['Freshness'].value_counts()

1    1187
0     813
Name: Freshness, dtype: int64

## Split into positive and negative reviews

In [57]:
development_split = pd.read_csv('development_split.csv')

In [58]:
development_split_fresh = development_split[development_split['Freshness'] == 1]
development_split_rotten = development_split[development_split['Freshness'] == 0]
development_split_fresh.head()
development_split_rotten.head()

Unnamed: 0.1,Unnamed: 0,Review,Freshness
1,291546,Fledgling director Bong Joon-ho received the ...,1
2,20579,Unconventional romantic comedy with a slap and...,1
3,54199,"Make no mistake, this movie is trash, but it's...",1
4,96286,This film adaptation of his James and the Gia...,1
6,37177,Stiller plays a familiar character with a nice...,1


Unnamed: 0.1,Unnamed: 0,Review,Freshness
0,351693,"Nerve's biggest problem is that, much like a ...",0
5,3779,Just when it's on the brink of becoming one of...,0
9,113837,I've never been a big fan of this Bond flick....,0
10,58684,Most of the segments are as clueless about th...,0
12,7946,This year's prize for most ludicrous set-up a...,0


In [59]:
# fresh reviews 
development_split_fresh.info()
# rotten reviews 
development_split_rotten.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1187 entries, 1 to 1999
Data columns (total 3 columns):
Unnamed: 0    1187 non-null int64
Review        1187 non-null object
Freshness     1187 non-null int64
dtypes: int64(2), object(1)
memory usage: 37.1+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 813 entries, 0 to 1998
Data columns (total 3 columns):
Unnamed: 0    813 non-null int64
Review        813 non-null object
Freshness     813 non-null int64
dtypes: int64(2), object(1)
memory usage: 25.4+ KB


## Labelling Functions

## 1. Word occurrences

### Positive Reviews

In [60]:
# Removing punctuation 
def remove_punctuation(dataframe):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in dataframe.Review.str.lower():
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [61]:
#Remove punctuation from fresh reviews & turn into Series
split_fresh= pd.Series(remove_punctuation(development_split_fresh))
split_fresh.head()

0     fledgling director bong joonho received the b...
1    unconventional romantic comedy with a slap and...
2    make no mistake this movie is trash but its me...
3     this film adaptation of his james and the gia...
4    stiller plays a familiar character with a nice...
dtype: object

In [62]:
stopWordList = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves",\
                "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their",\
                "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was",\
                "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and",\
                "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between",\
                "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off",\
                "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any",\
                "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",\
                "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [63]:
#Removing stopwords from fresh

replacements = dict(zip((fr'\b{word}\b' for word in stopWordList), repeat("")))
split_fresh.replace(replacements, regex=True, inplace=True)
split_fresh.replace({r' +': ' ', r' +\.': '.'}, regex=True, inplace=True)

### Implement lemmatization

In [64]:
# Lemmatization and Count again
def stem_recount(df):
    import pandas as pd
    # Lemmatization
    from nltk import LancasterStemmer
    st = LancasterStemmer()
    newdf = df.copy()
    for i in range(0,len(df)):
        newdf.iloc[i,0] = st.stem(str(df.iloc[i,0])) 
        # Plz make sure the word column is the first column in df when using this function
    
    # Recount
    duplicate = newdf[newdf.duplicated(['index'])]
    # Plz make sure the 'index' is the column name consisting of words
    for i in range(0,len(newdf)):
        if i >= len(duplicate):
            break
        if newdf.iloc[i,0] == duplicate.iloc[i,0]:
            newdf.iloc[i,1] = newdf.iloc[i,1] + duplicate.iloc[i,1]
    return newdf

In [65]:
common_words_fresh = split_fresh.str.split(expand=True).stack().value_counts()
common_words_fresh_df = pd.DataFrame(common_words_fresh)
common_words_fresh_df = common_words_fresh_df.rename({0:'Occurence good review'}, axis='columns')
new_common_words_fresh_df = common_words_fresh_df.reset_index()

In [66]:
stem_recount(new_common_words_fresh_df)

Unnamed: 0,index,Occurence good review
0,film,253
1,movy,115
2,on,88
3,lik,86
4,story,82
5,ful,68
6,span,60
7,review,57
8,good,53
9,best,49


In [67]:
#Get most common words in positive reviews 
top_common_words_fresh = common_words_fresh_df[common_words_fresh_df['Occurence good review'] >=4]
top_common_words_fresh

Unnamed: 0,Occurence good review
film,214
movie,115
one,88
like,86
story,82
full,68
spanish,60
review,57
good,53
best,49


### *** EXPLAIN WHY WE DONT USE LEMMATIZATION

<div class="alert alert-success">
<b> Reason for not using Lemmatization </b>

<p>
    Before counting the occurance of words in the movie review, we noticed that inflections in words may result in different occurances and thus generating bias during counting. For example, "enjoy" and "enjoyed" share the same root but would be counted separately if not using Lemmatization.
    </p> 
    
<p>
    The function "stem_recount" takes the root of a word and recounts the occurences. However, it posed a disadvantage of mis-normalizing words into other completely different words. For example, "movie" was identified as "movy", and "like" was identified as "lik". We thought this disadvantage exceeds the benefits of correcting word inflection, so we decided to not implement it.
    </p> 

</div>

### Negative reviews

In [68]:
#Remove punctuation from rotten & turn into Series
split_rotten= pd.Series(remove_punctuation(development_split_rotten))
split_rotten.head()

0     nerves biggest problem is that much like a co...
1    just when its on the brink of becoming one of ...
2     ive never been a big fan of this bond flick c...
3     most of the segments are as clueless about th...
4     this years prize for most ludicrous setup and...
dtype: object

In [69]:
# Removing stopwords from negative reviews
replacements = dict(zip((fr'\b{word}\b' for word in stopWordList), repeat("")))
split_rotten.replace(replacements, regex=True, inplace=True)
split_rotten.replace({r' +': ' ', r' +\.': '.'}, regex=True, inplace=True)

In [70]:
#Get most common words in negative reviews 

common_words_rotten = split_rotten.str.split(expand=True).stack().value_counts()
common_words_rotten_df = pd.DataFrame(common_words_rotten)
common_words_rotten_df = common_words_rotten_df.rename({0:'Occurence bad review'}, axis='columns')
top_common_words_rotten = common_words_rotten_df[common_words_rotten_df['Occurence bad review'] >=4]

In [71]:
top_common_words_rotten.head()

Unnamed: 0,Occurence bad review
film,129
movie,105
like,75
story,60
one,54


## Comparison of good and bad reviews

<div class="alert alert-success">
We want to find out which of the words in the good list only appear in the good movies (and not in the bad movies), vice versa and base labeling functions on these findings. We first ened to prepare the data accordingly, before we can write the labelling functions.
    <div>

In [72]:
top_fresh_words_exclusive = top_common_words_fresh.merge(top_common_words_rotten, indicator='i', how='outer', left_index=True,\
                                                         right_index=True).query('i == "left_only"').drop('i', 1)

top_rotten_words_exclusive = top_common_words_rotten.merge(top_common_words_fresh, indicator='i', how='outer', left_index=True,\
                                                           right_index=True).query('i == "left_only"').drop('i', 1)

In [73]:
top_fresh_words_exclusive.head()

Unnamed: 0,Occurence good review,Occurence bad review
2019,4.0,
absolute,6.0,
absorbing,4.0,
accomplished,4.0,
account,5.0,


In [74]:
top_rotten_words_exclusive.head()

Unnamed: 0,Occurence bad review,Occurence good review
2,4.0,
achieve,4.0,
adaptation,4.0,
amusing,4.0,
apart,5.0,


In [75]:
#Get only positive words 
top_fresh_words_exclusive_list = top_fresh_words_exclusive['Occurence good review'].index.tolist()
top_fresh_words_exclusive_list

['2019',
 'absolute',
 'absorbing',
 'accomplished',
 'account',
 'across',
 'adult',
 'adults',
 'aesthetic',
 'air',
 'ambitious',
 'america',
 'amp',
 'animated',
 'animation',
 'answers',
 'anyone',
 'approach',
 'arent',
 'artist',
 'aside',
 'astonishing',
 'atmosphere',
 'atmospheric',
 'based',
 'beautiful',
 'beautifully',
 'beauty',
 'beginning',
 'begins',
 'behind',
 'bill',
 'bloody',
 'bond',
 'book',
 'break',
 'breaks',
 'breath',
 'brian',
 'bright',
 'brilliant',
 'bring',
 'brings',
 'british',
 'brutal',
 'call',
 'called',
 'captures',
 'catch',
 'certain',
 'cheap',
 'chilling',
 'choice',
 'city',
 'clearly',
 'clever',
 'colorful',
 'comedic',
 'comic',
 'commentary',
 'committed',
 'common',
 'compelling',
 'complex',
 'constructed',
 'content',
 'convincing',
 'cop',
 'courage',
 'craft',
 'crafted',
 'creates',
 'creative',
 'creativity',
 'credit',
 'crime',
 'cultural',
 'culture',
 'cute',
 'dark',
 'darkly',
 'darkness',
 'date',
 'david',
 'de',
 'dead',

In [76]:
#take out the ones that seem to make sense: 
top_fresh_words_exclusive = ['absolutely',
 'addition',
 'adventure',
 'affectionate',
 'amazing',
 'ambition',
 'art',
 'artist',
 'arts',
 'atmosphere',
 'attractive',
 'awards',
 'balance',
 'beautiful',
 'beautifully',
 'beauty',
 'bond',
 'bright',
 'brilliant',
 'captivating',
 'captures',
 'celebration',
 'charm',
 'charming',
 'christmas',
 'classic',
 'clever',
 'committed',
 'consistently',
 'contemporary',
 'conventional',
 'convincingly',
 'creates',
 'creating',
 'crowdpleaser',
 'cult',
 'decade',
 'decades',
 'deep',
 'deeper',
 'deeply',
 'definitely',
 'delightful',
 'delightfully',
 'depth',
 'deserves',
 'design',
 'details',
 'different',
 'diverse',
 'dramatic',
 'early',
 'elegant',
 'emotionally',
 'engaging',
 'enjoyable',
 'enjoyed',
 'equal',
 'especially',
 'exploration',
 'extraordinary',
 'extremely',
 'familiar',
 'famous',
 'fan',
 'fantastic',
 'fantasy',
 'fascinating',
 'felt',
 'filled',
 'finest',
 'frank',
 'fresh',
 'friends',
 'friendship',
 'gags',
 'gorgeous',
 'grand',
 'happy',
 'heart',
 'hilarious',
 'honest',
 'hope',
 'huge',
 'impact',
 'insightful',
 'inspiring',
 'intelligent',
 'intense',
 'intrigue',
 'joy',
 'laugh',
 'loved',
 'mature',
 'mind',
 'mystery',
 'nostalgia',
 'novel',
 'opening',
 'passion',
 'perfect',
 'performers',
 'personal',
 'pleasure',
 'poignant',
 'power',
 'powerful',
 'precisely',
 'profound',
 'project',
 'proves',
 'provide',
 'provocative',
 'psychological',
 'quality',
 'remarkable',
 'reveals',
 'rich',
 'riveting',
 'satisfying',
 'sharp',
 'simple',
 'smart',
 'smile',
 'stunning',
 'succeeds',
 'supernatural',
 'surprise',
 'surprises',
 'surprising',
 'surprisingly',
 'sweet',
 'talents',
 'thoughtful',
 'thrills',
 'touch',
 'touching',
 'tragedy',
 'tragic',
 'tribute',
 'unique',
 'universal',
 'warm',
 'watchable',
 'welcome',
 'wit',
 'witty',
 'wonderful',
 'worthwhile',
 'worthy']

top_fresh_words_exclusive

['absolutely',
 'addition',
 'adventure',
 'affectionate',
 'amazing',
 'ambition',
 'art',
 'artist',
 'arts',
 'atmosphere',
 'attractive',
 'awards',
 'balance',
 'beautiful',
 'beautifully',
 'beauty',
 'bond',
 'bright',
 'brilliant',
 'captivating',
 'captures',
 'celebration',
 'charm',
 'charming',
 'christmas',
 'classic',
 'clever',
 'committed',
 'consistently',
 'contemporary',
 'conventional',
 'convincingly',
 'creates',
 'creating',
 'crowdpleaser',
 'cult',
 'decade',
 'decades',
 'deep',
 'deeper',
 'deeply',
 'definitely',
 'delightful',
 'delightfully',
 'depth',
 'deserves',
 'design',
 'details',
 'different',
 'diverse',
 'dramatic',
 'early',
 'elegant',
 'emotionally',
 'engaging',
 'enjoyable',
 'enjoyed',
 'equal',
 'especially',
 'exploration',
 'extraordinary',
 'extremely',
 'familiar',
 'famous',
 'fan',
 'fantastic',
 'fantasy',
 'fascinating',
 'felt',
 'filled',
 'finest',
 'frank',
 'fresh',
 'friends',
 'friendship',
 'gags',
 'gorgeous',
 'grand',
 '

In [77]:
#Get only negative words 
top_rotten_words_exclusive_list = top_rotten_words_exclusive['Occurence bad review'].index.tolist()
top_rotten_words_exclusive_list

['2',
 'achieve',
 'adaptation',
 'amusing',
 'apart',
 'appealing',
 'attempt',
 'attention',
 'awful',
 'basic',
 'battle',
 'beats',
 'becoming',
 'bigger',
 'biggest',
 'bore',
 'boring',
 'camp',
 'cartoon',
 'christmas',
 'clichés',
 'close',
 'collection',
 'commercial',
 'conventional',
 'convoluted',
 'couldnt',
 'country',
 'course',
 'days',
 'designed',
 'didnt',
 'die',
 'disappointing',
 'dry',
 'dull',
 'dumb',
 'eat',
 'emotions',
 'empty',
 'ends',
 'entirely',
 'example',
 'except',
 'excuse',
 'execution',
 'fact',
 'fails',
 'falling',
 'falls',
 'final',
 'flat',
 'formulaic',
 'gags',
 'give',
 'ground',
 'hair',
 'half',
 'hardly',
 'head',
 'hollow',
 'hoping',
 'idea',
 'impossible',
 'inevitable',
 'inspired',
 'jordan',
 'keep',
 'lacks',
 'largely',
 'leave',
 'leaves',
 'line',
 'live',
 'loses',
 'macgruber',
 'main',
 'mediocre',
 'mediocrity',
 'mess',
 'michael',
 'misses',
 'moves',
 'mr',
 'nearly',
 'needs',
 'none',
 'nonsense',
 'nuance',
 'obvious

In [78]:
#take out the ones that seem to make sense: 
top_rotten_words_exclusive = [
 'attempt',
 'awkward',
 'barely',
 'basically',
 'bizarre',
 'bland',
 'boring',
 'clumsy',
 'comedic',
 'disappointing',
 'disappointingly',
 'disappointment',
 'disaster',
 'dull',
 'effort',
 'failed',
 'fails',
 'generic',
 'irritating',
 'lacking',
 'manic',
 'missing',
 'nobody',
 'noir',
 'none',
 'painfully',
 'pointless',
 'poorly',
 'problem',
 'shallow',
 'shame',
 'sloppy',
 'slow',
 'suffers',
 'superficial',
 'try',
 'unfortunately',
 'unfunny',
 'worst']
top_rotten_words_exclusive

['attempt',
 'awkward',
 'barely',
 'basically',
 'bizarre',
 'bland',
 'boring',
 'clumsy',
 'comedic',
 'disappointing',
 'disappointingly',
 'disappointment',
 'disaster',
 'dull',
 'effort',
 'failed',
 'fails',
 'generic',
 'irritating',
 'lacking',
 'manic',
 'missing',
 'nobody',
 'noir',
 'none',
 'painfully',
 'pointless',
 'poorly',
 'problem',
 'shallow',
 'shame',
 'sloppy',
 'slow',
 'suffers',
 'superficial',
 'try',
 'unfortunately',
 'unfunny',
 'worst']

## Labelling Function

### 1. Word Occurences

###  A. Good / bad exclusive words occurrences

In [79]:
from snorkel.labeling.apply.spark import SparkLFApplier

from pyspark import SparkContext 
from pyspark.sql import SQLContext 
import pandas as pd 
sqlc=SQLContext(sc) 
df=pd.read_csv('/project/development_split.csv',index_col = 'Unnamed: 0')
df_with_punctuation = df.copy()
df['Review'] = remove_punctuation(df)
development_split=sqlc.createDataFrame(df)
development_split_with_punctuation=sqlc.createDataFrame(df_with_punctuation) 

In [80]:
development_split.show(5)

+--------------------+---------+
|              Review|Freshness|
+--------------------+---------+
| nerves biggest p...|        0|
| fledgling direct...|        1|
|unconventional ro...|        1|
|make no mistake t...|        1|
| this film adapta...|        1|
+--------------------+---------+
only showing top 5 rows



In [81]:
# development_split = pd.read_csv('/project/development_split.csv')
ABSTAIN = -1
NOTFRESH = 0
FRESH = 1

@labeling_function()
def fresh(x):
    for word in top_fresh_words_exclusive:
        word = " " +word+" "
        if word in str(x).lower():
            return FRESH
    return ABSTAIN
#return FRESH if "best" in x.str.lower() else ABSTAIN

@labeling_function()
def rotten(x):
    for word in top_rotten_words_exclusive:
        word = " " +word+" "
        if word in str(x).lower():
            return NOTFRESH
    return ABSTAIN
#return NOTFRESH if "best" in x.str.lower() else ABSTAIN

In [82]:
lfs = [fresh]
applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [83]:
sample_L

array([[-1],
       [-1],
       [-1],
       ...,
       [-1],
       [ 1],
       [-1]])

In [84]:
coverage_fresh = (sample_L != ABSTAIN).mean(axis=0)
print("fresh coverage:{:.1%}".format(coverage_fresh[0]))

fresh coverage:34.6%


In [85]:
lfs = [rotten]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [86]:
coverage_rotten = (sample_L != ABSTAIN).mean(axis=0)
print("rotten coverage:{:.1%}".format(coverage_rotten[0]))

rotten coverage:8.3%


### B. Word 'too' occurances

In [87]:
common_words_fresh_df[common_words_fresh_df.index == 'too']

Unnamed: 0,Occurence good review
too,39


In [88]:
common_words_rotten_df[common_words_rotten_df.index == 'too']

Unnamed: 0,Occurence bad review
too,54


In [89]:
@labeling_function()
def keyword_too(x):
    return NOTFRESH if 'too' in str(x).lower() else ABSTAIN

In [90]:
lfs = [keyword_too]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [91]:
coverage_keyword_too = (sample_L != ABSTAIN).mean(axis=0)
print("keyword too coverage:{:.1%}".format(coverage_keyword_too[0]))

keyword too coverage:5.1%


### C. Word 'far' occurrences

In [92]:
common_words_fresh_df[common_words_fresh_df.index == 'far']

Unnamed: 0,Occurence good review
far,15


In [93]:
common_words_rotten_df[common_words_rotten_df.index == 'far']

Unnamed: 0,Occurence bad review
far,13


In [94]:
@labeling_function()
def keyword_far(x):
    return FRESH if 'far' in str(x).lower() else ABSTAIN

In [95]:
lfs = [keyword_far]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [96]:
coverage_keyword_far = (sample_L != ABSTAIN).mean(axis=0)
print("keyword far coverage:{:.1%}".format(coverage_keyword_far[0]))

keyword far coverage:2.2%


### D. "n't" words occurrences

In [97]:
# Exploration on the n't
# Word occurancy that with punctuation with it

# Word occurrences dataframe for fresh reviews
development_split_fresh_1 = split_fresh.str.split(expand=True).stack().value_counts()
development_split_fresh_df = pd.DataFrame(development_split_fresh_1).reset_index()

# Words occurrences dataframe for rotten reviews
development_split_rotten_1 = split_rotten.str.split(expand=True).stack().value_counts()
development_split_rotten_df = pd.DataFrame(development_split_rotten_1).reset_index()

In [98]:
@labeling_function()

def t(x):
    if re.search("'t",str(x).lower()):
        return NOTFRESH
    return ABSTAIN
#return FRESH if "best" in x.str.lower() else ABSTAIN

In [99]:
lfs = [t]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [100]:
coverage_t = (sample_L != ABSTAIN).mean(axis=0)
print("keyword far coverage:{:.1%}".format(coverage_t[0]))

keyword far coverage:15.4%


### E. Occurenes of good & bad words from external list 

<div class="alert alert-success">
We also want to look at an imported list of postive and negative words and see whether we can base the labelling functions on them.
    </div>


In [101]:
#Importing good and bad words & preparing for labelling function 

In [102]:
#POSITIVE WORDS 
#positive words from --> DON't DELETE! NEED TO CITE PROPERLY http://ptrckprry.com/course/ssd/data/positive-words.txt
positive_word = pd.read_csv('/project/positive_words.csv')

#sample 500 words 
positive_word = positive_word.sample(500)

#convert it into a list 
positive_word= positive_word['a+'].tolist()



#NEGATIVE WORDS 
#negative words from --> HONG?? 
negative_word = pd.read_csv('/project/negative_words.csv')

#sample 500 words 
negative_word = negative_word.sample(500)

#convert it into a list 
negative_word= negative_word['2-faces'].tolist()


In [103]:
@labeling_function()
def negative(x): 
    for word in negative_word:
        word = " " + word + " "
        if word in str(x).lower():
            return NOTFRESH 
    return ABSTAIN 


@labeling_function()
def positive(x):
    for word in positive_word:
        word = " " + word + " "
        if word in str(x).lower():
            return FRESH 
    return ABSTAIN 

In [104]:
lfs = [negative]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [105]:
coverage_negative = (sample_L != ABSTAIN).mean(axis=0)
print("negative words coverage:{:.1%}".format(coverage_negative[0]))

negative words coverage:8.6%


In [106]:
lfs = [positive]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [107]:
coverage_positive = (sample_L != ABSTAIN).mean(axis=0)
print("positive words coverage:{:.1%}".format(coverage_positive[0]))

positive words coverage:21.1%


## 2. Punctuation occurrences

In [108]:
# Turn review column into Series
development_split_fresh_series = pd.Series(development_split_fresh.Review)
development_split_rotten_series = pd.Series(development_split_rotten.Review)

In [109]:
# Positive reviews
# Split reviews into word
fresh_split = pd.Series(development_split_fresh_series.str.split(expand=True).stack())
fresh_words = [i for i in fresh_split]

# Split words into characters
def split_str():
    return [list(ch) for ch in fresh_words]
fresh_split_words = pd.Series(split_str())

In [110]:
# Negative reviews
# Split reviews into word
rotten_split = pd.Series(development_split_rotten_series.str.split(expand=True).stack())
rotten_words = [i for i in rotten_split]

# Split words into characters
def split_str():
    return [list(ch) for ch in rotten_words]
rotten_split_words = pd.Series(split_str())

In [111]:
# Turn into a flattened list
fresh_flattened_list = [y for x in fresh_split_words for y in x]
rotten_flattened_list = [y for x in rotten_split_words for y in x]

# Count the occurancy of each character
# Positive reviews
fresh_split_characters = pd.Series(fresh_flattened_list).value_counts()
fresh_split_characters = pd.DataFrame(fresh_split_characters).reset_index()

# Negative reviews
rotten_split_characters = pd.Series(rotten_flattened_list).value_counts()
rotten_split_characters = pd.DataFrame(rotten_split_characters).reset_index()

### A. Question mark occurrences

In [112]:
# Count the # of occurance of '?' in fresh reviews
fresh_split_characters[fresh_split_characters['index'] == '?']

Unnamed: 0,index,0
64,?,16


In [113]:
# Count the # of occurance of '?' in rotten reviews
rotten_split_characters[rotten_split_characters['index'] == '?']

Unnamed: 0,index,0
53,?,27


In [114]:
list_with_question_mark = []
for review in development_split_rotten.Review:
    if '?' in review:
        list_with_question_mark.append(review)
        
print (list_with_question_mark[:3])

["Arctic Tale isn't a documentary. They say it right there in the title, see? It's a tale. [Blu-ray]", "Listen to the audience as they leave the theatre. Are they talking about being moved by honor and justice, or going 'wow' at the bloodshed so spectacularly delivered?", " It's almost like the movie is afraid of what it should be -- a young, frisky love story that should be exuberant and carefree, even if it means risking making a fool of itself. What's love, after all, if it doesn't do exactly that?"]


In [115]:
@labeling_function()
def question_mark(x):
    return NOTFRESH if '?' in str(x).lower() else ABSTAIN

In [116]:
lfs = [question_mark]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split_with_punctuation.rdd)

In [117]:
coverage_question_mark = (sample_L != ABSTAIN).mean(axis=0)
print("question mark coverage:{:.1%}".format(coverage_question_mark[0]))

question mark coverage:1.9%


### B. Exclamation mark occurrences

In [118]:
# Count the # of occurance of '!' in fresh reviews
fresh_split_characters[fresh_split_characters['index'] == '!']

Unnamed: 0,index,0
71,!,7


In [119]:
# Count the # of occurance of '!' in rotten reviews
rotten_split_characters[rotten_split_characters['index'] == '!']

Unnamed: 0,index,0
66,!,10


In [120]:
@labeling_function()
def exclamation_mark(x):
    return FRESH if '!' in str(x).lower() else ABSTAIN

In [121]:
lfs = [exclamation_mark]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split_with_punctuation.rdd)

In [122]:
coverage_exclamation_mark = (sample_L != ABSTAIN).mean(axis=0)
print("exclamation mark coverage:{:.1%}".format(coverage_exclamation_mark[0]))

exclamation mark coverage:0.7%


### 3. Combining labelling functions

<div class="alert alert-success">
Next, we want to combine all the labelling functions into one and apply them to the training set. However, as the labelling functions around the punctuation have very low coverages, we decided not to include these.
    </div>

In [123]:
lfs = [fresh,
       rotten,
       keyword_too,
       keyword_far,
       t,
       negative,
       positive]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [124]:
train_prepared = train.copy()
train_prepared['Review'] = remove_punctuation(train_prepared)

In [125]:
from snorkel.labeling import LFAnalysis
LFAnalysis(L=sample_L, lfs = lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
fresh,0,[1],0.3465,0.1885,0.117
rotten,1,[0],0.083,0.053,0.0395
keyword_too,2,[0],0.0505,0.0375,0.029
keyword_far,3,[1],0.022,0.0165,0.011
t,4,[0],0.1545,0.0865,0.0755
negative,5,[0],0.0865,0.0525,0.041
positive,6,[1],0.2115,0.1415,0.069


In [146]:
lfs = [fresh,
       rotten,
       keyword_too,
       keyword_far,
       t,
       negative,
       positive]

applier = SparkLFApplier(lfs)

L_train=sqlc.createDataFrame(train_prepared)

#is this next line correct?
L_test = sqlc.createDataFrame(test)

# type(L_train)
L_train = applier.apply(L_train.rdd)

#is this next line correct?
L_test = applier.apply(L_test.rdd)


In [147]:
type(L_train)

numpy.ndarray

In [148]:
L_train

array([[-1, -1, -1, ..., -1,  0, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [ 1, -1, -1, ..., -1, -1, -1],
       ...,
       [ 1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]])

In [149]:
from snorkel.labeling import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

In [150]:
preds_train.shape

(12000,)

In [151]:
preds_train

array([0, 1, 1, ..., 1, 0, 0])

In [152]:
train2 = train.copy()
train2['predicted_train'] = preds_train
train2.to_csv('12000_predicted_labels.csv')

In [153]:
#needs to show version 2.3
import networkx as nx
nx.__version__

'2.3'

In [154]:
#Labelling according to weights 
from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

In [155]:
#need to be checked 
L_test = L_test
Y_test = test['Freshness']

In [156]:
majority_acc = majority_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   59.1%
Label Model Accuracy:     59.4%


## Sensitivity Analysis

In [157]:
@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(str(x))
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

In [158]:
def getSentiment(text):
    x = {}
    x["polarity"] = TextBlob(text).sentiment.polarity
    x["subjectivity"] = TextBlob(text).sentiment.subjectivity
    return x

In [159]:
@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    x = getSentiment(x.text)
    return FRESH if x.polarity > 0.8 else ABSTAIN

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    x = getSentiment(x.text)
    return FRESH if x.subjectivity >= 0.5 else ABSTAIN

In [None]:
lfs = [textblob_polarity, textblob_subjectivity]

applier = SparkLFApplier(lfs)

# development_split=sqlc.createDataFrame(development_split)

sample_L = applier.apply(development_split.rdd)

In [None]:
LFAnalysis(L_sample, lfs).lf_summary()

# Classifier

In [None]:
import pandas as pd
import numpy as np
import nltk
import string

# Spark Environment
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

import pyspark

number_cores = 4
memory_gb = 16
conf = (
    pyspark.SparkConf()
        .setMaster('local[{}]'.format(number_cores))
        .set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext.getOrCreate(conf=conf)
print(sc)

# get the context
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark) 

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [None]:
# Download files
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

!pip install langid
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import preproc as pp

# Register all the functions in Preproc with Spark Context
check_lang_udf = udf(pp.check_lang, StringType())
remove_stops_udf = udf(pp.remove_stops, StringType())
remove_features_udf = udf(pp.remove_features, StringType())
tag_and_remove_udf = udf(pp.tag_and_remove, StringType())
lemmatize_udf = udf(pp.lemmatize, StringType())
check_blanks_udf = udf(pp.check_blanks, StringType())

In [None]:
from pyspark.sql.types import IntegerType

# Read the data (Spark)
review_df = sqlContext.read.csv('/project/development_split.csv', header=True)

# Rename Column
review_df = review_df.withColumnRenamed('Review','text')
review_df = review_df.withColumnRenamed('Freshness','label')
review_df = review_df.withColumnRenamed('_c0','index')

# Change data type to Integer
review_df = review_df.withColumn("label", review_df["label"].cast(IntegerType()))

# Show df information
review_df.show()
review_df.printSchema()
review_df.count()

In [None]:
# remove stop words to reduce dimensionality
review_df = review_df.withColumn("text", remove_stops_udf(review_df["text"]))

# remove other non essential words
review_df = review_df.withColumn("text", remove_features_udf(review_df["text"]))

# tag the words remaining and keep only Nouns, Verbs and Adjectives
review_df = review_df.withColumn("text", tag_and_remove_udf(review_df["text"]))

# lemmatization of remaining words to reduce dimensionality & boost measures
review_df = review_df.withColumn("text", lemmatize_udf(review_df["text"]))

review_df.show()

In [None]:
#Specify Training and Test data
training_df = review_df
test_df = sqlContext.read.csv('/project/1000_labels.csv', header=True)

# Rename Column
test_df = test_df.withColumnRenamed('Review','text')
test_df = test_df.withColumnRenamed('Freshness','label')
test_df = test_df.withColumnRenamed('_c0','index')

# Change data type to Integer
test_df = test_df.withColumn("label", test_df["label"].cast(IntegerType()))


test_df.show()
test_df.printSchema()
test_df.count()

In [None]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and nb.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol='words', outputCol="features")
idf = IDF(minDocFreq=3, inputCol="features", outputCol="idf")
nb = NaiveBayes()
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, nb])


paramGrid = ParamGridBuilder().addGrid(nb.smoothing, [0.0, 1.0]).build()


cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=MulticlassClassificationEvaluator(), 
                    numFolds=4)

cvModel = cv.fit(training_df)

result = cvModel.transform(test_df)
prediction_df = result.select("text", "label", "prediction")
prediction_df.show()

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Evaluate the Accuracy
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(result, {evaluator.metricName: "accuracy"})