# Data Engineering Group Project

## Team: Data Magicians
Members:
<p>
Arda Putra Ryandika
</p>
<p>
Atthaya Busayaruengrat (Hong)
</p>
<p>
Jingxue (Vera) Cao
</p>
<p>
Katharina Wiedmann  
</p>
<p>
Ying Tung (Debbie) Lau
</p>

## Objective
In this project, we are taking three sources of movie review data (csv, tsv, web scraping) and aiming to create weak labelling functions based on the data. The objective of the project is to compare the performance of a machine learning based classifier with that of the combination of weak labelling functions. 


## Movie Review Data

We split the work of obtaining data to three group members (Hong, Debbie, Vera) . Hong was responsible for scraping movie reviews from the rotten tomato website, while Debbie and Vera sought for other data formats like tsv and csv to ensure the sufficiency of data sample. 

The dataset contains two columns: one is the text review posted by people, another is the label 1 or 0. 1 indicates a positive review (fresh), and 0 indicates an non-positive review (not fresh).

The final dataset consists of 15000 reviews in total, with 5000 from web scrapping, 5000 from csv file, and 5000 from tsv file. 

We further sampled 2000 data points for labelling function development (development data set) and made sure the positive and non-positive reviews had the same amounts. 


## Building classifiers 

Arda was responsible for building a NLP classifier, which will be later applied to compare with the labelling function classifier. This model was done by utilizing tokens generated from two types of reviews and fed as features. This NLP classifier yielded 70% accuracy on the testing set. 
Arda also implemented the spark on Faculty so that the following labelling functions can be implemented in a spark environment. 

## Building Labelling functions

Meanwhile, the rest of the team members worked on generating weak labelling functions based on the findings in data exploration. 

During data exploration, as implementing spark slowed down the computation process as it partitions the dataset, we chose to use pandas to notice any patterns and differences in positive and non-positive reviews. We faced a few challenges in this stage. For example, we built a lemmatization function on the word count dataframe to avoid classifier bias caused by word inflections. However, many words were converted into completely different words incorrectly. Due to the mis-correction on words, we decided not to use lemmatization. 

As for building the labelling functions, Katie looked for stop words in the review first and counted the word occurrences in positive reviews and negative reviews. By identifying the difference in the word occurrences, we built our labelling functions to separate positive and non-positive reviews based on exclusive words. 

We also looked for capital letters mentioned in each type of reviews, but since the capital letters were irrelevant to emotions and most didn’t make sense for understanding, we decided not to create a labelling function based on it.

As we noticed keywords like “too” and “far” occurred more often in a specific type of reviews, so other labelling functions were created based on the keywords. 
Similarly, exclamation mark and question mark also appeared more often in one type of reviews, so we created labelling functions based on them as well.

After finalising the labeling functions, Arda built a classifier to combine the labeling functions together. 

## Results

<>

## GitHub
Every time we made a change, we used terminal in Faculty to push the changes to our group repository. 
After committing the change, we used “git status” to double check the state of repository. Using “git diff” also enables us to see all the changes in repository.

Link to GitHub repository:
https://github.com/KatharinaWiedmann/DataEngGroupProject.git


## JIRA board 
We used JIRA to manage the progress of our project and record our meeting topics. The brief of meetings is shown as below:
<p>
First meeting:
•	Rotten Tomatoes labelling functions - good movie or not (positive/ negative rating). 
•	Twitter data: Labelling whether someone is a Boris Johnson supporter or not. 
•	YouTube comments: positive or negative comment 
•	Promotion emails - is an email a spam email or a genuine email. 
</p>
<p>
Second meeting:
•	Labelling functions (Vera, Debbie, Katie) 
•	Classifier (Arda) 
•	Web Scraping (Hong) 
•	GitHub (Katie)
</p>
<p>
Third meeting:
•	Rewriting labelling functions (SparkLFApplier - done together)
•	Combining labelling functions (Hong & Katie)
•	Analyzing summary/ results labelling functions (Hong & Katie)
•	Plug in classifier (Arda)
•	Compare results between using labelling functions and not labelling functions (Arda)
•	Iterate on Mark-up/ write-down (Vera)
•	JIRA cleanup & additional notes (Katie)
•	Github reminder - don't forget to push/ pull (everyone)
•	Make sensitivity analysis work (Debbie) 
•	nt function (Hong)
</p>

Further details can be found at:
http://csjira2.cs.ucl.ac.uk:8080/secure/RapidBoard.jspa?rapidView=316&view=detail&selectedIssue=DED-16


In [1]:
conda install pandas==0.24.2

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
# conda install networkx==2.3.0
# run once and then need to restart the kernel?

In [3]:
# conda install networkx==2.3.0

In [4]:
#needs to show version 2.3
import networkx as nx
nx.__version__

'2.3'

In [5]:
conda upgrade --all -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda/envs/Python3


The following packages will be UPDATED:

  networkx                                         2.3-py_0 --> 2.4-py_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


In [6]:
conda install snorkel==0.9.0 -c conda-forge

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda/envs/Python3

  added / updated specs:
    - snorkel==0.9.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         149 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates     pkgs/main::ca-certificates-2020.1.1-0 --> conda-forge::ca-certificates-2019.11.28-hecc5488_0
  certifi                                         pkgs/main --> conda-forge
  openssl              pkgs/main::openssl-1.1.1d-h7b6447c_4 --> conda-forge::openssl-1.1.1d-h516909a_0



Downloading and Extracting Packages
certifi-2019.11.28   | 149 KB    | #

In [7]:
conda install -c conda-forge textblob

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [8]:
from snorkel.labeling import LFAnalysis
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier,LabelModel
from snorkel.preprocess import preprocessor
from textblob import TextBlob
import nltk
from itertools import repeat
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
import re

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
from bs4 import BeautifulSoup
from csv import writer
import re
import pickle
import time
import json

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/faculty/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
#Spark 

# Spark Environment
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
import pyspark

number_cores = 4
memory_gb = 16
conf = (
    pyspark.SparkConf()
        .setMaster('local[{}]'.format(number_cores))
        .set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext.getOrCreate(conf=conf)
print(sc)

# get the context
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark) 

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

<SparkContext master=local[4] appName=pyspark-shell>
<pyspark.sql.session.SparkSession object at 0x7faa8ace8f60>


# Web Scraping

## Browse all from DVD releases page

In [10]:
main = 'https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu%3Bamazon_prime%3Bfandango_now&certified&sortBy=release&type=dvd-streaming-all&page='

In [11]:
# Get movie url
movie_url = []
start_page = 1 ; end_page = 1
while start_page <= end_page:
#     time.sleep(7)
    url = main + str(start_page)
    response = requests.get(url)
    if response.status_code !=200:
        print('Request error')
        break
    file = json.loads(response.text)
    for i in file['results']:
        movie_url = movie_url + [i['url']]
    start_page +=1

In [12]:
print('Examples for the url:\n')
for i in range(3):
    print(movie_url[i])
print('\nNumber of movies in list: {}'.format(len(movie_url)))

Examples for the url:

/m/frozen_ii
/m/playmobil_the_movie
/m/charlies_angels_2019

Number of movies in list: 32


In [13]:
# Split into lists of 50 movies to do the scraping
movie_url_split = [movie_url[i:i+50] for i in range(0,600,50)]

In [14]:
len(movie_url_split)

12

In [15]:
# Get reviews from the web
reviews = []
titles = []
ratings = []
for split in movie_url_split: # Loop through each split
#     time.sleep(7)
    for title in split: # Loop through each movie title
        url = 'https://www.rottentomatoes.com'+title
#         time.sleep(7)
        response = requests.get(url)
        # Check the request status code
        if response.status_code != 200:
            print('Request error')
            break
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Get labels from each review (fresh vs. rotten)
        fresh_rotten = soup.find_all(class_="review_quote")
        
        # Get movie title
        title = soup.find(class_="mop-ratings-wrap__title mop-ratings-wrap__title--top").getText()
        
        # Get reviews
        review = soup.find_all('blockquote')
        for i in review:
            j = str(i.contents[1])
            j = j.replace("<p>\n                    \n                        ","")
            j = j.replace("\n                    \n                </p>","")
            reviews = reviews + [j]
            titles = titles + [title]
        
        # Identify labels
        for i in fresh_rotten:
            temp = str(i.findChildren()[2])
            if re.search('rotten',temp):
                ratings = ratings + ['rotten']
            else:
                ratings = ratings + ['fresh']
            

KeyboardInterrupt: 

In [None]:
# Create the DataFrame to store the scraped data
df = pd.DataFrame([titles,reviews,ratings],index = ['title','review','rating']).T

In [None]:
# Clean the data (drop duplicates, check missing values etc.)
df = df.drop_duplicates()
df = df.replace([None],np.nan)
df.info()

In [None]:
df.dropna(inplace=True)
df.head(3)

In [None]:
# Export to CSV files
# df.to_csv('web_scraping_rotten_tomatoes.csv')

# Reading & Preparing TSV file 

In [None]:
# Read TSV file
tsv_reviews = pd.read_csv('/project/reviews.tsv', sep='\t', header=0, encoding='unicode_escape')

In [None]:
tsv_reviews.head()

In [None]:
# Extract review and fresh columns
tsv_reviews = pd.DataFrame(tsv_reviews, columns = ['review', 'fresh'])

In [None]:
tsv_reviews.head()

In [None]:
tsv_reviews.isnull().sum()

In [None]:
# drop NaN rows in reviews
index_name = tsv_reviews[(tsv_reviews['review'].isnull())].index
tsv_reviews.drop(index_name, inplace= True)

In [None]:
tsv_reviews.isnull().sum()

In [None]:
# rename fresh as 1 and rotten as 0
tsv_reviews['fresh'].replace({'fresh':'1', 'rotten':'0'}, inplace = True)

In [None]:
#Rename columns
tsv_reviews.rename(columns={'fresh':'Freshness','review':'Review'},inplace=True)
tsv_reviews = tsv_reviews.sample(5000)
#take 5000
tsv_reviews.info()

# Reading & Preparing CSV file 

In [None]:
# Read CSV file
csv_reviews= pd.read_csv('/project/rotten_tomatoes_reviews.csv')
csv_reviews.head()


In [None]:
#Swap Freshness and Review 
columns_titles = ["Review","Freshness"]
csv_reviews=csv_reviews.reindex(columns=columns_titles)
csv_reviews = csv_reviews.sample(5000)

csv_reviews.head()
csv_reviews.info()

# Web Scraping & Preparing scrapped data 

In [None]:
#Read Web Scraping Data
web_scraping_reviews= pd.read_csv('/project/web_scraping_rotten_tomatoes.csv')
web_scraping_reviews.head()

web_scraping_reviews = web_scraping_reviews.sample(5000)

In [None]:
# rename fresh as 1 and rotten as 0
web_scraping_reviews['rating'].replace({'fresh':'1', 'rotten':'0'}, inplace = True)

#Rename Rating to Review 
web_scraping_reviews.rename(columns={'rating':'Freshness', 'review':'Review'},inplace=True)
web_scraping_reviews.head()

# Extract Review and Freshness columns
web_scraping_reviews= pd.DataFrame(web_scraping_reviews, columns = ['Review', 'Freshness'])

In [None]:
web_scraping_reviews.head()

# Combining all the data together 

In [None]:
# csv_reviews.info()
# tsv_reviews.info()
# web_scraping_reviews.info()



# Concat two files into all_reviews
all_reviews=pd.concat([csv_reviews, tsv_reviews,web_scraping_reviews],axis=0, sort=False)
all_reviews.head()

In [None]:
all_reviews.shape

# Split into test and training set 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
all_reviews['Freshness'] = all_reviews['Freshness'].astype(int)

In [None]:
train, test = train_test_split(all_reviews,test_size=0.2,stratify = all_reviews['Freshness'])

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.head()

In [None]:
#get rid off training labels 
train = train.drop('Freshness', 1)

In [None]:
test['Freshness'].value_counts()

In [None]:
#From labelled test set, extract a sample to find out about which labelling functions could be written
#Not sure how big the development split_ can be --> take sample of 1000 data points 

development_split = test.sample(2000,random_state=42)
development_split.head()


In [None]:
# test2000.index

In [None]:
# test1000 = pd.merge(test,test2000,how='left',on = 'Review')
# test1000 = test1000[test1000.Freshness_y.isnull()]

In [None]:
# test1000.shape

In [None]:
# test1000.drop('Freshness_y',axis=1, inplace=True)
# test1000.columns = ['Review','Freshness']
# test1000

In [None]:
# test1000.to_csv('1000_labels.csv')

In [None]:
#For finding labelling functions: 
development_split

In [None]:
development_split.count()

In [None]:
development_split.to_csv('development_split.csv')

In [None]:
# development_split = pd.read_csv('development_split.csv',index_col = 'Unnamed: 0')

In [None]:
development_split.count()

In [None]:
development_split.head()

In [None]:
#Might have to get rid off index?

development_split[development_split['Freshness'] !=1].count()

In [None]:
development_split['Freshness'].value_counts()

## Split into positive and negative reviews

In [None]:
development_split = pd.read_csv('development_split.csv')

In [None]:
development_split_fresh = development_split[development_split['Freshness'] == 1]
development_split_rotten = development_split[development_split['Freshness'] == 0]
development_split_fresh.head()
development_split_rotten.head()

In [None]:
# fresh reviews 
development_split_fresh.info()
# rotten reviews 
development_split_rotten.info()

## Labelling Functions

## 1. Word occurrences

### Positive Reviews

In [None]:
# Removing punctuation 
def remove_punctuation(dataframe):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in dataframe.Review.str.lower():
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [None]:
#Remove punctuation from fresh reviews & turn into Series
split_fresh= pd.Series(remove_punctuation(development_split_fresh))
split_fresh.head()

In [None]:
stopWordList = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves",\
                "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their",\
                "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was",\
                "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and",\
                "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between",\
                "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off",\
                "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any",\
                "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",\
                "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [None]:
#Removing stopwords from fresh

replacements = dict(zip((fr'\b{word}\b' for word in stopWordList), repeat("")))
split_fresh.replace(replacements, regex=True, inplace=True)
split_fresh.replace({r' +': ' ', r' +\.': '.'}, regex=True, inplace=True)

### Implement lemmatization

In [None]:
# Lemmatization and Count again
def stem_recount(df):
    import pandas as pd
    # Lemmatization
    from nltk import LancasterStemmer
    st = LancasterStemmer()
    newdf = df.copy()
    for i in range(0,len(df)):
        newdf.iloc[i,0] = st.stem(str(df.iloc[i,0])) 
        # Plz make sure the word column is the first column in df when using this function
    
    # Recount
    duplicate = newdf[newdf.duplicated(['index'])]
    # Plz make sure the 'index' is the column name consisting of words
    for i in range(0,len(newdf)):
        if i >= len(duplicate):
            break
        if newdf.iloc[i,0] == duplicate.iloc[i,0]:
            newdf.iloc[i,1] = newdf.iloc[i,1] + duplicate.iloc[i,1]
    return newdf

In [None]:
common_words_fresh = split_fresh.str.split(expand=True).stack().value_counts()
common_words_fresh_df = pd.DataFrame(common_words_fresh)
common_words_fresh_df = common_words_fresh_df.rename({0:'Occurence good review'}, axis='columns')
new_common_words_fresh_df = common_words_fresh_df.reset_index()

In [None]:
stem_recount(new_common_words_fresh_df)

In [None]:
#Get most common words in positive reviews 
top_common_words_fresh = common_words_fresh_df[common_words_fresh_df['Occurence good review'] >=4]
top_common_words_fresh

### *** EXPLAIN WHY WE DONT USE LEMMATIZATION

<div class="alert alert-success">
<b> Reason for not using Lemmatization </b>

<p>
    Before counting the occurance of words in the movie review, we noticed that inflections in words may result in different occurances and thus generating bias during counting. For example, "enjoy" and "enjoyed" share the same root but would be counted separately if not using Lemmatization.
    </p> 
    
<p>
    The function "stem_recount" takes the root of a word and recounts the occurences. However, it posed a disadvantage of mis-normalizing words into other completely different words. For example, "movie" was identified as "movy", and "like" was identified as "lik". We thought this disadvantage exceeds the benefits of correcting word inflection, so we decided to not implement it.
    </p> 

</div>

### Negative reviews

In [None]:
#Remove punctuation from rotten & turn into Series
split_rotten= pd.Series(remove_punctuation(development_split_rotten))
split_rotten.head()

In [None]:
# Removing stopwords from negative reviews
replacements = dict(zip((fr'\b{word}\b' for word in stopWordList), repeat("")))
split_rotten.replace(replacements, regex=True, inplace=True)
split_rotten.replace({r' +': ' ', r' +\.': '.'}, regex=True, inplace=True)

In [None]:
#Get most common words in negative reviews 

common_words_rotten = split_rotten.str.split(expand=True).stack().value_counts()
common_words_rotten_df = pd.DataFrame(common_words_rotten)
common_words_rotten_df = common_words_rotten_df.rename({0:'Occurence bad review'}, axis='columns')
top_common_words_rotten = common_words_rotten_df[common_words_rotten_df['Occurence bad review'] >=4]

In [None]:
top_common_words_rotten.head()

## Comparison of good and bad reviews

<div class="alert alert-success">
We want to find out which of the words in the good list only appear in the good movies (and not in the bad movies), vice versa and base labeling functions on these findings. We first ened to prepare the data accordingly, before we can write the labelling functions.
    <div>

In [None]:
top_fresh_words_exclusive = top_common_words_fresh.merge(top_common_words_rotten, indicator='i', how='outer', left_index=True,\
                                                         right_index=True).query('i == "left_only"').drop('i', 1)

top_rotten_words_exclusive = top_common_words_rotten.merge(top_common_words_fresh, indicator='i', how='outer', left_index=True,\
                                                           right_index=True).query('i == "left_only"').drop('i', 1)

In [None]:
top_fresh_words_exclusive.head()

In [None]:
top_rotten_words_exclusive.head()

In [None]:
#Get only positive words 
top_fresh_words_exclusive_list = top_fresh_words_exclusive['Occurence good review'].index.tolist()
top_fresh_words_exclusive_list

In [None]:
#take out the ones that seem to make sense: 
top_fresh_words_exclusive = ['absolutely',
 'addition',
 'adventure',
 'affectionate',
 'amazing',
 'ambition',
 'art',
 'artist',
 'arts',
 'atmosphere',
 'attractive',
 'awards',
 'balance',
 'beautiful',
 'beautifully',
 'beauty',
 'bond',
 'bright',
 'brilliant',
 'captivating',
 'captures',
 'celebration',
 'charm',
 'charming',
 'christmas',
 'classic',
 'clever',
 'committed',
 'consistently',
 'contemporary',
 'conventional',
 'convincingly',
 'creates',
 'creating',
 'crowdpleaser',
 'cult',
 'decade',
 'decades',
 'deep',
 'deeper',
 'deeply',
 'definitely',
 'delightful',
 'delightfully',
 'depth',
 'deserves',
 'design',
 'details',
 'different',
 'diverse',
 'dramatic',
 'early',
 'elegant',
 'emotionally',
 'engaging',
 'enjoyable',
 'enjoyed',
 'equal',
 'especially',
 'exploration',
 'extraordinary',
 'extremely',
 'familiar',
 'famous',
 'fan',
 'fantastic',
 'fantasy',
 'fascinating',
 'felt',
 'filled',
 'finest',
 'frank',
 'fresh',
 'friends',
 'friendship',
 'gags',
 'gorgeous',
 'grand',
 'happy',
 'heart',
 'hilarious',
 'honest',
 'hope',
 'huge',
 'impact',
 'insightful',
 'inspiring',
 'intelligent',
 'intense',
 'intrigue',
 'joy',
 'laugh',
 'loved',
 'mature',
 'mind',
 'mystery',
 'nostalgia',
 'novel',
 'opening',
 'passion',
 'perfect',
 'performers',
 'personal',
 'pleasure',
 'poignant',
 'power',
 'powerful',
 'precisely',
 'profound',
 'project',
 'proves',
 'provide',
 'provocative',
 'psychological',
 'quality',
 'remarkable',
 'reveals',
 'rich',
 'riveting',
 'satisfying',
 'sharp',
 'simple',
 'smart',
 'smile',
 'stunning',
 'succeeds',
 'supernatural',
 'surprise',
 'surprises',
 'surprising',
 'surprisingly',
 'sweet',
 'talents',
 'thoughtful',
 'thrills',
 'touch',
 'touching',
 'tragedy',
 'tragic',
 'tribute',
 'unique',
 'universal',
 'warm',
 'watchable',
 'welcome',
 'wit',
 'witty',
 'wonderful',
 'worthwhile',
 'worthy']

top_fresh_words_exclusive

In [None]:
#Get only negative words 
top_rotten_words_exclusive_list = top_rotten_words_exclusive['Occurence bad review'].index.tolist()
top_rotten_words_exclusive_list

In [None]:
#take out the ones that seem to make sense: 
top_rotten_words_exclusive = [
 'attempt',
 'awkward',
 'barely',
 'basically',
 'bizarre',
 'bland',
 'boring',
 'clumsy',
 'comedic',
 'disappointing',
 'disappointingly',
 'disappointment',
 'disaster',
 'dull',
 'effort',
 'failed',
 'fails',
 'generic',
 'irritating',
 'lacking',
 'manic',
 'missing',
 'nobody',
 'noir',
 'none',
 'painfully',
 'pointless',
 'poorly',
 'problem',
 'shallow',
 'shame',
 'sloppy',
 'slow',
 'suffers',
 'superficial',
 'try',
 'unfortunately',
 'unfunny',
 'worst']
top_rotten_words_exclusive

## Labelling Function

### 1. Word Occurences

###  A. Good / bad exclusive words occurrences

In [None]:
from snorkel.labeling.apply.spark import SparkLFApplier

from pyspark import SparkContext 
from pyspark.sql import SQLContext 
import pandas as pd 
sqlc=SQLContext(sc) 
df=pd.read_csv('/project/development_split.csv',index_col = 'Unnamed: 0')
df_with_punctuation = df.copy()
df['Review'] = remove_punctuation(df)
development_split=sqlc.createDataFrame(df)
development_split_with_punctuation=sqlc.createDataFrame(df_with_punctuation) 

In [None]:
development_split.show(5)

In [None]:
# development_split = pd.read_csv('/project/development_split.csv')
ABSTAIN = -1
NOTFRESH = 0
FRESH = 1

@labeling_function()
def fresh(x):
    for word in top_fresh_words_exclusive:
        word = " " +word+" "
        if word in str(x).lower():
            return FRESH
    return ABSTAIN
#return FRESH if "best" in x.str.lower() else ABSTAIN

@labeling_function()
def rotten(x):
    for word in top_rotten_words_exclusive:
        word = " " +word+" "
        if word in str(x).lower():
            return NOTFRESH
    return ABSTAIN
#return NOTFRESH if "best" in x.str.lower() else ABSTAIN

In [None]:
lfs = [fresh]
applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
sample_L

In [None]:
coverage_fresh = (sample_L != ABSTAIN).mean(axis=0)
print("fresh coverage:{:.1%}".format(coverage_fresh[0]))

In [None]:
lfs = [rotten]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_rotten = (sample_L != ABSTAIN).mean(axis=0)
print("rotten coverage:{:.1%}".format(coverage_rotten[0]))

### B. Word 'too' occurances

In [None]:
common_words_fresh_df[common_words_fresh_df.index == 'too']

In [None]:
common_words_rotten_df[common_words_rotten_df.index == 'too']

In [None]:
@labeling_function()
def keyword_too(x):
    return NOTFRESH if 'too' in str(x).lower() else ABSTAIN

In [None]:
lfs = [keyword_too]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_keyword_too = (sample_L != ABSTAIN).mean(axis=0)
print("keyword too coverage:{:.1%}".format(coverage_keyword_too[0]))

### C. Word 'far' occurrences

In [None]:
common_words_fresh_df[common_words_fresh_df.index == 'far']

In [None]:
common_words_rotten_df[common_words_rotten_df.index == 'far']

In [None]:
@labeling_function()
def keyword_far(x):
    return FRESH if 'far' in str(x).lower() else ABSTAIN

In [None]:
lfs = [keyword_far]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_keyword_far = (sample_L != ABSTAIN).mean(axis=0)
print("keyword far coverage:{:.1%}".format(coverage_keyword_far[0]))

### D. "n't" words occurrences

In [None]:
# Exploration on the n't
# Word occurancy that with punctuation with it

# Word occurrences dataframe for fresh reviews
development_split_fresh_1 = split_fresh.str.split(expand=True).stack().value_counts()
development_split_fresh_df = pd.DataFrame(development_split_fresh_1).reset_index()

# Words occurrences dataframe for rotten reviews
development_split_rotten_1 = split_rotten.str.split(expand=True).stack().value_counts()
development_split_rotten_df = pd.DataFrame(development_split_rotten_1).reset_index()

In [None]:
@labeling_function()

def t(x):
    if re.search("'t",str(x).lower()):
        return NOTFRESH
    return ABSTAIN
#return FRESH if "best" in x.str.lower() else ABSTAIN

In [None]:
lfs = [t]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_t = (sample_L != ABSTAIN).mean(axis=0)
print("keyword far coverage:{:.1%}".format(coverage_t[0]))

### E. Occurenes of good & bad words from external list 

<div class="alert alert-success">
We also want to look at an imported list of postive and negative words and see whether we can base the labelling functions on them.
    </div>


In [None]:
#Importing good and bad words & preparing for labelling function 

In [None]:
#POSITIVE WORDS 
#positive words from --> DON't DELETE! NEED TO CITE PROPERLY http://ptrckprry.com/course/ssd/data/positive-words.txt
positive_word = pd.read_csv('/project/positive_words.csv')

#sample 500 words 
positive_word = positive_word.sample(500)

#convert it into a list 
positive_word= positive_word['a+'].tolist()



#NEGATIVE WORDS 
#negative words from --> HONG?? 
negative_word = pd.read_csv('/project/negative_words.csv')

#sample 500 words 
negative_word = negative_word.sample(500)

#convert it into a list 
negative_word= negative_word['2-faces'].tolist()


In [None]:
@labeling_function()
def negative(x): 
    for word in negative_word:
        word = " " + word + " "
        if word in str(x).lower():
            return NOTFRESH 
    return ABSTAIN 


@labeling_function()
def positive(x):
    for word in positive_word:
        word = " " + word + " "
        if word in str(x).lower():
            return FRESH 
    return ABSTAIN 

In [None]:
lfs = [negative]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_negative = (sample_L != ABSTAIN).mean(axis=0)
print("negative words coverage:{:.1%}".format(coverage_negative[0]))

In [None]:
lfs = [positive]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
coverage_positive = (sample_L != ABSTAIN).mean(axis=0)
print("positive words coverage:{:.1%}".format(coverage_positive[0]))

## 2. Punctuation occurrences

In [None]:
# Turn review column into Series
development_split_fresh_series = pd.Series(development_split_fresh.Review)
development_split_rotten_series = pd.Series(development_split_rotten.Review)

In [None]:
# Positive reviews
# Split reviews into word
fresh_split = pd.Series(development_split_fresh_series.str.split(expand=True).stack())
fresh_words = [i for i in fresh_split]

# Split words into characters
def split_str():
    return [list(ch) for ch in fresh_words]
fresh_split_words = pd.Series(split_str())

In [None]:
# Negative reviews
# Split reviews into word
rotten_split = pd.Series(development_split_rotten_series.str.split(expand=True).stack())
rotten_words = [i for i in rotten_split]

# Split words into characters
def split_str():
    return [list(ch) for ch in rotten_words]
rotten_split_words = pd.Series(split_str())

In [None]:
# Turn into a flattened list
fresh_flattened_list = [y for x in fresh_split_words for y in x]
rotten_flattened_list = [y for x in rotten_split_words for y in x]

# Count the occurancy of each character
# Positive reviews
fresh_split_characters = pd.Series(fresh_flattened_list).value_counts()
fresh_split_characters = pd.DataFrame(fresh_split_characters).reset_index()

# Negative reviews
rotten_split_characters = pd.Series(rotten_flattened_list).value_counts()
rotten_split_characters = pd.DataFrame(rotten_split_characters).reset_index()

### A. Question mark occurrences

In [None]:
# Count the # of occurance of '?' in fresh reviews
fresh_split_characters[fresh_split_characters['index'] == '?']

In [None]:
# Count the # of occurance of '?' in rotten reviews
rotten_split_characters[rotten_split_characters['index'] == '?']

In [None]:
list_with_question_mark = []
for review in development_split_rotten.Review:
    if '?' in review:
        list_with_question_mark.append(review)
        
print (list_with_question_mark[:3])

In [None]:
@labeling_function()
def question_mark(x):
    return NOTFRESH if '?' in str(x).lower() else ABSTAIN

In [None]:
lfs = [question_mark]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split_with_punctuation.rdd)

In [None]:
coverage_question_mark = (sample_L != ABSTAIN).mean(axis=0)
print("question mark coverage:{:.1%}".format(coverage_question_mark[0]))

### B. Exclamation mark occurrences

In [None]:
# Count the # of occurance of '!' in fresh reviews
fresh_split_characters[fresh_split_characters['index'] == '!']

In [None]:
# Count the # of occurance of '!' in rotten reviews
rotten_split_characters[rotten_split_characters['index'] == '!']

In [None]:
@labeling_function()
def exclamation_mark(x):
    return FRESH if '!' in str(x).lower() else ABSTAIN

In [None]:
lfs = [exclamation_mark]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split_with_punctuation.rdd)

In [None]:
coverage_exclamation_mark = (sample_L != ABSTAIN).mean(axis=0)
print("exclamation mark coverage:{:.1%}".format(coverage_exclamation_mark[0]))

### 3. Combining labelling functions

<div class="alert alert-success">
Next, we want to combine all the labelling functions into one and apply them to the training set. However, as the labelling functions around the punctuation have very low coverages, we decided not to include these.
    </div>

In [None]:
lfs = [fresh,
       rotten,
       keyword_too,
       keyword_far,
       t,
       negative,
       positive]

applier = SparkLFApplier(lfs)
sample_L = applier.apply(development_split.rdd)

In [None]:
train_prepared = train.copy()

train_prepared
train_prepared['Review'] = remove_punctuation(train_prepared)

In [None]:
from snorkel.labeling import LFAnalysis
LFAnalysis(L=sample_L, lfs = lfs).lf_summary()

In [None]:
lfs = [fresh,
       rotten,
       keyword_too,
       keyword_far,
       t,
       negative,
       positive]

applier = SparkLFApplier(lfs)

L_train=sqlc.createDataFrame(train_prepared)

#is this next line correct?
L_test = sqlc.createDataFrame(test)

# type(L_train)
L_train = applier.apply(L_train.rdd)

#is this next line correct?
L_test = applier.apply(L_test.rdd)


In [None]:
type(L_train)

In [None]:
L_train

In [None]:
from snorkel.labeling import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

In [None]:
preds_train.shape

In [None]:
preds_train

In [None]:
train2 = train.copy()
train2['predicted_train'] = preds_train
train2.to_csv('12000_predicted_labels.csv')

In [None]:
#needs to show version 2.3
import networkx as nx
nx.__version__

In [None]:
#Labelling according to weights 
from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

In [None]:
#need to be checked 
L_test = L_test
Y_test = test['Freshness']

In [None]:
majority_acc = majority_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

## Sensitivity Analysis

In [None]:
from snorkel.preprocess import preprocessor
from textblob import TextBlob

@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

#pick a reasonable threshold 
#Using a lower threshold than other examples as this could be a good indicator for determining official sources vs general negative sentiment
@labeling_function()
def textblob_polarity(x):
    return NOTFRESH if x.polarity > 0.8 else ABSTAIN

#do the same for the subjectivity scores. 
#Using a higher threshold than other examples as this could be a good indicator for determining official sources
#This will run faster than the last cell, since we memoized the Preprocessor outputs.
@labeling_function()
def textblob_subjectivity(x):
    return FRESH if x.subjectivity >= 0.6 else ABSTAIN

In [None]:
def getSentiment(text):
    x = {}
    x["polarity"] = TextBlob(text).sentiment.polarity
    x["subjectivity"] = TextBlob(text).sentiment.subjectivity
    return x

In [None]:
# @labeling_function(pre=[textblob_sentiment])
# def textblob_polarity(x):
#     x = getSentiment(x.text)
#     return FRESH if x.polarity > 0.8 else ABSTAIN

# @labeling_function(pre=[textblob_sentiment])
# def textblob_subjectivity(x):
#     x = getSentiment(x.text)
#     return FRESH if x.subjectivity >= 0.5 else ABSTAIN



#pick a reasonable threshold 
#Using a lower threshold than other examples as this could be a good indicator for determining official sources vs general negative sentiment
#@labeling_function(pre=[textblob_sentiment])
@labeling_function()
def textblob_polarity(x):
    x = getSentiment(x.text)
    return FRESH if x["polarity"] > 0.8 else ABSTAIN

#do the same for the subjectivity scores. 
#Using a higher threshold than other examples as this could be a good indicator for determining official sources
#This will run faster than the last cell, since we memoized the Preprocessor outputs.
#@labeling_function(pre=[textblob_sentiment])
@labeling_function()
def textblob_subjectivity(x):
    x = getSentiment(x.text)
    return FRESH if x["subjectivity"] >= 0.5 else ABSTAIN

In [None]:
# data = en_tweets_df
lfs = [
       textblob_polarity,
       textblob_subjectivity
      ]

applier = SparkLFApplier(lfs)

# L_train = applier.apply(L_train.rdd)


development_split = sqlContext.read.csv('/project/development_split.csv', header=True)

sample_L = applier.apply(development_split.rdd)



In [None]:
LFAnalysis(L_sample, lfs).lf_summary()

# Classifier

In [None]:
import pandas as pd
import numpy as np
import nltk
import string

# Spark Environment
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

import pyspark

number_cores = 4
memory_gb = 16
conf = (
    pyspark.SparkConf()
        .setMaster('local[{}]'.format(number_cores))
        .set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext.getOrCreate(conf=conf)
print(sc)

# get the context
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark) 

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [None]:
# Download files
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

!pip install langid
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import preproc as pp

# Register all the functions in Preproc with Spark Context
check_lang_udf = udf(pp.check_lang, StringType())
remove_stops_udf = udf(pp.remove_stops, StringType())
remove_features_udf = udf(pp.remove_features, StringType())
tag_and_remove_udf = udf(pp.tag_and_remove, StringType())
lemmatize_udf = udf(pp.lemmatize, StringType())
check_blanks_udf = udf(pp.check_blanks, StringType())

In [None]:
from pyspark.sql.types import IntegerType

# Read the data (Spark)
review_df = sqlContext.read.csv('/project/development_split.csv', header=True)

# Rename Column
review_df = review_df.withColumnRenamed('Review','text')
review_df = review_df.withColumnRenamed('Freshness','label')
review_df = review_df.withColumnRenamed('_c0','index')

# Change data type to Integer
review_df = review_df.withColumn("label", review_df["label"].cast(IntegerType()))

# Show df information
review_df.show()
review_df.printSchema()
review_df.count()

In [None]:
# remove stop words to reduce dimensionality
review_df = review_df.withColumn("text", remove_stops_udf(review_df["text"]))

# remove other non essential words
review_df = review_df.withColumn("text", remove_features_udf(review_df["text"]))

# tag the words remaining and keep only Nouns, Verbs and Adjectives
review_df = review_df.withColumn("text", tag_and_remove_udf(review_df["text"]))

# lemmatization of remaining words to reduce dimensionality & boost measures
review_df = review_df.withColumn("text", lemmatize_udf(review_df["text"]))

review_df.show()

In [None]:
#Specify Training and Test data
training_df = review_df
test_df = sqlContext.read.csv('/project/1000_labels.csv', header=True)

# Rename Column
test_df = test_df.withColumnRenamed('Review','text')
test_df = test_df.withColumnRenamed('Freshness','label')
test_df = test_df.withColumnRenamed('_c0','index')

# Change data type to Integer
test_df = test_df.withColumn("label", test_df["label"].cast(IntegerType()))


test_df.show()
test_df.printSchema()
test_df.count()

In [None]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and nb.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol='words', outputCol="features")
idf = IDF(minDocFreq=3, inputCol="features", outputCol="idf")
nb = NaiveBayes()
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, nb])


paramGrid = ParamGridBuilder().addGrid(nb.smoothing, [0.0, 1.0]).build()


cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=MulticlassClassificationEvaluator(), 
                    numFolds=4)

cvModel = cv.fit(training_df)

result = cvModel.transform(test_df)
prediction_df = result.select("text", "label", "prediction")
prediction_df.show()

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Evaluate the Accuracy
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(result, {evaluator.metricName: "accuracy"})