# <center>Assignment 4 : Text Processing </center>

Requirements:

1. Define a function "**tokenize**" as follows:
   - takes a string as an input
   - converts the string into lowercase
   - tokenizes the lowercased string into tokens. Each token has at least two characters. A token **only contains letters (i.e. a-z or A-Z), "-" (hyphen), or "_" (underscore)**. Moreover, ** a token cannot starts or ends with "-" or "_" **. 
   - removes stop words from the tokens (use English stop words list from NLTK)
   - returns the resulting token list as the output
   
2. Define a function "**sentiment_analysis**" as follows:
   - takes a string, a list of positive words, and a list of negative words as inputs. Assume the lists are read from positive-words.txt and negative-words.txt outside of this function.
   - tokenize the string using NLTK word tokenizer
   - counts positive words and negative words in the tokens using the positive/negative words lists. The final positive/negative words are defined as follows:
     - Positive words:
       * a positive word not preceded by a negation word (i.e. not, n't, no, cannot, neither, nor, too)
       * a negative word preceded by a negation word
     - Negative words:
       * a negative word not preceded by a negation word
       * a positive word preceded by a negation word
   - determines the sentiment of the string as follows:
     - 2: number of positive words > number of negative words
     - 1: number of positive words <= number of negative words
   - returns the sentiment 
    
3. Define a function called **performance_evaluate** to evaluate the accuracy of the sentiment analysis in (2) as follows: 
   - takes an input file ("amazon_review_300.csv"), a list of positive words, and a list of negative words as inputs. The input file has a list of reviews in the format of (label, title, review). Use label (either '2' or '1') and review columns (i.e. columns 1 and 3 only) here.
   - reads the input file to get reviews as a list of (label, reviews) tuples
   - for each review, predicts its sentiment using the function defined in (2), and compare the prediction with its label
   - returns the accuracy as the number of correct sentiment predictions/total reviews
    

In [22]:
# Manank Valand - 10429101
# references: lecture notes - Natural Language Processing I.ipynb, Regular_Expression.ipynb and Python_II.ipynb (for csv)
#
# wait for 2-3 seconds to get full output, because it takes time to read and apply calculations on 
# whole bunch of texts for couple of times
#
# PUT all required files for this assignment in the same path of this program. 
# Or change path values in driver program (__main__) 


import string
import nltk
import re
import csv
from nltk.corpus import stopwords

def tokenize(text):
    tokens=[]
    # write your code here
    stop_words = stopwords.words('english')
    text = text.lower()
    #print(text)
    #pattern = r'\w'
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if re.match("^[A-Za-z_-]*$", token)]
    #print(tokens)
    # to remove punctuations from begging and starting of the tokens 
    tokens = [token.strip(string.punctuation) for token in tokens]
    # now removing extra empty characters from tokens
    tokens = [token.strip() for token in tokens if token.strip()!='']
    tokens = [token for token in tokens if len(token)>1]
    tokens = [token for token in tokens if token not in stop_words]
    
    return tokens

def sentiment_analysis(text, positive_words, negative_words):
    
    sentiment=None
    negations=['not', 'too', 'n\'t', 'no', 'cannot', 'neither', 'nor']
    # write your code here
    tokens = tokenize(text)
    #print(tokens)
    positive_tokens =[]
    negative_tokens =[]
    for idx, token in enumerate(tokens):
        if token in positive_words:
            if(idx>0):
                if tokens[idx-1] not in negations:
                    positive_tokens.append(token)
                else:
                    negative_tokens.append(token)
            else:
                positive_tokens.append(token)
        elif token in negative_words:
            if(idx>0):
                if tokens[idx-1] not in negations:
                    negative_tokens.append(token)
                else:
                    positive_tokens.append(token)
            else:
                negative_tokens.append(token)
    #remove below 2 comments to check the array built out of provided string
    #print("positive tokens:",positive_tokens)
    #print("negative tokens:",negative_tokens)
    
    if len(positive_tokens)>len(negative_tokens):
        sentiment = 2
    else:
        sentiment = 1
    return sentiment


def performance_evaluate(input_file, positive_words, negative_words):
    
    accuracy=None
    cnt=0
    # write your code here
    with open(input_file, "r") as f:
        reader=csv.reader(f, delimiter=',')
        rows=[(row[0], row[2]) for row in reader]
    row_len = len(rows)
    #print(row_len)
    for i in rows:
        if int(i[0]) == sentiment_analysis(i[1], positive_words, negative_words):
            cnt+=1
    
    #print(cnt)
    accuracy = cnt/row_len
    return accuracy


In [23]:
if __name__ == "__main__":  
    
    text="This is a breath-taking ambitious movie; test text: abc_dcd abc_ dvr89w, abc-dcd -abc"

    tokens=tokenize(text)
    print("tokens:")
    print(tokens)
    
    
    with open("positive-words.txt",'r') as f:
        positive_words=[line.strip() for line in f]
        
    with open("negative-words.txt",'r') as f:
        negative_words=[line.strip() for line in f]
    #print(positive_words)  
    print("\nsentiment")
    sentiment=sentiment_analysis(text, positive_words, negative_words)
    print(sentiment)
    
    accuracy=performance_evaluate("amazon_review_300.csv", positive_words, negative_words)
    print("\naccuracy")
    print(accuracy)

tokens:
['breath-taking', 'ambitious', 'movie', 'test', 'text', 'abc_dcd', 'abc', 'abc-dcd', 'abc']

sentiment
2

accuracy
0.71
