# INFO206B Fall 2022 Assignment 3

Sentiment analysis uses natural language processing, text analysis, and other methods to systematically identify, extract, quantify, and study affective information in many different contexts. In this assignment, you will perform a simple sentiment analysis on different texts, ranging from tweets to tomes. In the process, you will learn about the design, implementation, and performance evaluation of search algorithms and data structures.

## Part 1. Tweets (5 points)

You are given a [list of famous (or infamous) tweets](https://people.ischool.berkeley.edu/~chuang/i206/b3/tweets.txt). from the colored history of Twitter. Your task is to compute the sentiment of each tweet based on the sentiment scores of the words in the tweet. For this part of the assignment, the sentiment of a tweet is simply the sum of the sentiment scores for each word in the tweet.

You will use a hand-coded [sentiment lexicon developed by Finn Årup Nielsen](https://github.com/fnielsen/afinn). that contains a list of ~2,400 English words with sentiment scores ranging from -5 (most negative) to 5 (most positive). Your program will read the file [AFINN-111.txt](https://people.ischool.berkeley.edu/~chuang/i206/b3/AFINN-111.txt) into a dictionary data structure. (Note that the file is tab delimited, so you will want to use the `“\t”` argument for your split method.)

For each of the ten tweets, you should split the tweet into its component words, removing any whitespaces and punctuations, and converting the words to lowercase. Then, you look up the sentiment score for each word in the dictionary. If a word is not found, its score is zero. The sentiment score of the tweet is the sum of the scores of the individual words. Print the tweet and its sentiment score.

In [11]:
#strip word of punctuation and convert to all lower-case
def stripWord( w ):
    w = w.replace( ".", "" )
    w = w.replace( ",", "" )
    w = w.replace( ";", "" )
    w = w.replace( ":", "" )
    w = w.replace( "'", "" )
    w = w.replace( "&", "" )
    w = w.replace( "!", "" )
    w = w.replace( "?", "" )
    w = w.replace( "\"", "" )
    w = w.replace( "\n", "" )
    w = w.lower()
    return( w )

In [12]:
# make the lexicon
# open the file
lexicon_contents = open("AFINN-111.txt", "r")
# turn it into a list
file_lines = lexicon_contents.readlines()

# trace trace trace
#print(file_lines)

# start the dictionary
lex_dict = {}

# strip the lexicon into list of list 
for i in range(len(file_lines)):
    file_lines[i] = file_lines[i].strip().split("\t")
    # add the keys/values to the dictionary 
    lex_dict[file_lines[i][0]] = int(file_lines[i][1])

#print(file_lines)
#print(lex_dict)

# close the file
lexicon_contents.close()

# define the function
def tweet_sentiment(in_file):


    ### SENTIMENT DICTIONARY IS MADE

    # Now it's time for reading the actual file of tweets
    tweet_contents = open(in_file, "r")
    tweet_lines = tweet_contents.readlines()

    # loop over contents
    for i in range(len(tweet_lines)):
        # strip the newline
        tweet_lines[i] = tweet_lines[i].strip()
        # print out the tweet
        print("Tweet", i, ":", tweet_lines[i])
        # now we strip into component words
        tweet_lines[i] = tweet_lines[i].split()
        
        
        #define the sentiment
        sentiment_max = 0

        # now we need to add up the sentiment of all words
        for j in range(len(tweet_lines[i])):
            # strip word it
            the_word = stripWord(tweet_lines[i][j])
            
            # check to see if it's in there
            if the_word in lex_dict:
                # if it is, add to the value
                sentiment_max += lex_dict[the_word]
                #print(the_word, lex_dict[the_word])
        #print(tweet_lines[i])
        # print it out!
        print("Tweet sentiment:", sentiment_max, "\n")
    tweet_contents.close()

## MAIN 
tweet_sentiment("tweets.txt")

Tweet 0 : just setting up my twttr
Tweet sentiment: 0 

Tweet 1 : there's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy.
Tweet sentiment: -2 

Tweet 2 : Are you ready to celebrate? Well, get ready: We have ICE!!!!! Yes, ICE, *WATER ICE* on Mars! woot!!! Best day ever!!
Tweet sentiment: 7 

Tweet 3 : Arrested
Tweet sentiment: -3 

Tweet 4 : HI TWITTERS . THANK YOU FOR A WARM WELCOME. FEELING REALLY 21ST CENTURY .
Tweet sentiment: 6 

Tweet 5 : Hello Twitterverse! We r now LIVE tweeting from the International Space Station -- the 1st live tweet from Space! :) More soon, send your ?s
Tweet sentiment: 0 

Tweet 6 : OK, What The Hell Is "Weird Twitter"?
Tweet sentiment: -6 

Tweet 7 : Please retweet this to spread awareness for retweets.
Tweet sentiment: 1 

Tweet 8 : If only Bradley's arm was longer. Best photo ever. #oscars
Tweet sentiment: 3 

Tweet 9 : admiring my award winning masterpiece -- super stunning roflcopter tweet ftw woohoo!
Tweet sentiment: 31 



## Part 2. Tomes (9 points)

Moving from tweets to tomes, we want to evaluate the run-time efficiency of different data structures and search algorithms for supporting sentiment analysis as we scale up the size of the input files.

You will measure and compare the run-times of three different search strategies on texts of different sizes.

Python's time package provides a timestamp function that you can use:

```
import time
tstart = time.time()
\# the main loop of your code goes here
tstop = time.time()
elapsed_time = tstop - tstart
```

 

Strategy 1 – dictionary lookup: For each word in a tome, look up its sentiment score from the dictionary you constructed in Part 1, and sum up the scores for all the words in the tome. Divide the sum by the number of words to obtain the normalized sentiment score of the tome. Record the elapsed time for processing all the words in your tome. Do not include the time for reading in the text file and constructing the dictionary. Only include your main loop that performs the dictionary lookups for the words.

Strategy 2 – linear search: First, construct a new sentiment lexicon using two lists. We will take advantage of the fact that the AFINN file's word entries are already sorted alphabetically. The first list contains the word entries, while the second list contains the word's corresponding sentiment scores. Now, for each word in your tome, perform a linear search for the word in the first list. If and when the word is found, use the list index to look up the word's score in the second list. If you reach the end of the first list and cannot find the word, then the word's score is zero. Once again, sum up the scores for all the words, then compute the normalized sentiment score for the tome. Record the elapsed time for processing all the words in your tome.

Strategy 3 – binary search: This strategy is basically the same as Strategy 2, and you should use the same two lists from above. However, for each word in your tome, you perform a binary search instead of a linear search. You can re-use the binary search function that you wrote for Assignment 2. (While your Assignment 2 binary search function was written to search for numbers in a list, it should work for searching for text strings with little or no modification, since the AFINN word entries are already sorted alphabetically.)

Write a function for each of the three strategies. For each function, return (i) the number of words in the tome, (ii) the elapsed time, and (ii) the normalized sentiment score of the tome.

Run your functions for tomes of different sizes. [Project Gutenberg](https://www.gutenberg.org/) is a good source of long texts:

- [The Complete Works of William Shakespeare](https://people.ischool.berkeley.edu/~chuang/i206/b3/shakespeare.txt) (~900k words)

- [Les Misérables](https://people.ischool.berkeley.edu/~chuang/i206/b3/les-miserables.txt) (~500k words)
- [The Odyssey](https://people.ischool.berkeley.edu/~chuang/i206/b3/odyssey.txt) (~100k words)
- [Alice’s Adventure in Wonderland](https://people.ischool.berkeley.edu/~chuang/i206/b3/alice.txt)] (~10k words) 

In addition to these four tomes, choose a few more of your favorite books.

To wrap up your analysis, produce these three outputs:

Report the normalized sentiment scores of all the tomes you have analyzed.
Use `matplotlib` to generate a graph that plots run-time (in the y-axis) versus tome length in words (x-axis), using Red, Green, and Blue for strategies 1, 2, and 3 respectively.
Interpret your results in a few sentences, e.g., how do you interpret the normalized sentiment scores and the graph of run-time vs. tome length for the three strategies.

In [13]:
import time

def dict_sent(in_file):
    file_contents = open(in_file, "r")
    file_words = file_contents.read().split()
    file_contents.close()


    sentiment = 0
    for i in range(len(file_words)):
        file_words[i] = stripWord(file_words[i].strip())
        if file_words[i] in lex_dict:
            sentiment += lex_dict[file_words[i]]
            
    #print(file_words)
    
    #print(sentiment)

    return len(file_words), 

    

dict_sent("dracula.txt")


539
