# Homework 1 – Sentiment Analysis on X(Twitter) Data

#### Student Name: Parneet Kaur
#### OMIS 114
#### Professor Wilson Lin

***

The following cell makes sure that all of the outputs of a cell are printed.

In [249]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#conda install -c conda-forge ipympl

In [250]:
%autosave 1000

Autosaving every 1000 seconds


<b> Step 1: </b> Loading the Data from Each Given File 

In [251]:
# with open()automatically closes the file after reading
with open('Trump.txt', 'r', encoding='utf-8', errors='ignore') as fhand: # encoding='utf-8' ensures that Python can correctly handle all the characters in the file (especially, here, we can have special characters)
    tweets = fhand.read()

with open('negative.txt', 'r') as f1:
    negativeContents = f1.read()
    # Notice: On the file, we have all words on differnt lines (i.e. new lines)
    negative_words = negativeContents.split('\n')

with open('positive.txt', 'r') as f2:
    positiveContents = f2.read()
    # Notice: On the file, we have all words on differnt lines (i.e. new lines)
    postive_words = positiveContents.split('\n')

with open('stopwords.txt', 'r') as f3:
    stopwordsContents = f3.read()
    # Notice: On the file, we have all words on differnt lines (i.e. new lines)
    stopwords_words = stopwordsContents.split('\n')

<b> Step 2: </b> Data Cleaning Using Function

In [489]:
import re

def cleanFile(tweets): 
    # Remove non-ASCII characters 
    cleanedTweets = tweets.encode('ascii', 'ignore').decode('ascii') 
    cleanedTweets = ' '.join(cleanedTweets.split())
    
    # Remove byte-like prefixes (e.g., b') and retweet markers (rt)
    # Had to do this manually because the file contains the string 'b' at the beginning of each tweet
    # b' - Matches the character b followed by a single quote
    # | - Bitwise OR operator
    # b\" - Matches the character b followed by a double quote
    cleanedTweets = re.sub(r"b'|b\"", '', cleanedTweets)

    cleanedTweets = re.sub(r'\brt\b', '', cleanedTweets) 

    # Convert all text to lowercase
    cleanedTweets = cleanedTweets.lower()

    # Remove URLs
    # https? - 'http' or 'https'
    # :// - Matches characters '://'
    # \S+ - Matches one or more non-whitespace characters (so, going ahead and checking other characters in link)
    cleanedTweets = re.sub(r'https?://\S+', '', cleanedTweets)

    # Remove new lines and 
    cleanedTweets = re.sub(r'\n', ' ', cleanedTweets)

    # Remove extra spaces
    cleanedTweets = re.sub(r'\s+', ' ', cleanedTweets)

    # Remove mentions, hashtags, and other punctuation
    cleanedTweets = re.sub(r'@\w+', '', cleanedTweets)
    cleanedTweets = re.sub(r'#\w+', '', cleanedTweets)
    cleanedTweets = re.sub(r'[^\w\s]', '', cleanedTweets)
    
    # Remove all single characters at the start of the string
    cleanedTweets = re.sub(r'\s+[a-zA-Z]\s+', ' ', cleanedTweets)
    cleanedTweets = re.sub(r'\s+', ' ', cleanedTweets)
    
    # Remove numbers
    cleanedTweets = ''.join([j for j in cleanedTweets if not j.isdigit()])
     
    # Manually removing unwanted xexx' pattern 
    cleanedTweets = re.sub(r'xexx', '', cleanedTweets)
    
    # Remove excessive repeating characters or invalid patterns
    cleanedTweets = ' '.join([w for w in cleanedTweets.split() 
                              if not re.search(r'(.)\1{2,}', w) 
                              and len(w) > 1
                              and not re.search(r'[^a-zA-Z]', w)])
    
    
    # Split the cleaned string into words 
    cleanedTweets = cleanedTweets.split()
    
     # Notice: There are words that will still have trump in them and it does not get separated, so will do this manually: 
    newWord = []
    for w in cleanedTweets:
        if 'trump' in w:
            parts = re.split(r'(trump)', w)
            newWord.extend(part for part in parts if part) 
        else:
            newWord.append(w)
    
    return newWord

### General Test
***

In [490]:
# Testing the Code above to see 40 outputs in the list 
# Call the function: 
newCleanedTweets = cleanFile(tweets)
print(newCleanedTweets[:1000])

['be', 'careful', 'what', 'you', 'wish', 'for', 'new', 'rusty', 'bowers', 'republican', 'arizona', 'house', 'speaker', 'says', 'trump', 'backed', 'gop', 'candidates', 'might', 'send', 'the', 'country', 'cback', 'intanrt', 'should', 'the', 'justice', 'department', 'file', 'criminal', 'charges', 'against', 'the', 'owners', 'of', 'fox', 'news', 'oan', 'and', 'newsmax', 'for', 'helping', 'donaldanrt', 'fun', 'fact', 'every', 'single', 'state', 'that', 'voted', 'for', 'donald', 'trump', 'receives', 'more', 'money', 'from', 'the', 'federal', 'government', 'than', 'it', 'contributeanrt', 'breaking', 'ny', 'ag', 'letitia', 'james', 'just', 'filed', 'civil', 'fraud', 'lawsuit', 'against', 'donald', 'trump', 'donald', 'trump', 'jr', 'eric', 'trump', 'ivankaanrt', 'ny', 'ag', 'trump', 'case', 'is', 'not', 'about', 'subjective', 'real', 'estate', 'valuations', 'these', 'are', 'not', 'inflations', 'of', 'these', 'are', 'inflationsanrt', 'the', 'art', 'of', 'the', 'steal', 'has', 'just', 'been', 'bu

<b> Question 1: </b> What’s the word count for positive/negative/stop word/other?

In [491]:
def wordCounter(newCleanedTweets, negativeContents, positiveContents, stopwordsContents):
    posWordCount = 0
    negWordCount = 0
    stopWordCount = 0
    otherWordCount = 0

    for word in newCleanedTweets:
        if word in positiveContents:
            posWordCount += 1
        elif word in negativeContents:
            negWordCount += 1
        elif word in stopwordsContents:
            stopWordCount += 1
        else:
            otherWordCount += 1
    
    return posWordCount, negWordCount, stopWordCount, otherWordCount


In [492]:
# Call the function: 
posWordCount, negWordCount, stopWordCount, otherWordCount = wordCounter(newCleanedTweets, negativeContents, positiveContents, stopwordsContents)
print ("There are ", posWordCount, " positive words in the tweets.")
print ("There are ", negWordCount, " negative words in the tweets.")
print ("There are ", stopWordCount, " stop words in the tweets.")
print ("There are ", otherWordCount, " other words in the tweets.")

There are  63715  positive words in the tweets.
There are  15492  negative words in the tweets.
There are  12239  stop words in the tweets.
There are  44059  other words in the tweets.


<b> Question 2: </b> What’s the ratio of positive/negative/stop word/other, compared to the total word count?

<b> Question 3: </b> What’s the ratio for positive versus negative word count? (positive words # / negative
words #)

In [493]:
def ratios(newCleanedTweets, negativeContents, positiveContents, stopwordsContents):
    # call the wordCounter function to get the word counts
    posWordCount, negWordCount, stopWordCount, otherWordCount = wordCounter(newCleanedTweets, negativeContents, positiveContents, stopwordsContents)

    wordCountSum = posWordCount + negWordCount + stopWordCount + otherWordCount
    # To check, use -> print(wordCountSum)

    # Rounding each ratio by 2 digits because this is necessary for a concise understanding 
    posRatio = round(posWordCount/wordCountSum, 2) 
    negRatio = round(negWordCount/wordCountSum, 2)  
    stopRatio = round(stopWordCount/wordCountSum, 2) 
    otherRatio = round(otherWordCount/wordCountSum, 2)
    
    # Ratio of postive words / negative words 
    postiveVersusNegative = round(posWordCount/negWordCount, 2)
    
    print ("The ratio of positive words to total words is:", posRatio)
    print ("The ratio of negative words to total words is:", negRatio)
    print ("The ratio of stop words to total words is:", stopRatio)
    print ("The ratio of other words to total words is:", otherRatio)
    
    print("Positive Word Count / Negative Word Count:", posWordCount, "/", negWordCount, "=", postiveVersusNegative)

In [494]:
# Call the function: 
ratios(newCleanedTweets, negativeContents, positiveContents, stopwordsContents)

The ratio of positive words to total words is: 0.47
The ratio of negative words to total words is: 0.11
The ratio of stop words to total words is: 0.09
The ratio of other words to total words is: 0.33
Positive Word Count / Negative Word Count: 63715 / 15492 = 4.11


<b> Question from Answer Sheet (Not included in Homework Document: </b> If we account for "Trump" to not be a positive/negative/stopword, what would be the updated ratio for positive and negative words? (provide the decimal value of positive # / negative #, 2 decimal places)

In [495]:
def newRatios(newCleanedTweets, negativeContents, positiveContents, stopwordsContents):
    # call the wordCounter function to get the word counts
    posWordCount, negWordCount, stopWordCount, otherWordCount = wordCounter(newCleanedTweets, negativeContents, positiveContents, stopwordsContents)
    
    # Check how many times this word occurs in cleanedTweets
    trumpWordCount = sum(1 for word in newCleanedTweets if word.lower() == 'trump')
    # To check, use - print(trumpWordCount)
    
    # Find the difference between trumpWordCount and posWordCount because that is the only file this word exists
    newPosCount = posWordCount - trumpWordCount
    
    wordCountSum = newPosCount + negWordCount + stopWordCount + otherWordCount
    
    # Rounding each ratio by 2 digits because this is necessary for a concise understanding 
    posRatio = round(newPosCount/wordCountSum, 2) 
    negRatio = round(negWordCount/wordCountSum, 2)  
    stopRatio = round(stopWordCount/wordCountSum, 2) 
    otherRatio = round(otherWordCount/wordCountSum, 2)
    
    # Ratio of postive words / negative words 
    postiveVersusNegative = round(newPosCount/negWordCount, 2)
    
    print ("The ratio of positive words to total words is:", posRatio)
    print ("The ratio of negative words to total words is:", negRatio)
    print ("The ratio of stop words to total words is:", stopRatio)
    print ("The ratio of other words to total words is:", otherRatio)
    
    print("Positive Word Count / Negative Word Count:", newPosCount, "/", negWordCount, "=", postiveVersusNegative)

In [496]:
# Call the function: 
newRatios(newCleanedTweets, negativeContents, positiveContents, stopwordsContents)

The ratio of positive words to total words is: 0.45
The ratio of negative words to total words is: 0.12
The ratio of stop words to total words is: 0.09
The ratio of other words to total words is: 0.34
Positive Word Count / Negative Word Count: 57696 / 15492 = 3.72


<b> Question 4: </b> Do you think that the general sentiment is negative or positive? Weakly or strongly?

Although the statistics show otherwise, I believe the general statement is negative given what I had read in the text document. However, in the statistics, the sentiment appears to be generally positive given the high ratio of positive words (47%) compared to negative words (11%) identified. 

<b> Question 5: </b> What are some other considerations you might have on this analysis? (e.g., can
you take the results above as is?)

To begin with, there are other considerations that can be made in this sentiment analysis model. For example, words that are classified as negative or positive can have different connotations for different individuals and depending on context. For example, saying "great" can be positive when we are complementing someone's work, but may be interpreted as a negative if someone is being sarcastic. Moreover, as seen with the word "Trump", which was stored in the positive text file, there may be neutral opinions and/or strongly positive or negative responses by different individuals. As mentioned previously, political discussions have sarcastic notes to them, so it becomes difficult for a basic sentiment analysis model, like this one, to interpret the negativity or positivity levels. This may be further strengthened with a larger database and shifting to a machine learning model (e.g. logistic regression) to categorize.  