# Sentiment Analysis on Twitter 

## author: Sergey Ovsianyk

The following cell makes sure that all of the outputs of a cell are printed.

In [1]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The following cell disables autosaves.

In [2]:
%autosave 0

Autosave disabled


## 1."Trump" keyword

### Data loading and preparation.

After collecting tweets from Tweeter, using key word "*Trump*", I read the file. 

In [3]:
tweets_file = open("../data/Trump.txt")

In [4]:
# saving all information from file to the string
tweets_string = tweets_file.read()

In [5]:
# closing the file
tweets_file.close()

In [6]:
# since words in our positive, negative and stop lists are lower-case, I converted all tweets to lower-case.
tweets_string = tweets_string.lower()

All of collected tweets are saved into 1 string right now.

Separating one string of tweets into list of "words"

In [7]:
tweets_words_list = tweets_string.split()

Total number of dirty words is:

In [8]:
len(tweets_words_list)

94974

Reading the file of **positive** words:

In [9]:
positive_file = open('../data/positive.txt','r')

In [10]:
positive_words = positive_file.read()

In [11]:
positive_file.close()

All of positive words are saved into 1 string, separated by \n right now. Separating one string of positive words into list of positive words:

In [12]:
positive_words = positive_words.split(sep = '\n')

Reading the file of **negative** words:

In [13]:
negative_file = open('../data/negative-words.txt','r')

In [14]:
negative_words = negative_file.read()

In [15]:
negative_file.close()

All of negative words are saved into 1 string, separated by \n right now. Separating one string of negative words into list of negative words:

In [16]:
negative_words = negative_words.split(sep = '\n')

Reading the file of **stop** words:

In [17]:
stop_file = open('../data/stopwords.txt','r')

In [18]:
stop_words = stop_file.read()

In [19]:
stop_file.close()

All of stop words are saved into 1 string, separated by \n right now. Separating one string of stop words into list of stop words:

In [20]:
stop_words = stop_words.split(sep = '\n')

### Data cleaning

Firstly, I defined a function that removes special characters that I defined and numbers from both sides of the string. After that it checks if each letter of each word is not a special character, except '-'. If it is, remove that word from the list. At the end it checks if word is equal to empty string, or '-' or if it starts with '@', if word meets any of these conditions, word is excluded from the list.

In [21]:
def word_cleaning(words_list):
    
    # I did not include '@' to specia characters for a reason.
    # I do not want to clean my word form this character to the left of the word.
    # In twitter, you can tag a user in your post using @User_name syntax
    # That is a reason I left '@' to the left of the word. I still will delete these words, later.
    # string of special characters
    special_characters = ':.,?/><`~\\|\"\';]}[{1234567890±§!#$%^&*)(_='
    for index, item in enumerate(words_list):
        # delete special symbols from both sides
        words_list[index] = words_list[index].rstrip(special_characters + '@')
        words_list[index] = words_list[index].lstrip(special_characters)
        # for each word in the list
    for word in words_list:
        # for each letter of each word
        for c in word:
            # if letter is one of special characters, I delete the word
            if ((c in special_characters) & (c != '-')):
                words_list.remove(word)
                break
    for word in words_list:
        # if word is empty string, or it is a dash, or it starts with '@', I delete the word
        if word == '' or word == '-' or word.startswith('@'):
            words_list.remove(word)

Apply the function above, to our list of words:

In [22]:
word_cleaning(tweets_words_list)

In [23]:
print('After cleaning we have ' + str(len(tweets_words_list)) + ' words')

After cleaning we have 78985 words


Even after the cleaning, there are some dirty words, like in example below:

<div class="alert alert-block alert-warning">WARNING: this example is relevant only to the current Trump.txt. Though you can manually inspect the list and see some dirty words. </div>

In [24]:
print(tweets_words_list[72])

latest\xe2\x80\xa6\n'b"rt


In [25]:
def dirty_count(words_list):
    dirty = 0
    for word in words_list:
        if not word.isalpha():
            dirty += 1
    print("Rafly culculating(including all hyphen words). There are " + str(round((dirty * 100/len(words_list)), 2)) + "% of dirty words")

In [26]:
dirty_count(tweets_words_list)

Rafly culculating(including all hyphen words). There are 3.1% of dirty words


We consider it as a noise.

### Sentiment Analysis

Create a function that counts how many positive, negative, stop words and other word is there.

In [27]:
def countPosNegStopWords(listOfWords, positiveList, negativeList, stopWordsList):
    positive_count = 0
    negative_count = 0
    stop_word_count = 0
    other_count = 0
    for word in listOfWords:
        if word in positiveList:
            positive_count += 1
            continue
        elif word in negativeList:
            negative_count += 1
            continue
        elif word in stopWordsList:
            stop_word_count += 1
            continue
        else:
            other_count += 1
    return (positive_count, negative_count, stop_word_count, other_count)

In [28]:
positive_count_trump, negative_count_trump, stop_word_count_trump, other_count_trump\
= countPosNegStopWords(tweets_words_list,positive_words,negative_words,stop_words)

Create a function that performs sentiment analysis on our list of words and prints results.

In [29]:
def sentiment_info(npositive, nnegative, nstop, nother, nall):
    ratio = 0
    info = ''
    sentiment_info = 'In general sentiment is '
    if npositive > nnegative:
        ratio = npositive/nnegative
        info = "For each negative word there is " + str(round(ratio,2)) + " positive words"
        if ratio > 1.4:
            sentiment_info += 'strongly positive'
        elif ratio > 1.1:
            sentiment_info += 'weakly positive'
        else:
            sentiment_info += 'neutral'
    elif nnegative > npositive:
        ratio = nnegative/npositive
        info = "For each positive word there is " + str(round(ratio,2)) + " negative words"
        if ratio > 1.4:
            sentiment_info += 'strongly negative'
        elif ratio > 1.1:
            sentiment_info += 'weakly negative'
        else:
            sentiment_info += 'neutral'
    else:
        ratio = nnegative/npositive
        info = "For each positive word there is " + str(round(ratio,2)) + " negative words"
        sentiment_info += 'neutral'
    print(info)
    print(sentiment_info)
    print()
    print("The sum of positive and negative words = " + str(npositive + nnegative))
    print()
    print('There are ' + str(npositive) + ' positive words')
    print('There are ' + str(nnegative) + ' negative words')
    print('There are ' + str(nstop) + ' stop words')
    print('There are ' + str(nother) + ' other words')
    print()
    print('Ratio of positive words to all words is:' + str(round(npositive/nall,5)))
    print('Ratio of negative words to all words is:' + str(round(nnegative/nall,5)))
    print('Ratio of stop words to all words is:' + str(round(nstop/nall,5)))
    print('Ratio of other words to all words is:' + str(round(nother/nall,5)))

In [30]:
sentiment_info(positive_count_trump, negative_count_trump, stop_word_count_trump, other_count_trump, len(tweets_words_list))

For each negative word there is 1.63 positive words
In general sentiment is strongly positive

The sum of positive and negative words = 8258

There are 5118 positive words
There are 3140 negative words
There are 38924 stop words
There are 31803 other words

Ratio of positive words to all words is:0.0648
Ratio of negative words to all words is:0.03975
Ratio of stop words to all words is:0.4928
Ratio of other words to all words is:0.40265


### However:

In [31]:
'trump' in positive_words

True

In [32]:
positive_words.remove('trump')

In [33]:
positive_count_trump, negative_count_trump, stop_word_count_trump, other_count_trump\
= countPosNegStopWords(tweets_words_list,positive_words,negative_words,stop_words)

In [34]:
sentiment_info(positive_count_trump, negative_count_trump, stop_word_count_trump,\
               other_count_trump, len(tweets_words_list))

For each positive word there is 1.16 negative words
In general sentiment is weakly negative

The sum of positive and negative words = 5858

There are 2718 positive words
There are 3140 negative words
There are 38924 stop words
There are 34203 other words

Ratio of positive words to all words is:0.03441
Ratio of negative words to all words is:0.03975
Ratio of stop words to all words is:0.4928
Ratio of other words to all words is:0.43303


#### Summary: after removing word "trump" from list of positive words, sentiment changed from strongly positive to weakly negative.

In [35]:
positive_words.append('trump')

## 2."covid-19" keyword

### Data loading and preparation.

After collecting tweets from Tweeter, using key word "*covid-19*", I read the file. 

In [36]:
covid_19_file = open("../data/covid-19.txt")

In [37]:
# saving all information from file to the string
covid_19_string = covid_19_file.read()

In [38]:
# closing the file
covid_19_file.close()

In [39]:
# since words in our positive, negative and stop lists are lower-case, I converted all tweets to lower-case.
covid_19_string = covid_19_string.lower()

All of collected tweets are saved into 1 string right now.

Separating one string of tweets into list of "words"

In [40]:
covid_19_words_list = covid_19_string.split()

Total number of dirty words is:

In [41]:
len(covid_19_words_list)

45897

### Data cleaning

In [42]:
word_cleaning(covid_19_words_list)

In [43]:
print('After cleaning we have ' + str(len(covid_19_words_list)) + ' words')

After cleaning we have 38980 words


In [44]:
dirty_count(covid_19_words_list)

Rafly culculating(including all hyphen words). There are 7.22% of dirty words


### Sentiment Analysis

In [45]:
positive_count_covid_19, negative_count_covid_19, stop_word_count_covid_19, other_count_covid_19 =\
countPosNegStopWords(covid_19_words_list, positive_words, negative_words, stop_words)

In [46]:
sentiment_info(positive_count_covid_19, negative_count_covid_19, stop_word_count_covid_19,\
               other_count_covid_19, len(covid_19_words_list))

For each positive word there is 1.06 negative words
In general sentiment is neutral

The sum of positive and negative words = 2421

There are 1173 positive words
There are 1248 negative words
There are 18139 stop words
There are 18420 other words

Ratio of positive words to all words is:0.03009
Ratio of negative words to all words is:0.03202
Ratio of stop words to all words is:0.46534
Ratio of other words to all words is:0.47255
