## A sentiment analysis on tweets related to E-levy.

#### This project seems to analyze the general consensus on the introduction of e-levy (electronic levy) charges by the Ghana Government on digital transactions. 

It uses the ------ NLP model to analyze the sentiments of hundreds of tweets scraped from twitter.

Importing first libraries <br>
<ul>
<li>snscrape ---> a scraper for social networking services </li>
<li>pandas ---> an open source data analysis and manipulation tool </li> 
<li>re ---> library for string manipulation</li>
</ul>

In [195]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import itertools
import re
import preprocessor as p


import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize

lemmatizer = nltk.stem.WordNetLemmatizer()
stop_words = set(stopwords.words('english'))



[nltk_data] Downloading package wordnet to C:\Users\Innocent
[nltk_data]     Anyaele\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Innocent
[nltk_data]     Anyaele\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


we build our query here by gathering e-levy related tweets from when it was first announced (2022/09/20)

we store our gathered tweets in a csv file -> streams.csv

In [196]:
def scrapeTweets():    
    query = '"e-levy" lang:en until:2022-09-20 since:2022-05-01 -filter:links'

    tweets = []
    limit = 500

    data = sntwitter.TwitterSearchScraper(query).get_items()
    for tweet in data:
        if len(tweets) == limit:
            break
        else:
            tweet_text = tweet.content
            tweets.append([tweet.id, tweet_text, tweet.date])
            
    df = pd.DataFrame(tweets, columns=['id', 'Tweet', 'Date'])

    df.to_csv('stream.csv', index=False, columns=['id','Tweet','Date'])


Our tweet containes 500 rows, each row containing the tweet id, tweet content and the tweet date.
We print the first 5 tweets
We see our tweets, contains a lot of unneccessary data for analysis, so we preprocess it in the next step

In [198]:
df = pd.read_csv('stream.csv')
print(df.shape)
print(df['Tweet'].head(5))

(500, 3)
0    @edburtler @shamimamuslim Notice that Shamima ...
1    @edburtler @shamimamuslim QUESTION: \nWhy was ...
2    @FrankOw18664478 @hearttooclean @mandemthe1st ...
3    @FrankOw18664478 @hearttooclean @mandemthe1st ...
4    @FrankOw18664478 @bra_Kofi__ @hearttooclean @m...
Name: Tweet, dtype: object


The preprocessing stage involves cleaning up our tweets with a regex function by removing links, tags and whitespaces.

We also leverage the NLTK library to remove stop words such as "and", "or", "in"

After we print some of our tweets, it looks much more cleaner than before

In [203]:
def tweet_preprocessing(tweet):
    # regex cleanup
    tweet = re.sub(r"^https://t.co/[A-Za-z0-9]*\s", " ", tweet)
    tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*\s", " ", tweet)
    tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*$", " ", tweet)
    tweet = re.sub("\.\.+", " ", tweet)
    tweet = re.sub("-$", "", tweet)
    tweet = re.sub(r'[^\w\s]', '', tweet)
    tweet = re.sub(r"^ +", "", tweet)
    tweet = re.sub(r"  +", " ", tweet)
    
    # use preprocessing library to clean
    tweet = p.clean(tweet)
    tweet = tweet.lower()
    
    # tokenize
    token_tweet = word_tokenize(tweet)
    filtered = [w for w in token_tweet if not w.lower() in stop_words]
    filtered_array  = []
    
    # remove stopwords
    for w in token_tweet:
        if w not in stop_words:
            filtered_array.append(w)
                 
    
    return ' '.join(filtered_array)


# applying our pre processing function to our tweet
df['Tweet'] = df['Tweet'].apply(lambda x: tweet_preprocessing(x))
print (df['Tweet'].tail(5)) 


495    minister communication ursulaow mocghana moigo...
496    georgekankambo4 hmmm tear even cedis elevy npp...
497    vodafoneghana charged elevy transaction less c...
498                            citi973 like refund elevy
499                    everytime pay elevy dey feel sick
Name: Tweet, dtype: object


##### Using the BERT Model

The BERT Model is a transformer based machine learning technique for natural language processing pre-training developed by Google. -> Wikipedia

We import the AutoTokenizer and the AutoModel for Sequence Classification

In [162]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')


Downloading: 100%|██████████| 39.0/39.0 [00:00<00:00, 18.7kB/s]
Downloading: 100%|██████████| 953/953 [00:00<00:00, 144kB/s]
Downloading: 100%|██████████| 851k/851k [00:18<00:00, 46.3kB/s] 
Downloading: 100%|██████████| 112/112 [00:00<00:00, 113kB/s]
Downloading: 100%|██████████| 638M/638M [20:59<00:00, 531kB/s]    


We import Torch (an open source machine learning framework that can be used for natural language processing) and create a function that will leverage our tokenizer and model to find the sentiment score of a particular tweet. The score will return an array of scores will be computed on using torch.argmax function to find the appropriate sentiment score.

The score will fall in the range 1 - 3.
1 being negative sentiment and 3 being position sentiment with 2 a neutral sentiment.

Let's try this function on the text "I hate this" to test our model.

It returns 1 (a negative sentiment).

We try function on the text "I love this".

It returns a 3 (a positive sentiment)



In [226]:
import torch

def sentiment_score(tweet):
    tokens = tokenizer.encode(tweet, return_tensors='pt')
    result = model(tokens)
    result.logits
    # print (result.logits)
    return (int(torch.argmax(result.logits))+1)

sentiment_score('I hate this')
# sentiment_score('i love this')

1

We apply our sentiment score function on a newly created column called sentiment score 

In [218]:
df['sentiment_score'] = df['Tweet'].apply(lambda x: sentiment_score(x))

In [223]:
df.tail(10)

Unnamed: 0,id,Tweet,Date,sentiment_score
490,1567781340363345920,sikanikwame_ gramof ursula stubborn academy cs...,2022-09-08 07:46:14+00:00,2
491,1567773778842718209,kevinekowtaylor nakufoaddo happens elevy cst e...,2022-09-08 07:16:11+00:00,2
492,1567756517788598273,elevy affects transactions less gh,2022-09-08 06:07:36+00:00,2
493,1567685541897928706,_lawslaw get elevy,2022-09-08 01:25:34+00:00,2
494,1567664697737969665,everything e npp successfully implemented elev...,2022-09-08 00:02:44+00:00,3
495,1567652269893500929,minister communication ursulaow mocghana moigo...,2022-09-07 23:13:21+00:00,1
496,1567649835804663809,georgekankambo4 hmmm tear even cedis elevy npp...,2022-09-07 23:03:41+00:00,2
497,1567630086681108482,vodafoneghana charged elevy transaction less c...,2022-09-07 21:45:12+00:00,2
498,1567607475356012545,citi973 like refund elevy,2022-09-07 20:15:21+00:00,2
499,1567605853531897859,everytime pay elevy dey feel sick,2022-09-07 20:08:54+00:00,1


In [222]:
df['sentiment_score'].describe()

count    500.000000
mean       1.810000
std        0.488238
min        1.000000
25%        2.000000
50%        2.000000
75%        2.000000
max        3.000000
Name: sentiment_score, dtype: float64

We add a newly created column with the text equivalent of our score.
1 - negative
2 - neutral
3 - positive

In [230]:
def get_sentiment(x):
    if x == 1:
        return 'Negative'
    if x == 2:
        return 'Neutral'
    if x == 3:
        return "Positive"
    
df['sentiment'] = df['sentiment_score'].apply(lambda x: get_sentiment(x))

df['sentiment'].tail(10)

490     Neutral
491     Neutral
492     Neutral
493     Neutral
494    Positive
495    Negative
496     Neutral
497     Neutral
498     Neutral
499    Negative
Name: sentiment, dtype: object