## A sentiment analysis on tweets related to E-levy.

#### This project seems to analyze the general consensus on the introduction of e-levy (electronic levy) charges by the Ghana Government on digital transactions. 

It uses the ------ NLP model to analyze the sentiments of hundreds of tweets scraped from twitter.

Importing first libraries <br>
<ul>
<li>snscrape ---> a scraper for social networking services </li>
<li>pandas ---> an open source data analysis and manipulation tool </li> 
<li>re ---> library for string manipulation</li>
</ul>

In [195]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import itertools
import re
import preprocessor as p


import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize

lemmatizer = nltk.stem.WordNetLemmatizer()
stop_words = set(stopwords.words('english'))



[nltk_data] Downloading package wordnet to C:\Users\Innocent
[nltk_data]     Anyaele\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Innocent
[nltk_data]     Anyaele\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


we build our query here by gathering e-levy related tweets from when it was first announced (2022/09/20)

we store our gathered tweets in a csv file -> streams.csv

In [196]:
def scrapeTweets():    
    query = '"e-levy" lang:en until:2022-09-20 since:2022-05-01 -filter:links'

    tweets = []
    limit = 500

    data = sntwitter.TwitterSearchScraper(query).get_items()
    for tweet in data:
        if len(tweets) == limit:
            break
        else:
            tweet_text = tweet.content
            tweets.append([tweet.id, tweet_text, tweet.date])
            
    df = pd.DataFrame(tweets, columns=['id', 'Tweet', 'Date'])

    df.to_csv('stream.csv', index=False, columns=['id','Tweet','Date'])


Our tweet containes 500 rows, each row containing the tweet id, tweet content and the tweet date.
We print the first 5 tweets
We see our tweets, contains a lot of unneccessary data for analysis, so we preprocess it in the next step

In [198]:
df = pd.read_csv('stream.csv')
print(df.shape)
print(df['Tweet'].head(5))

(500, 3)
0    @edburtler @shamimamuslim Notice that Shamima ...
1    @edburtler @shamimamuslim QUESTION: \nWhy was ...
2    @FrankOw18664478 @hearttooclean @mandemthe1st ...
3    @FrankOw18664478 @hearttooclean @mandemthe1st ...
4    @FrankOw18664478 @bra_Kofi__ @hearttooclean @m...
Name: Tweet, dtype: object


The preprocessing stage involves cleaning up our tweets with a regex function by removing links, tags and whitespaces.

We also leverage the NLTK library to remove stop words such as "and", "or", "in"

After we print some of our tweets, it looks much more cleaner than before

In [203]:
def tweet_preprocessing(tweet):
    # regex cleanup
    tweet = re.sub(r"^https://t.co/[A-Za-z0-9]*\s", " ", tweet)
    tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*\s", " ", tweet)
    tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*$", " ", tweet)
    tweet = re.sub("\.\.+", " ", tweet)
    tweet = re.sub("-$", "", tweet)
    tweet = re.sub(r'[^\w\s]', '', tweet)
    tweet = re.sub(r"^ +", "", tweet)
    tweet = re.sub(r"  +", " ", tweet)
    
    # use preprocessing library to clean
    tweet = p.clean(tweet)
    tweet = tweet.lower()
    
    # tokenize
    token_tweet = word_tokenize(tweet)
    filtered = [w for w in token_tweet if not w.lower() in stop_words]
    filtered_array  = []
    
    # remove stopwords
    for w in token_tweet:
        if w not in stop_words:
            filtered_array.append(w)
                 
    
    return ' '.join(filtered_array)


# applying our pre processing function to our tweet
df['Tweet'] = df['Tweet'].apply(lambda x: tweet_preprocessing(x))
print (df['Tweet'].tail(5)) 


495    minister communication ursulaow mocghana moigo...
496    georgekankambo4 hmmm tear even cedis elevy npp...
497    vodafoneghana charged elevy transaction less c...
498                            citi973 like refund elevy
499                    everytime pay elevy dey feel sick
Name: Tweet, dtype: object


In [204]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [205]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [163]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'negative': scores[0],
        'neutral': scores[1],
        'positive': scores[2]
    }
    return scores_dict


res = {}
for i, row in tqdm(twitter_df.iterrows(), total=len(df)):
    try:   
        tweet_text = row['Tweet']
        tweet_id = row['id']
        roberta_results = polarity_scores_roberta(tweet_text)
        res[tweet_id] = roberta_results
    except RuntimeError:
        tweet_id = row['id']
        print ("Broke for {tweet_id} ")

print(res)

  0%|          | 0/100 [00:00<?, ?it/s]

{1571994188660744192: {'negative': 0.33652392, 'neutral': 0.20414186, 'positive': 0.1756662}, 1571992159183593473: {'negative': 0.22968423, 'neutral': 0.21303397, 'positive': 0.22564976}, 1571981363699679233: {'negative': 0.17080615, 'neutral': 0.15704201, 'positive': 0.21783318}, 1571981131238543366: {'negative': 0.17584077, 'neutral': 0.12185262, 'positive': 0.15068156}, 1571980917006278656: {'negative': 0.2278475, 'neutral': 0.13903745, 'positive': 0.15184657}, 1571980514512232452: {'negative': 0.4672054, 'neutral': 0.14724894, 'positive': 0.14081949}, 1571960199564460033: {'negative': 0.65381294, 'neutral': 0.17703915, 'positive': 0.10964892}, 1571958217692909568: {'negative': 0.29376715, 'neutral': 0.15764081, 'positive': 0.14446248}, 1571955389612322816: {'negative': 0.76084423, 'neutral': 0.15583694, 'positive': 0.058712028}, 1571955188034056192: {'negative': 0.4105462, 'neutral': 0.2875442, 'positive': 0.18929048}, 1571952217560875010: {'negative': 0.2579873, 'neutral': 0.13353

In [159]:
results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index' : 'id'})
merged_df = pd.merge(twitter_df,results_df,how='outer')
print (merged_df)

                     id                                              Tweet  \
0   1571994188660744192  @edburtler @shamimamuslim Notice that Shamima ...   
1   1571992159183593473  @edburtler @shamimamuslim QUESTION: \nWhy was ...   
2   1571981363699679233  @FrankOw18664478 @hearttooclean @mandemthe1st ...   
3   1571981131238543366  @FrankOw18664478 @hearttooclean @mandemthe1st ...   
4   1571980917006278656  @FrankOw18664478 @bra_Kofi__ @hearttooclean @m...   
..                  ...                                                ...   
95  1571510724848934915  @Mandemthe1st @ShoeLhaze @BongoIdeas I should ...   
96  1571510199264985088  @GhanaRevenue why are the Telcos charging e-le...   
97  1571509964140843008  @ShoeLhaze @Mandemthe1st @BongoIdeas Bro this ...   
98  1571509203214434307  @Mandemthe1st @ShoeLhaze @BongoIdeas Is Afro d...   
99  1571500062890663941  @Mandemthe1st @ShoeLhaze @BongoIdeas Lol 😂 😹 y...   

                        Date  negative   neutral  positive  
0 

##### Using the BERT Model

In [162]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')


Downloading: 100%|██████████| 39.0/39.0 [00:00<00:00, 18.7kB/s]
Downloading: 100%|██████████| 953/953 [00:00<00:00, 144kB/s]
Downloading: 100%|██████████| 851k/851k [00:18<00:00, 46.3kB/s] 
Downloading: 100%|██████████| 112/112 [00:00<00:00, 113kB/s]
Downloading: 100%|██████████| 638M/638M [20:59<00:00, 531kB/s]    


In [164]:
import torch

def sentiment_score(tweet):
    tokens = tokenizer.encode(tweet, return_tensors='pt')
    result = model(tokens)
    result.logits
    return (int(torch.argmax(result.logits))+1)

In [165]:
data1 = twitter_df
data1['sentiment_score'] = data1['Tweet'].apply(lambda x: sentiment_score(x))

In [166]:
data1.head()

Unnamed: 0,id,Tweet,Date,sentiment_score
0,1571994188660744192,@edburtler @shamimamuslim Notice that Shamima ...,2022-09-19 22:46:35+00:00,1
1,1571992159183593473,@edburtler @shamimamuslim QUESTION: \nWhy was ...,2022-09-19 22:38:31+00:00,1
2,1571981363699679233,@FrankOw18664478 @hearttooclean @mandemthe1st ...,2022-09-19 21:55:37+00:00,4
3,1571981131238543366,@FrankOw18664478 @hearttooclean @mandemthe1st ...,2022-09-19 21:54:42+00:00,5
4,1571980917006278656,@FrankOw18664478 @bra_Kofi__ @hearttooclean @m...,2022-09-19 21:53:51+00:00,5
