# <center>Coronavirus Sentiment Analysis(Kenya)</center>

<img src="https://images.unsplash.com/photo-1584118624012-df056829fbd0?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60" alt="Credits: CDC on Unsplash" width="950"
         height="350">

## **Project Objective**
* To determine change in sentiment/perception over time on COVID-19 across different regions in the country(Kenya)

*The regions to be covered include:*

1. *Nairobi*
2. *Mombasa*
3. *Migori*
4. *Kiambu*
5. *Mandera*

*Timespan: 2019-11-01 to 2020-08-01*

## **Sources of Data**
* Twitter

*Search Phrases to look out for include popular hashtags such as:*

1. *#KomeshaCorona*
2. *#COVID19KE*
3. *#UHURUsToughChoices*
4. *#UhuruAddress*
5. *#staysafe*

*Number of tweets to fetch per region = 100000*

## **Tools**
1. Google Colab
2. Github
3. Python and its relevant frameworks
4. Docker **NB: will dockerize the project at the end**



# **<center>Issues Arising From Tweepy Approach</center>**

1. ### Tweepy Limitations

* *There are different types and levels of API access(**Standard, Premium and Enterprise**) that tweepy offers for very specific use-cases. For my case, I was using the Standard API access for a free Twitter Developers Account*
   * The standard API only allows you to scape tweets upto seven(7) days old

   * Limited to scraping only 15K tweets per 15min window. This can however be increased through [these methods](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./).

   * Can only obtain a max of 3200 tweets of users most recent tweet.

   * Suitable when making complex queries or extensive information for each tweet is needed.
   
2. ### Search Context
* *Could not maintain a search context across our API rate limit window, so as to avoid getting duplicate results when searching repeatedly over a long period of time*

* *the fact that not all tweets matching the search criteria will be returned by the API*


# **<center>Adopted APproach; GetOldTweets3</center>**

*The package allows me to work around Twitters Standard API limitations and is a quick, no frills way of scraping.*

*Does not offer extensive functionality like tweepy.*

In [None]:
#Import Libraries

import GetOldTweets3 as got
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

In [None]:
#Define global variables for the project

'''the phrase to search for 
Can only search for string, not list or otherwise
#Search words below were used only for the case of Nairobi
#KomeshaCorona
#COVID19KE
#UHURUsToughChoices
#UhuruAddress
#staysafe
#UhuruDontLiftLockdown
#CurfewinKenya'''

search_text = 'COVID_19'  

since_date = '2019-11-01' #specifies the date to begin querying/searching from

until_date = '2020-08-01' #specifies the date to end our query/search

count = 100000 #specifies the number of tweets to fetch. Give a high value figure.

# *For the rest of the towns/counties, different hashtags were used as indicated below:*

### Mombasa:

* *COVID_19*

### Migori:
*  *COVID_19*

### Kiambu:
* *COVID_19*

## <center>**Creation of a Query Object**</center>

#### *I will be using python classes; tweet, tweetManager and tweetCriteria of the GetOldTweets Library*

#### *The search parameters to look out for that suit my purpose in this project are:*
* *Text of the tweet*
* *Location of the user-doesn't give the precise location of the tweet but a general location of the tweet since most users do not share the exact tweet location*
* *Date of the tweet*
* *Retweet Count-shows that most people resonate with the tweet*
* *Favorited tweets*
* *Hashtag-our search_text*

In [None]:
#Execute the code using python classes

#search parameters to be used with the manager class

tweetCriteria = got.manager.TweetCriteria().setQuerySearch(search_text).setSince(since_date)\
    .setUntil(until_date).setNear('Mombasa,Kenya').setMaxTweets(count)

#List of objects get stored in tweets variable

tweets = got.manager.TweetManager.getTweets(tweetCriteria)

#print(tweets + '\n')

#iterating through tweets list and storing them temporarily in the tweets variable.
#get information and store it as a list inside tweetsList

tweetList = [[tweet.id, tweet.date, tweet.text, tweet.geo, tweet.retweets, tweet.favorites, tweet.hashtags] for tweet in tweets]

#print(tweetList)

In [None]:
#define the columns for your dataframe

columns_new = ['ID', 'DATE', 'TWEET', 'LOCATION', 'RETWEETS', 'FAVORITES', 'HASHTAGS']

#Create a dataframe from the list

df = pd.DataFrame(data=tweetList, columns=columns_new)

df.shape

In [None]:
df.to_csv('nairobi_7.csv')

In [None]:
data1=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_1.csv')
data2=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_2.csv')
data3=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_3.csv')
data4=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_4.csv')
data5=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_5.csv')
data6=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_6.csv')
data7=pd.read_csv('/home/grivine/Desktop/Get_Old_Tweets/nairobi_7.csv')


new_df = pd.concat([data1, data2, data3, data4, data5, data6, data7])

new_df.shape

In [None]:
df.head(50)

In [None]:
duplicateDFRow = new_df[new_df.duplicated()]
print(duplicateDFRow)

In [None]:
#Checking for null values

new_df.isnull().mean()*100

In [None]:
'''for col in new_df:
    print(col)'''
nairobi_final = new_df.drop(['Unnamed: 0'], axis = 1)

In [None]:
nairobi_final.to_csv('nairobi.csv')

In [None]:
#Check for the max and min dates in the data frame
#not correct as such in some instances

print(f" Data Available since {new_df.DATE.min()}")
print(f" Data Available upto {new_df.DATE.max()}")

In [None]:
print(f" Maximum number of retweets {new_df.RETWEETS.max()}")
print(f" Maximum number of favorites {new_df.FAVORITES.max()}")

In [None]:

#wordcloud

wordcloud__ = WordCloud(
                          background_color='yellow',
                          stopwords=set(STOPWORDS),
                          max_words=250,
                          max_font_size=40, 
                          random_state=1705
                         ).generate(str(new_df['TWEET'].dropna()))
def cloud_plot(wordcloud):
    fig = plt.figure(1, figsize=(20,15))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
cloud_plot(wordcloud__)


In [None]:
#The number of tweets according to dates

df['DATE'] =  pd.to_datetime(new_df['DATE'])
cnt_srs =new_df['DATE'].dt.date.value_counts()
cnt_srs = cnt_srs.sort_index()
plt.figure(figsize=(14,10))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color='green')
plt.xticks(rotation='vertical')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Number of tweets', fontsize=12)
plt.title("Number of tweets according to dates")
plt.show()