The notion of influence plays an important role in how businesses operate and is heavily linked to viral marketing and even how the society functions. Studying influence patterns can help us better understand why certain trends or innovations are adopted faster than others and how we could help advertisers andmarketers design more effective campaigns.

**The Goal of this project is to Identify influencers rank position from Twitter data in africa.**

In the next cells, i will be scraping data of the top 100 twitter influencers in africa and also the top governemnt officials influencers. 

## Web scrapping using python

#### References
1. [Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
2. [Web Scraping using Python](https://www.datacamp.com/community/tutorials/web-scraping-using-python)

We will need to install two packages that is;
 1. Requests for performing your HTTP requests
 2. BeautifulSoup4 for handling all of your HTML processing

In [1]:
!pip install requests BeautifulSoup4 --upgrade

Requirement already up-to-date: requests in /Users/DClinton/opt/anaconda3/lib/python3.7/site-packages (2.24.0)
Requirement already up-to-date: BeautifulSoup4 in /Users/DClinton/opt/anaconda3/lib/python3.7/site-packages (4.9.1)


In [2]:
#immporting libraries
import requests
from urllib.request import urlopen as uReq
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup 
import pandas as pd
import os, sys

import fire

# MAKING WEB REQUESTS#

We will make requests to the web using the requests package plus some python functions that will help us to get data from the web

In [3]:

#%%writefile ../pyscrap_url.py #This jupyter magic helps you to save this codecell as a script.

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content  #.encode(BeautifulSoup.original_encoding)
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
def get_elements(url, tag='',search={}, fname=None):
    """
    Downloads a page specified by the url parameter
    and returns a list of strings, one per tag element
    """
    
    if isinstance(url,str):
        response = simple_get(url)
    else:
        #if already it is a loaded html page
        response = url

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        
        res = []
        if tag:    
            for li in html.select(tag):
                for name in li.text.split('\n'):
                    if len(name) > 0:
                        res.append(name.strip())
                       
                
        if search:
            soup = html            
            
            
            r = ''
            if 'find' in search.keys():
                print('findaing',search['find'])
                soup = soup.find(**search['find'])
                r = soup

                
            if 'find_all' in search.keys():
                print('findaing all of',search['find_all'])
                r = soup.find_all(**search['find_all'])
   
            if r:
                for x in list(r):
                    if len(x) > 0:
                        res.extend(x)
            
        return res

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))    
    
    
if get_ipython().__class__.__name__ == '__main__':
    fire(get_tag_elements)

## This is the data of Hundred most influential people in Africa and their twitter names. ##

In [4]:
#we will now call the function and pass the tag of the data we need.
res = get_elements('https://africafreak.com/100-most-influential-twitter-users-in-africa',tag='h2')
#res

In [5]:
#We write the data we scrapped into a csv file
with open('csvfile.csv','w') as file:
    for line in res:
        file.write(line)
        file.write('\n')

In [6]:
#Reading the csv file
data=pd.read_csv("csvfile.csv",header=None)

We will drop the rows that we dont need and do some cleaning on the data

In [7]:
#dropping the rows we don't need, t
data.drop([100,101,102,103,104],axis=0,inplace=True)

In [8]:
#We will rename our column to influncers
data.columns=['influncers']

In [9]:
data

Unnamed: 0,influncers
0,100. Jeffrey Gettleman (@gettleman)
1,99. Africa24 Media (@a24media)
2,98. Scapegoat (@andiMakinana)
3,97. Africa Check (@AfricaCheck)
4,96. James Copnall (@JamesCopnall)
...,...
95,5. Julius Sello Malema (@Julius_S_Malema)
96,4. News24 (@News24)
97,3. Jacob G. Zuma (@SAPresident)
98,2. Gareth Cliff (@GarethCliff)


In [10]:
data['influncers']

0           100. Jeffrey Gettleman (@gettleman)
1                99. Africa24 Media (@a24media)
2                 98. Scapegoat (@andiMakinana)
3               97. Africa Check (@AfricaCheck)
4             96. James Copnall (@JamesCopnall)
                        ...                    
95    5. Julius Sello Malema (@Julius_S_Malema)
96                          4. News24 (@News24)
97              3. Jacob G. Zuma (@SAPresident)
98               2. Gareth Cliff (@GarethCliff)
99                 1. Trevor Noah (@Trevornoah)
Name: influncers, Length: 100, dtype: object

In [11]:
#splitting the rank from the names
data[['Rank','Names']]=pd.DataFrame(data.influncers.str.split('.',1).tolist())

In [12]:
data

Unnamed: 0,influncers,Rank,Names
0,100. Jeffrey Gettleman (@gettleman),100,Jeffrey Gettleman (@gettleman)
1,99. Africa24 Media (@a24media),99,Africa24 Media (@a24media)
2,98. Scapegoat (@andiMakinana),98,Scapegoat (@andiMakinana)
3,97. Africa Check (@AfricaCheck),97,Africa Check (@AfricaCheck)
4,96. James Copnall (@JamesCopnall),96,James Copnall (@JamesCopnall)
...,...,...,...
95,5. Julius Sello Malema (@Julius_S_Malema),5,Julius Sello Malema (@Julius_S_Malema)
96,4. News24 (@News24),4,News24 (@News24)
97,3. Jacob G. Zuma (@SAPresident),3,Jacob G. Zuma (@SAPresident)
98,2. Gareth Cliff (@GarethCliff),2,Gareth Cliff (@GarethCliff)


In [13]:
#splitting the Names and Twitter account name
data[['Name','Twittername']]=pd.DataFrame(data.Names.str.split('(',1).tolist())
data

Unnamed: 0,influncers,Rank,Names,Name,Twittername
0,100. Jeffrey Gettleman (@gettleman),100,Jeffrey Gettleman (@gettleman),Jeffrey Gettleman,@gettleman)
1,99. Africa24 Media (@a24media),99,Africa24 Media (@a24media),Africa24 Media,@a24media)
2,98. Scapegoat (@andiMakinana),98,Scapegoat (@andiMakinana),Scapegoat,@andiMakinana)
3,97. Africa Check (@AfricaCheck),97,Africa Check (@AfricaCheck),Africa Check,@AfricaCheck)
4,96. James Copnall (@JamesCopnall),96,James Copnall (@JamesCopnall),James Copnall,@JamesCopnall)
...,...,...,...,...,...
95,5. Julius Sello Malema (@Julius_S_Malema),5,Julius Sello Malema (@Julius_S_Malema),Julius Sello Malema,@Julius_S_Malema)
96,4. News24 (@News24),4,News24 (@News24),News24,@News24)
97,3. Jacob G. Zuma (@SAPresident),3,Jacob G. Zuma (@SAPresident),Jacob G. Zuma,@SAPresident)
98,2. Gareth Cliff (@GarethCliff),2,Gareth Cliff (@GarethCliff),Gareth Cliff,@GarethCliff)


In [14]:
# We can now drop the names and influncers column
data = data.drop("Names", axis=1)
data = data.drop("influncers",axis=1)

In [15]:
#striping off the last character in the twittername column
data['Twittername'] = data['Twittername'].str.rstrip(')')

In [16]:
#saving the data into a csv file
data.to_csv ('\Desktop\export_dataframe.csv', index = False, header=True)

In [17]:
data

Unnamed: 0,Rank,Name,Twittername
0,100,Jeffrey Gettleman,@gettleman
1,99,Africa24 Media,@a24media
2,98,Scapegoat,@andiMakinana
3,97,Africa Check,@AfricaCheck
4,96,James Copnall,@JamesCopnall
...,...,...,...
95,5,Julius Sello Malema,@Julius_S_Malema
96,4,News24,@News24
97,3,Jacob G. Zuma,@SAPresident
98,2,Gareth Cliff,@GarethCliff


In [18]:
#we will reverse the data to get from the top most influential
df1=data.reindex(index=data.index[::-1])

In [19]:
df2 = df1.reset_index(drop=True)

In [20]:
df3=df2.head(10)

In [21]:
list100=df3['Twittername'] #getting only the twitter handles

In [22]:
df3

Unnamed: 0,Rank,Name,Twittername
0,1,Trevor Noah,@Trevornoah
1,2,Gareth Cliff,@GarethCliff
2,3,Jacob G. Zuma,@SAPresident
3,4,News24,@News24
4,5,Julius Sello Malema,@Julius_S_Malema
5,6,Helen Zille,@helenzille
6,7,mailandguardian,@mailandguardian
7,8,5FM,@5FM
8,9,loyiso gola,@loyisogola
9,10,Computicket,@Computicket


In [23]:
df3.to_csv(r'10most_twitter_influencers.csv', index = False, header=True) #saving the top ten influencers 

### Now we will scrap Data to get the Top government officers influencers ###

There are multiple ways to scrap data from website and in the following cells i have highlighted some but used one

In [24]:
#If the data you are scrapping has similarities such us same class, you can use this code segment.
#The code gets the data and saves it in a csv file
'''
Name = []
Twitter_name = []
for item in soup.findAll('span', {'class': 'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0'}):
    Name.append(item.get_text(strip=True))
for item in soup.findAll('span', {'class': 'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0'}):
    Twitter_name.append(item.text)
data = []
for items in zip(Name,Twitter_name):
    data.append(items)

with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Twitter_name'])
    writer.writerows(data)
    print("Operation Completed")
'''

'\nName = []\nTwitter_name = []\nfor item in soup.findAll(\'span\', {\'class\': \'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\'}):\n    Name.append(item.get_text(strip=True))\nfor item in soup.findAll(\'span\', {\'class\': \'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\'}):\n    Twitter_name.append(item.text)\ndata = []\nfor items in zip(Name,Twitter_name):\n    data.append(items)\n\nwith open(\'output.csv\', \'w+\', newline=\'\', encoding=\'UTF-8-SIG\') as file:\n    writer = csv.writer(file)\n    writer.writerow([\'Name\', \'Twitter_name\'])\n    writer.writerows(data)\n    print("Operation Completed")\n'

In [25]:
# Attempts to get the content at `url` by making an HTTP GET request.
url= 'https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa'
response = simple_get(url)

In [26]:
''' calling the function that is at the start of this notebook to get the data then through doing python string 
manipulation you can get the list of the top government officials'''
#res = get_elements(response, search={'find_all':{'class_':'wp-block-embed__wrapper'})
#res

' calling the function that is at the start of this notebook to get the data then through doing python string \nmanipulation you can get the list of the top government officials'

In [27]:
#Getting the content from the url by making a requests
url="https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#southern-africa"
r=requests.get(url)
soup=BeautifulSoup(r.content)
links=soup.find_all('blockquote') #blockquote is the tag that contains the information we need.


In [28]:
#Changing the content in blockquote to string
list_str = str(links)

In [29]:
list_str

'[<blockquote class="twitter-tweet" data-dnt="true" data-width="550"><p dir="ltr" lang="en">The Deputy Prime Minister Themba Masuku has today met representatives of the private sector and employees\' unions to map a collaborative effort in the fight against <a href="https://twitter.com/hashtag/COVID19?src=hash&amp;ref_src=twsrc%5Etfw">#COVID19</a>. <a href="https://t.co/EIYNGOEKRN">pic.twitter.com/EIYNGOEKRN</a></p>— Eswatini Government (@EswatiniGovern1) <a href="https://twitter.com/EswatiniGovern1/status/1241038139889721346?ref_src=twsrc%5Etfw">March 20, 2020</a></blockquote>, <blockquote class="twitter-tweet" data-dnt="true" data-width="550"><p dir="ltr" lang="en">GUIDELINES FOR SCHOOLS IN <a href="https://twitter.com/hashtag/MALAWI?src=hash&amp;ref_src=twsrc%5Etfw">#MALAWI</a> ON THE PREVENTION AND MANAGEMENT OF <a href="https://twitter.com/hashtag/COVID19?src=hash&amp;ref_src=twsrc%5Etfw">#COVID19</a> <a href="https://twitter.com/hashtag/CORONAVIRUS?src=hash&amp;ref_src=twsrc%5Etf

In [30]:
#using regex to the desired content twitter names
import re
res = re.findall(r'\(.*?\)', list_str) 

In [31]:
#changing the string list into a dataframe
df=pd.DataFrame(res)
df.columns=['Twitter_names']

In [32]:
df.head(10)

Unnamed: 0,Twitter_names
0,(@EswatiniGovern1)
1,(@MalawiGovt)
2,(@hagegeingob)
3,(@FinanceSC)
4,(COGTA)
5,(Act No. 57 of 2002)
6,(@PresidencyZA)
7,(@mohzambia)
8,(@edmnangagwa)
9,(@MinSantedj)


Cleaning the data

In [33]:
#striping off the last character in the twittername column
df['Twitter_names'] = df['Twitter_names'].str.rstrip(')')

In [34]:
df

Unnamed: 0,Twitter_names
0,(@EswatiniGovern1
1,(@MalawiGovt
2,(@hagegeingob
3,(@FinanceSC
4,(COGTA
5,(Act No. 57 of 2002
6,(@PresidencyZA
7,(@mohzambia
8,(@edmnangagwa
9,(@MinSantedj


In [35]:
#striping off the last character in the twittername column
df['Twitter_names'] = df['Twitter_names'].str.rstrip('(')

In [36]:
# Delete rows that are not twitter handles  
newlist = df.drop([df.index[4],df.index[5],df.index[14],df.index[19],df.index[36]])

In [37]:
len(newlist) #Length of the returned list

36

# ANALYZING INFLUENCERS AND GOVERNMENT OFFICIALS TWEETS #

We have the list of top hundred influencers in africa and government officials, now we will extract their data from twitter.

We will first import the libraries that we will use in exercise 

In [38]:
#importing libraries
import sys
import os
import time
import csv
import json
import pandas as pd
import matplotlib.pyplot as plt
import re
import string
from datetime import datetime, date, time, timedelta
from collections import Counter
from tweepy import OAuthHandler
from tweepy import API
from tweepy import Cursor
# to view all columns
pd.set_option("display.max.columns", None)

In [39]:
#Import the necessary methods from tweepy library  

#install tweepy if you don't have it
!pip install tweepy
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#sentiment analysis package
#!pip install textblob
from textblob import TextBlob

#general text pre-processor
!pip install nltk
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')

#tweet pre-processor 
!pip install tweet-preprocessor
import preprocessor as p



[nltk_data] Downloading package punkt to /Users/DClinton/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [40]:
def print_full(x):
  '''
  This is to print nicely DataFrame wide tables
  '''
  pd.set_option('display.max_rows', len(x))
  pd.set_option('display.max_columns', None)
  pd.set_option('display.width', 2000)
  pd.set_option('display.float_format', '{:20,.2f}'.format)
  pd.set_option('display.max_colwidth', -1)
  print(x)
  pd.reset_option('display.max_rows')
  pd.reset_option('display.max_columns')
  pd.reset_option('display.width')
  pd.reset_option('display.float_format')
  pd.reset_option('display.max_colwidth')

In [61]:

# Creating the authentication object
auth = tweepy.OAuthHandler('Ld7ZSXsbN6qfo0uU0djKOihnv', 'RybRgMuWQuvEgbJOsGnfnhZtUUXe4QTF0EZ2j0qGyzE77oZy5Z')
# Setting your access token and secret
auth.set_access_token('990262908696387584-HTgPRm2tySXaq7nBHeY1rhhKdV6umNe', '5BTPHpjJl7JewU2mwd3qCek716Fyz8Q3AgDhgJIYxQ2hO')
# Creating the API object while passing in auth information
api = tweepy.API(auth, wait_on_rate_limit=True,
                     wait_on_rate_limit_notify=True)

In [42]:
#To obtain valid screen name avoiding Tweepy error 404

names = []
for influencer in list100:      
    try:
        u=api.get_user(influencer)
        names.append(u.screen_name)
    except Exception:
            pass
names[:10]

['Trevornoah',
 'GarethCliff',
 'SAPresident',
 'News24',
 'Julius_S_Malema',
 'helenzille',
 'mailandguardian',
 '5FM',
 'loyisogola',
 'Computicket']

In [43]:
#We can also use this code segment that returns a csv file which has the tweets, retweet_count and favorite_count,tweets id

consumer_key = os.environ.get('consumer_key')
consumer_secret = os.environ.get('consumer_secret')
access_token = os.environ.get('access_token')
access_token_secret = os.environ.get('access_token_secret')

def get_tweets(screen_name):


    #initialize a list to hold all the tweepy Tweets
    alltweets = []  

    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name, count=200)

    #save most recent tweets
    alltweets.extend(new_tweets)

    #save the id of the oldest tweet minus one
    oldest = alltweets[-1].id - 1

    #keep grabbing tweets until there are no tweets left to grab. 
    # Limit set to around 3k tweets, can be edited to preferred number.
    while len(new_tweets) > 0:
        print("getting tweets before %s" % (oldest))

        #all subsiquent requests use the max_id arg to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name,count=200, max_id=oldest)

        #save most recent tweets
        alltweets.extend(new_tweets)

        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1

        print("...%s tweets downloaded so far" % (len(alltweets)))   

    #transform the tweets into a 2D array that will populate the csv 
    outtweets = [[tweet.id_str, tweet.created_at,tweet.retweet_count,tweet.favorite_count, tweet.text.encode("utf-8")] for tweet in alltweets]

    #write the csv  
    with open('%s_tweets.csv' % screen_name, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","retweet_count","favorite_count","text"])
        writer.writerows(outtweets)

    pass

In [44]:
'''# fetch the unique handles from the top_100 and leaders dataframes
# convert to list the merge them to be one list.
#we also pass tweepy error 
the_100 = list100.unique()
the_leaders_response = newlist.unique()
l1 = the_100.astype(str).tolist() 
l2 = the_leaders_response.astype(str).tolist()
accounts = l1 + l2
try:
    if __name__ == '__main__':
    #loop through the handles in the list
        for i,name in enumerate(accounts):
              get_tweets(name)
            
except tweepy.error.TweepError:
    pass
      '''

"# fetch the unique handles from the top_100 and leaders dataframes\n# convert to list the merge them to be one list.\n#we also pass tweepy error \nthe_100 = list100.unique()\nthe_leaders_response = newlist.unique()\nl1 = the_100.astype(str).tolist() \nl2 = the_leaders_response.astype(str).tolist()\naccounts = l1 + l2\ntry:\n    if __name__ == '__main__':\n    #loop through the handles in the list\n        for i,name in enumerate(accounts):\n              get_tweets(name)\n            \nexcept tweepy.error.TweepError:\n    pass\n      "

In [45]:
'''#getting the number of comments , 


comment_count = []    
for influencer in list100:
    influencers_tweets = api.user_timeline(screen_name =influencer, count=200)
    cm = 0
    for tweet in influencers_tweets:
        cm += tweet.reply_count
    comment_count.append(cm)    
    
#! ONLY AVAILABLE FOR PREMIUM USERS ON TWITTER API'''

'#getting the number of comments , \n\n\ncomment_count = []    \nfor influencer in list100:\n    influencers_tweets = api.user_timeline(screen_name =influencer, count=200)\n    cm = 0\n    for tweet in influencers_tweets:\n        cm += tweet.reply_count\n    comment_count.append(cm)    \n    \n#! ONLY AVAILABLE FOR PREMIUM USERS ON TWITTER API'

In [46]:
#collecting mentions for the top 100 influencers

from datetime import datetime, date, time, timedelta
mentions = []
    
for influencer in list100:
  try:
    for status in tweepy.Cursor(api.user_timeline, id=influencer).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
  except Exception:
            pass

In [47]:
mention_csv=pd.DataFrame(mentions)
mention_csv.columns=['mention']

mention_csv.to_csv('mention.csv')

mention_csv.head()

Unnamed: 0,mention
0,KingBach
1,franklinleonard
2,SawyerHackett
3,kimlatricejones
4,DEADLINE


In [48]:
#collecting mentions for the top government officials

from datetime import datetime, date, time, timedelta
govt_mentions = []
    
for official in newlist:
  try:
    for status in tweepy.Cursor(api.user_timeline, id=official).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  govt_mentions.append(name)
  except Exception:
            pass

In [49]:
govt_mention_csv=pd.DataFrame(mentions)
govt_mention_csv.columns=['mention']
govt_mention_csv.to_csv('African_leaders_mention.csv')

govt_mention_csv.head()

Unnamed: 0,mention
0,KingBach
1,franklinleonard
2,SawyerHackett
3,kimlatricejones
4,DEADLINE


In [54]:
#collecting the hashatags of the influencers
influencers_hashtags = {}

for influencer in list100:  
  hs = []
  try:  
    for status in tweepy.Cursor(api.user_timeline, id=influencer).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hs.append(hashtag)
    influencers_hashtags[influencer] = hs
  except Exception:
        pass

In [55]:

dataframe1=pd.DataFrame.from_dict(influencers_hashtags, orient='index')
dataframe1=dataframe1.transpose()
dataframe1.head(10) 

Unnamed: 0,@Trevornoah,@GarethCliff,@SAPresident,@News24,@Julius_S_Malema,@helenzille,@mailandguardian,@5FM,@loyisogola,@Computicket
0,BlackPeoplePenance,Lockdown,FridayFelling,RIPKaundaNtunja,EFFTurns7,GBV,SafeHands,ForbesAndFix,FACup,VanPletzen
1,BlackLivesMatter,Covid19,FullLoadWorkPresure,MapitiMatsena,JuliusMalema,Women,COVID19,UnpopularOpinion,CRYMUN,GMABenefitConcert
2,BlackOutTuesday,SoWhatNow,business,MapitiMatsena,BlackLivesMatter,VBSArrests,COVID19,MidMorningsOn5,LaLiga,GMABenefitConcert
3,JusticeForGeorgeFloyd,MoonyeennLee,Self,MapitiMatsena,RIPZindziMandela,VBS,Africa,HappyMonday,LaLiga,MissSa2020
4,CincUp,BlindHistory,Motivation,StateCaptureInquiry,MakhayaNtini,day78oflockdown,SouthAfrica,GoodeMix,ARSLIV,FaceYourPower
5,coronavirus,KnowYourHistory,confidence,StateCaptureInquiry,COVID19,ClassicFawlty,COVID19lockdown,GoodeMorning,Sopranos,EmbraceYourFuture
6,FallonAtHome,SoWhatNow,Tesla,StateCaptureInquiry,MakhayaNtini,FawltyTowers,Covid19,GoodeMorning,SaveLiveComedy,MissSATop15
7,ClubQuarantine,SoWhatNow,,StateCaptureInquiry,TBTChallenge,CoronaCast,Eskom,GoodeMorning,ARSLEI,SARIEGESELS
8,WithMe,SoWhatNow,,StateCaptureInquiry,SAMA26,CoronaCast,67minutes,TheStirUp,ARSLEI,1stOfAll
9,AwholeNewSong,SoWhatNow,,StateCaptureInquiry,Loadshedding,UCT,COVID__19,TheStirUp,CRYCHE,1stOfAll


In [64]:
dataframe1.to_csv('influencers_hashtag.csv')

In [62]:
#collecting the hashatags of the government officials
govt_hashtags = {}

for official in list100:  
  hs = []
  try:  
    for status in tweepy.Cursor(api.user_timeline, id=official).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hs.append(hashtag)
    govt_hashtags[official] = hs
  except Exception:
        pass

KeyboardInterrupt: 

In [59]:
dataframe2=pd.DataFrame.from_dict(govt_hashtags, orient='index')
dataframe2=dataframe2.transpose()
dataframe2.head(10) 

Unnamed: 0,@Trevornoah,@GarethCliff,@SAPresident,@News24,@Julius_S_Malema
0,BlackPeoplePenance,Lockdown,FridayFelling,RIPKaundaNtunja,EFFTurns7
1,BlackLivesMatter,Covid19,FullLoadWorkPresure,MapitiMatsena,JuliusMalema
2,BlackOutTuesday,SoWhatNow,business,MapitiMatsena,BlackLivesMatter
3,JusticeForGeorgeFloyd,MoonyeennLee,Self,MapitiMatsena,RIPZindziMandela
4,CincUp,BlindHistory,Motivation,StateCaptureInquiry,MakhayaNtini
5,coronavirus,KnowYourHistory,confidence,StateCaptureInquiry,COVID19
6,FallonAtHome,SoWhatNow,Tesla,StateCaptureInquiry,MakhayaNtini
7,ClubQuarantine,SoWhatNow,,StateCaptureInquiry,TBTChallenge
8,WithMe,SoWhatNow,,StateCaptureInquiry,SAMA26
9,AwholeNewSong,SoWhatNow,,StateCaptureInquiry,Loadshedding


In [63]:
dataframe2.to_csv('govt_hashtag.csv')

## CONCLUSION ##

In this notebook we have scrapped data from two websites and also streamed data from twitter for the names that we got 
from the websites. The next notebook will be on analysis the data we have gotten and do some plotting.