The notion of influence plays an important role in how businesses operate and is heavily linked to viral marketing and even how the society functions. Studying influence patterns can help us better understand why certain trends or innovations are adopted faster than others and how we could help advertisers andmarketers design more effective campaigns.

**The Goal of this project is to Identify influencers rank position from Twitter data in africa.**

## Web scrapping using python

#### References
1. [Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
2. [Web Scraping using Python](https://www.datacamp.com/community/tutorials/web-scraping-using-python)

We will need to install two packages that is;
 1. Requests for performing your HTTP requests
 2. BeautifulSoup4 for handling all of your HTML processing

In [1]:
!pip install requests BeautifulSoup4 --upgrade

Requirement already up-to-date: requests in /Users/DClinton/opt/anaconda3/lib/python3.7/site-packages (2.24.0)
Requirement already up-to-date: BeautifulSoup4 in /Users/DClinton/opt/anaconda3/lib/python3.7/site-packages (4.9.1)


In [2]:
#immporting libraries
import requests
from urllib.request import urlopen as uReq
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup 
import pandas as pd
import os, sys

import fire

# MAKING WEB REQUESTS#

We will make requests to the web using the requests package plus some python functions that will help us to get data from the web

In [3]:

#%%writefile ../pyscrap_url.py #This jupyter magic helps you to save this codecell as a script.

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content  #.encode(BeautifulSoup.original_encoding)
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
def get_elements(url, tag='',search={}, fname=None):
    """
    Downloads a page specified by the url parameter
    and returns a list of strings, one per tag element
    """
    
    if isinstance(url,str):
        response = simple_get(url)
    else:
        #if already it is a loaded html page
        response = url

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        
        res = []
        if tag:    
            for li in html.select(tag):
                for name in li.text.split('\n'):
                    if len(name) > 0:
                        res.append(name.strip())
                       
                
        if search:
            soup = html            
            
            
            r = ''
            if 'find' in search.keys():
                print('findaing',search['find'])
                soup = soup.find(**search['find'])
                r = soup

                
            if 'find_all' in search.keys():
                print('findaing all of',search['find_all'])
                r = soup.find_all(**search['find_all'])
   
            if r:
                for x in list(r):
                    if len(x) > 0:
                        res.extend(x)
            
        return res

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))    
    
    
if get_ipython().__class__.__name__ == '__main__':
    fire(get_tag_elements)

This is the data of Hundred most influential people in Africa and their twitter names.

In [4]:
#we will now call the function and pass the tag of the data we need.
res = get_elements('https://africafreak.com/100-most-influential-twitter-users-in-africa',tag='h2')
#res

In [5]:
#We write the data we scrapped into a csv file
with open('csvfile.csv','w') as file:
    for line in res:
        file.write(line)
        file.write('\n')

In [6]:
#Reading the csv file
data=pd.read_csv("csvfile.csv",header=None)

We will drop the rows that we dont need and do some cleaning on the data

In [7]:
#dropping the rows we don't need
data.drop([100,101,102,103,104],axis=0,inplace=True)

In [8]:

data.columns=['influncers']

In [9]:
data

Unnamed: 0,influncers
0,100. Jeffrey Gettleman (@gettleman)
1,99. Africa24 Media (@a24media)
2,98. Scapegoat (@andiMakinana)
3,97. Africa Check (@AfricaCheck)
4,96. James Copnall (@JamesCopnall)
...,...
95,5. Julius Sello Malema (@Julius_S_Malema)
96,4. News24 (@News24)
97,3. Jacob G. Zuma (@SAPresident)
98,2. Gareth Cliff (@GarethCliff)


In [10]:
data['influncers']

0           100. Jeffrey Gettleman (@gettleman)
1                99. Africa24 Media (@a24media)
2                 98. Scapegoat (@andiMakinana)
3               97. Africa Check (@AfricaCheck)
4             96. James Copnall (@JamesCopnall)
                        ...                    
95    5. Julius Sello Malema (@Julius_S_Malema)
96                          4. News24 (@News24)
97              3. Jacob G. Zuma (@SAPresident)
98               2. Gareth Cliff (@GarethCliff)
99                 1. Trevor Noah (@Trevornoah)
Name: influncers, Length: 100, dtype: object

In [11]:
#splitting the rank from the names
data[['Rank','Names']]=pd.DataFrame(data.influncers.str.split('.',1).tolist())

In [12]:
data

Unnamed: 0,influncers,Rank,Names
0,100. Jeffrey Gettleman (@gettleman),100,Jeffrey Gettleman (@gettleman)
1,99. Africa24 Media (@a24media),99,Africa24 Media (@a24media)
2,98. Scapegoat (@andiMakinana),98,Scapegoat (@andiMakinana)
3,97. Africa Check (@AfricaCheck),97,Africa Check (@AfricaCheck)
4,96. James Copnall (@JamesCopnall),96,James Copnall (@JamesCopnall)
...,...,...,...
95,5. Julius Sello Malema (@Julius_S_Malema),5,Julius Sello Malema (@Julius_S_Malema)
96,4. News24 (@News24),4,News24 (@News24)
97,3. Jacob G. Zuma (@SAPresident),3,Jacob G. Zuma (@SAPresident)
98,2. Gareth Cliff (@GarethCliff),2,Gareth Cliff (@GarethCliff)


In [13]:
#splitting the Names and Twitter account name
data[['Name','Twittername']]=pd.DataFrame(data.Names.str.split('(',1).tolist())
data

Unnamed: 0,influncers,Rank,Names,Name,Twittername
0,100. Jeffrey Gettleman (@gettleman),100,Jeffrey Gettleman (@gettleman),Jeffrey Gettleman,@gettleman)
1,99. Africa24 Media (@a24media),99,Africa24 Media (@a24media),Africa24 Media,@a24media)
2,98. Scapegoat (@andiMakinana),98,Scapegoat (@andiMakinana),Scapegoat,@andiMakinana)
3,97. Africa Check (@AfricaCheck),97,Africa Check (@AfricaCheck),Africa Check,@AfricaCheck)
4,96. James Copnall (@JamesCopnall),96,James Copnall (@JamesCopnall),James Copnall,@JamesCopnall)
...,...,...,...,...,...
95,5. Julius Sello Malema (@Julius_S_Malema),5,Julius Sello Malema (@Julius_S_Malema),Julius Sello Malema,@Julius_S_Malema)
96,4. News24 (@News24),4,News24 (@News24),News24,@News24)
97,3. Jacob G. Zuma (@SAPresident),3,Jacob G. Zuma (@SAPresident),Jacob G. Zuma,@SAPresident)
98,2. Gareth Cliff (@GarethCliff),2,Gareth Cliff (@GarethCliff),Gareth Cliff,@GarethCliff)


In [14]:
# We can now drop the names and influncers column
data = data.drop("Names", axis=1)
data = data.drop("influncers",axis=1)

In [15]:
#stripping the last character from our twittername column
data['Twittername'] = data['Twittername'].str.rstrip(')')

In [16]:
#saving the data into a csv file
data.to_csv ('\Desktop\export_dataframe.csv', index = False, header=True)

In [17]:
data

Unnamed: 0,Rank,Name,Twittername
0,100,Jeffrey Gettleman,@gettleman
1,99,Africa24 Media,@a24media
2,98,Scapegoat,@andiMakinana
3,97,Africa Check,@AfricaCheck
4,96,James Copnall,@JamesCopnall
...,...,...,...
95,5,Julius Sello Malema,@Julius_S_Malema
96,4,News24,@News24
97,3,Jacob G. Zuma,@SAPresident
98,2,Gareth Cliff,@GarethCliff


In [18]:
#we will reverse the data to get from the top most influential
df1=data.reindex(index=data.index[::-1])

In [19]:
df2 = df1.reset_index(drop=True)

In [20]:
df3=df2.head(10)

In [21]:
df3

Unnamed: 0,Rank,Name,Twittername
0,1,Trevor Noah,@Trevornoah
1,2,Gareth Cliff,@GarethCliff
2,3,Jacob G. Zuma,@SAPresident
3,4,News24,@News24
4,5,Julius Sello Malema,@Julius_S_Malema
5,6,Helen Zille,@helenzille
6,7,mailandguardian,@mailandguardian
7,8,5FM,@5FM
8,9,loyiso gola,@loyisogola
9,10,Computicket,@Computicket


In [22]:
df3.to_csv(r'10most_twitter_influencers.csv', index = False, header=True)

Now we will scrap Data to get the Top government officers influencers

In [23]:
#Trial code
'''
import requests
from bs4 import BeautifulSoup
import csv
import re
r = requests.get('https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#west-africa')
soup = BeautifulSoup(r.text, 'html.parser')
'''

"\nimport requests\nfrom bs4 import BeautifulSoup\nimport csv\nimport re\nr = requests.get('https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#west-africa')\nsoup = BeautifulSoup(r.text, 'html.parser')\n"

In [24]:
#If the data you are scrapping has similarities such us same class, you can use this code segment.
#The code gets the data and saves it in a csv file
'''
Name = []
Twitter_name = []
for item in soup.findAll('span', {'class': 'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0'}):
    Name.append(item.get_text(strip=True))
for item in soup.findAll('span', {'class': 'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0'}):
    Twitter_name.append(item.text)
data = []
for items in zip(Name,Twitter_name):
    data.append(items)

with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Twitter_name'])
    writer.writerows(data)
    print("Operation Completed")
'''

'\nName = []\nTwitter_name = []\nfor item in soup.findAll(\'span\', {\'class\': \'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\'}):\n    Name.append(item.get_text(strip=True))\nfor item in soup.findAll(\'span\', {\'class\': \'css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\'}):\n    Twitter_name.append(item.text)\ndata = []\nfor items in zip(Name,Twitter_name):\n    data.append(items)\n\nwith open(\'output.csv\', \'w+\', newline=\'\', encoding=\'UTF-8-SIG\') as file:\n    writer = csv.writer(file)\n    writer.writerow([\'Name\', \'Twitter_name\'])\n    writer.writerows(data)\n    print("Operation Completed")\n'

In [25]:
# Attempts to get the content at `url` by making an HTTP GET request.
url= 'https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa'
response = simple_get(url)

In [26]:
# calling the function that is at the start of this notebook to get the data
res = get_elements(response, search={'find_all':{'class_':'wp-block-embed__wrapper'})
#res

SyntaxError: invalid syntax (<ipython-input-26-e32b5a829f88>, line 2)

In [None]:
#Getting the content from the url by making a requests
url="https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#southern-africa"
r=requests.get(url)
soup=BeautifulSoup(r.content)
links=soup.find_all('blockquote')


In [None]:
#Changing the content to string
list_str = str(links)

In [None]:
list_str

In [None]:
'''
import re
res1 = re.findall('[^()]+', list_str) 
'''

In [None]:
#using regex to the desired content twitter names
import re
res = re.findall(r'\(.*?\)', list_str) 

In [None]:
#changing the string list into a dataframe
df=pd.DataFrame(res)
df.columns=['Twitter_names']

In [None]:
df