## Web scrapping using python

#### References
1. [Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
2. [Web Scraping using Python](https://www.datacamp.com/community/tutorials/web-scraping-using-python)

In [None]:
# $ python3 -m venv venv
# $ . ./venv/bin/activate

In [8]:
#Better
!pip install requests BeautifulSoup4 fire

Collecting fire
  Using cached https://files.pythonhosted.org/packages/34/a7/0e22e70778aca01a52b9c899d9c145c6396d7b613719cd63db97ffa13f2f/fire-0.3.1.tar.gz
Collecting termcolor (from fire)
  Using cached https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Building wheels for collected packages: fire, termcolor
  Building wheel for fire (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/yabebal/Library/Caches/pip/wheels/c1/61/df/768b03527bf006b546dce284eb4249b185669e65afc5fbb2ac
  Building wheel for termcolor (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/yabebal/Library/Caches/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6
Successfully built fire termcolor
Installing collected packages: termcolor, fire
Successfully installed fire-0.3.1 termcolor-1.1.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [9]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import pandas as pd
import os, sys

import fire

In [108]:
#%%writefile ../pyscrap_url.py

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content  #.encode(BeautifulSoup.original_encoding)
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
def get_elements(url, tag='',search={}, fname=None):
    """
    Downloads a page specified by the url parameter
    and returns a list of strings, one per tag element
    """
    
    if isinstance(url,str):
        response = simple_get(url)
    else:
        #if already it is a loaded html page
        response = url

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        
        res = []
        if tag:    
            for li in html.select(tag):
                for name in li.text.split('\n'):
                    if len(name) > 0:
                        res.append(name.strip())
                       
                
        if search:
            soup = html            
            
            
            r = ''
            if 'find' in search.keys():
                print('findaing',search['find'])
                soup = soup.find(**search['find'])
                r = soup

                
            if 'find_all' in search.keys():
                print('findaing all of',search['find_all'])
                r = soup.find_all(**search['find_all'])
   
            if r:
                for x in list(r):
                    if len(x) > 0:
                        res.extend(x)
            
        return res

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))    
    
    
if get_ipython().__class__.__name__ == '__main__':
    fire(get_tag_elements)

In [107]:
res = get_elements('https://africafreak.com/100-most-influential-twitter-users-in-africa',tag='h2')
res

['100. Jeffrey Gettleman (@gettleman)',
 '99. Africa24 Media (@a24media)',
 '98. Scapegoat (@andiMakinana)',
 '97. Africa Check (@AfricaCheck)',
 '96. James Copnall (@JamesCopnall)',
 '95. Online Africa (@oafrica)',
 '94. Patrick Ngowi (@PatrickNgowi)',
 '93. DOS African Affairs (@StateAfrica)',
 '92. MoadowAJE (@Moadow)',
 '91. Brendan Boyle (@BrendanSAfrica)',
 '90. City of Tshwane (@CityTshwane)',
 '89. VISI Magazine (@VISI_Mag)',
 '88. andBeyond (@andBeyondSafari)',
 '87. This Is Africa (@ThisIsAfricaTIA)',
 '86. Sarah Carter (@sarzss)',
 '85. The EIU Africa team (@TheEIU_Africa)',
 '84. Investing In Africa (@InvestInAfrica)',
 '83. Barry Malone (@malonebarry)',
 '82. ARTsouthAFRICA (@artsouthafrica)',
 '81. Kahn Morbee (@KahnMorbee)',
 '80. Jamal Osman (@JamalMOsman)',
 '79. iamsuede™ (@iamsuede)',
 '78. Mike Stopforth (@mikestopforth)',
 '77. Equal Education (@equal_education)',
 '76. Tristan McConnell (@t_mcconnell)',
 '75. Kate Forbes (@forbeesta)',
 '74. Vanessa Raphaely (@hur

In [109]:
res = [{'Name':x.split('.')[1].split('(')[0], 
        'twitter_name':x.split('(')[1].split(')')[0] } for x in res if '@' in x]
df100 = pd.DataFrame(res)
df100

Unnamed: 0,Name,twitter_name
0,Jeffrey Gettleman,@gettleman
1,Africa24 Media,@a24media
2,Scapegoat,@andiMakinana
3,Africa Check,@AfricaCheck
4,James Copnall,@JamesCopnall
5,Online Africa,@oafrica
6,Patrick Ngowi,@PatrickNgowi
7,DOS African Affairs,@StateAfrica
8,MoadowAJE,@Moadow
9,Brendan Boyle,@BrendanSAfrica


In [110]:
df100.to_csv('africafreak_100_infuencers_2018.csv')

In [103]:
url= 'https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa'
response = simple_get(url)

In [149]:
res_gov = get_elements(response, search={'find_all':{'class_':'wp-block-embed__wrapper'}})
res_gov

['\nhttps://twitter.com/TsholetsaDomi/status/1238324860536922112\n',
 '\nhttps://twitter.com/Azali_officiel/status/1239649350747332613\n',
 '\n',
 <blockquote class="twitter-tweet" data-dnt="true" data-width="550"><p dir="ltr" lang="en">The Deputy Prime Minister Themba Masuku has today met representatives of the private sector and employees' unions to map a collaborative effort in the fight against <a href="https://twitter.com/hashtag/COVID19?src=hash&amp;ref_src=twsrc%5Etfw">#COVID19</a>. <a href="https://t.co/EIYNGOEKRN">pic.twitter.com/EIYNGOEKRN</a></p>— Eswatini Government (@EswatiniGovern1) <a href="https://twitter.com/EswatiniGovern1/status/1241038139889721346?ref_src=twsrc%5Etfw">March 20, 2020</a></blockquote>,
 <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>,
 '\n',
 '\nhttps://twitter.com/SE_Rajoelina/status/1241101811647500288\n',
 '\n',
 <blockquote class="twitter-tweet" data-dnt="true" data-width="550"><p dir="ltr" lang="en">GUIDEL

In [147]:
def account_blockquote(r):
    r = r.split('</p>—')[1]
    l = r.split('(')  
    name = l[0]
    handle = l[1].split(')')[0] 
    return name, handle


gov_accounts = []
for rr in res_gov:
    r = str(rr).rstrip()
    if 'https://twitter.com/' in r and '/status/' in r:
        if 'blockquote' in r:
            name, handle = account_blockquote(r)
        elif r:
            name = r.split('https://twitter.com/')[1].split('/status/')[0]
            handle = '@'+name

        gov_accounts.append({'Name':name, 'twitter_handle':handle})
    else:        
        #print('unknown',r)
        pass
   
#print(gov_accounts)
dfgov = pd.DataFrame(gov_accounts)
dfgov

Unnamed: 0,Name,twitter_handle
0,TsholetsaDomi,@TsholetsaDomi
1,Azali_officiel,@Azali_officiel
2,Eswatini Government,@EswatiniGovern1
3,SE_Rajoelina,@SE_Rajoelina
4,Malawi Government,@MalawiGovt
5,PKJugnauth,@PKJugnauth
6,Hage G. Geingob,@hagegeingob
7,Seychelles Ministry of Finance,@FinanceSC
8,PresidencyZA,@PresidencyZA
9,Ministry of Health Zambia,@mohzambia


In [148]:
dfgov.to_csv('africa_gov_tweeter_handles.csv')

## Web scrapping using bash script
If the web site has a quite simple HTML, you can easily use curl to perform the request and then extract the needed values using bash commands grep, cut , sed, ..

This tutorial is adapted from [this](https://medium.com/@LiliSousa/web-scraping-with-bash-690e4ee7f98d) medium article

In [35]:
%%bash 

# curl the page and save content to tmp_file
#url="https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa"
#curl -X GET $url -o tmp_file


#!/bin/bash

# write headers to CSV file
echo "Name, twitter_id" >> extractData.csv
n=1
while [ $n -lt 2 ]
do
  echo "started"
  #get title
  #title=$(
  cat tmp_file | grep -o '\S*/status/\S*' | awk  '{print $1  }' 
  echo -e $title
  #get author
  #twitter_id=$(cat tmp_file |grep -A1 "class=\"css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\"" | tail -1)

  #echo "$title, $twitter_id" >> extractData.csv
  #echo "$title, $twitter_id"
    
  n=$[$n+1]

done

started
href="https://twitter.com/PaulKagame/status/1239263206691999748">participation</a>
https://twitter.com/TsholetsaDomi/status/1238324860536922112
https://twitter.com/Azali_officiel/status/1239649350747332613
href="https://twitter.com/EswatiniGovern1/status/1241038139889721346?ref_src=twsrc%5Etfw">March
https://twitter.com/SE_Rajoelina/status/1241101811647500288
href="https://twitter.com/MalawiGovt/status/1240275631323185152?ref_src=twsrc%5Etfw">March
https://twitter.com/PKJugnauth/status/1240740484714319872
href="https://twitter.com/hagegeingob/status/1240272081805336577?ref_src=twsrc%5Etfw">March
href="https://twitter.com/FinanceSC/status/1241039570608828416?ref_src=twsrc%5Etfw">March
href="https://twitter.com/PresidencyZA/status/1240502027446300674?ref_src=twsrc%5Etfw">March
href="https://twitter.com/mohzambia/status/1240292737892732931?ref_src=twsrc%5Etfw">March
href="https://twitter.com/edmnangagwa/status/1237958955080519680?ref_src=twsrc%5Etfw">March
href="https://twitter.co