## Web scrapping using python

#### References
1. [Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
2. [Web Scraping using Python](https://www.datacamp.com/community/tutorials/web-scraping-using-python)

In [None]:
#$ python3 -m venv venv
#$ . ./venv/bin/activate

In [None]:
#Installing the required modules.
!pip install requests BeautifulSoup4 fire  #Request is for performing HTTP requests and BeautifulSoup4 is for handling the HTML processing.

In [None]:
#importing modules
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import pandas as pd
import os, sys

import fire

In [None]:
#%%writefile ../pyscrap_url.py

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content  #.encode(BeautifulSoup.original_encoding)
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
def get_elements(url, tag='',search={}, fname=None):
    """
    Downloads a page specified by the url parameter
    and returns a list of strings, one per tag element
    """
    
    if isinstance(url,str):
        response = simple_get(url)
    else:
        #if already it is a loaded html page
        response = url

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        
        res = []
        if tag:    
            for li in html.select(tag):
                for name in li.text.split('\n'):
                    if len(name) > 0:
                        res.append(name.strip())
                       
                
        if search:
            soup = html            
            
            
            r = ''
            if 'find' in search.keys():
                print('finding',search['find'])
                soup = soup.find(**search['find'])
                r = soup

                
            if 'find_all' in search.keys():
                print('finding all of',search['find_all'])
                r = soup.find_all(**search['find_all'])
   
            if r:
                for x in list(r):
                    if len(x) > 0:
                        res.extend(x)
            
        return res

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))    
    
    
if get_ipython().__class__.__name__ == '__main__':
    fire(get_elements)

Scraping the [The 100 most Influential Twitter users website
](https://africafreak.com/100-most-influential-twitter-users-in-africa) to obtain the users twitter handle.


In [None]:
res = get_elements('https://africafreak.com/100-most-influential-twitter-users-in-africa',tag='h2')
res


['100. Jeffrey Gettleman (@gettleman)',
 '99. Africa24 Media (@a24media)',
 '98. Scapegoat (@andiMakinana)',
 '97. Africa Check (@AfricaCheck)',
 '96. James Copnall (@JamesCopnall)',
 '95. Online Africa (@oafrica)',
 '94. Patrick Ngowi (@PatrickNgowi)',
 '93. DOS African Affairs (@StateAfrica)',
 '92. MoadowAJE (@Moadow)',
 '91. Brendan Boyle (@BrendanSAfrica)',
 '90. City of Tshwane (@CityTshwane)',
 '89. VISI Magazine (@VISI_Mag)',
 '88. andBeyond (@andBeyondSafari)',
 '87. This Is Africa (@ThisIsAfricaTIA)',
 '86. Sarah Carter (@sarzss)',
 '85. The EIU Africa team (@TheEIU_Africa)',
 '84. Investing In Africa (@InvestInAfrica)',
 '83. Barry Malone (@malonebarry)',
 '82. ARTsouthAFRICA (@artsouthafrica)',
 '81. Kahn Morbee (@KahnMorbee)',
 '80. Jamal Osman (@JamalMOsman)',
 '79. iamsuede™ (@iamsuede)',
 '78. Mike Stopforth (@mikestopforth)',
 '77. Equal Education (@equal_education)',
 '76. Tristan McConnell (@t_mcconnell)',
 '75. Kate Forbes (@forbeesta)',
 '74. Vanessa Raphaely (@hur

In [120]:
new = pd.DataFrame(res).head(100) #creating a datarame
new
#Data manipulation
df1 = new[0].str.split('.', expand=True)
df1.head(100)

df2 = df1[1].str.split('(', expand=True)
df2.head(100)

df2[1] = df2[1].str.strip(')')
df2.head(100)

df2.columns = ['Twitter_name','Twitter_handle']
df2


Influencer_handle = df2['Twitter_handle']
Influencer_handle

Influencer_handle = df2['Twitter_handle'].astype(str).to_list()
Influencer_handle

reversed_list = Influencer_handle[::-1]
reversed_list

reversed_list.pop(2)  #removing None.
reversed_list

Top_10_influential = reversed_list[:10]


#Converting into dataframe 
Top_10_influential_df = pd.DataFrame(Top_10_influential)

#naming  the column
Top_10_influential_df.columns = ["Twitter__Handle"]
Top_10_influential_df

Unnamed: 0,Twitter__Handle
0,@Trevornoah
1,@GarethCliff
2,@News24
3,@Julius_S_Malema
4,@helenzille
5,@mailandguardian
6,@5FM
7,@loyisogola
8,@Computicket
9,@MTVbaseAfrica


In [118]:
#Saving file

Top_10_influential_df.to_csv('Top 10 Influencers Twitter handle.csv',index=False)

from google.colab import files
files.download("Top 10 Influencers Twitter handle.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Scrapping the [The website of top government officials responding to Coronavirus in East Africa](https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa) and obtain their twitter handles.

In [None]:
#Obtaining the url
url= 'https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa'
response = simple_get(url)

res = get_elements(response, search={'find_all':{'class_':'twitter-tweet'}})
res
#Obtaining the specific strings with the twitter name and handle.
my_stringtags = []
for tag in res:
  if tag.string != None:
    my_stringtags.append(tag.string)
my_stringtags


In [112]:
#Extracting the specific twitter handles.
import re
handleregex = re.compile(r'@[a-zA-Z0-9_]{0,15}')
tags = ''.join(my_stringtags)
govt_official = handleregex.findall(tags)
govt_handle = pd.DataFrame(govt_official)
govt_handle

##Data manipulation
#govt_handle[0] = govt_handle[0].str.strip('@')
#govt_handle.head(100)

govt_handle.columns = ['Twitter_handle']
govt_handle

Top_officials = govt_handle['Twitter_handle'].astype(str).to_list()
Top_officials

Top_10_govt_official = Top_officials[:10]

Top_10_govt_official_df = pd.DataFrame(Top_10_govt_official)
Top_10_govt_official_df.columns = ["Twitter_Handle"]
Top_10_govt_official_df


Unnamed: 0,Twitter_Handle
0,@EswatiniGovern1
1,@MalawiGovt
2,@hagegeingob
3,@FinanceSC
4,@PresidencyZA
5,@mohzambia
6,@edmnangagwa
7,@MinSantedj
8,@hawelti
9,@StateHouseKenya


In [119]:
#Saving file
Top_10_govt_official_df.to_csv('Top 10 govt officials Twitter Handle.csv',index=False)

from google.colab import files
files.download("Top 10 govt officials Twitter Handle.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Web scrapping using bash script
If the web site has a quite simple HTML, you can easily use curl to perform the request and then extract the needed values using bash commands grep, cut , sed, ..

This tutorial is adapted from [this](https://medium.com/@LiliSousa/web-scraping-with-bash-690e4ee7f98d) medium article

In [None]:
%%bash 

# curl the page and save content to tmp_file
url = "https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa"
curl -X GET $url -o tmp_file


#!/bin/bash

# write headers to CSV file
echo "Name, twitter_id" >> extractData.csv
n="1"
while [ $n -lt 2 ]
do
  
  #get title
  title=$(cat tmp_file | grep "class=\"twitter-tweet\"" | cut -d ';' -f1 )
  echo $title
  #get author
  twitter_id=$(cat tmp_file |grep -A1 "class=\"css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0\"" | tail -1)

  echo "$title, $twitter_id" >> extractData.csv
  echo "$title, $twitter_id"
    
  n=$[$n+1

done


, 


bash: line 3: url: command not found
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information
cat: tmp_file: No such file or directory
cat: tmp_file: No such file or directory
