#<img style="float: left; padding-right: 10px; width: 45px;" src="https://www.eyeofriyadh.com/includes/image.php?image=/directory/images/2018/04/273d4696fbb5d.png&width=50&height=50"> Web Scrapping

**Taibah University**<br/>
**$3^{rd}$ Term 2023**<br/>
**Instructors**: Prof. Dr. Mohammed Al-Sarem
<hr style='height:2px'>

When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

- Understand the structure of an HTML document and use that structure to extract desired information
- Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information
- Identify some other (semi-)structured formats commonly used for storing and transferring data, such as [JSON](https://en.wikipedia.org/wiki/JSON) and [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)
- Practice using Python packages such as [Beautiful Soap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and Pandas, including how to navigate their documentation to find functionality.

#Table of Contents:
* Prerequisites
* Introduction to web scraping: [Selenuim](https://selenium-python.readthedocs.io/getting-started.html) vs [Beautiful Soap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package
 * Setting the environment
 * Parse the HTML with __BeautifulSoup__
* Concatenate and merge two or more dataframes
* ٍSave data frame as __*.csv__ file


## Prerequisites
### Basic knowledge scraping with CSS selectors

If you haven't scraped with CSS selectors, here is a brief introduction showing how to use CSS selectors when web-scraping. Note, CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
- [Web scraping with css selectors](https://serpapi.com/blog/web-scraping-with-css-selectors-using-python/)

In [None]:
from IPython.display import VimeoVideo
VimeoVideo("733383823", h="d6228d4de1", width=600)

**Task 1.1:** In this lab, you need to install the following libraries:      [request](https://realpython.com/python-requests/), [lxml](https://lxml.de/) and [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Refere to the instructor to know how can install the required packages and thier versions using **_requirements.txt_** file.

- [How to install a library in Colab?](https://colab.research.google.com/notebooks/snippets/importing_libraries.ipynb)
- [What's a HTML tags?](https://www.simplilearn.com/tutorials/html-tutorial/html-tags)
- [CSS tags.](https://www.tutorialrepublic.com/css-reference/css3-properties.php)
- [Installion python packages using _requirments.txt_](https://note.nkmk.me/en/python-pip-install-requirements/)

In [None]:
!pip install requests lxml beautifulsoup4
#!pip install -r requirements.txt

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import requests  # call REQUESTS lib
from bs4 import BeautifulSoup  # call BEAUTIFULSOUP

from urllib.request import Request, urlopen
import urllib
import urllib.error

import re
from IPython.display import HTML
import pandas as pd

import time
from IPython.display import YouTubeVideo

In [None]:
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
requests.packages.urllib3.disable_warnings()

### **Additional:** Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at [how to reduce the chance of being blocked while web-scraping](https://serpapi.com/blog/how-to-reduce-chance-of-being-blocked-while-web/), there are eleven methods to bypass blocks from most websites and some of them will be covered in this blog post.

A few examples of bypassing blocks:

* __Proxy__:
```
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
    'http': os.getenv('HTTP_PROXY') # Or just type without os.getenv()
}
```
* browser automation such as [selenium](https://selenium-python.readthedocs.io/getting-started.html) or [playwright](https://playwright.dev/python/docs/api/class-playwright). Adding proxy to _selenium_ or _playwright_.

* render HTML page with Python library such _requests-HTML_.


## Methodology
Here’s step by step outline of this project:

* Download the google scholar webpage of related topic using requests
* Parse the HTML code source code using beautiful soup
* Extract Title of the paper , Number of citation , Author of the paper , Year of Publication , Place of Publication from page
* Compile the data and create a CSV file using pandas

### Download the __Google scholar citation__ webpage using requests
To begin , we’ll use the requests Python library to download the web page. We can use ___requests.get___ to download a page . Here we also need to define __headers__ in this function because google scholar webpage required login.


In [None]:
VimeoVideo("733383823", h="d6228d4de1", width=600)

__Task 1.1__: Find your user agent and assign it to avarible name _user_agent_. Then create a dictinary _headers_ and store your user agent.

__Attention:__
>  **Note!⚠️**:To get your custom __user agent__, open your web browser and type in the search engine "My user agent". Google will return you the your custom user agent.

In [None]:
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}

In [None]:
VimeoVideo("733383823", h="d6228d4de1", width=600)

__Task 1.2:__ Let's browse the official __Google Scholar Citation__ pages where you can find all information about _[Taiba University Researchers](https://scholar.google.com/citations?view_op=view_org&hl=en&org=8607118205373147890)_ activities. Check HTTP response status! What is a this Response [200]? To ensure that your code works normally, use a `try-except` block to avoid undesirable crashes.

- Let's google: response 200 meaning. All possible codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).
- [Error and Exception in Python: Try-except Block](https://docs.python.org/3/tutorial/errors.html)

In [None]:
params = {
        "view_op": "view_org",                       # author results
        "org": '8607118205373147890',                # organization ID
        "hl": "en",                                  # language
         }

url = 'https://scholar.google.com/citations?'

In [None]:
try:
    reqest = requests.get(url,headers=headers, params=params)
except urllib.error.HTTPError as e:
    # Return code error (e.g. 404, 501, ...)
    # ...
    print('HTTPError: {}'.format(e.code))
except urllib.error.URLError as e:
    # Not an HTTP-specific error (e.g. connection refused)
    # ...
    print('URLError: {}'.format(e.reason))
else:
    # 200
    # ...
    print('The requested page is rendered successfully: the response status is good')

The requested page is rendered successfully: the response status is good


__Task 1.3:__ Get the ___html___ source from __Google Scholar Citation__ of _[Taiba University Researchers](https://scholar.google.com/citations?view_op=view_org&hl=en&org=8607118205373147890)_. Write a function named ```google_citation_by_university()``` that takes ID of the orginzation as a parameter and returns a Soup object.
> ⓘ **Note** ⓘ: __Organization  id of Taibah University__: <font color="red"> _"8607118205373147890"_ </font>


In [None]:
def google_citation_by_university(org='8607118205373147890')->BeautifulSoup:
    """Get html content of first page of a given university.

    Parameters
    ----------
    org : str
        Organization ID given by Google.

    Returns
    -------
    page_cont : BeautifulSoup object
        Html content of the page.
    """
    url = 'https://scholar.google.com/citations?view_op=view_org&hl=en&org=' + org
    reqest = Request(url,headers=headers)
    page = urlopen(reqest)
    page_cont = BeautifulSoup(page, "lxml")
    return page_cont

ORG= '8607118205373147890'
univ_page= google_citation_by_university(ORG)
print(univ_page)

In [None]:
# Check your work
assert(
      type(univ_page).__name__ == 'BeautifulSoup' and
      len(str(univ_page))==91062),\
      f"`univ_page` should be an object if bs4.BeautifulSoup. and has length of\
      {len(str(univ_page))}"

In [None]:
# help(google_citation_by_university)

__Task 1.4:__ Get the ___html___ source of __Google Scholar Citation__ of _[Taiba University Researchers](https://scholar.google.com/citations?view_op=view_org&hl=en&org=8607118205373147890)_. In this case, create a function called __google_citation_by_author( )__ which takes id of the author as a parameters and returns html contant of that page.
<br>__Attention:__
> **Note!⚠️**: To get the _id_ of a particular researcher from __Google Citation__ page, run the youtube video below.

In [None]:
YouTubeVideo('ITHkaADXgaY')

In [None]:
def google_citation_by_author(AuthID='G2h9d8AAAAJ')->BeautifulSoup:
    """get html content of research page.

    Parameters
    ----------
    AuthID : str
        Author ID given by Google.

    Returns
    -------
    page content : BeautifulSoup object
        Html content of the page.
    """
    url = 'https://scholar.google.com/citations?user=' + AuthID +\
     '&hl=en&&ie=utf-8&oe=utf-8'
    reqest = Request(url,headers=headers)
    page = urlopen(reqest)
    soup = BeautifulSoup(page, "lxml")
    return soup

DEFULT_AUTHOR_ID= 'G2h9d8AAAAJ'
author_src= google_citation_by_author(DEFULT_AUTHOR_ID)
print(author_src.prettify()) # print page in a readable format

__Task 1.5:__ Parse the html content stored in _univ_page_ variable to get the following information:
* _Author name_
* _Author affilation_
* _Link to author Google citation page_
* _Author image_
* and, _Count of total citation_.

In [None]:
# Access user name found under 'gsc_1usr' div tag
div = univ_page.find(class_= "gsc_1usr")
name = div.find('img')['alt'] # Get auhtor name
aff = div.find('div', {'class':'gs_ai_aff'}).contents[0] # Affilation
link=  ('https://scholar.google.com/'+\
        univ_page.find('a', {'class':'gs_ai_pho'})['href']
        # url link to google citation page
       )
auth_id = (
           str(div.find('a')['href']).
           split('user=')[1]
          ) # Get author id by exploring the hyperlink
img = div.find('img')['src']     # get image link
citation = div.find('div', \
             {'class':'gs_ai_cby'}).contents[0]
             # Find how many citation in total all publication were gained

print(f"""
Author name: \t {name}
Affilation:  \t {aff}
Link to google scholar profile: \t {link}
Author ID: \t {auth_id}
....  """)


Author name: 	 Prof. Aly R. Seadawy 
Affilation:  	 Professor of Applied Mathematics ; Taibah University  
Link to google scholar profile: 	 https://scholar.google.com//citations?hl=en&user=HiSowXUAAAAJ 
Author ID: 	 HiSowXUAAAAJ
....  


__Task 1.6:__ Parse the html content stored in _univ_page_ variable to get _Research Intersts_ and _Count of total citation_. Extract only the number!
* Check how to write [list comperhansion](https://realpython.com/list-comprehension-python/)
* Check how to extract numbers using [Regulare Expression](https://docs.python.org/3/library/re.html)

In [None]:
interst =  [
            tag.string for tag in div.find_all('a',\
            {'class':'gs_ai_one_int'}
            )
           ]   # Identify list of intersts of the author
citation = div.find('div',\
                    {'class':'gs_ai_cby'}).contents[0]
                    # Find how many citation in total all publication were gained
cite_count= re.findall(r'\d+',citation)[0]

print(f"""
Author Research Intersts: \t {interst}
Citation Count:           \t {cite_count}
....  """)


Author Research Intersts: 	 ['Fluid Mechanics', 'Partial Differential Equations'] 
Citation Count:           	 23438  
....  


__Task 1.7:__ Now, stor all extracted data as a list _author_info_( __name, google citation link, author image, count of citation, etc__). Create a DataFrame ```df``` to disply data appropriately.


In [None]:
auth_info_list = []
auth_info_list.append({
      'name': name,
      'affilation': aff,
      'link': link,
      'authorID':auth_id,
      'img':img,
      'citation':citation,
      'interst': interst,
      })
df_singl_auth= pd.DataFrame(auth_info_list)
df_singl_auth

Unnamed: 0,name,affilation,link,authorID,img,citation,interst
0,Prof. Aly R. Seadawy,Professor of Applied Mathematics ; Taibah Univ...,https://scholar.google.com//citations?hl=en&us...,HiSowXUAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 23438,"[Fluid Mechanics, Partial Differential Equations]"


In [None]:
# Check your work
assert (
        type(df_singl_auth).__name__ == 'DataFrame' \
        and
        df_singl_auth.shape[0]== 1 \
        and df_singl_auth.shape[1]== 7
        ),\
        f"`df_singl_auth` should be an object of 'Pandas DataFrame' and has shape of {df_singl_auth.shape}"

__Task 1.8:__ Now, repeate all the above tasks to stor all extracted data as a list _author_info_( __name, google citation link, author image, count of citation, etc__). Write a function named ```get_authors_list ( )``` that takes html source of the uinversity and returns list of authors.

In [None]:
def get_authors_list(src)-> BeautifulSoup:
    """get list of all researchers found in university first page.

    Parameters
    ----------
    src : BeautifulSoup object
        Html content of the page.

    Returns
    -------
    authors_list: list
        List of Researchers.
    """
    authors_list = []
    divs = src.find_all(class_= "gsc_1usr") # Get user ingormation tags
    for div in divs:
       authors_list.append({
            'name': div.find('img')['alt'],
            'affilation': div.find('div', {'class':'gs_ai_aff'}).contents[0],
            'link': (
                     'https://scholar.google.com/'+\
                     univ_page.find('a', \
                    {'class':'gs_ai_pho'})['href']
                    ) ,  # url link to google citation page
            'authorID':str(div.find('a')['href']).split('user=')[1],
            'img':div.find('img')['src'],
            'citation': (
                        div.find('div',\
                                  {'class':'gs_ai_cby'}).contents[0]
                         ),
            'interst': [
                tag.string for tag in div.find_all('a',\
                              {'class':'gs_ai_one_int'})
                      ],
            })

    return authors_list

info= get_authors_list(univ_page)

__Task 1.9:__ Convert list of authors information into [_Pandas DataFrame_](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [None]:
f_page_df=pd.DataFrame(info)
f_page_df

Unnamed: 0,name,affilation,link,authorID,img,citation,interst
0,Prof. Aly R. Seadawy,Professor of Applied Mathematics ; Taibah Univ...,https://scholar.google.com//citations?hl=en&us...,HiSowXUAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 23438,"[Fluid Mechanics, Partial Differential Equations]"
1,Prof. Muhammad Sohail Zafar,"Taibah University, Saudi Arabia, Professor, Ph...",https://scholar.google.com//citations?hl=en&us...,N4S8HI8AAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 12234,"[Dentistry, Dental Materials, Biomaterials, Re..."
2,Nadeem Ahmad,"College of Medicine, Taibah University",https://scholar.google.com//citations?hl=en&us...,Ao9Y41oAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 10736,[Community Medicine]
3,Hatem Makhdoom,Taibah University,https://scholar.google.com//citations?hl=en&us...,QfhpWW0AAAAJ,/citations/images/avatar_scholar_56.png,Cited by 5168,[]
4,Mansour Al Nozha,"Professor of Cardiology, King Saud University,...",https://scholar.google.com//citations?hl=en&us...,cCZNX8cAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 4355,"[Cardiology, Metabolic syndrome, Diabetes]"
5,Omar Alharbi,"Biology Department, Faculty of Sciences, Taiba...",https://scholar.google.com//citations?hl=en&us...,X143mTQAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 3650,[Environmental Science]
6,eman alfadhli,"Professor of Medicine,Taibah university",https://scholar.google.com//citations?hl=en&us...,fNj3ox8AAAAJ,/citations/images/avatar_scholar_56.png,Cited by 3158,[diabetes and endocrinology]
7,Mohammed Ali Ali Al-Mamary,Taibah University- KSA; Sana'a University- Yem...,https://scholar.google.com//citations?hl=en&us...,macDI08AAAAJ,/citations/images/avatar_scholar_56.png,Cited by 2857,[Biological Activities of Natural and …]
8,Gias U Ahmmed,Taibah University,https://scholar.google.com//citations?hl=en&us...,cg9jv2UAAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 2820,"[Ion chanel, Calcium signaling]"
9,Mansour Adam Mahmoud (Associate Professor),"Clinical and HospitalPharmacy Department, Taib...",https://scholar.google.com//citations?hl=en&us...,ndpoZQ8AAAAJ,https://scholar.googleusercontent.com/citation...,Cited by 2790,"[Medication Safety and Pharmacovigilance, Phar..."




In [None]:
# Check your work
assert (
        type(f_page_df).__name__ == 'DataFrame'\
        and
        f_page_df.shape[0]== 10 \
        and df_singl_auth.shape[1]== 7
        ),\
        f"`df_singl_auth` should be an object of 'Pandas DataFrame' and has shape of {f_page_df.shape}"

## Next steps 🏃

__Task 1.10:__ Till now, we got only information of the authors found in the first page. Your task now is to get list of all authors in the next pages. Write ```scrape_all_authors_from_university ()``` that takes _univeristy id_ as a parameter and returns a _Pandas DataFrame_.



In [None]:
def scrape_all_authors_from_university(university_name: str)-> pd.DataFrame:
    """get list of all researchers have Taibah University as affilation.

    Parameters
    ----------
    university_name : str
        University ID.

    Returns
    -------
    pd.DataFrame: Data Frame
        List of Researchers.
    """
    params = {
        "after_author":'4VLjAFv3__8J',                       # author ID
        "hl": "en",                                          # language
        "astart": 0                                          # page number
    }
    authors_lst = []
    profiles_is_present = True
    while profiles_is_present:
        url = 'https://scholar.google.com/citations?view_op=view_org&hl=en&org=8607118205373147890&after_author='+ str(params['after_author'])+ '&astart='+ str(params['astart'])
        print(url)
        reqest = Request(url,headers=headers)
        page = urlopen(reqest)
        soup = BeautifulSoup(page, "lxml")
        print(f"extracting authors at page #{params['astart']}.")


        divs = soup.find_all(class_= "gsc_1usr")
        for div in divs:
          authors_lst.append({
              'name': div.find('img')['alt'],
              'affilation': div.find('div', {'class':'gs_ai_aff'}).contents[0],
              'link': 'https://scholar.google.com/'+ div.find('a')['href'],
              'img':div.find('img')['src'],
              'citation':div.find('div', {'class':'gs_ai_cby'}).contents[0] if div.find('div', {'class':'gs_ai_cby'}).contents  else 0,
              'interst': div.find('div', {'class':'gs_ai_int'}),
              })

        # if next page token is present -> update next page token and increment 10 to get the next page
        # next_author = soup.findChildren("button")
        next_author = soup.findChildren("button",{'class':'gs_btnPR', 'disabled':''})
        last_page= re.search(r"disabled=""", str(next_author))

        if not last_page:
            link_next_author= re.search(r"after_author\\x3d(.*)\\x26", str(next_author)).group(1)
            print(link_next_author)
            temp={"after_author": str(link_next_author)}
            params.update(temp)
            params["astart"] += 10
            time.sleep(3)
        else:
            profiles_is_present = False
        # print(authors)
    return pd.DataFrame(authors_lst)

df_auth_list= scrape_all_authors_from_university(university_name="Taibah University")


__Task 1.11:__ Check the size of _df_auth_list_ dataframe. _**What do you see?**_
* Check size. Use [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) property of dataframe to explore the size
* Refere to the unversity page to count how many authors there are in total.
* merge both dataframes __f_page_df: of first page__ and __df: remain authors list__.

In [None]:
print(f_page_df.shape)
print(df_auth_list.shape)
result= pd.concat([f_page_df, df_auth_list])
# print(result.shape)

(10, 7)
(456, 6)


In [None]:
# Check your work
assert (
        type(result).__name__ == 'DataFrame' and
        result.shape[0]== 466 and result.shape[1]== 7), f"`result` should be an object of 'Pandas DataFrame' and has shape of {result.shape}"

__Task 1.12:__ It is time to save the collected data as _*.csv file_. Create a function called __save_as_csv()__ which takes as a parameter: __name of file__, __the directory__, and dataframe that you generated before.

In [None]:
import os
def save_as_csv(file_name: str, directory= "."):
  """Creates CSV file with all authors' information.

    CSV file name will include today's date, e.g. `'2023-03-28__authors_list.csv`'.

    Parameters
    ----------
    file_name : suggested name
        Observations with group assignment.
    directory : str, default='.'
        Location for saved CSV file.

    Returns
    -------
    None
    """
  # Create filename with date
  date_string = pd.Timestamp.now().strftime(format= "%Y-%m-%d")
  final_filename = directory + "/" + date_string +'_' + file_name+ '.csv'
  os.makedirs(directory, exist_ok=True)
  result.to_csv(final_filename)

save_as_csv('authors_list')

---
Copyright 2023 Taibah University. This
content is created, prepared by [Prof. Dr. Mohammed Al-Sarem](https://sites.google.com/site/alsaremmh) and licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.