#<img style="float: left; padding-right: 10px; width: 45px;" src="https://www.eyeofriyadh.com/includes/image.php?image=/directory/images/2018/04/273d4696fbb5d.png&width=50&height=50"> Web Scrapping (Lab 03)

**Taibah University**<br/>
**$3^{rd}$ Term 2023**<br/>
**Instructors**: Prof. Dr. Mohammed Al-Sarem
<hr style='height:2px'>

#Table of Contents:
* Prerequisites
* Introduction to web scraping: [Selenuim](https://selenium-python.readthedocs.io/getting-started.html) vs [Beautiful Soap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package
 * Setting the environment
 * Parse the HTML with __BeautifulSoup__
* Concatenate and merge two or more dataframes
* ٍSave data frame as __*.csv__ file

In [None]:
import warnings
warnings.filterwarnings("ignore")

import requests
import re
from bs4 import BeautifulSoup
from IPython.display import HTML
import pandas as pd

from urllib.request import Request, urlopen
#import urllib
import urllib.error


import time
from IPython.display import YouTubeVideo

# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
requests.packages.urllib3.disable_warnings()

## Methodology
Here’s step by step outline of this project:

* Download the google scholar webpage of related topic using requests
* Parse the HTML code source code using beautiful soup
* Extract Title of the paper , Number of citation , Author of the paper , Year of Publication , Place of Publication from page
* Compile the data and create a CSV file using pandas

### Download the __Google scholar citation__ webpage using requests
To begin , we’ll use the requests Python library to download the web page. We can use ___requests.get___ to download a page . Here we also need to define __headers__ in this function because google scholar webpage required login.


**Task 3.1:** Write function named _get_link_to_profile( )_ that takes a path to data source ([Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)) that was genereted in __LAB 01 (Web Scraping)__ and returns a sliced pd.Series which contains active link to researchers' google citation profiles.

In [None]:
def wrangle(file_path):
  """Creates CSV file with all authors' information.

    CSV file name will include today's date, e.g. `'2023-03-28__authors_list.csv`'.

    Parameters
    ----------
    file_name : suggested name
        csv file name where you saved the data obtained at the previous lab.
    Returns
    -------
    mask_lnk: pd.Series
  """
  # Read CSV file into DataFrame
  df = pd.read_csv(file_path)
  # Subset to google citation link of researchers' profiles
  mask_lnk = df ["link"]

  return pd.Series(mask_lnk)

**Task 3.2:** Use your `wrangle` function to create a DataFrame `df` from the CSV file `./content/2023-03-23_authors_list.csv`.
__Attention:__
>  **Note!⚠️**:You have to pass the right name of your dataset. Note that, the name of *.csv has the following convension: **date in (%Y-%m-%d) format followed by ___authors_list.csv__.

In [None]:
df = wrangle("/content/2023-03-23_authors_list.csv")
print("df shape:", df.shape)
df.head()

df shape: (404,)


0    https://scholar.google.com//citations?hl=en&us...
1    https://scholar.google.com/citations?hl=en&use...
2    https://scholar.google.com/citations?hl=en&use...
3    https://scholar.google.com//citations?hl=en&us...
4    https://scholar.google.com//citations?hl=en&us...
Name: link, dtype: object

At this point, your DataFrame `df` should have no more than 404 observations.

__Task 3.3__: Set your user agent and assign it to avarible name _user_agent_. Then create a dictinary _headers_ and store your user agent.

In [None]:
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}

__Task 3.4:__ create a function called __metric()__ that parse the html content and return a data frame with the follwing information:
  * _total citation_
  * _citation since last 5 years_
  * _h-index_
  * _h-index since last 5 year_
  * _i10h-index_
  * and, _i10h-index since last 5year_.


In [None]:
def metric(df):


  info=[]
  for record in df:
    col_data=[]
    time.sleep(0.70)
    url= record
    reqest = Request(url,headers=headers)
    page = urlopen(reqest)
    soup = BeautifulSoup(page, "lxml")

    print(f'The requested page: {url} is being process')
    try:
      table= soup.find('table',{'id':'gsc_rsb_st'})
      table_body = table.find('tbody')
      rows = table_body.find_all('tr')
      for row in rows:
        cols = row.find_all('td',{'class':'gsc_rsb_std'})
        cols = [ele.text.strip() for ele in cols]
        col_data.append([ele for ele in cols if ele])
      info.append({
                      'auth_prf': url,
                      'all_citation': col_data[0][0],
                      'since_2018': col_data[0][1],
                      'h-index_all': col_data[1][0],
                      'h-index_2018':col_data[1][1],
                      'i10_h_index_all':col_data[2][0],
                      'i10_h_index_2018':col_data[2][1]
                })
    except Exception as e:
       print('the following error is raised '+ str(e))
       info.append({
                  'auth_prf': url,
                  'all_citation': 0,
                  'since_2018': 0,
                  'h-index_all': 0,
                  'h-index_2018':0,
                  'i10_h_index_all':0,
                  'i10_h_index_2018':0
                  })


  return info

record= metric(df)
dataframe= pd.DataFrame(record)
dataframe.head()

__Task 3.5:__ It is time to save the collected data as _*.csv file_. Create a function called __save_as_csv()__ which takes as a parameter: __name of file__, __the directory__, and dataframe that you generated before.

In [None]:
import os
def save_as_csv(file_name: str, directory= "."):
  """Creates CSV file with all authors' information.
    CSV file name will include today's date, e.g. `'authors_h-index.csv`'.
    Parameters
    ----------
    file_name : suggested name
        Observations with group assignment.
    directory : str, default='.'
        Location for saved CSV file.
    Returns
    -------
    None
    """
  # Create filename with date
  date_string = pd.Timestamp.now().strftime(format= "%Y-%m-%d")
  final_filename = directory + "/" + date_string +'_' + file_name+ '.csv'
  os.makedirs(directory, exist_ok=True)
  dataframe.to_csv(final_filename)

save_as_csv('authors_h-index')

In [None]:
from IPython.display import HTML, display
import time

def progress(value, max=100):
    return HTML("""
        <progress
            value='{value}'
            max='{max}',
            style='width: 100%'
        >
            {value}
        </progress>
    """.format(value=value, max=max))

out = display(progress(0, 100), display_id=True)
for ii in range(101):
    time.sleep(0.02)
    out.update(progress(ii, 100))