#<img style="float: left; padding-right: 10px; width: 45px;" src="https://www.eyeofriyadh.com/includes/image.php?image=/directory/images/2018/04/273d4696fbb5d.png&width=50&height=50"> Web Scrapping (Lab 02)

**Taibah University**<br/>
**$3^{rd}$ Term 2023**<br/>
**Instructors**: Prof. Dr. Mohammed Al-Sarem
<hr style='height:2px'>

When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

- Understand the structure of an HTML document and use that structure to extract desired information
- Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information
- Identify some other (semi-)structured formats commonly used for storing and transferring data, such as [JSON](https://en.wikipedia.org/wiki/JSON) and [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)
- Practice using Python packages such as [Beautiful Soap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and Pandas, including how to navigate their documentation to find functionality.

#Table of Contents:
* Prerequisites
* Introduction to web scraping: [Selenuim](https://selenium-python.readthedocs.io/getting-started.html) vs [Beautiful Soap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package
 * Setting the environment
 * Parse the HTML with __BeautifulSoup__
* Concatenate and merge two or more dataframes
* ٍSave data frame as __*.csv__ file


In [None]:
import warnings
warnings.filterwarnings("ignore")

import requests
import re
from bs4 import BeautifulSoup
from IPython.display import HTML
import pandas as pd

from urllib.request import Request, urlopen
#import urllib
import urllib.error


import time
from IPython.display import YouTubeVideo

# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
requests.packages.urllib3.disable_warnings()

## Methodology
Here’s step by step outline of this project:

* Download the google scholar webpage of related topic using requests
* Parse the HTML code source code using beautiful soup
* Extract Title of the paper , Number of citation , Author of the paper , Year of Publication , Place of Publication from page
* Compile the data and create a CSV file using pandas

### Download the __Google scholar citation__ webpage using requests
To begin , we’ll use the requests Python library to download the web page. We can use ___requests.get___ to download a page . Here we also need to define __headers__ in this function because google scholar webpage required login.


In [None]:
from IPython.display import VimeoVideo
VimeoVideo("733383823", h="d6228d4de1", width=600)

# Prepared Data
## Import

In the previous lab, we got our data frame by exploring the orginazation page on google scholar citation page. Among data that we obtained was __link attribute__ that refers to author's google citation page.  

In [None]:
VimeoVideo("656703362", h="bae256298f", width=600)

**Task 2.1:** Write function named _get_link_to_profile( )_ that takes a path to data source ([Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)) that was genereted in __LAB 01 (Web Scraping)__ and returns a sliced pd.Series which contains active link to researchers' google citation profiles.

In [None]:
def wrangle(file_path):
  """Creates CSV file with all authors' information.

    CSV file name will include today's date, e.g. `'2023-03-28__authors_list.csv`'.

    Parameters
    ----------
    file_name : suggested name
        csv file name where you saved the data obtained at the previous lab.
    Returns
    -------
    mask_lnk: pd.Series
  """
  # Read CSV file into DataFrame
  df = pd.read_csv(file_path)
  # Subset to google citation link of researchers' profiles
  mask_lnk = df ["link"]

  return pd.Series(mask_lnk)

Now that we have a function written, let's test it out.

In [None]:
VimeoVideo("656701336", h="c3a3e9bc16", width=600)

**Task 2.2:** Use your `wrangle` function to create a DataFrame `df` from the CSV file `./content/2023-03-23_authors_list.csv`.
__Attention:__
>  **Note!⚠️**:You have to pass the right name of your dataset. Note that, the name of *.csv has the following convension: **date in (%Y-%m-%d) format followed by ___authors_list.csv__.

In [None]:
df = wrangle("/content/2023-03-23_authors_list.csv")
print("df shape:", df.shape)
df.head()

At this point, your DataFrame `df` should have no more than 404 observations.

In [None]:
# Check your work
assert (
    len(df) <= 404
), f"`df` should have no more than 8606 observations, not {len(df)}."

__Task 2.3__: Set your user agent and assign it to avarible name _user_agent_. Then create a dictinary _headers_ and store your user agent.

In [None]:
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}

__Task 2.4:__ Write a function called ``` author_page()``` that get access to the researcher's profile on Google scholar citation and returs html page content and response status. As example, take the first record of the pd.Series that you got in the previous __Task 2.2__.

In [None]:
def author_page(lnk: pd.Series):

  url = lnk[0]
  status=0
  try:
      reqest = requests.get(url)
  except urllib.error.HTTPError as e:
      # Return code error (e.g. 404, 501, ...)
      # ...
      print('HTTPError: {}'.format(e.code))
  except urllib.error.URLError as e:
      # Not an HTTP-specific error (e.g. connection refused)
      # ...
      print('URLError: {}'.format(e.reason))
  else:
      status= 200

  reqest = Request(url,headers=headers)
  page = urlopen(reqest)
  soup = BeautifulSoup(page, "lxml")

  return status, soup

status, src = author_page(df)
print(f'The requested page is rendered successfully: the response status is good: {status}')
print(src.prettify())

## Some Issue in some Google Citation Pages 🏃
Maybe you note that at the bottom of some scholare pages the "show more" button presents. To load more articles, you have to click first on the button and the broweser will render and refresh the content again adding more articles and associated information to the page.
To capture the whole html page, in this lab we provide an basic solution using <font color="green">"pagesize" </font> parameter.



__Task 2.5__: **Create** a dictionary _params_ that contains all necessary parameters that you have to pass in the url. Do not forget to add __pagesize__ to the _params_ dictionary.



In [None]:
params = {
        "hl": "en",                                          # language
        "pagesize": 20                                       # page size
    }

__Task 2.6__: **Test** URL and pass __pagesize__ as parameter. Use the first record in your ```df```.






In [None]:
page_link= df[0]
full_url = page_link +'&pagesize='+ str(params['pagesize'])
print(full_url)

__Task 2.7:__ Parse the html content provided by the url sorted in __`full_url`__ variable to:
1. get the following information:
  * _paper title_
  * _co-authors_
  * _cited by count_
  * _journal name_
  * and, _publication time_.

2. save the obtained information in adictionary called __info__.

In [None]:
full_url= 'https://scholar.google.com//citations?hl=en&user=Ao9Y41oAAAAJ&pagesize=20'
reqest = Request(full_url,headers=headers)
page = urlopen(reqest)
soup = BeautifulSoup(page, "lxml")
print(f"extracting authors page size to #{params['pagesize']}.")
print(full_url)
#print(f'The requested page is rendered successfully: the response status is good: {status}')
#print(src.prettify())
info=[]
paper = soup.find_all('tr',{'class':'gsc_a_tr'})
for td in paper:
      info.append({
          'paper_title': td.find("a", {'class':'gsc_a_at'}).text,
          'authors_list':td.find("div", {'class':'gs_gray'}).text ,
          'cited by': td.find('a',\
                              {'class':'gsc_a_ac gs_ibl'}).text if td.find('a',\
                                                                           {'class':'gsc_a_ac gs_ibl'})  else 0,
          'journal_name': td.find_all("div", {'class':'gs_gray'})[1].text ,
          'publication_time':  re.search(r'(\d\d\d\d)',
                                            str(td.find("span",\
                                                {'class':'gsc_a_h gsc_a_hc gs_ibl'}))).group(1) if td.find("span",\
                                                              {'class':'gsc_a_h gsc_a_hc gs_ibl'}).contents  else 'None',
                   })

author_df= pd.DataFrame(info)
author_df

__Task 2.8__: Now, since you knaw how to pass _page size_ paramater through URL, your task is to create a function called __scrape_first_100_records( )__ which takes __page_link__ of the author as a parameter and returns a data frame containg the full information listed in __Task 2.7__.

__Attention:__
>  **Note!⚠️**: The current solution works only till the page size reaches 100. When there are many articles, you have to seek another solution. This is why we called the function such that.

In [None]:
def scrape_first_100_records(page_link: str)-> pd.DataFrame:
    params = {
        "hl": "en",                                          # language
        "pagesize": 20                                       # page size
    }
    authors_lst = []
    profiles_is_present = True
    while profiles_is_present:
        url = page_link +'&pagesize='+ str(params['pagesize'])

        reqest = Request(url,headers=headers)
        page = urlopen(reqest)
        soup = BeautifulSoup(page, "lxml")
        print(f"extracting author page: {url}.")

        info=[]
        paper = soup.find_all('tr',{'class':'gsc_a_tr'})
        for td in paper:
          info.append({
                    'author_page': page_link,
                    'paper_title': td.find("a", {'class':'gsc_a_at'}).text,
                    'authors_list':td.find("div", {'class':'gs_gray'}).text ,
                    'cited by': td.find('a', {'class':'gsc_a_ac gs_ibl'}).text if td.find('a',
                                                                                          {'class':'gsc_a_ac gs_ibl'}) else 0,
                    'journal_name': td.find_all("div", {'class':'gs_gray'})[1].text ,
                    'publication_time': re.search(r'(\d\d\d\d)',
                                        str(td.find("span",\
                                                    {'class':'gsc_a_h gsc_a_hc gs_ibl'}))).group(1) if td.find("span",\
                                                                                                      {'class':\
                                                                                                       'gsc_a_h gsc_a_hc gs_ibl'
                                                                                                       }).contents  else 'None',
                   })
        # if next page token is present -> update next page token and increment 10 to get the next page
        # next_author = soup.findChildren("button")
        show_more = soup.findChildren("button",{'id':'gsc_bpf_more', 'disabled':''})
        extend= re.search(r"disabled=""", str(show_more))
        # print('********')
        # print(show_more)

        # print(extend)
        # print('*******')
        if ((not extend) and (params["pagesize"] != 100)):
            params["pagesize"] += 20
            time.sleep(3)
        else:
            profiles_is_present = False
            reset={"pagesize": 20}
            params.update(reset)
            print('**********')


        # print(authors)
    return pd.DataFrame(info)
df_auth_list= pd.DataFrame()
for row in df:
  updated= scrape_first_100_records(row)
  df_auth_list= pd.concat([updated, df_auth_list])
  print(df_auth_list.shape)

df_auth_list

__Task 2.9:__ Check the size of _df_auth_list_ dataframe. _**What do you see?**_
* Use [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) property of dataframe to explore the size of dataframe
* Refere to the [TU home page](https://www.taibahu.edu.sa/Pages/AR/Home.aspx). Under [البيانات المفتوحة](https://www.taibahu.edu.sa/Pages/AR/CustomPage.aspx?ID=87) section, find the __updated excel file__ that contain the necessary information to count how many academic staff there are in total.
* Create a variable called __portion_of_authors__ that referes to the ration of authors whose an account on google scholare citation.

In [None]:
print(df_auth_list.shape)
portion_of_authors= df_auth_list.shape[0] /3720 # the number might change year by year.
print(f'Ratio of scholars whose account on google scholare citation is {round(portion_of_authors,2)} %')

(9068, 6)
Ratio of scholars whose account on google scholare citation is 2.44 %


__Task 2.10:__ Count how many authors contributed to each paper. Create the new column called __#of_authors__ that contains such information. use [apply function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html).

In [None]:
#@title
def count_authors(df):
  count= len(str(df).split(','))
  return count

df_auth_list['#of_authors'] = df_auth_list['authors_list'].apply(count_authors)
df_auth_list


__Task 2.11:__ It is time to save the collected data as _*.csv file_. Create a function called __save_as_csv()__ which takes as a parameter: __name of file__, __the directory__, and dataframe that you generated before.

In [None]:
import os
def save_as_csv(file_name: str, directory= "."):
  """Creates CSV file with all authors' information.

    CSV file name will include today's date, e.g. `'2023-03-28__authors_list.csv`'.

    Parameters
    ----------
    file_name : suggested name
        Observations with group assignment.
    directory : str, default='.'
        Location for saved CSV file.

    Returns
    -------
    None
    """
  # Create filename with date
  date_string = pd.Timestamp.now().strftime(format= "%Y-%m-%d")
  final_filename = directory + "/" + date_string +'_' + file_name+ '.csv'
  os.makedirs(directory, exist_ok=True)
  df_auth_list.to_csv(final_filename)

save_as_csv('auth_paper_lst')

---
Copyright 2023 Taibah University. This
content is created, prepared by [Associate Prof. Dr. Mohammed Al-Sarem](https://sites.google.com/site/alsaremmh) and licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

In [None]:
import pandas as pd

df=pd.read_csv('/content/2023-03-23_authors_list.csv')
df.head()

In [None]:
df['interst_new']= df['interst'].apply(lambda st: st[st.find('">')+1:st.find("</a>")])
df['interst_new2']= df['interst_new'].apply(lambda st: st[st.find('">')+1:st.find("</a>")])

In [None]:
df.to_csv('new.csv')