## Web Scraping with Selectors

Jingwen Fu\
jf3483@columbia.edu

### 1. Get HTML
Get the content of the page into R.

In [1]:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Mitch_McConnell')
df
# wiki_url = 'https://en.wikipedia.org/wiki/Mitch_McConnell'
# wiki_html = requests.get(wiki_url)
# wiki_doc = html.fromstring(wiki_html.content)

[                    Mitch McConnell  \
 0           Official portrait, 2016   
 1                               NaN   
 2            Senate Minority Leader   
 3                         Incumbent   
 4   Assumed office January 20, 2021   
 ..                              ...   
 56                       Allegiance   
 57                   Branch/service   
 58                 Years of service   
 59                             Unit   
 60                              NaN   
 
                                     Mitch McConnell.1  
 0                             Official portrait, 2016  
 1                                                 NaN  
 2                              Senate Minority Leader  
 3                                           Incumbent  
 4                     Assumed office January 20, 2021  
 ..                                                ...  
 56                                      United States  
 57                                 United States Army  
 58  

In [2]:
import requests
print(requests.get(url = 'https://en.wikipedia.org/wiki/Mitch_McConnell').text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Mitch McConnell - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"fb2ee752-19a3-4cdf-b1ab-b5c1437b07ef","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Mitch_McConnell","wgTitle":"Mitch McConnell","wgCurRevisionId":1122175073,"wgRevisionId":1122175073,"wgArticleId":350567,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Webarchive template wayback links","CS1 maint: url-status","Articles with short description","Short description 

### 2. Get the info box
On the right side of the page is a box of structured content, called an info box. Wikipedia has many types of such info boxes to provide content comparably for a group of articles of the same class (e.g. the Members of the U.S. senate, Fortune 500 companies, Crime Syndicates etc.)

#### a) Find the CSS class of the infobox.
infobox vcard

#### b) Extract the part of the HTML document that contains the infobox using the CSS information.

In [3]:
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Mitch_McConnell")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
# soup = BeautifulSoup(response.content, 'html.parser')
# print(soup.title)

In [4]:
info_box = soup.find(class_="infobox")
print(info_box)

<table class="infobox vcard"><tbody><tr><th class="infobox-above" colspan="2" style="font-size: 100%;"><div class="fn" style="font-size:125%;">Mitch McConnell</div></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:Mitch_McConnell_2016_official_photo_(1).jpg"><img alt="Mitch McConnell 2016 official photo (1).jpg" data-file-height="2335" data-file-width="1696" decoding="async" height="303" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Mitch_McConnell_2016_official_photo_%281%29.jpg/220px-Mitch_McConnell_2016_official_photo_%281%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Mitch_McConnell_2016_official_photo_%281%29.jpg/330px-Mitch_McConnell_2016_official_photo_%281%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Mitch_McConnell_2016_official_photo_%281%29.jpg/440px-Mitch_McConnell_2016_official_photo_%281%29.jpg 2x" width="220"/></a><div class="infobox-caption" style="line-height:normal;padding-top:0.2

#### 3. Make a data frame
 a) Parse the infobox table HTML you obtained above into a data frame.\
 b) Name the columns of the table you obtain key and value. So, in the example for Mitch McConnell, "Whip" would be the key, and the content information (i.e. the value) is "John Thune".\
 c) Filter the data frame (and rename variables if necessary) to the "Full name", "Political Party", and "Children". Use this selection of variables for all subsequent questions

In [5]:
infobox_label = [a.text for a in soup.find_all(class_="infobox-label")]
infobox_data =[a.text for a in soup.find_all(class_="infobox-data")]

In [6]:
infobox_label[0:5]

['Whip', 'Preceded by', 'Whip', 'Preceded by', 'Succeeded by']

In [7]:
infobox_data[0:5]

['John Thune',
 'Chuck Schumer',
 'Trent LottJon KylJohn Cornyn',
 'Harry Reid',
 'Harry Reid']

In [8]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key': infobox_label, 'value': infobox_data})

In [9]:
import re
df = df.replace(to_replace = 'Born', value = 'Full name', regex = True)
df = df[df['key'].isin(['Full name', 'Political party', 'Children'])]
df

Unnamed: 0,key,value
22,Full name,Addison Mitchell McConnell III (1942-02-20) Fe...
23,Political party,Republican
25,Children,3


In [10]:
df['value']

22    Addison Mitchell McConnell III (1942-02-20) Fe...
23                                           Republican
25                                                    3
Name: value, dtype: object

In [11]:
df.loc[df['key'] == 'Full name', 'value'] = re.search(r'(\w+\s)+', df.loc[df['key'] == 'Full name', 'value'].values[0]).group()
df

Unnamed: 0,key,value
22,Full name,Addison Mitchell McConnell III
23,Political party,Republican
25,Children,3


#### 4. Make a function
a) Use the code above to make a function called get_wiki_info that uses a single input url (a Wikipedia URL) and outputs the data frame of the format above. There is no need to account for exceptions (e.g. no info box on the page; page does not exist etc.) - we will only use members of the U.S. Senate for this exercise.\
b) Show how your function works on the following two URLs:

https://en.wikipedia.org/wiki/Tammy_Duckworth 

https://en.wikipedia.org/wiki/Susan_Collins

Depending on your previous function, you may receive an error message because Susan Collins has no entry for Children. Fix your function so that NA is recorded in such instances.

In [12]:
def get_wiki_info(url):
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(response.content, 'html.parser')
    info_box = soup.find(class_="infobox")
    infobox_label = [a.text for a in soup.find_all(class_="infobox-label")]
    infobox_data =[a.text for a in soup.find_all(class_="infobox-data")]
    df = pd.DataFrame({'key': infobox_label, 'value': infobox_data})
    df = df.replace(to_replace = 'Born', value = 'Full name', regex = True)
    df = df[df['key'].isin(['Full name', 'Political party', 'Children'])]
    df.loc[df['key'] == 'Full name', 'value'] = re.search(r'(\w+\s)+', df.loc[df['key'] == 'Full name', 'value'].values[0]).group()
    return df

In [13]:
get_wiki_info('https://en.wikipedia.org/wiki/Tammy_Duckworth')

Unnamed: 0,key,value
11,Full name,Ladda Tammy Duckworth
12,Political party,Democratic
14,Children,2


In [14]:
get_wiki_info('https://en.wikipedia.org/wiki/Susan_Collins')

Unnamed: 0,key,value
5,Full name,Susan Margaret Collins
6,Political party,Republican


In [15]:
def get_wiki_info(url):
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(response.content, 'html.parser')
    info_box = soup.find(class_="infobox")
    infobox_label = [a.text for a in soup.find_all(class_="infobox-label")]
    infobox_data =[a.text for a in soup.find_all(class_="infobox-data")]
    df = pd.DataFrame({'key': infobox_label, 'value': infobox_data})
    df = df.replace(to_replace = 'Born', value = 'Full name', regex = True)
    df = df[df['key'].isin(['Full name', 'Political party', 'Children'])]
    df.loc[df['key'] == 'Full name', 'value'] = re.search(r'(\w+\s)+', df.loc[df['key'] == 'Full name', 'value'].values[0]).group()
    
    if len(df.index) == 3:
        return df
    
#   if df.shape == (3, 2):
#         return df
#   else:
    if len(df.index) < 3:
        if 'Full name' not in set(df['key']):
            missing_name = {'key': 'Name', 'value': 'Na'}
            df = df.append(missing_name, ignore_index = True)
        if 'Political party' not in set(df['key']):
            missing_party = {'key': 'Political party', 'value': 'Na'}
            df = df.append(missing_party, ignore_index = True)
        if 'Children' not in set(df['key']):
            missing_child = {'key': 'Children', 'value': 'Na'} 
            df = df.append(missing_child, ignore_index = True)
        return df 
    
#     if len(df.index) < 3: 
#         df.fillna(0)
#         df.fillna(method="ffill")
#         nan_values = df[df['first_set'].isna()]
#         return nan_values in df

In [16]:
get_wiki_info('https://en.wikipedia.org/wiki/Tammy_Duckworth')

Unnamed: 0,key,value
11,Full name,Ladda Tammy Duckworth
12,Political party,Democratic
14,Children,2


In [17]:
get_wiki_info('https://en.wikipedia.org/wiki/Susan_Collins')

  df = df.append(missing_child, ignore_index = True)


Unnamed: 0,key,value
0,Full name,Susan Margaret Collins
1,Political party,Republican
2,Children,Na
