# Data Science Project - Part 1 : Data Collection

In this Notebook, we will collect texts from different categories in Wikipedia:
- biographies of [US Presidents](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States) 
- biographies of [Scientists Women in the 21st century](https://en.wikipedia.org/wiki/List_of_female_scientists_in_the_21st_century)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

## 1. Scraping
### 1.1 Get the links for target texts  
In this first section, we are interested in retrieving two lists of links: the links pointing at US Presidents biographies and the links pointing at Women Scientist biographies. 

In [2]:
BASE_WIKI = "https://en.wikipedia.org"
BASE_URL_PRESIDENTS = "/wiki/List_of_presidents_of_the_United_States"
BASE_URL_WOMEN_SCIENTISTS = "/wiki/List_of_female_scientists_in_the_21st_century"
DATA_DIR = "../../data/part1/"
UA = {'User-agent': 'Mozilla/5.0'}

# Get links of US presidents pages
def get_presidents_links():
    r = requests.get(BASE_WIKI + BASE_URL_PRESIDENTS, headers=UA)
    soup = BeautifulSoup(r.text, 'html.parser')
    links = []

    # Get the table containing the presidents list
    pres_table = soup.find("table", class_ = "wikitable sortable")

    # Get the links of the presidents pages : the second link of each row
    for table_row in pres_table.select('tbody > tr'):
        row_links = table_row.select('td a[href]')
        if len(row_links) > 1:
            links.append(row_links[1]['href'])

    return links

# Get links of Women Scientists in the 21st century
def get_women_scientists_links():
    r = requests.get(BASE_WIKI + BASE_URL_WOMEN_SCIENTISTS, headers=UA)
    soup = BeautifulSoup(r.text, 'html.parser')
    links = []

    # Get all the links that are in a list in the main content div
    all_lists = soup.select('div.mw-parser-output ul li')
    
    for list in all_lists:

        # We are only interested in the first link
        link = list.find('a') 

        if link:
            # Remove links that are not to articles
            if link['href'].startswith("/wiki/Category:") or link['href'].startswith("/wiki/File:") or link['href'].startswith("/wiki/Portal:"):
                continue

            elif link['href'].startswith("/wiki/"):
                links.append(link['href'])

    return links

In [3]:
# Visualize a few links
pres = get_presidents_links()[:5]
sci = get_women_scientists_links()[:5]
print(pres, sci)

['/wiki/George_Washington', '/wiki/John_Adams', '/wiki/Thomas_Jefferson', '/wiki/James_Madison', '/wiki/James_Monroe'] ['/wiki/Mimoza_Hafizi', '/wiki/Laura_Mersini-Houghton', '/wiki/Af%C3%ABrdita_Veve%C3%A7ka_Priftaj', '/wiki/Yasmine_Amhis', '/wiki/Sonia_%C3%81lvarez_Leguizam%C3%B3n']


### 1.2 Extract the content of the articles  
Now we will extract the content of the articles. More precisely, we extract all the text that is enclosed in \<p\>\</p\> brackets (paragraphs).

In [4]:
def extract_content_from_articles(links_list, category):
    
    data = []

    for link in links_list:
        r = requests.get(BASE_WIKI + link, headers=UA)
        soup = BeautifulSoup(r.text, 'html.parser')

        # Get the title of the article
        title = soup.find("h1", class_ = "firstHeading").text
        if title == "Main Page":
            print(link)

        # Get the content of the article: all paragraphs that are inside the main content div
        content = "".join([p.text for p in soup.select("#content p")])

        # Store the content of the article in a file
        with open(DATA_DIR + category + "/" + title + ".txt", "w", encoding="utf-8") as f:
            f.write(content)

        data.append([title, content, category])
    
    return pd.DataFrame(data, columns = ["title", "content", "category"])


### 1.3 Store the data into files and pandas DataFrames

In [5]:
def collect_texts_from_categories(categories):
    
    # Collect the links 
    president_links = get_presidents_links()
    women_scientist_links = get_women_scientists_links()

    # Store the texts
    df1 = extract_content_from_articles(president_links, categories[0])
    df2 = extract_content_from_articles(women_scientist_links, categories[1])

    # Concatenate the dataframes
    df = pd.concat([df1, df2], ignore_index=True)

    # Shuffle the data
    return shuffle(df, random_state=42)

In [6]:
df = collect_texts_from_categories(["US_Presidents", "Women_Scientists"])

## 2. Inspect the collected data

In [7]:
df

Unnamed: 0,title,content,category
287,Alice Alldredge,Alice Alldredge is an American oceanographer a...,Women_Scientists
329,Yolanda T. Moses,Yolanda Theresa Moses (born 1946) is an anthro...,Women_Scientists
323,Carolyn M. Mazure,Carolyn M. Mazure (born 1949) is an American p...,Women_Scientists
145,Merieme Chadid,Merieme Chadid (Arabic: مريم شديد; born 11 Oct...,Women_Scientists
55,Vandika Ervandovna Avetisyan,"Vandika Ervandovna Avetisyan (born October 5, ...",Women_Scientists
...,...,...,...
71,Mary E. White,\nMary Elizabeth White AM (22 February 1926 –...,Women_Scientists
106,Diane Massam,"Diane Massam is a Canadian linguist, Professor...",Women_Scientists
270,Margaret Stanley (virologist),"\nMargaret Anne Stanley, OBE FMedSci is a Brit...",Women_Scientists
348,Una Ryan,"Una Ryan (born December 18, 1941) is a British...",Women_Scientists


In [8]:
# Shuffle and save the data
df = shuffle(df, random_state=42)
# df.to_csv(DATA_DIR + "all_articles.csv", index=False)

We observe that there are way more Women Scientists articles than US Presidents articles. We may want to balance the data by selecting only 46 Women Scientists.

In [9]:
pres_df = df[df['category'] == 'US_Presidents']
women_sci_df = df[df['category'] == 'Women_Scientists']

In [10]:
print(len(pres_df), len(women_sci_df))

46 332


In [11]:
# Choose a sample of 46 lines
sample_women_sci_df = women_sci_df.sample(n=46, random_state=42)
len(sample_women_sci_df)

46

In [12]:
balanced_df = pd.concat([pres_df, sample_women_sci_df], ignore_index=True)
# Shuffle and save the data
balanced_df = shuffle(balanced_df, random_state=42)
# balanced_df.to_csv(DATA_DIR + "balanced_selection-46.csv", index=False)
balanced_df

Unnamed: 0,title,content,category
40,George H. W. Bush,"\nGeorge Herbert Walker Bush[a] (June 12, 1924...",US_Presidents
22,Rutherford B. Hayes,\n\nRutherford Birchard Hayes (/ˈrʌðərfərd/; O...,US_Presidents
55,Alice K. Jacobs,Alice K. Jacobs is a professor at the Boston U...,Women_Scientists
72,Carme Torras,Carme Torras Genís (born 4 July 1956)[1] is a ...,Women_Scientists
0,George W. Bush,"\nGeorge Walker Bush (born July 6, 1946) is an...",US_Presidents
...,...,...,...
20,Grover Cleveland,"\nStephen Grover Cleveland (March 18, 1837 – J...",US_Presidents
60,Sara Gill,Sara Gill (Urdu: سارہ گِل) is a Pakistani phys...,Women_Scientists
71,Lydia Kavraki,Lydia E. Kavraki (Greek: Λύδια Καβράκη) is a G...,Women_Scientists
14,Martin Van Buren,\nMartin Van Buren (/væn ˈbjʊərən/ van BYURE-ə...,US_Presidents
