# 🗿 Scraping Static Website

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](<https://colab.research.google.com/github/MYTE21/Red.Blue.States/blob/okidogi/notebooks/scraping_static_website.ipynb>)

Scraping a static Wikipedia page [🔗 Red states and blue states](https://en.wikipedia.org/wiki/Red_states_and_blue_states),
to collect information about presidential elections by states since 1972.

* **Wikipedia Page Link:** [🔗 Red states and blue states](https://en.wikipedia.org/wiki/Red_states_and_blue_states).
* **Scraped Table Link:** [🔗 Table of presidential elections by states since 1972](https://en.wikipedia.org/wiki/Red_states_and_blue_states#:~:text=suburbs%20were%20divided.-,Table%20of%20presidential%20elections%20by%20states%20since%201972,-%5Bedit%5D).



# 1. ⚙️ Imports

Import the necessary libraries and packages.

In [23]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Local Libraries
from data import data

# 2. 📁 Data

Rendering the webpage and the expected table and storing it in a predefined place.

## 2.1. 📜 Getting the Webpage

In [2]:
# Define the URL.
url = data.get_dataset_path("rbs", "urls", "red_blue_states_url", 1)

# Send an HTTP GET request to the URL.
response = requests.get(url)

# Parse the HTML content of the page using the 'lxml' parser.
webpage = BeautifulSoup(response.text, "lxml")

print("Title of the web page: ", webpage.title.string)

Title of the web page:  Red states and blue states - Wikipedia


## 2.2. ♟️ Getting the Table

In [3]:
tables = webpage.find_all(name="table", attrs={"class": "wikitable"})

print("Number of tables where class name is 'wikitable': ", len(tables))

Number of tables where class name is 'wikitable':  1


⚓️ There is only one table with the class name 'wikitable,' and this is our targeted table. ⤵️

In [4]:
table = tables[0]

print("The 'table' contains the full Presidential Elections table with HTML tags. \n"
    "Let's see the first 20 elements contained in that table: \n")

list(table.stripped_strings)[:20]

The 'table' contains the full Presidential Elections table with HTML tags. 
Let's see the first 20 elements contained in that table: 



['Year',
 '1972',
 '1976',
 '1980',
 '1984',
 '1988',
 '1992',
 '1996',
 '2000',
 '2004',
 '2008',
 '2012',
 '2016',
 '2020',
 '2024',
 'Democratic',
 'Republican',
 '(lighter shading indicates win ≤5%)',
 'Winner received plurality of the vote but did not receive an outright majority of the popular vote',
 'Winner']

⚓️ We can see that we find our desired table. ⤴️

### 2.2.1. 🧢 Header of Table

⚓️ Let's get the column names. ⤵️

In [5]:
headers = table.find_all(name="th")

print("Total number of columns of the table: ", len(headers))

Total number of columns of the table:  16


In [6]:
print("Let's check the last column: ", " ".join(list(headers[-1].stripped_strings)))

Let's check the last column:  Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote


⚓️ The last column has some instructions. We will ignore this column for now. ⤵️

In [7]:
columns = [column.text.strip() for column in headers[:-1]]

print("Table columns are: \n")
columns

Table columns are: 



['Year',
 '1972',
 '1976',
 '1980',
 '1984',
 '1988',
 '1992',
 '1996',
 '2000',
 '2004',
 '2008',
 '2012',
 '2016',
 '2020',
 '2024']

### 2.2.2. 🇺🇸 US Presidential Elections by States

#### 2.2.2.1. 🐬 Body of Table

As the first row is a column row, the 2nd to 4th rows contain helper information.
We will take from the 5th to the last row. ⤵️

In [8]:
body = table.find_all(name="tr")

In [9]:
print("The first row of the table: ", " ".join(list(body[4].stripped_strings)))

print("...\nThe last row of the table: ", " ".join(list(body[-1].stripped_strings)))

The first row of the table:  National popular vote Nixon Carter Reagan Reagan Bush Clinton Clinton Gore Bush Obama Obama Clinton Biden Trump
...
The last row of the table:  Wyoming Nixon Ford Reagan Reagan Bush Bush Dole Bush Bush McCain Romney Trump Trump Trump


In [10]:
state_contents = []

for row_id in range(4, len(body)):
    row_data = [
        data.text.replace("\n", "").strip() for data in body[row_id].find_all("td")
    ]
    state_contents.append(row_data)

print("View the first row of the state_contents: ", state_contents[0])
print("...\nView the last row of the state_contents: ", state_contents[-1])

View the first row of the state_contents:  ['National popular vote', 'Nixon', 'Carter', 'Reagan', 'Reagan', 'Bush', 'Clinton', 'Clinton', 'Gore', 'Bush', 'Obama', 'Obama', 'Clinton', 'Biden', 'Trump']
...
View the last row of the state_contents:  ['Wyoming', 'Nixon', 'Ford', 'Reagan', 'Reagan', 'Bush', 'Bush', 'Dole', 'Bush', 'Bush', 'McCain', 'Romney', 'Trump', 'Trump', 'Trump']


#### 2.2.2.2. 🗂️ Whole State Table

Gathering the state whole table.

In [11]:
state_dataframe = pd.DataFrame(data=state_contents, columns=columns)

In [12]:
print("Presidential elections by states: \n")
state_dataframe.head()

Presidential elections by states: 



Unnamed: 0,Year,1972,1976,1980,1984,1988,1992,1996,2000,2004,2008,2012,2016,2020,2024
0,National popular vote,Nixon,Carter,Reagan,Reagan,Bush,Clinton,Clinton,Gore,Bush,Obama,Obama,Clinton,Biden,Trump
1,Alabama,Nixon,Carter,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump,Trump
2,Alaska,Nixon,Ford,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump,Trump
3,Arizona,Nixon,Ford,Reagan,Reagan,Bush,Bush,Clinton,Bush,Bush,McCain,Romney,Trump,Biden,Trump
4,Arkansas,Nixon,Carter,Reagan,Reagan,Bush,Clinton,Clinton,Bush,Bush,McCain,Romney,Trump,Trump,Trump


In [13]:
print("Shape of the state_dataframe: ", state_dataframe.shape)
print("\t - Total rows: ", state_dataframe.shape[0])
print("\t - Total columns: ", state_dataframe.shape[1])

Shape of the state_dataframe:  (55, 15)
	 - Total rows:  55
	 - Total columns:  15


### 2.2.3. 🤴🏻US Presidential Elections Candidate

⚓️ As for the US presidential election candidates' data on the 2nd and 3rd rows, we will only consider those. ⤵️

In [14]:
body = table.find_all(name="tr")

In [15]:
candidate_contents = []

for row_id in range(2, 4):
    row_data = [
        data.text.replace("\n", "").strip() for data in body[row_id].find_all("td")
    ]
    candidate_contents.append({column: data for column, data in zip(columns, row_data)})

print("View the first row of the state_contents: ", candidate_contents[0])
print("\nView the second row of the state_contents: ", candidate_contents[-1])

View the first row of the state_contents:  {'Year': 'Democratic candidate', '1972': 'George McGovern', '1976': 'Jimmy Carter', '1980': 'Jimmy Carter', '1984': 'Walter Mondale', '1988': 'Michael Dukakis', '1992': 'Bill Clinton', '1996': 'Bill Clinton', '2000': 'Al Gore', '2004': 'John Kerry', '2008': 'Barack Obama', '2012': 'Barack Obama', '2016': 'Hillary Clinton', '2020': 'Joe Biden', '2024': 'Kamala Harris'}

View the second row of the state_contents:  {'Year': 'Republican candidate', '1972': 'Richard Nixon', '1976': 'Gerald Ford', '1980': 'Ronald Reagan', '1984': 'Ronald Reagan', '1988': 'George H. W. Bush', '1992': 'George H. W. Bush', '1996': 'Bob Dole', '2000': 'George W. Bush', '2004': 'George W. Bush', '2008': 'John McCain', '2012': 'Mitt Romney', '2016': 'Donald Trump', '2020': 'Donald Trump', '2024': 'Donald Trump'}


#### 2.2.3.1. 🗂️ Whole Candidate Table

Gathering the candidate whole table.

In [16]:
candidate_dataframe = pd.DataFrame(data=candidate_contents, columns=columns)

In [17]:
print("Candidates of presidential elections:: \n")
candidate_dataframe.head()

Candidates of presidential elections:: 



Unnamed: 0,Year,1972,1976,1980,1984,1988,1992,1996,2000,2004,2008,2012,2016,2020,2024
0,Democratic candidate,George McGovern,Jimmy Carter,Jimmy Carter,Walter Mondale,Michael Dukakis,Bill Clinton,Bill Clinton,Al Gore,John Kerry,Barack Obama,Barack Obama,Hillary Clinton,Joe Biden,Kamala Harris
1,Republican candidate,Richard Nixon,Gerald Ford,Ronald Reagan,Ronald Reagan,George H. W. Bush,George H. W. Bush,Bob Dole,George W. Bush,George W. Bush,John McCain,Mitt Romney,Donald Trump,Donald Trump,Donald Trump


#### 2.2.3.2. 🏆 Winner Candidates

Names of the candidate who wins the presidential election.

In [18]:
winner_contents = []

for row_id in range(2, 4):
    row_data = []
    for data in body[row_id].find_all("td"):
        # Parties name.
        if data.find("span"):
            row_data.append(data.text.strip())
            continue

        # Candidates name.
        b_tag = data.find("b")

        if b_tag:
            row_data.append(b_tag.text.replace("\n", "").strip())
        else:
            row_data.append(None)

    winner_contents.append({column: value for column, value in zip(columns, row_data)})

print("View the first row of the winner_contents: ", winner_contents[0])
print("\nView the second row of the winner_contents: ", winner_contents[-1])

View the first row of the winner_contents:  {'Year': 'Democratic candidate', '1972': None, '1976': 'Jimmy Carter', '1980': None, '1984': None, '1988': None, '1992': 'Bill Clinton', '1996': 'Bill Clinton', '2000': None, '2004': None, '2008': 'Barack Obama', '2012': 'Barack Obama', '2016': None, '2020': 'Joe Biden', '2024': None}

View the second row of the winner_contents:  {'Year': 'Republican candidate', '1972': 'Richard Nixon', '1976': None, '1980': 'Ronald Reagan', '1984': 'Ronald Reagan', '1988': 'George H. W. Bush', '1992': None, '1996': None, '2000': 'George W. Bush', '2004': 'George W. Bush', '2008': None, '2012': None, '2016': 'Donald Trump', '2020': None, '2024': 'Donald Trump'}


In [19]:
winner_dataframe = pd.DataFrame(data = winner_contents, columns = columns)

print("Winners of the presidential elections: \n")
winner_dataframe

Winners of the presidential elections: 



Unnamed: 0,Year,1972,1976,1980,1984,1988,1992,1996,2000,2004,2008,2012,2016,2020,2024
0,Democratic candidate,,Jimmy Carter,,,,Bill Clinton,Bill Clinton,,,Barack Obama,Barack Obama,,Joe Biden,
1,Republican candidate,Richard Nixon,,Ronald Reagan,Ronald Reagan,George H. W. Bush,,,George W. Bush,George W. Bush,,,Donald Trump,,Donald Trump


# 3. 💣 Export the DataFrame

⚓️ Create a reusable function to export a dataframe into a CSV file. ⤵️

In [20]:
def export_dataframe_to_csv(dataframe: pd.DataFrame, name: str) -> None:
    """
    Export the given dataframe into a CSV file with the given name.
    Parameters:
        - dataframe (pd.DataFrame): The dataframe to be exported.
        - name (str): The name of the CSV file.
    Returns:
        - None: Show the successful message with the exported location when the operation succeeds;
        otherwise, show an error message.
    """
    data_path = data.get_dataset_path("rbs", "raw", name, 1)

    try:
        dataframe.to_csv(data_path, index=False)
        print("🎉 Saved data to CSV at: ", data_path)
    except Exception as e:
        print(f"❌ Error: {e}")

⚓️ Export the 🇺🇸 US Presidential Elections by States dataframe. ⤵️

In [24]:
export_dataframe_to_csv(state_dataframe, "us_presidential_elections_by_states")

🎉 Saved data to CSV at:  /../../../../../../Volumes/Workstation/Datasets/Red.Blue.States/raw/us_presidential_elections_by_states.csv


⚓️ Export the 🇺🇸 US Presidential Elections Candidate dataframe. ⤵️

In [25]:
export_dataframe_to_csv(candidate_dataframe, "us_presidential_elections_candidate")

🎉 Saved data to CSV at:  /../../../../../../Volumes/Workstation/Datasets/Red.Blue.States/raw/us_presidential_elections_candidate.csv


⚓️ Export the US Presidential Elections Winner dataframe. ⤵️

In [26]:
export_dataframe_to_csv(winner_dataframe, "us_presidential_elections_winner")

🎉 Saved data to CSV at:  /../../../../../../Volumes/Workstation/Datasets/Red.Blue.States/raw/us_presidential_elections_winner.csv


🎉 Congratulations! The `Scraping Static Website` is complete!