<font style='font-size:3em'>**📝 DS105 Week 7 Summative Assessment** </font>

**PURPOSE**: This Jupyter Notebook contains the scraping of the page titles and links for the Wikipedia pages with the US presidential Elections from 1944-2024. I create a CSV file with the URLs of each 21 pages and I create a dataframe with this information called *Elections.cvs*

<a href="https://en.wikipedia.org/w/index.php?title=Special:Search&limit=56201&offset=0&ns0=1&search=United+States+presidential+election">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Wikipedia-logo-v2-en.svg/300px-Wikipedia-logo-v2-en.svg.png" alt="Wikipedia" width="10%">
</a>

**LAST REVISION:** *16th November 2023*

## ⚙️ Setting Up

### Packages needed for this NB to run  

In [1]:
import pandas as pd
import requests
from scrapy import Selector

Here, I set the limit to 56201 to get all results and set offset to 0 to start from the beginning
- This will maximise the scope of my scrape and the results
- I set the User-agent to my LSE Email

In [2]:
adv_search_url = 'https://en.wikipedia.org/w/index.php?title=Special:Search&limit=56201&offset=0&ns0=1&search=United+States+presidential+election'
headers = {'User-Agent': 'm.filip-turner@lse.ac.uk'}

- I create function for getting search results from advanced search

In [3]:
def get_search_results(url, headers):
    response = requests.get(adv_search_url, headers=headers)
    sel = Selector(text=response.text)
    return sel.css("div.mw-search-result-heading > a::attr(href)").getall()


- I create function for cleaning data by removing '/wiki/' and '_' from title headings

In [4]:
def clean_title(title):
    """Clean and format the title string."""
    return title.replace("/wiki/", "").replace("_", " ")

- I will then filter the titles to contain only 1944-2024 and take those out that do not using function
- I noticed that the results gave me headings for 50 state elections accross the 21 years. Therefore 1050 headings I did not want plus a 2 standalone headings containing predictions and recounts. So I got rid of these


In [5]:
def filter_elections(titles, start_year=1944, end_year=2024):
    """Filter election titles based on given criteria."""
    elections = []
    for title in titles:
        if "United States presidential election" in title:
            year = title.split(' ')[0]
            try:
                if start_year <= int(year) <= end_year:
                    elections.append(title)
            except ValueError:
                continue
    return [election for election in sorted(elections, reverse=True) 
            if "in" not in election 
            and election not in ["2020 United States presidential election predictions", 
                                 "2016 United States presidential election recounts"]]

- I create a function for getting the URl 

In [6]:
def get_election_urls(titles, base_url):
    urls = []
    for title in titles:
        title_for_url = title.replace(' ', '_')
        params = {'action': 'query', 'format': 'json', 'titles': title_for_url}
        response = requests.get(base_url, params=params)
        data = response.json()
        page_id = next(iter(data['query']['pages']))
        if 'missing' not in data['query']['pages'][page_id]:
            page_title = data['query']['pages'][page_id]['title']
            page_url = f"https://en.wikipedia.org/wiki/{page_title.replace(' ', '_')}"
            urls.append(page_url)
    return urls

- I use list comprehensions to parse all URLs and apply the  filter_elections function

In [7]:
# Get all search result headings
all_search_result_headings = get_search_results(adv_search_url, headers)
all_search_result_headings = [clean_title(title) for title in all_search_result_headings]

# Filter the election results
filtered_elections = filter_elections(all_search_result_headings)


In [8]:
# Get URLs for filtered elections
api_base_url = "https://en.wikipedia.org/w/api.php"
election_urls = get_election_urls(filtered_elections, api_base_url)

- Create DF and export to /Data file as CSV
- Saved this as *"Elections.csv"*

In [9]:
df_elections = pd.DataFrame({'Link': election_urls, 'Elections': filtered_elections})
df_elections.to_csv('Data/Elections.csv', index=False)

**PURPOSE**: This Jupyter Notebook contains the scraping of the page titles and links for the Wikipedia pages with the US presidential Elections from 1944-2024. I create a CSV file with the URLs of each 21 pages and I create a dataframe with this information called *Elections.cvs*

- The dataframe created from this NB allows me to use the URLs for each election article to scrape through the results table in each and create a more informative df for my analysis.