<a href="https://colab.research.google.com/github/Diiamon/Election-News-Article-Exploration/blob/main/capstone_get_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import necessary libraries for web scraping and data manipulation

In [27]:
from bs4 import BeautifulSoup  ## the BeautifulSoup library for scraping from the bs4 package
import requests ## Establish website connection using the requests library
import pandas as pd # Pandas for data manipulation and analysis
import numpy as np # NumPy for numerical operations
import re ## RegEx for pattern matching

## Define a function to scrape news links from a given website

In [28]:
def news_links(link):
  main_site = link
  main_site_resp = requests.get(main_site)
  main_site_soup = BeautifulSoup(main_site_resp.text, 'html.parser')

  # Extract top news links
  top_links = [link.find('a')['href'] for link in main_site_soup.find('div', class_="Pt(20px)").find_all('li') if link.find('a') and link.find('a').has_attr('href')]

  # Extract additional news links
  links = [link.find('a')['href'] for link in main_site_soup.find('div',class_="D(f) Jc(sb) My(20px)").find('ul') if link.find('a') and link.find('a').has_attr('href')]

  # Combine and deduplicate the list of links
  full_links = set(top_links + links)

  return full_links

In [29]:
full_links_ = news_links('https://uk.yahoo.com/topics/uk-election-2024/')

## Define a function to scrape information from each news page

In [30]:
def news_page_info(full_links):
  all_news_info = []

  for link in full_links:
    article_site_resp = requests.get(link)
    article_site_soup = BeautifulSoup(article_site_resp.text, 'html.parser')

    try:
      # Attempt to extract relevant information from the news page
      source = article_site_soup.find('span', class_="caas-attr-provider").text
      title = article_site_soup.find('div', class_="caas-title-wrapper").text
      headings = [heading.text for heading in article_site_soup.find('div', class_="caas-body").find_all('h2')]
      author = article_site_soup.find('span', class_="caas-author-byline-collapse").text
      published = article_site_soup.find('div', class_="caas-attr-time-style").text
      article = article_site_soup.find('div', class_="caas-body").text
    except:
      # If scraping fails, set all fields to 'none'
      source = 'none'
      title = 'none'
      headings = 'none'
      author = 'none'
      published = 'none'
      article = 'none'

    # Append the extracted information to the list
    all_news_info.append({'Source': source,
                          'Title' : title,
                          'Headings' : headings,
                          'author' : author,
                          'Published' :  published,
                          'Article': article
                          })

    # Convert the list of dictionaries to a DataFrame
  return pd.DataFrame(all_news_info)

## Saving the article dataframe without any 'none' into a csv file, for quicker data manipulation

In [31]:
articles = news_page_info(full_links_)
articles = articles.loc[articles['Source'] != 'none']
articles.to_csv('article41_data', index=False)

In [32]:
article_df = pd.read_csv('article41_data')
article_df

Unnamed: 0,Source,Title,Headings,author,Published,Article
0,The Telegraph,When disinformation might really have swung an...,[],Michael Coren,4 July 2024 at 5:28 am·4-min read,I’m not sure that anything can now help poor R...
1,Yahoo Finance UK,What Labour needs to do for pensions and work ...,"['Personal finance', 'Tax', 'Work', 'Pensions'...",Sarah Coles and Helen Morrissey,4 July 2024 at 7:23 am·6-min read,A new government could have a big impact on pe...
2,The Guardian,#ukpolitics: how the 2024 general election has...,['Straight TikTok: ‘traditional’ news for a ne...,"Carmen Aguilar García, Pamela Duncan, Michael ...",4 July 2024 at 6:08 am·9-min read,"If a week is a long time in politics, the five..."
3,The Guardian,UK political parties on track to spend £1m on ...,[],Jim Waterson,4 July 2024 at 7:11 am·3-min read,The UK’s political parties are on track to spe...
4,The Independent,When will we know who’s won? An hour-by-hour g...,['Here is a breakdown of exactly when to expec...,Maryam Zakir-Hussain,4 July 2024 at 6:59 am·14-min read,After a whirlwind six-week general election ca...
5,Yahoo News UK,Who is Ed Davey? The Lib Dem leader whose life...,"['His early life', 'His family', 'His parents'...",Hannah Fearn,Updated 1 July 2024 at 11:04 am·3-min read,"The youngest of three boys, Sir Ed Davey had a..."
6,Yahoo News UK,What time will we know the general election re...,"['When can I vote in the election?', 'When are...",James Harrison,Updated 4 July 2024 at 2:40 am·4-min read,The election race has entered its final furlon...
7,The Telegraph,What time is the exit poll today and when will...,"['What time is the UK exit poll?', 'What is th...",Jordan Young,4 July 2024 at 5:24 am·4-min read,Follow The Telegraph’s latest general election...
8,Yahoo News UK,Who is standing for election in my area? Full ...,"['Who is standing in my constituency?', 'When ...",Jimmy Nsubuga,Updated 4 July 2024 at 3:54 am·2-min read,More than a month of general election campaign...
9,The Independent,General election – live: Keir Starmer and Rish...,"['Key Points', 'Pinned: How to vote in the Gen...",Salma Ouaguira,4 July 2024 at 7:20 am·37-min read,Millions of people across the country are head...


## Define functions for cleaning and formatting the scraped data

In [33]:
def list_clean(column):
  if column == []:
    # Clean list columns
      return ''
  else:
      return column

# Additional cleaning functions (e.g., reg_day, reg_month, reg_time, reg_year, reg_read) are defined here...
def reg_day(column):
  return re.findall('\d\d*', str(column))[0]

def reg_month(column):
  return re.findall('(J\w+|F\w+|Ma\w\w*|A\w\w+|J\w+|S\w+|O\w+|N\w+|D\w+)', str(column))[0]

def reg_time(column):
  return re.findall('\d\d*:\d\d*', str(column))[0]

def reg_year(column):
  return re.findall('20\d\d', str(column))[0]

def reg_read(column):
  if re.findall('read', str(column)) == ['read']:
    read = re.findall('\d\d*-min read', str(column))[0]
    num = re.findall('\d\d*', str(read))[0]
  else:
    num = -1
  return num

# Define a function to apply cleaning operations to the DataFrame
def cleaning(df):

  # Apply cleaning functions to the respective columns
  df['Headings'] = df['Headings'].apply(list_clean)
  df['Publish_Day'] = df['Published'].apply(reg_day).astype(int)
  df['Publish_Month'] = df['Published'].apply(reg_month).astype(str)
  df['Publish_Year'] = df['Published'].apply(reg_year).astype(int)
  df['Publish_Time'] = df['Published'].apply(reg_time)
  df['Publish_Time'] = pd.to_datetime(df['Publish_Time'], format='%H:%M').dt.time
  df['Read_Time'] = df['Published'].apply(reg_read).astype(int)

  # Reorder and select specific columns for the final DataFrame
  df = df[['Source', 'Title', 'Headings', 'author',
           'Published',
           'Publish_Day',
           'Publish_Month',
           'Publish_Year',
          'Publish_Time',
           'Read_Time',
           'Article'
           ]]
  return df

## Clean and check the articles DataFrame and then export it to a CSV file

In [34]:
articles_df = cleaning(article_df)
articles_df

Unnamed: 0,Source,Title,Headings,author,Published,Publish_Day,Publish_Month,Publish_Year,Publish_Time,Read_Time,Article
0,The Telegraph,When disinformation might really have swung an...,[],Michael Coren,4 July 2024 at 5:28 am·4-min read,4,July,2024,05:28:00,4,I’m not sure that anything can now help poor R...
1,Yahoo Finance UK,What Labour needs to do for pensions and work ...,"['Personal finance', 'Tax', 'Work', 'Pensions'...",Sarah Coles and Helen Morrissey,4 July 2024 at 7:23 am·6-min read,4,July,2024,07:23:00,6,A new government could have a big impact on pe...
2,The Guardian,#ukpolitics: how the 2024 general election has...,['Straight TikTok: ‘traditional’ news for a ne...,"Carmen Aguilar García, Pamela Duncan, Michael ...",4 July 2024 at 6:08 am·9-min read,4,July,2024,06:08:00,9,"If a week is a long time in politics, the five..."
3,The Guardian,UK political parties on track to spend £1m on ...,[],Jim Waterson,4 July 2024 at 7:11 am·3-min read,4,July,2024,07:11:00,3,The UK’s political parties are on track to spe...
4,The Independent,When will we know who’s won? An hour-by-hour g...,['Here is a breakdown of exactly when to expec...,Maryam Zakir-Hussain,4 July 2024 at 6:59 am·14-min read,4,July,2024,06:59:00,14,After a whirlwind six-week general election ca...
5,Yahoo News UK,Who is Ed Davey? The Lib Dem leader whose life...,"['His early life', 'His family', 'His parents'...",Hannah Fearn,Updated 1 July 2024 at 11:04 am·3-min read,1,July,2024,11:04:00,3,"The youngest of three boys, Sir Ed Davey had a..."
6,Yahoo News UK,What time will we know the general election re...,"['When can I vote in the election?', 'When are...",James Harrison,Updated 4 July 2024 at 2:40 am·4-min read,4,July,2024,02:40:00,4,The election race has entered its final furlon...
7,The Telegraph,What time is the exit poll today and when will...,"['What time is the UK exit poll?', 'What is th...",Jordan Young,4 July 2024 at 5:24 am·4-min read,4,July,2024,05:24:00,4,Follow The Telegraph’s latest general election...
8,Yahoo News UK,Who is standing for election in my area? Full ...,"['Who is standing in my constituency?', 'When ...",Jimmy Nsubuga,Updated 4 July 2024 at 3:54 am·2-min read,4,July,2024,03:54:00,2,More than a month of general election campaign...
9,The Independent,General election – live: Keir Starmer and Rish...,"['Key Points', 'Pinned: How to vote in the Gen...",Salma Ouaguira,4 July 2024 at 7:20 am·37-min read,4,July,2024,07:20:00,37,Millions of people across the country are head...


In [35]:
articles_df['Article'][0]

'I’m not sure that anything can now help poor Rishi Sunak, but surely a letter a few days back from the Russian foreign minister alleging Moscow’s backing for Keir Starmer would have done just a little good. And it would have been a fitting way to celebrate the centenary of the infamous Zinoviev Letter.In 1924 the Labour Party had formed its first government. It was a minority administration, dependent on the Liberals, and was under siege from its formation. While party leader and Prime Minister Ramsay MacDonald was moderate, pragmatic, and certainly anti-communist, he and his party were constantly accused of being overly sympathetic to Moscow.Their decision to normalise Britain’s relationship with the Soviets, only seven years after the Russian Revolution, however, fuelled paranoia about “Bolshevik influence”, and this fear even extended to some within the security services. It was partly the notion that Labour was too close to the Soviets that led to the MacDonald government losing a

In [36]:
articles_df.dtypes

Source           object
Title            object
Headings         object
author           object
Published        object
Publish_Day       int64
Publish_Month    object
Publish_Year      int64
Publish_Time     object
Read_Time         int64
Article          object
dtype: object

In [37]:
articles_df.to_csv('article41_data_clean', index=False)