<font style='font-size:3em'>📝 NB03 - WikiNews Scraping  </font>

**PURPOSE**: This Jupyter Notebook contains scraping of wikinews articles on the elections from 2008-2020. There is also some analysis of the content of these articles. 
- There is no Election News on WikiNews before 2008 election

**LAST REVISION:** 16th November 2023  

## ⚙️ Setting Up

### Packages needed for this NB to run 

In [1]:
import pandas as pd
import requests
from scrapy import Selector
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from textblob import TextBlob

Set up the Search URl for WikiNews 
- set limit to 1200 to get the full search result and set offset to 0 to start from beginning 
- get the response and set my user-agent to my LSE email adress

In [2]:
adv_search_url = 'https://en.wikinews.org/w/index.php?title=Special:Search&limit=1200&offset=0&ns0=1&ns14=1&search=United+States+presidential+election'
headers = {'User-Agent': 'm.filip-turner@lse.ac.uk'}

- Create function to scrape 

In [3]:
def fetch_search_results(url, headers):
    response = requests.get(url, headers=headers)
    sel = Selector(text=response.text)
    return sel.css("div.mw-search-result-heading > a::attr(href)").getall()

Create years variable for 1944-2024 going up by 4 
- using this to filer the news articles for ones of interest
- **I notice that only news articles for 2008-2020**

In [4]:
def filter_results_by_years(links, years):
    return [link for link in links if any(year in link for year in years)]

- Create function to clean headlines

In [5]:
def clean_headlines(headlines):
    return [tag.replace('_', ' ').replace('%27', '').replace('/wiki/', '') for tag in headlines]

- Create function to attatch base url to make links valid

In [6]:
def add_base_url(links, base_url):
    return [base_url + link for link in links]

- apply the functions 

In [7]:
all_search_result_headings = fetch_search_results(adv_search_url, headers)


years = [str(year) for year in range(1944, 2025, 4)]
filtered_news = filter_results_by_years(all_search_result_headings, years)

base_url = "https://en.wikinews.org"
news_urls = add_base_url(filtered_news, base_url)

filtered_news = clean_headlines(filtered_news)

## Creating DataFrames for news headlines and links

### + Saving these DFs as CSVs as "df_news.csv"

In [8]:
df_news = pd.DataFrame({'News URL': news_urls , 'Headlines': filtered_news})
df_news.to_csv('Data/df_news.csv', index=False)

- scrape the text from each of the news articles for each year

In [9]:
def extract_text_from_url(url):
    try:
        response = requests.get(url, headers)
        sel = Selector(text=response.text)
        text = sel.css("div.mw-parser-output > p::text").getall()
        return ' '.join(text)
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return ''  # Return an empty string in case of an error

def get_articles_with_year(df, year):
    filtered_urls = df[df['News URL'].str.contains(str(year))]['News URL']
    return [extract_text_from_url(url) for url in filtered_urls]

# Load your DataFrame
df_news = pd.read_csv('Data/df_news.csv')

# Use the function to get articles for the year 2008
year_2008_article_text = get_articles_with_year(df_news, 2008)
year_2012_article_text = get_articles_with_year(df_news, 2012)
year_2016_article_text = get_articles_with_year(df_news, 2016)
year_2020_article_text = get_articles_with_year(df_news, 2020)

## Some text analysis of the news articles 

### + Create wordle the news articles relevent to each election and export them as png to /Data folder 

In [10]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

def generate_wordcloud_for_year(df, year, additional_stopwords=None, save_to_file=False):
    if additional_stopwords is None:
        additional_stopwords = []

    # Filter and extract text for the given year
    year_text = get_articles_with_year(df, year)

    # Combine all texts and preprocess
    combined_text = ' '.join(year_text).lower()
    words = combined_text.split()

    # Filter stopwords
    all_stopwords = stopwords.words('english') + ['said', 'would', 'also', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'] + additional_stopwords
    filtered_words = [word for word in words if word not in all_stopwords]

    # Generate word cloud
    filtered_text = ' '.join(filtered_words)
    wordcloud = WordCloud(width=800, height=800, background_color='white').generate(filtered_text)

    # Save or display word cloud
    if save_to_file:
        plt.figure(figsize=(8, 8), facecolor=None)
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.tight_layout(pad=0)
        plt.savefig(f'Data/wordcloud_{year}.png')
        plt.close()
    else:
        plt.figure(figsize=(8, 8), facecolor=None)
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.tight_layout(pad=0)
        plt.show()

# Example usage
df_news = pd.read_csv('Data/df_news.csv')
for year in [2008, 2012, 2016, 2020]:
    generate_wordcloud_for_year(df_news, year, save_to_file=True)


### 📖 Takeaways from wordle

- The wordle gives brief insight to the key people and topics of each election.
- In the 2008 election we see the word **Police** come up as wuite prominent which would be a good indicator of potential domestic unrest as a key theme for the election
- Similarly, 2020 we see **war** show up which hints of the impact of Russia and Ukraine on the election

### + Calculate an average polarity score for articles relevent to each election

In [11]:
def calculate_average_polarity_for_year(df, year):
    # Filter and extract text for the given year
    year_text = get_articles_with_year(df, year)

    # Combine all texts for each article
    articles = [' '.join(article) for article in year_text]

    # Calculate polarity scores for each article
    polarity_scores = [TextBlob(article).sentiment.polarity for article in articles]

    # Compute average polarity score
    if polarity_scores:
        average_polarity = sum(polarity_scores) / len(polarity_scores)
    else:
        average_polarity = 0  # Default value in case there are no articles

    return average_polarity

# Example usage
df_news = pd.read_csv('Data/df_news.csv')

years = [2008, 2012, 2016, 2020]
average_polarities = {}

for year in years:
    average_polarity = calculate_average_polarity_for_year(df_news, year)
    average_polarities[year] = average_polarity

# Optional: Save the average polarities to a CSV file
df_average_polarities = pd.DataFrame(list(average_polarities.items()), columns=['Year', 'Average Polarity'])
df_average_polarities.to_csv('Data/average_polarity_by_year.csv', index=False)


### 📖 Takeaways from Polarity measure

- 2008: The average polarity score of 0.10625 suggests a slightly positive sentiment in news articles for this election year.
- 2012: This year shows an average polarity score of -0.00864, indicating a nearly neutral but slightly negative sentiment in the news coverage.
- 2016: The average score of 0.0625 again indicates a slightly positive sentiment, though not as strong as in 2008.
- 2020: With an average polarity score of 0.03385, the sentiment in news articles was slightly positive, but closer to neutral compared to other years.

- The WikiNews sources are often very neutral news peices relaying information and little opinion or emotive language.

### I want to merge the polarity measure for election news artiles I have to the Elections_table_data.csv

In [12]:
df_election = pd.read_csv('Data/Election_table_data.csv')
merged_df = pd.merge(df_election, df_average_polarities, on='Year', how='left')
merged_df.to_csv('Data/Elections_polarity_merged_df.csv', index=False)

**PURPOSE**: This Jupyter Notebook contains scraping of wikinews articles on the elections from 2008-2020. There is also some analysis of the content of these articles. 

- The purpose of WikiNews is to back up the polarity of elections with analysis of the news articles relating to elections.
- Higher polaroty score would indicate that it is a close election and turnout should be higher. 
- If news articles were dates back to 1944 on WIkinews, this analysis would have been more informative. i was however limited my the availability of News articles. 