<a id='Q0'></a>
<center> <h1> Notebook 1: Developing a Script for Retrieving Data and Saving it in an AWS S3 Bucket</h1> </center>
<p style="margin-bottom:1cm;"></p>
<center><strong>Angela Niederberger, 2022</strong></center>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> In this notebook I develop a python script to retrieve data on the most recent news items from the SRF website and save it in an AWS S3 bucket.</p>
        <strong> Outline:</strong>
        <a id='P0' name="P0"></a>
        <ol>
            <li> <a style="color:#303030" href='#I'>Introduction </a> </li>
            <li> <a style="color:#303030" href='#SU'>Set up</a></li>
            <li> <a style="color:#303030" href='#P1'>Retrieving the Data</a></li>
            <li> <a style="color:#303030" href='#P2'>Saving the Data</a></li>
            <li> <a style="color:#303030" href='#P3'>Code Refactor</a></li>
            <li> <a style="color:#303030" href='#CL'>Conclusion</a></li>
        </ol>
        <strong>Keywords:</strong> Webscraping, BeautifulSoup, AWS S3
    </div>
</div>
</nav>

<a id='I' name="I"></a>
## [Introduction](#P0)

In this notebook I explore the following steps:
- how to retrieve data from a website using BeautifulSoup
- performing basic cleaning tasks on this data
- saving it to an AWS S3 Bucket

Then I rewrite my code to create a Python script that can be automated.

<a id='SU' name="SU"></a>
## [Set up](#P0)

I created a virtual environment and installed Python 3.10.4 in it for this project. Below are the specifics on all the packages I used. I've also summarized all of this information in the requirements file.

### Packages

In [6]:
# Retrieving & Wrangling Data
import requests  # Version 2.27.1
from bs4 import BeautifulSoup  # Version 4.11.1
import pandas as pd  # Version 1.4.2
from datetime import date  # Built-in Python library

# Saving the Data
import s3fs

# Building the script
import logging  # Built-in Python library

### Magic Commands

In [2]:
%config Completer.use_jedi = False

<a id='P1'></a>
## [Retrieving the Data](#P0)

In this first section I use requests and BeautifulSoup to retrieve data on news items from the SRF website. Then I wrangle the data into the desired format with Pandas.

### Webscraping

Below is a simple webscraping script which serves to extract the publishing time and teaser data of all news articles found on https://www.srf.ch/news/das-neueste on any specific day.

Here's an excellent ressource for more information on webscraping: https://realpython.com/beautiful-soup-web-scraper-python/.

In [3]:
# This is the static page to scrape
url = "https://www.srf.ch/news/das-neueste"
page = requests.get(url)

# Then parse it
soup = BeautifulSoup(page.content, "html.parser")

# Get the different elements
teaser_lists = soup.find_all("div", class_="js-teaser-data")
kickers = soup.find_all("span", class_="teaser__kicker-text")
titles = soup.find_all("span", class_="teaser__title")
leads = soup.find_all("p", class_="teaser__lead")

# Iterate through the elements and extract the relevant information
news_snippet_dict = {
    "time_published": [],
    "kicker": [],
    "title": [],
    "lead": []
}

for (teaser_list, kicker, title, lead) in zip(teaser_lists, kickers, titles, leads):
    news_snippet_dict["time_published"].append(teaser_list.get("data-date-published"))
    news_snippet_dict["kicker"].append(kicker.get_text())
    news_snippet_dict["title"].append(title.get_text())
    news_snippet_dict["lead"].append(lead.get_text())
    
news_snippets_df = pd.DataFrame(news_snippet_dict)
news_snippets_df.head()

Unnamed: 0,time_published,kicker,title,lead
0,2022-06-01T14:33:00+02:00,Krieg in der Ukraine,Zögerliche Zeitenwende: Berlin fremdelt mit se...,Nach langem Zaudern sagt der deutsche Bundeska...
1,2022-06-01T14:03:00+02:00,Gesetz gegen Renditesanierung,Zoff um das Basler Mietgesetz,Für Sanierungen gelten in Basel neu strenge Be...
2,2022-06-01T13:44:00+02:00,Fussball und der Krieg,Als Russen und Ukrainer noch gemeinsam auf dem...,"Die Ukraine will an die WM, Russland schaut zu..."
3,2022-06-01T12:55:00+02:00,Kampf gegen Prämienexplosion,Nationalrat will Prämienanstieg mit Kostenziel...,Kosten- und Qualitätsziele sollen Prämien eind...
4,2022-06-01T12:36:00+02:00,Nachtragskredite wegen Covid,Hitzige Diskussion um 2.1. Milliarden für Kurz...,Wegen eines Bundesgerichtsurteils erhalten Unt...


### Data Wrangling

In [4]:
# Turn the publishing time into a timestamp
news_snippets_df["time_published"] = pd.to_datetime(news_snippets_df["time_published"])

# Select only news from this day
news_snippets_df = news_snippets_df[news_snippets_df["time_published"].dt.date==date.today()]

In [5]:
news_snippets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype                                
---  ------          --------------  -----                                
 0   time_published  20 non-null     datetime64[ns, pytz.FixedOffset(120)]
 1   kicker          20 non-null     object                               
 2   title           20 non-null     object                               
 3   lead            20 non-null     object                               
dtypes: datetime64[ns, pytz.FixedOffset(120)](1), object(3)
memory usage: 800.0+ bytes


<a id='P2' name="P2"></a>
## [Saving the Data](#P0)

### AWS Set-up

In order to save data to an S3 Bucket, I first had to complete the following steps:
- sign up for an AWS account
- with the root user, create an admin user
- with the admin user, create an S3 bucket
- attach a policy to this bucket, which gives the admin user write access

To complete all these steps correctly, I followed this walkthrough provided by AWS: https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example1.html.

<div style="background:#EEEDF5;border:0.1cm solid #00BAE5;color:#303030">
    <div style="margin: 0.2cm 0.2cm 0.2cm 0.2cm">
        <b style="color:#00BAE5">Note:</b>
        This is the most crucial part! Once everything is set up correctly in the AWS cloud, it is actually very easy to save files in the bucket.
    </div>
</div>

### Interacting with AWS

Now, I can get started with the code below. I am using the S3fs package to interact with the S3 file systems from Python.  For more information, here's the documentation: https://s3fs.readthedocs.io/en/latest/. However, it's not really necessary for saving files. This can be done with Pandas, just like when I'm saving a file to my local drive.

To stay organized, I created a `data` folder in my S3 Bucket, so when I write data to the bucket, I need to include this in the file path.

In [6]:
# Define the path for the file
s3_folder_path = "s3://srf-news-snippets/data"
filename = f"{date.today()}_srf_news_snippets.csv"

# Save the file
news_snippets_df.to_csv(f"{s3_folder_path}/{filename}")

<a id='P3' name="P3"></a>
## [Code Refactor](#P0)

Now I want to combine all of my code into one Python script that can be automated. To do this, I will write some functions.

### Functions

In [3]:
def scrape_srf_daily_news(url):
    """
    This function takes in the URL from the SRF news page.
    It scrapes this page and returns a dataframe with information on the day's news teasers.
    
    Required arguments:
    - URL: string, the SRF website to be scraped
    """
    page = requests.get(url)

    # Then parse it
    soup = BeautifulSoup(page.content, "html.parser")

    # Get the different elements
    teaser_lists = soup.find_all("div", 
                                 class_="js-teaser-data")
    kickers = soup.find_all("span", class_="teaser__kicker-text")
    titles = soup.find_all("span", class_="teaser__title")
    leads = soup.find_all("p", class_="teaser__lead")

    # Iterate through the elements and extract the relevant information
    news_snippet_dict = {
        "time_published": [],
        "kicker": [],
        "title": [],
        "lead": []
    }

    for (teaser_list, kicker, title, lead) in zip(teaser_lists, kickers, titles, leads):
        news_snippet_dict["time_published"].append(teaser_list.get("data-date-published"))
        news_snippet_dict["kicker"].append(kicker.get_text())
        news_snippet_dict["title"].append(title.get_text())
        news_snippet_dict["lead"].append(lead.get_text())

    news_snippets_df = pd.DataFrame(news_snippet_dict)
    
    # Turn the publishing time into a timestamp
    news_snippets_df["time_published"] = pd.to_datetime(news_snippets_df["time_published"])

    # Select only news from this day
    news_snippets_df = news_snippets_df[news_snippets_df["time_published"].dt.date==
                                        date.today()]
    
    return news_snippets_df
  
    
def main(url, s3_folder_path, filename):
    """
    This function combines the scraping and saving functions.
    It logs the progress and prints out info messages.
    """
    # Start logging
    logger = logging.getLogger(__name__)
    
    # Scrape the data
    todays_news_df = scrape_srf_daily_news(url=url)
    logger.info('Data retrieved')
    
    # Save it to the bucket
    todays_news_df.to_csv(f"{s3_folder_path}/{filename}")
    logger.info('File saved to bucket')


### Check the Functionality

In [2]:
# Define the required arguments
srf_news_site = "https://www.srf.ch/news/das-neueste"
s3_data_folder = "s3://srf-news-snippets/data"
file = f"{date.today()}_srf_news_snippets.csv"

# Configure the logging
log_fmt = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
logging.basicConfig(level=logging.INFO, format=log_fmt)

# Run the main function
main(url=srf_news_site, s3_folder_path=s3_data_folder, filename=file)

2022-06-01 14:56:23,914 - __main__ - INFO - Data retrieved
2022-06-01 14:56:24,251 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2022-06-01 14:56:24,462 - __main__ - INFO - File saved to bucket


In [7]:
# Check if the new file was saved
fs = s3fs.S3FileSystem(anon=False)
fs.ls(s3_data_folder)

2022-06-01 14:57:02,131 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials


['srf-news-snippets/data/',
 'srf-news-snippets/data/2022-06-01_srf_news_snippets.csv']

<a id='CL'></a>
## [Conclusion](#P0)

In this notebook, I developed some code for scraping the SRF News site, store the information in a dataframe and finally save it to a csv file in an AWS S3 bucket. Then I refactored this code into functions, which I can use as a python script. The next step will be to automate the execution of this script in the cloud.

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>