# Ejercicio opcional de Web Scraping

Este ejercicio consiste en extraer datos de una página web, procesarlos y guardarlos en un fichero `csv`. Para ello, debes:

1. Extraer los artículos en la página de inicio de [https://slashdot.org/](https://slashdot.org/) utilizando `BeautifulSoup`.
2. Procesar los datos y guardarlos en un `DataFrame`.
3. Crear un fichero `csv` a partir de dicho `DataFrame`.

## Importar librerías

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Hacer scraping de artículos

In [None]:
response = requests.get("https://slashdot.org/")
soup = BeautifulSoup(response.content)

In [None]:
articles = soup.select("h2")
data = []
for a in articles:
    link_element = a.find_next('a', class_='story-sourcelnk')
    if link_element:
        contenido_link = link_element.get_text().strip()
    else:
        contenido_link = None

    body_element = a.find_next('div', class_='body')
    if body_element:
        contenido_body = body_element.get_text().strip()
    else:
        contenido_body = None

    time_element = a.find_next('time')
    if time_element:
        datetime_attr = time_element.get('datetime')
    else:
        datetime_attr = None

    dict_article = {
        "título": a.get_text(),
        "fuente": contenido_link,
        "descripción": contenido_body,
        "fecha": datetime_attr,
    }
    data.append(dict_article)

    #He tenido que hacerlo con condicionales if, en algunos artículos no estaban los datos completos.
    #Por ejemplo: faltaba la fuente y daba error.

In [None]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,título,fuente,descripción,fecha
0,\n PostgreSQL Reconsiders Its Process-Based Mo...,(reuters.com),"Jonathan Corbet, writing at LWN: In the fast-m...","on Monday June 19, 2023 @01:23PM"
1,\n EU To Air Ideas on Guarding Prized Technolo...,(reuters.com),The European Commission will unveil on Tuesday...,"on Monday June 19, 2023 @12:53PM"
2,\n Intel To Spend $33 Billion in Germany in La...,(reuters.com),Intel will invest more than 30 billion euros (...,"on Monday June 19, 2023 @12:00PM"
3,\n Apple Is Taking On Apples in a Truly Weird ...,(wired.com),"Apple, the company, wants rights to the image ...","on Monday June 19, 2023 @11:20AM"
4,"\n Indonesia, SpaceX Launch Satellite To Boost...",(reuters.com),Indonesia and Elon Musk's rocket company Space...,"on Monday June 19, 2023 @10:40AM"


In [None]:
df['título'] = df['título'].str.lstrip('\n') #eliminamos \n

In [None]:
df['fuente'] = df['fuente'].str[1:-1] #eliminamos primer y último caracter de columna 'fuente'.

In [None]:
df['fecha'] = df['fecha'].str[3:] # eliminamos el on de la columna 'fecha'

In [None]:
df.head()

Unnamed: 0,título,fuente,descripción,fecha
0,PostgreSQL Reconsiders Its Process-Based Mode...,reuters.com,"Jonathan Corbet, writing at LWN: In the fast-m...","Monday June 19, 2023 @01:23PM"
1,EU To Air Ideas on Guarding Prized Technology...,reuters.com,The European Commission will unveil on Tuesday...,"Monday June 19, 2023 @12:53PM"
2,Intel To Spend $33 Billion in Germany in Land...,reuters.com,Intel will invest more than 30 billion euros (...,"Monday June 19, 2023 @12:00PM"
3,Apple Is Taking On Apples in a Truly Weird Tr...,wired.com,"Apple, the company, wants rights to the image ...","Monday June 19, 2023 @11:20AM"
4,"Indonesia, SpaceX Launch Satellite To Boost I...",reuters.com,Indonesia and Elon Musk's rocket company Space...,"Monday June 19, 2023 @10:40AM"


## Guardar dataframe

In [None]:
df.to_csv('Ejercico_WebScraping.csv', index=False)