<a href="https://colab.research.google.com/github/Frederikmh90/python_workshop/blob/main/Scripts/Workshop3_scraping_dataanalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Workshop 3: Scraping, data analysis and AI-supported coding**




## **Wordpress scraper**

## Install and import packages
Based on this tutorial: https://tutorials.opted.eu/tutorials/wp_scraping/

First, clone (or download) our repository for wp-json-scraper. The module was originally developed by Mickaël “Kilawyn” Walter and it provides a convenient wrapper for the WordPress API in python. We created a fork in our OPTED repository, where we made some minor improvements to the already excellent software.

You can clone the repository to our Colab Environment

In [None]:
!git clone https://github.com/opted-eu/wp-json-scraper.git

In [None]:
# set the path to your data
import os
import sys
os.chdir("wp-json-scraper/") # change directory path
sys.path.append('/content/wp-json-scraper/lib') # change directory path

print(os.getcwd())


Install packages and restart session, when all packages are installed. Remember to set path after restarting.

In [None]:
#!pip install -r requirements.txt
!pip install certifi==2020.4.5.1
!pip install chardet==3.0.4
!pip install idna==2.9
!pip install requests==2.23.0
!pip install urllib3==1.25.9
!pip install bs4

## Testing the API
First, we need to ensure that the website that we want to scrape actually uses WordPress and also has the API exposed:

In [2]:
from lib.wpapi import WPApi

In [3]:
### Danish alternative news media sources ###
#target = "https://denkorteavis.dk/"
#target = "https://konfront.dk/"
#target = "https://frihedsbrevet.dk/" # not really but...
target = "https://tv.frihedensstemme.dk/"
#target = "https://redox.dk/"

### English alternative news media sources ###
#target = "https://evolvepolitics.com/"

### Indie music blogs ###
#target = "https://www.regnsky.dk/"
#target = "https://passiveaggressive.dk/"

### Climate change news ###
#target = "https://www.climatechangenews.com/"
#target = "https://insideclimatenews.org/"

### Bubbles
#target = "https://hunderacer.info/"

Checking the availability of the API is rather simple, we just have to create an instance of the WPApi class where we pass in our target as the first (and only) argument:

In [4]:
wordpress = WPApi(target)

 Next, we get the basic information of the website and have it printed:

In [5]:
info = wordpress.get_basic_info()
print(info)

{'name': 'Frihedens Stemme', 'description': 'Giv en vælgererklæring til Stram Kurs på http://nejtilislam.dk', 'url': 'https://tv.frihedensstemme.dk', 'home': 'https://tv.frihedensstemme.dk', 'gmt_offset': 2, 'timezone_string': 'Europe/Copenhagen', 'namespaces': ['oembed/1.0', 'akismet/v1', 'h5vp/v1', 'jetpack/v4', 'wpcom/v2', 'jetpack/v4/stats-app', 'jetpack/v4/import', 'wpcom/v3', 'jetpack-boost/v1', 'my-jetpack/v1', 'jetpack/v4/blaze-app', 'jetpack/v4/blaze', 'wp/v2', 'wp-site-health/v1', 'wp-block-editor/v1'], 'authentication': {'application-passwords': {'endpoints': {'authorization': 'https://tv.frihedensstemme.dk/wp-admin/authorize-application.php'}}}, 'routes': {'/': {'namespace': '', 'methods': ['GET'], 'endpoints': [{'methods': ['GET'], 'args': {'context': {'default': 'view', 'required': False}}}], '_links': {'self': [{'href': 'https://tv.frihedensstemme.dk/wp-json/'}]}}, '/batch/v1': {'namespace': '', 'methods': ['POST'], 'endpoints': [{'methods': ['POST'], 'args': {'validatio

Next, we can check different users

In [None]:
users = wordpress.get_users()
authors = {user['id']: user['name'] for user in users}
authors

In [None]:
categories = wordpress.get_categories()
categories = {c['id']: c['name'] for c in categories}
categories

In [None]:
import pandas as pd  # pandas library for data management
from bs4 import BeautifulSoup  # BeautifulSoup is a web-scraping package
import numpy as np  # Import numpy to use np.nan for missing values

# Create a list to store cleaned posts
cleaned_posts = []

try:
    users = wordpress.get_users()
    authors = {user['id']: user['name'] for user in users}
except Exception:
    authors = {}  # Empty dictionary if user data cannot be fetched

try:
    categories = wordpress.get_categories()
    categories_dict = {c['id']: c['name'] for c in categories}
except Exception:
    categories_dict = {}  # Empty dictionary if category data cannot be fetched

for post in wordpress.yield_posts(num=1000, search_terms=""):
    post_id = post['id']
    author_name = authors.get(post['author'], np.nan)  # Use np.nan if the author isn't found
    category_names = [categories_dict.get(c, np.nan) for c in post.get('categories', [])]  # Use np.nan for missing categories

    try:
        cleaned_post = {
            'id': post_id,
            'link': post['link'],
            'date_published': post['date'],
            'date_modified': post['modified'],
            'title': BeautifulSoup(post['title']['rendered'], 'html.parser').text,
            'content': BeautifulSoup(post['content']['rendered'], 'html.parser').text,
            'excerpt': BeautifulSoup(post['excerpt']['rendered'], 'html.parser').text,
            'author': author_name,
            'categories': category_names
        }
        cleaned_posts.append(cleaned_post)
    except Exception as e:
        print(f"Error processing post ID {post_id}: {e}")

# Convert the list of dictionaries to a pandas DataFrame
posts_df = pd.DataFrame(cleaned_posts)

# Display the DataFrame
posts_df


In [10]:
#posts_df["categories"].value_counts()
posts_df["author"].value_counts()

author
Uwe Max Jensen       884
Rasmus Paludan        45
Peter Neerup Buhl     43
Frihedens Stemme      28
Name: count, dtype: int64

In [None]:
posts_df["title"][4][0:500]
#posts_df["content"][0][0:500]

Using the pandas package, you can save dataframes to csv-format.

Just uncomment (remove hashtag) and rename your dataframe


In [None]:
posts_df.to_csv("frihedensstemme_1000.csv")

# If you want to check where your file will be saved, you can always check
print(os.getcwd())

# **Data Analysis: Computational Text Analysis**

## Emotion analysis using DaNLP

DaNLP is a repository for Natural Language Processing resources for the Danish Language. It is a collection of available datasets and models for a variety of NLP tasks.

The **aim** is to make it easier and more applicable to practitioners in the industry to use Danish NLP and hence this project is licensed to allow commercial use.

https://danlp-alexandra.readthedocs.io/en/latest/tasks.html

https://github.com/alexandrainst/danlp?tab=readme-ov-file


Install the package

In [None]:
!pip install danlp

### **Example**

Import the models and load the emotion model

In [None]:
from danlp.models import load_bert_emotion_model

# Load the BERT Emotion classifier
classifier = load_bert_emotion_model()

In [None]:
classes = classifier._classes()  # If classes are accessed via a different method or attribute
classes

In [None]:
# Example sentences to classify
sentences = [
    'der er et træ i haven',  # Example without emotion
    'jeg ejer en rød bil og det er en god bil',  # Example with 'Tillid/Accept'
    'jeg ejer en rød bil, men den er gået i stykker',  # Example with 'Sorg/trist'
    'jeg ejer en rød bil og er bange for, at den ikke den klarer næste syn'  # Example with 'Frygt/bekymret'
]

# Predict the emotion for each sentence
for sentence in sentences:
    prediction = classifier.predict(sentence)
    print(f'Text: {sentence}\nPredicted Emotion: {prediction}\n')


In [None]:
#example_title = "jeg ejer en rød bil og det er en god bil"
example_title = "Fremmed muslim taler knapt nok dansk efter 52 år i Danmark"

# Get the output of the predict_proba function
probabilities = classifier.predict_proba(example_title, no_emotion=False)
probabilities_list = probabilities[0].tolist()
classes = classifier._classes()  # This needs to be adjusted based on actual method available.
classes=classes[0]

# Print the header for the table
print(example_title)
print("{:<30} {:<15}".format('Emotion', 'Probability'))
print("-" * 45)

# Loop through both lists and print each class with its corresponding probability
for emotion, probability in zip(classes, probabilities_list):
    print("{:<30} {:<15.4f}".format(emotion, probability))


### **Emotion classification** with article dataset

First we read in and inspect our data

In [16]:
import pandas as pd
#os.chdir("/content/") # change directory path
#print(os.getcwd())
#posts_df = pd.read_csv("/content/wp-json-scraper/passiveaggressive.csv")
posts_df = pd.read_csv("/content/wp-json-scraper/frihedensstemme_1000.csv", index_col = 0)


#url = 'https://raw.githubusercontent.com/Frederikmh90/python_workshop/main/Data/frihedsbrevet_1000.csv'
#url = 'https://raw.githubusercontent.com/Frederikmh90/python_workshop/main/Data/passiveaggressive.csv'
#posts_df = pd.read_csv(url,index_col=0)



In [None]:
posts_df.head()

In [None]:
from danlp.models import load_bert_emotion_model

# Load the BERT Emotion classifier
classifier = load_bert_emotion_model()

# Predict the emotion for each title in the DataFrame
posts_df['predicted_emotion'] = posts_df['title'].apply(classifier.predict)

# Display the DataFrame with the new column of predicted emotions
posts_df[['title', 'predicted_emotion']]

In [None]:
posts_df["predicted_emotion"].value_counts()

## Sentiment and Tone Analysis also using DaNLP

In [None]:
from danlp.models import load_bert_tone_model
classifier_tone = load_bert_tone_model()


In [None]:
classes_tone =classifier_tone._classes()
print(classes_tone)

In [None]:
# Example usage:
text1 = 'Analysen viser, at økonomien bliver forfærdelig dårlig'
result1 = classifier_tone.predict(text1)
# Output: {'analytic': 'objektive', 'polarity': 'negative'}

text2 = 'Jeg tror alligevel, det bliver godt'
result2 = classifier_tone.predict(text2)
# Output: {'analytic': 'subjektive', 'polarity': 'positive'}

# Get probabilities and matching class names
probabilities1 = classifier_tone.predict_proba(text1)
probabilities2 = classifier_tone.predict_proba(text2)
classes_tone = classifier_tone._classes()

print(f"Classes for tone analysis: {classes_tone}")
print(f"Probabilities for '{text1}'\n: {probabilities1}")
print(f"Probabilities for '{text2}'\n: {probabilities2}")

In [27]:
df=posts_df.sample(10)

In [None]:
# Define the function to get sentiment, tone, and their probabilities
def get_sentiment_tone_probs(text):
    result = classifier_tone.predict(text)
    probabilities = classifier_tone.predict_proba(text)
    classes = classifier_tone._classes()
    sentiment = result.get('polarity', 'neutral')  # Fallback to 'neutral'
    tone = result.get('analytic', 'objective')  # Fallback to 'objective'
    sentiment_prob = probabilities[0][classes[0].index(sentiment)]
    tone_prob = probabilities[1][classes[1].index(tone)]
    return sentiment, sentiment_prob, tone, tone_prob

# Example texts
example_text = "USA hjælper Ukraine til at vinde krigen"
#example_text = "Vores venner USA hjælper Ukraine til at vinde krigen"

# Get sentiment, tone, and their probabilities
sentiment, sentiment_prob, tone, tone_prob = get_sentiment_tone_probs(example_text)

# Print the results along with the probabilities
print(f"Text: '{example_text}'\nSentiment: {sentiment} (Probability: {sentiment_prob})\nTone: {tone} (Probability: {tone_prob})")


## Emotion classification with transformer models (Huggingface)

In [83]:
from transformers import pipeline

# Load the BERT-Emotions-Classifier
classifier = pipeline("text-classification", model="ayoubkirouane/BERT-Emotions-Classifier")

# Input text
text = "Your input text here"

# Perform emotion classification
results = classifier(text)

# Display the classification results
print(results)

#classifier.predict("I hate learning new things")

[{'label': 'optimism', 'score': 0.23417861759662628}]


# Load datasets

In [173]:
# Frihedsbrevet data

#url = 'https://raw.githubusercontent.com/Frederikmh90/python_workshop/main/Data/frihedsbrevet_1000.csv'
#url = 'https://raw.githubusercontent.com/Frederikmh90/python_workshop/main/Data/passiveaggressive.csv'

df = pd.read_csv(url,index_col=0)


In [None]:
df.head()

# Named-entity recognition with the same dataset

In [89]:
from danlp.models import load_bert_ner_model
bert = load_bert_ner_model()


Downloading file /tmp/tmpap6e6na2



You passed along `num_labels=9` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=9` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=9` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=9` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.


### Example

In [90]:
# Get lists of tokens and labels in BIO format
tokens, labels = bert.predict("Jens Peter Hansen kommer fra Danmark")
print(tokens)
print(labels)

['jens', 'peter', 'hansen', 'kommer', 'fra', 'danmark']
['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'B-LOC']


# Newspaper3k package

In [None]:
!pip3 install newspaper3k

In [33]:
from newspaper import Article

In [34]:
#url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
#url = "https://denkorteavis.dk/2024/graensekontrol-virker-slagvaaben-gav-boede-og-hvidvask-gav-tremmer/"
url = "https://passiveaggressive.dk/anders-vestergaard-der-er-intet-tilbage-til-mig-selv/"
article = Article(url)

In [35]:
article.download()
article.html

'<!DOCTYPE html>\n<!--[if IE 7]><html lang="en" class="no-js ie7"><![endif]-->\n<!--[if IE 8]><html lang="en" class="no-js ie8"><![endif]-->\n<!--[if gt IE 8]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta charset="utf-8" />\n\t<meta name="description" content="" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n\t<meta http-equiv="screenorientation" content="autorotate:disabled">\n\t\n\t<link href="/favicon.ico" rel="icon" type="image/x-icon" />\n\t<meta property="og:title" content="Anders Vestergaard – Der er intet tilbage til mig selv"/>\n<meta property="og:site_name" content="Passive/Aggressive"/>\n<meta property="og:description" content=\'\n\n\n\n\nInterview af Jon Albjerg Ravnholt. PR-foto.\n\n\n\nDen københavnske jazztrommeslager Anders Vestergaard arbejdede sig ud af en personlig krise ved at sidde om aftenen i et øvelokale og spille de samme rytmemønstre om\'/>\n

In [36]:
article.parse()

In [37]:
article.authors

[]

In [38]:
article.publish_date

In [39]:
article.text

'Anders Vestergaard – Der er intet tilbage til mig selv\n\nInterview af Jon Albjerg Ravnholt. PR-foto.\n\nDen københavnske jazztrommeslager Anders Vestergaard arbejdede sig ud af en personlig krise ved at sidde om aftenen i et øvelokale og spille de samme rytmemønstre om og om igen, indtil trancen indfandt sig. Med sin tredje soloplade, “Propeller”, lukker han os andre ind i det rum – og i den proces lukker han sig selv ude af det.\n\nAnders Vestergaards virke strækker sig fra den melodiske fusionsjazz med Girls in Airports over freejazz i gruppen Yes Deer og ambient avantgarde-jazz i trioen Zeuthen/Anderskov/Vestergaard til noiseduoen Laser Nun med Lars Bech Pilgaard.Han er desuden en del af den albumaktuelle trommekvartet Blomsten sammen med Victor Dybbroe fra Girls in Airports, Mads Forsby, som Anders Vestergaard tog over for i samme band, og Anders Bach, der har produceret “Propeller”.\n\nMen hans to seneste plader under eget navn er ikke skabt i et kreativt samspil med andre. De e

In [40]:
article.top_image

'https://passiveaggressive.dk/wp-content/uploads/2024/04/anders-vestergaard-860x573.jpeg'

In [41]:
article.movies

['https://www.youtube.com/embed/J4RaQO00Jjk?feature=oembed',
 'https://www.youtube.com/embed/T_Xy9EyJem4?feature=oembed']

In [43]:
import nltk
nltk.download('punkt')

article.nlp()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [44]:
article.keywords

['den',
 'anders',
 'til',
 'vestergaard',
 'mig',
 'intet',
 'og',
 'er',
 'en',
 'der',
 'med',
 'selv',
 'det',
 'jeg',
 'tilbage',
 'på']

## Download articles

In [48]:
import newspaper

In [49]:
#blog = newspaper.build('https://www.theguardian.com/europe')
blog = newspaper.build("https://passiveaggressive.dk/")



CRITICAL:newspaper.network:[REQUEST FAILED] 404 Client Error: Not Found for url: https://passiveaggressive.dk/feeds


In [50]:
for article in blog.articles:
  print(article.url)

https://passiveaggressive.dk/category/feature/
https://passiveaggressive.dk/passive-aggressive-conversations-2-alto-aria/
https://passiveaggressive.dk/anders-vestergaard-der-er-intet-tilbage-til-mig-selv/
https://passiveaggressive.dk/ilinx-mellem-folkemusik-carl-nielsen-og-japansk-pigepunk-mix/
https://passiveaggressive.dk/clarissa-connelly-jeg-vil-gerne-have-at-fortiden-lober-igennem-min-musik-interview/
https://passiveaggressive.dk/open-call-passive-aggressive-is-looking-for-new-board-members/
https://passiveaggressive.dk/barbro-kuypers-og-mija-milovic-om-at-skrive-musik-sammen-protestsange-og-at-vaerne-om-at-vaere-okay-interview/
https://passiveaggressive.dk/mouth-wound-ekstrem-lyd-som-eksorcisme-for-sindets-smertepunkter-interview/
https://passiveaggressive.dk/space-africa-experience-is-at-the-root-of-everything-that-we-produce-interview/
https://passiveaggressive.dk/passive-aggressive-conversations-1-dalin-waldo/
https://passiveaggressive.dk/drift-radio-on-housing-diversities-and-

In [51]:
for category in blog.category_urls():
  print(category)

https://passiveaggressive.dk/
http://passiveaggressive.dk
https://passiveaggressive.dk


In [52]:
blog_article = blog.articles[0]

In [102]:
#dir(bt_article)

In [53]:
blog_article.download()


In [54]:
blog_article.parse()


In [55]:
blog_article.nlp()

In [56]:
blog_article.title


'Aggressive – Passive Aggressive Conversations #2: Alto Aria'

In [57]:
blog_article.authors

[]

In [58]:
blog_article.url

'https://passiveaggressive.dk/category/feature/'

In [59]:
blog_article.top_image

'https://passiveaggressive.dk/favicon.ico'

In [60]:
blog_article.text

'Passive Aggressive Conversations #2: Alto Aria Maite Cintron, with Ivna Franic and Macon Holt Passive Aggressive Conversations is our new podcast series hosted by two Passive/Aggressive journalists, Macon Holt and Ivna Franic, where we dive deep into some of the emerging sounds within the Danish music scene and their unique forms of expression. In each episode, we sit down with an exciting new independent artist, and they share their creative ... Læs resten\n\nAnders Vestergaard – Der er intet tilbage til mig selv Interview af Jon Albjerg Ravnholt. PR-foto. Den københavnske jazztrommeslager Anders Vestergaard arbejdede sig ud af en personlig krise ved at sidde om aftenen i et øvelokale og spille de samme rytmemønstre om og om igen, indtil trancen indfandt sig. Med sin tredje soloplade, “Propeller”, lukker han os andre ind i det rum – og i den proces lukker han sig selv ude af det. Læs resten\n\nSpace Afrika – “Experience is at the root of everything that we produce” (Interview) By Ivn

In [None]:
"""
import pandas as pd
from newspaper import Article, build

# URL of the news category
url = 'https://www.theguardian.com/europe'

# Build a newspaper object that will scrape articles from the specified URL
paper = build(url, memoize_articles=False)

# List to hold dictionaries of article information
articles_list = []

# Loop through the articles in the paper object
for article in paper.articles:
    # Try to download and parse the article
    try:
        article.download()
        article.parse()
    except:
        continue  # Skip the article if there are issues downloading/parsing

    # Create a dictionary of the desired article information
    article_data = {
        "url": article.url,
        "title": article.title,
        "text": article.text,
        "authors": article.authors,
        "top_image": article.top_image
    }
    # Add the dictionary to the list of articles
    articles_list.append(article_data)

# Create a DataFrame from the list of article dictionaries
df_articles = pd.DataFrame(articles_list)

# Display the DataFrame
print(df_articles.head())

# Save DataFrame to a CSV file (optional)
df_articles.to_csv('guardian_articles.csv', index=False)
"""

In [132]:
df_articles

Unnamed: 0,url,title,text,authors,top_image
0,https://www.theguardian.com/world/live/2024/ap...,Middle East crisis live: US denies carrying ou...,From 3h ago 06.08 EDT Iraqi military say it is...,"[Amy Sedghi, Clea Skopeliti]",https://i.guim.co.uk/img/media/43a894ac750aa78...
1,https://www.theguardian.com/us-news/2024/apr/2...,‘Media firestorm’: Israel protest at professor...,During a dinner for students that the dean of ...,[J Oliver Conroy],https://i.guim.co.uk/img/media/635b8a4fdcccb7c...
2,https://www.theguardian.com/us-news/2024/apr/2...,Pro-Israel US groups plan $100m effort to unse...,Pro-Israel groups are pumping millions into th...,"[Joan E Greve, Alice Herman, Will Craft]",https://i.guim.co.uk/img/media/69ba5956beabb12...
3,https://www.theguardian.com/world/2024/apr/20/...,Explosion hits base of Iranian-aligned Iraqi a...,An explosion has hit an Iraqi military base ho...,[],https://i.guim.co.uk/img/media/81b80a9a9732efd...
4,https://www.theguardian.com/us-news/2024/apr/2...,Man dies after setting himself on fire outside...,A man has died after setting himself on fire o...,[],https://i.guim.co.uk/img/media/4d823c913f7a60c...
...,...,...,...,...,...
1183,https://www.theguardian.com/food/2024/mar/16/i...,I didn’t eat proper risotto till I was nearly ...,"I know, I know. How peak middle class to make ...",[Rachel Cooke],https://i.guim.co.uk/img/media/d54dc38e1205e70...
1184,https://www.theguardian.com/food/2024/mar/16/i...,I didn’t eat proper risotto till I was nearly ...,"I know, I know. How peak middle class to make ...",[Rachel Cooke],https://i.guim.co.uk/img/media/d54dc38e1205e70...
1185,https://www.theguardian.com/food/2023/nov/08/e...,Extreme weather pushes global wine production ...,Global wine production has fallen this year to...,[],https://i.guim.co.uk/img/media/29e30c1646d66c0...
1186,https://www.theguardian.com/food/2024/apr/19/b...,Benjamina Ebuehi’s recipe for mango and Tajín ...,I spent two glorious weeks in Mexico City last...,[Benjamina Ebuehi],https://i.guim.co.uk/img/media/b398fa40e64670a...


In [None]:
import pandas as pd
from newspaper import Article, Config, build

# Configuration to enable multi-threading and caching
config = Config()
config.memoize_articles = True
config.fetch_images = False

# URL of the news category
url = 'https://www.theguardian.com/europe'

# Build a newspaper object with the specified configuration
paper = build(url, config=config)

# List to hold dictionaries of article information
articles_list = []

# Process each article in multi-threading
for article in paper.articles:
    try:
        article.download()
        article.parse()
        article.nlp()
    except:
        continue  # Skip the article if issues occur

    # Dictionary of desired article information including NLP results and publication date
    article_data = {
        "url": article.url,
        "title": article.title,
        "text": article.text,
        "authors": article.authors,
        "publish_date": article.publish_date,
        "top_image": article.top_image,
        "keywords": article.keywords,
        "summary": article.summary
    }
    articles_list.append(article_data)

# Convert the list of dictionaries into a DataFrame
df_articles = pd.DataFrame(articles_list)

# Ensure datetime is in the correct format if not already
df_articles['publish_date'] = pd.to_datetime(df_articles['publish_date'])

# Display the DataFrame
print(df_articles[['url', 'title', 'publish_date', 'authors', 'top_image']].head())

# Optionally, save to CSV
df_articles.to_csv('guardian_articles.csv', index=False)


In [72]:
import pandas as pd
from newspaper import Article, Config, build

# Configuration to enable multi-threading and caching
config = Config()
config.memoize_articles = True
config.fetch_images = False

# URL of the news category and the number of articles to scrape
url = 'https://www.theguardian.com/europe'

max_articles = 10  # Set to the desired number of articles

# Build a newspaper object with the specified configuration
paper = build(url, config=config)

# List to hold dictionaries of article information
articles_list = []

# Process each article in multi-threading
for count, article in enumerate(paper.articles):
    if count >= max_articles:
        break  # Stop processing once the maximum number of articles is reached

    try:
        article.download()
        article.parse()
        article.nlp()
    except:
        continue  # Skip the article if issues occur

    # Dictionary of desired article information including NLP results and publication date
    article_data = {
        "url": article.url,
        "title": article.title,
        "text": article.text,
        "authors": article.authors,
        "publish_date": article.publish_date,
        "top_image": article.top_image,
        "keywords": article.keywords,
        "summary": article.summary
    }
    articles_list.append(article_data)

# Convert the list of dictionaries into a DataFrame
df_articles = pd.DataFrame(articles_list)

# Ensure datetime is in the correct format if not already
df_articles['publish_date'] = pd.to_datetime(df_articles['publish_date'])

# Display the DataFrame
print(df_articles[['url', 'title', 'publish_date', 'authors', 'top_image']].head())

# Optionally, save to CSV
df_articles.to_csv('guardian_articles.csv', index=False)


CRITICAL:newspaper.network:[REQUEST FAILED] 404 Client Error: Not Found for url: https://www.theguardian.com/feeds
CRITICAL:newspaper.network:[REQUEST FAILED] 404 Client Error: Not Found for url: https://www.theguardian.com/feed


KeyError: "None of [Index(['url', 'title', 'publish_date', 'authors', 'top_image'], dtype='object')] are in the [columns]"

In [67]:
df_articles

Unnamed: 0,url,title,text,authors,publish_date,top_image,keywords,summary
0,https://www.theguardian.com/books/2020/sep/18/...,Diary of an MP’s Wife by Sasha Swire review – ...,"We have come a long way, thankfully, since the...",[],2020-09-18,https://i.guim.co.uk/img/media/f3c621b2ab3b691...,"[camerons, hugo, sasha, wife, review, memoir, ...","We have come a long way, thankfully, since the..."
