# News and Events Scrapping by Event Registry API Services

This notebook deploys the Python package EventRegistry to access the Event Registry API, a service for collecting and analyzing news articles and other news-related data. The data will be used to train news articles classifier model or scrap the news.

In [14]:
from eventregistry import *
api_key = 'eabf582f-dd68-404e-88d8-a448daae96e8'

# Get News Articles

Output of get articles function:

- `"uri"`: A unique identifier for the article.
- `"lang"`: The language of the article.
- `"isDuplicate"`: A boolean value indicating whether the article is a duplicate of another article.
- `"date"`: The publication date of the article.
- `"time"`: The publication time of the article.
- `"dateTime"`: The publication date and time of the article in ISO 8601 format.
- `"dataType"`: The type of the article (e.g., "news", "blog").
- `"url"`: The URL of the article.
- `"title"`: The title of the article.
- `"body"`: The body text of the article.
- `"source"`: A dictionary containing information about the source of the article.
- `"authors"`: A list of dictionaries, each representing an author of the article.
- `"image"`: The URL of an image in the article.
- `"eventUri"`: A unique identifier for the event associated with the article.
- `"sentiment"`: The sentiment score of the article.
- `"wgt"`: A weight value for the article.
- `"relevance"`: The relevance score of the article.

Also, user could search the news articles based on the concept (or area & topics), news categories, and location of source. 

In [3]:
def get_articles(api_key = None, concepts = None, category = None,
                 source_uris = None, sort_by = None, location = None,
                 nums_of_search = None, lang = None, start_date = None,
                 end_date = None):

  # Initialise API
  er = EventRegistry(apiKey = api_key)

  # Convert source names to URIs by OR operator
  source_uri_list = QueryItems.OR([er.getSourceUri(uri) for uri in source_uris]) if source_uris else None

  # Obtain articles
  articles = []

  # Flexibility for single or multiple concept extraction
  for concept in concepts:
    # Query set up
    q = QueryArticlesIter(
      conceptUri = er.getConceptUri(concept),
      categoryUri = er.getCategoryUri(category) if category else None,
      locationUri= er.getLocationUri(location) if location else None,
      sourceUri = source_uri_list,
      lang = lang if lang else lang,
      dateStart = start_date if start_date else None,
      dateEnd = end_date if end_date else None)

    for art in q.execQuery(er, sortBy = sort_by, maxItems = nums_of_search):
      art_dict = art
      art_dict['sourceTitle'] = art['source']['title']
      art_dict['sourceUri'] = art['source']['uri']
      articles.append(art_dict)

  return articles

Example of getting the concept

In [4]:
er = EventRegistry(apiKey = api_key)
er.getConceptUri("Taiwan")

'http://en.wikipedia.org/wiki/Taiwan_(island)'

Example of getting category

In [52]:
er = EventRegistry(apiKey = api_key)
er.getCategoryUri("economy")

'dmoz/Games/Board_Games/Economy_and_Trading'

Next, it sets some global variables before scrapping news articles.

In [10]:
# Define news sources to avoid content farm articles

# International news agencies
inter_news_soruces = ["afp.com", "efe.com", "euronews.com", 
                      "politico.com", "reuters.com", "rferl.org", 
                      "scmp.com", "swissinfo.ch"]

# US news agencies
US_news_sources = ["bloomberg.com", "cnn.com", "cbsnews.com", "foxnews.com",
                   "huffpost.com", "latimes.com", "nbcnews.com", "nytimes.com",
                   "msnbc.com", "sfchronicle.com",
                   "usatoday.com", "washingtonpost.com", "wsj.com"]

# UK news agencies
UK_news_sources = ["bbc.co.uk", "dailymail.co.uk", "express.co.uk", "gbnews.com",
                   "independent.co.uk", "metro.co.uk","mirror.co.uk",
                   "news.sky.com", "telegraph.co.uk",
                   "theguardian.com", "thesun.co.uk"]

# Combine the above news sources
new_sources = inter_news_soruces + UK_news_sources + US_news_sources

Set the period of new search:

In [7]:
# YYYY-mm-dd format
start_date = "2022-11-01"
end_date = "2023-11-21"

# Ecnomomy news articles

In [8]:
# Define the econ concepts
econ_concepts = ["economy", "finance", "marketing", "investment", "management",
                     "entrepreneurship", "startups", "technology", "real estate",
                     "e-commerce", "International trade", "corporate governance",
                     "product development", "risk management", "sales",
                     "public relations", "ethics", "business", "technology",
                     "CEO", "artifical intelligence"]

In [15]:
# get econ articles
econ_articles = get_articles(api_key = api_key,
                        concepts = econ_concepts,
                        category = "business",
                        source_uris = new_sources,
                        sort_by = 'rel',
                        location = None,
                        nums_of_search = 500,
                        lang = "eng",
                        start_date = start_date,
                        end_date = end_date)

In [16]:
# Convert to pd dataframe
import pandas as pd
df_econ_articles = pd.DataFrame(econ_articles)

# Filter the duplicated articles
Flt_econ_articles = df_econ_articles[df_econ_articles["isDuplicate"] == False]

In [18]:
Flt_econ_articles.head(3)

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,body,source,authors,image,eventUri,sentiment,wgt,relevance,sourceTitle,sourceUri
3,7799798736,eng,False,2023-10-25,06:27:47,2023-10-25T06:27:47Z,2023-10-25T06:26:54Z,news,0.0,https://www.reuters.com/world/india/india-serv...,...,"NEW DELHI, Oct 25 (Reuters) - Indian tax autho...","{'uri': 'reuters.com', 'dataType': 'news', 'ti...",[],https://www.reuters.com/pf/resources/images/re...,,0.027451,26,26,Reuters,reuters.com
5,7813047304,eng,False,2023-11-01,16:06:24,2023-11-01T16:06:24Z,2023-11-01T15:57:03Z,news,0.623529,https://www.thesun.co.uk/betting/24596527/betg...,...,Commercial content notice: Taking one of the b...,"{'uri': 'thesun.co.uk', 'dataType': 'news', 't...",[],https://www.thesun.co.uk/wp-content/uploads/20...,eng-9019872,0.231373,26,26,The Sun,thesun.co.uk
6,7836276713,eng,False,2023-11-15,10:00:02,2023-11-15T10:00:02Z,2023-11-15T09:57:51Z,news,0.823529,https://metro.co.uk/2023/11/15/gta-6-publisher...,...,The CEO of Rockstar owner Take-Two CEO has tal...,"{'uri': 'metro.co.uk', 'dataType': 'news', 'ti...","[{'uri': 'kenneth_andersen@metro.co.uk', 'name...",https://metro.co.uk/wp-content/uploads/2023/03...,eng-9063302,0.341176,26,26,Metro,metro.co.uk


In [25]:
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/econ_news.csv"
Flt_econ_articles.to_csv(file_path, index=False)

# Technology news articles

In [26]:
# Define the tech concepts
tech_concepts = ["technology", "artificial intelligence", "machine learning",
                 "data science", "cybersecurity", "financial technology",
                 "internet of things", "blockchain", "cloud computing",
                 "virtual reality", "augmented reality", "big data",
                 "quantum computing", "robotics", "software development",
                 "network", "bioinformatics", "digital", "automation", "5g",
                 "computing", "autonomous vehicles", "smart", "cryptocurrency",
                 "deep learning", "natural language processing",
                 "computer vision", "e-commerce", "telecommunication"]

In [27]:
# get your articles
tech_articles = get_articles(api_key = api_key,
                        concepts = tech_concepts,
                        category = "technology",
                        source_uris = new_sources,
                        sort_by = 'rel',
                        location = None,
                        nums_of_search = 500,
                        lang = "eng",
                        start_date = start_date,
                        end_date = end_date)

The processing of the request took a lot of time (43 sec). By repeatedly making slow requests your account will be temporarily disabled.


In [28]:
# Convert to pd dataframe
import pandas as pd
df_tech_articles = pd.DataFrame(tech_articles)

# Filter the duplicated articles
Flt_tech_articles = df_tech_articles[df_tech_articles["isDuplicate"] == False]

In [35]:
Flt_tech_articles.head(2)

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,body,source,authors,image,eventUri,sentiment,wgt,relevance,sourceTitle,sourceUri
0,7815330104,eng,False,2023-11-02,21:36:46,2023-11-02T21:36:46Z,2023-11-02T21:27:24Z,news,0.741176,https://www.dailymail.co.uk/wires/pa/article-1...,...,"Elon Musk warned of humanoid robots that ""can ...","{'uri': 'dailymail.co.uk', 'dataType': 'news',...",[],https://i.dailymail.co.uk/1s/2023/11/02/21/wir...,eng-9027597,0.082353,74,74,Daily Mail Online,dailymail.co.uk
1,7809839495,eng,False,2023-10-30,23:49:28,2023-10-30T23:49:28Z,2023-10-30T23:48:41Z,news,0.701961,https://www.politico.com/news/2023/10/30/biden...,...,Most observers see three broad groups trying t...,"{'uri': 'politico.com', 'dataType': 'news', 't...",[],https://static.politico.com/a4/3e/e9722f9341bc...,eng-9016248,0.176471,68,68,POLITICO,politico.com


In [30]:
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/tech_news.csv"
Flt_tech_articles.to_csv(file_path, index=False)

# Political news articles

In [31]:
# Define the politics concept
politics_concepts = ["politics", "elections", "political parties",
                     "legislation", "international relations",
                     "public policy", "political campaigns",
                     "political economy", "US politics", "conservative party",
                     "labour party", "European Union", "green party", "CCP",
                     "US government" "UK government" "Chinese government",
                     "Taiwan"]

In [32]:
# get politics articles
politic_articles = get_articles(api_key = api_key,
                        concepts = politics_concepts,
                        category = "politics",
                        source_uris = new_sources,
                        sort_by = 'rel',
                        location = None,
                        nums_of_search = 500,
                        lang = "eng",
                        start_date = start_date,
                        end_date = end_date)

In [33]:
# Convert to pd dataframe
import pandas as pd
df_pol_articles = pd.DataFrame(politic_articles)

# Filter the duplicated articles
Flt_pol_articles = df_pol_articles[df_pol_articles["isDuplicate"] == False]

In [36]:
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/pol_news.csv"
Flt_pol_articles.to_csv(file_path, index=False)

# Environmental news

In [78]:
# Define the concepts
env_concepts = ["climate change", "global warming", "climate action",
                    "extreme weather", "carbon footprint", "sustainability",
                    "climate resilience", "eco friendly", "carbon emission"]

In [41]:
# get your articles
env_articles = get_articles(api_key = api_key,
                        concepts = env_concepts,
                        category = "environment",
                        source_uris = new_sources,
                        sort_by = 'rel',
                        location = None,
                        nums_of_search = 500,
                        lang = "eng",
                        start_date = start_date,
                        end_date = end_date)

In [42]:
# Convert to pd dataframe
import pandas as pd
df_env_articles = pd.DataFrame(env_articles)

# Filter the duplicated articles
Flt_env_articles = df_env_articles[df_env_articles["isDuplicate"] == False]

In [43]:
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/env_news.csv"
Flt_env_articles.to_csv(file_path, index=False)

# Legal news

In [75]:
er = EventRegistry(apiKey = api_key)
er.getConceptUri("competition Law")

'http://en.wikipedia.org/wiki/Competition_law'

In [77]:
# Define the concepts
law_concepts = ["civil law", "finance law", "business law", "commercial law",
               "construction law", "consumer law", "riminal law",
               "employment law", "Environmental law", "human right law",
               "immigration law", "insurance law", "Intellectual Property law",
               "property law", "tax law", "competition law"]

In [79]:
# get your articles
law_articles = get_articles(api_key = api_key,
                        concepts = law_concepts,
                        category = "law",
                        source_uris = new_sources,
                        sort_by = 'rel',
                        location = None,
                        nums_of_search = 500,
                        lang = "eng",
                        start_date = start_date,
                        end_date = end_date)

In [80]:
# Convert to pd dataframe
import pandas as pd
df_law_articles = pd.DataFrame(law_articles)

# Filter the duplicated articles
Flt_law_articles = df_law_articles[df_law_articles["isDuplicate"] == False]

In [81]:
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/law_news.csv"
Flt_law_articles.to_csv(file_path, index=False)

# Create dataset for Machine Learning

In [86]:
# Add category column to each dataframe
econ_df = Flt_econ_articles
econ_df['category'] = 'economy'

env_df = Flt_env_articles
env_df['category'] = 'environment'

pol_df = Flt_pol_articles
pol_df["category"] = 'politics'

tech_df = Flt_tech_articles
tech_df["category"] = 'technology'

law_df = Flt_law_articles
law_df["category"] = 'law'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  econ_df['category'] = 'economy'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  env_df['category'] = 'environment'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pol_df["category"] = 'politics'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer]

Explore all datasets:

In [87]:
econ_df.columns

Index(['uri', 'lang', 'isDuplicate', 'date', 'time', 'dateTime', 'dateTimePub',
       'dataType', 'sim', 'url', 'title', 'body', 'source', 'authors', 'image',
       'eventUri', 'sentiment', 'wgt', 'relevance', 'sourceTitle', 'sourceUri',
       'category'],
      dtype='object')

In [89]:
econ_df.loc[10:11, ["title", "body", "category", "sourceTitle"]]

Unnamed: 0,title,body,category,sourceTitle
10,Young hero who walked away from gangs reveals ...,Osmond Gordon-Vernon is this year's National L...,economy,Mirror
11,"Tech, Media & Telecom Roundup: Market Talk","The latest Market Talks covering Technology, M...",economy,The Wall Street Journal


In [90]:
env_df.loc[10:11, ["title", "body", "category", "sourceTitle"]]

Unnamed: 0,title,body,category,sourceTitle
10,Revealed: the industry figures behind 'declara...,Document used to target top EU officials over ...,environment,The Guardian
11,The age of 'climate war' is upon us. Courts ne...,The Rome Statute includes multiple provisions ...,environment,Euronews English


In [95]:
pol_df.loc[20:21, ["title", "body", "category", "sourceTitle"]]

Unnamed: 0,title,body,category,sourceTitle
20,Pieter Omtzigt: centrist outsider who wants to...,The brand new NSC party is top of the polls as...,politics,The Guardian
21,David Cameron's return to politics stuns forei...,David Cameron's shock return to politics has s...,politics,Daily Mail Online


In [94]:
tech_df.loc[14:15, ["title", "body", "category", "sourceTitle"]]

Unnamed: 0,title,body,category,sourceTitle
14,Biden wants to move fast on AI safeguards and ...,WASHINGTON (AP) - President Joe Biden on Monda...,technology,Daily Mail Online
15,Start-Ups Bring Silicon Valley Ethos to a Lumb...,"Small, fast-moving U.S. tech firms are using t...",technology,The New York Times


In [96]:
law_df.loc[44:45, ["title", "body", "category", "sourceTitle"]]

Unnamed: 0,title,body,category,sourceTitle
44,White Island volcano eruption: Whakaari Manage...,New Zealand Justice Evangelos Thomas criticise...,law,The Guardian
45,Texas justices weigh lawsuit by judge censured...,Oct 25 (Reuters) - Members of the Texas Suprem...,law,Reuters


Combine each dataset:

In [97]:
news_df = pd.concat([econ_df,env_df,pol_df,tech_df,law_df],
                    axis = 0, ignore_index = True)

In [99]:
news_df.head()

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,source,authors,image,eventUri,sentiment,wgt,relevance,sourceTitle,sourceUri,category
0,7799798736,eng,False,2023-10-25,06:27:47,2023-10-25T06:27:47Z,2023-10-25T06:26:54Z,news,0.0,https://www.reuters.com/world/india/india-serv...,...,"{'uri': 'reuters.com', 'dataType': 'news', 'ti...",[],https://www.reuters.com/pf/resources/images/re...,,0.027451,26,26,Reuters,reuters.com,economy
1,7813047304,eng,False,2023-11-01,16:06:24,2023-11-01T16:06:24Z,2023-11-01T15:57:03Z,news,0.623529,https://www.thesun.co.uk/betting/24596527/betg...,...,"{'uri': 'thesun.co.uk', 'dataType': 'news', 't...",[],https://www.thesun.co.uk/wp-content/uploads/20...,eng-9019872,0.231373,26,26,The Sun,thesun.co.uk,economy
2,7836276713,eng,False,2023-11-15,10:00:02,2023-11-15T10:00:02Z,2023-11-15T09:57:51Z,news,0.823529,https://metro.co.uk/2023/11/15/gta-6-publisher...,...,"{'uri': 'metro.co.uk', 'dataType': 'news', 'ti...","[{'uri': 'kenneth_andersen@metro.co.uk', 'name...",https://metro.co.uk/wp-content/uploads/2023/03...,eng-9063302,0.341176,26,26,Metro,metro.co.uk,economy
3,7831019242,eng,False,2023-11-12,08:20:57,2023-11-12T08:20:57Z,2023-11-12T08:15:18Z,news,0.513726,https://www.scmp.com/lifestyle/entertainment/a...,...,"{'uri': 'scmp.com', 'dataType': 'news', 'title...",[],https://cdn.i-scmp.com/sites/default/files/sty...,eng-9051704,0.380392,26,26,South China Morning Post,scmp.com,economy
4,7845568959,eng,False,2023-11-20,11:03:52,2023-11-20T11:03:52Z,2023-11-20T11:00:39Z,news,0.0,https://www.latimes.com/entertainment-arts/bus...,...,"{'uri': 'latimes.com', 'dataType': 'news', 'ti...",[],https://ca-times.brightspotcdn.com/dims4/defau...,,0.05098,27,27,Los Angeles Times,latimes.com,economy


In [101]:
news_df.tail()

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,source,authors,image,eventUri,sentiment,wgt,relevance,sourceTitle,sourceUri,category
8655,2023-11-141538514,eng,False,2023-11-01,17:06:02,2023-11-01T17:06:02Z,2023-11-01T16:53:48Z,news,0.0,https://www.reuters.com/legal/legalindustry/la...,...,"{'uri': 'reuters.com', 'dataType': 'news', 'ti...","[{'uri': 'sara_merken@reuters.com', 'name': 'S...",https://www.reuters.com/resizer/lx3Dqm3R897ELB...,,0.035294,9,9,Reuters,reuters.com,law
8656,2023-10-134467173,eng,False,2023-10-26,22:34:45,2023-10-26T22:34:45Z,2023-10-26T22:20:51Z,news,0.74902,https://www.reuters.com/legal/government/canna...,...,"{'uri': 'reuters.com', 'dataType': 'news', 'ti...",[],https://www.reuters.com/resizer/wPgZczJ3RJerF4...,eng-9010916,-0.137255,8,8,Reuters,reuters.com,law
8657,2023-11-141667333,eng,False,2023-11-01,19:16:04,2023-11-01T19:16:04Z,2023-11-01T17:32:38Z,news,0.827451,https://www.reuters.com/legal/litigation/dc-su...,...,"{'uri': 'reuters.com', 'dataType': 'news', 'ti...","[{'uri': 'mike_scarcella@reuters.com', 'name':...",https://www.reuters.com/resizer/7mGaVctlj-1zNS...,eng-9024690,-0.113725,6,6,Reuters,reuters.com,law
8658,2023-11-140961384,eng,False,2023-11-01,09:12:58,2023-11-01T09:12:58Z,2023-11-01T09:02:44Z,news,0.0,https://www.nytimes.com/2023/11/01/us/politics...,...,"{'uri': 'nytimes.com', 'dataType': 'news', 'ti...","[{'uri': 'charlie_savage@nytimes.com', 'name':...",https://static01.nyt.com/images/2023/10/31/mul...,,0.207843,6,6,The New York Times,nytimes.com,law
8659,7814956236,eng,False,2023-11-02,16:33:17,2023-11-02T16:33:17Z,2023-11-02T16:32:44Z,news,0.0,https://www.washingtonpost.com/technology/2023...,...,"{'uri': 'washingtonpost.com', 'dataType': 'new...",[{'uri': 'caroline_o_donovan@washingtonpost.co...,https://www.washingtonpost.com/wp-apps/imrs.ph...,,-0.058824,3,3,Washington Post,washingtonpost.com,law


In [102]:
# Download dataset 
file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/news_for_ML.csv"
news_df.to_csv(file_path, index=False)