# Load data about articles from the Guardian Media Group API

## 1. Extract information using the Guardian Media Group API

More information about using the Guardian Media Group API can be found in [Documentation](https://open-platform.theguardian.com/documentation/) 

To access the API, it is needed to [sign up for anAPI key](https://open-platform.theguardian.com/access/), which should be sent with every request.

I generated an API key and saved it in api.cfg file. 

Import required libraries:

In [1]:
import configparser
import json
from datetime import datetime, timedelta, date

In [2]:
import requests

In [3]:
import pandas as pd

Import API key:

In [4]:
config =  configparser.ConfigParser()
config.read('api.cfg')

['api.cfg']

Create query for extracting data from the Guardian Media Group API, which extacts data from 2021-01-01 to today:

In [5]:
from_date = '2022-11-12'
to_date = '2022-11-13'

In [6]:
url_querry = 'https://content.guardianapis.com/search?' \
    + 'api-key=' + config['API']['KEY'] + '&' \
    + 'from-date=' + from_date + '&' \
    + 'to-date=' + to_date + '&' \
    + 'type=' + 'article' + '&' \
    + 'show-tags=keyword' + '&' \
    + 'order-by=' + 'oldest'

Get response from API and save as JSON object.

In [7]:
response = requests.get(url_querry)
data_json = response.json()

Get number of pages of query:

In [8]:
# could be max only 3800
data_json['response']['pages']

32

In [9]:
data_json['response']

{'status': 'ok',
 'userTier': 'developer',
 'total': 316,
 'startIndex': 1,
 'pageSize': 10,
 'currentPage': 1,
 'pages': 32,
 'orderBy': 'oldest',
 'results': [{'id': 'football/2022/nov/12/harry-kane-world-cup-fitness-england-tottenham-antonio-conte',
   'type': 'article',
   'sectionId': 'football',
   'sectionName': 'Football',
   'webPublicationDate': '2022-11-12T00:04:24Z',
   'webTitle': 'England fans ‘must not worry’ about Harry Kane’s fitness, insists Conte',
   'webUrl': 'https://www.theguardian.com/football/2022/nov/12/harry-kane-world-cup-fitness-england-tottenham-antonio-conte',
   'apiUrl': 'https://content.guardianapis.com/football/2022/nov/12/harry-kane-world-cup-fitness-england-tottenham-antonio-conte',
   'tags': [{'id': 'football/harry-kane',
     'type': 'keyword',
     'sectionId': 'football',
     'sectionName': 'Football',
     'webTitle': 'Harry Kane',
     'webUrl': 'https://www.theguardian.com/football/harry-kane',
     'apiUrl': 'https://content.guardianapis.c

Convert first page of response to dataframe `df_response`:

In [10]:
df_response = pd.DataFrame(data_json['response']['results'])

In [11]:
df_response.head()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,tags,isHosted,pillarId,pillarName
0,football/2022/nov/12/harry-kane-world-cup-fitn...,article,football,Football,2022-11-12T00:04:24Z,England fans ‘must not worry’ about Harry Kane...,https://www.theguardian.com/football/2022/nov/...,https://content.guardianapis.com/football/2022...,"[{'id': 'football/harry-kane', 'type': 'keywor...",False,pillar/sport,Sport
1,australia-news/2022/nov/11/cruise-ship-majesti...,article,australia-news,Australia news,2022-11-12T00:35:40Z,Majestic Princess: cruise ship passengers dise...,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,"[{'id': 'australia-news/australia-news', 'type...",False,pillar/news,News
2,us-news/2022/nov/11/democrat-mark-kelly-arizon...,article,us-news,US news,2022-11-12T03:46:54Z,Mark Kelly holds on to Arizona seat in critica...,https://www.theguardian.com/us-news/2022/nov/1...,https://content.guardianapis.com/us-news/2022/...,"[{'id': 'us-news/us-midterm-elections-2022', '...",False,pillar/news,News
3,australia-news/2022/nov/12/medibank-hack-clare...,article,australia-news,Australia news,2022-11-12T03:47:17Z,Medibank hack: Clare O’Neil says new cybercrim...,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,"[{'id': 'australia-news/australian-politics', ...",False,pillar/news,News
4,environment/2022/nov/12/replace-animal-farms-m...,article,environment,Environment,2022-11-12T04:00:30Z,Replace animal farms with micro-organism tanks...,https://www.theguardian.com/environment/2022/n...,https://content.guardianapis.com/environment/2...,"[{'id': 'environment/food', 'type': 'keyword',...",False,pillar/news,News


Get data from the second page to max number of pages from response and save these data to dataframe `df_response`:

In [12]:
count_of_pages = data_json['response']['pages']
for page in range(2,count_of_pages+1):
    response = requests.get(url_querry  + '&' + 'page=' + str(page))
    data_json = response.json()
    df_page = pd.DataFrame(data_json['response']['results'])
    frames = [df_response, df_page]
    result = pd.concat(frames)
    df_response = result
    print('Processed {}/{} - {:2.2%}'.format(page,count_of_pages,page/count_of_pages))

Processed 2/32 - 6.25%
Processed 3/32 - 9.38%
Processed 4/32 - 12.50%
Processed 5/32 - 15.62%
Processed 6/32 - 18.75%
Processed 7/32 - 21.88%
Processed 8/32 - 25.00%
Processed 9/32 - 28.12%
Processed 10/32 - 31.25%
Processed 11/32 - 34.38%
Processed 12/32 - 37.50%
Processed 13/32 - 40.62%
Processed 14/32 - 43.75%
Processed 15/32 - 46.88%
Processed 16/32 - 50.00%
Processed 17/32 - 53.12%
Processed 18/32 - 56.25%
Processed 19/32 - 59.38%
Processed 20/32 - 62.50%
Processed 21/32 - 65.62%
Processed 22/32 - 68.75%
Processed 23/32 - 71.88%
Processed 24/32 - 75.00%
Processed 25/32 - 78.12%
Processed 26/32 - 81.25%
Processed 27/32 - 84.38%
Processed 28/32 - 87.50%
Processed 29/32 - 90.62%
Processed 30/32 - 93.75%
Processed 31/32 - 96.88%
Processed 32/32 - 100.00%


In [13]:
data_json

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 316,
  'startIndex': 311,
  'pageSize': 10,
  'currentPage': 32,
  'pages': 32,
  'orderBy': 'oldest',
  'results': [{'id': 'australia-news/2022/nov/14/medibank-mental-health-data-posted-on-dark-web-as-russian-hackers-vow-to-keep-our-word',
    'type': 'article',
    'sectionId': 'australia-news',
    'sectionName': 'Australia news',
    'webPublicationDate': '2022-11-13T21:39:38Z',
    'webTitle': 'Medibank mental health data posted on dark web as Russian hackers vow to ‘keep our word’',
    'webUrl': 'https://www.theguardian.com/australia-news/2022/nov/14/medibank-mental-health-data-posted-on-dark-web-as-russian-hackers-vow-to-keep-our-word',
    'apiUrl': 'https://content.guardianapis.com/australia-news/2022/nov/14/medibank-mental-health-data-posted-on-dark-web-as-russian-hackers-vow-to-keep-our-word',
    'tags': [{'id': 'australia-news/medibank',
      'type': 'keyword',
      'sectionId': 'australia-news',
      

In [14]:
df_response = df_response.reset_index()

In [15]:
df_response = df_response.drop(columns=['index'])

In [16]:
df_response['id_num'] = range(1, len(df_response) + 1)

## 2. Save general information about article

In [17]:
df_article = df_response[['id_num',
               'id', 
               'type', 
               'sectionId',
               'sectionName',
               'webPublicationDate',
               'webTitle',
               'isHosted',
               'pillarId',
               'pillarName'
              ]]

In [18]:
df_article.head()

Unnamed: 0,id_num,id,type,sectionId,sectionName,webPublicationDate,webTitle,isHosted,pillarId,pillarName
0,1,football/2022/nov/12/harry-kane-world-cup-fitn...,article,football,Football,2022-11-12T00:04:24Z,England fans ‘must not worry’ about Harry Kane...,False,pillar/sport,Sport
1,2,australia-news/2022/nov/11/cruise-ship-majesti...,article,australia-news,Australia news,2022-11-12T00:35:40Z,Majestic Princess: cruise ship passengers dise...,False,pillar/news,News
2,3,us-news/2022/nov/11/democrat-mark-kelly-arizon...,article,us-news,US news,2022-11-12T03:46:54Z,Mark Kelly holds on to Arizona seat in critica...,False,pillar/news,News
3,4,australia-news/2022/nov/12/medibank-hack-clare...,article,australia-news,Australia news,2022-11-12T03:47:17Z,Medibank hack: Clare O’Neil says new cybercrim...,False,pillar/news,News
4,5,environment/2022/nov/12/replace-animal-farms-m...,article,environment,Environment,2022-11-12T04:00:30Z,Replace animal farms with micro-organism tanks...,False,pillar/news,News


In [19]:
name_csv = 'general/guardian-article-data_start-date-' + from_date + '_end-date-' + to_date + '_general.csv'

In [20]:
df_article.to_csv(name_csv,index=False,sep=';')

## 3. Save tag information

In [21]:
df_tags = df_response[['id_num',
               'id',
               'tags'
              ]]

In [22]:
df_tags.head()

Unnamed: 0,id_num,id,tags
0,1,football/2022/nov/12/harry-kane-world-cup-fitn...,"[{'id': 'football/harry-kane', 'type': 'keywor..."
1,2,australia-news/2022/nov/11/cruise-ship-majesti...,"[{'id': 'australia-news/australia-news', 'type..."
2,3,us-news/2022/nov/11/democrat-mark-kelly-arizon...,"[{'id': 'us-news/us-midterm-elections-2022', '..."
3,4,australia-news/2022/nov/12/medibank-hack-clare...,"[{'id': 'australia-news/australian-politics', ..."
4,5,environment/2022/nov/12/replace-animal-farms-m...,"[{'id': 'environment/food', 'type': 'keyword',..."


In [23]:
pd.DataFrame(df_tags.iloc[3]['tags'])

Unnamed: 0,id,type,sectionId,sectionName,webTitle,webUrl,apiUrl,references,description
0,australia-news/australian-politics,keyword,australia-news,Australia news,Australian politics,https://www.theguardian.com/australia-news/aus...,https://content.guardianapis.com/australia-new...,[],
1,australia-news/labor-party,keyword,australia-news,Australia news,Labor party,https://www.theguardian.com/australia-news/lab...,https://content.guardianapis.com/australia-new...,[],
2,australia-news/coalition,keyword,australia-news,Australia news,Coalition,https://www.theguardian.com/australia-news/coa...,https://content.guardianapis.com/australia-new...,[],
3,australia-news/australia-news,keyword,australia-news,Australia news,Australia news,https://www.theguardian.com/australia-news/aus...,https://content.guardianapis.com/australia-new...,[],
4,world/russia,keyword,world,World news,Russia,https://www.theguardian.com/world/russia,https://content.guardianapis.com/world/russia,[],
5,australia-news/medibank,keyword,australia-news,Australia news,Medibank,https://www.theguardian.com/australia-news/med...,https://content.guardianapis.com/australia-new...,[],
6,australia-news/crime-australia,keyword,australia-news,Australia news,Crime - Australia,https://www.theguardian.com/australia-news/cri...,https://content.guardianapis.com/australia-new...,[],Latest news on crime in Australia from the Gua...


In [24]:
name_json = 'guardian-article-data_start-date-' + from_date + '_end-date-' + to_date + '_pre-tags.json'

In [25]:
df_tags.to_json(name_json,orient="records")

In [26]:
with open(name_json,'r') as f:
    data = json.loads(f.read())

In [27]:
df_tag = pd.json_normalize(data, 
                           record_path =['tags'], 
                           meta =['id_num','id'],
                           record_prefix='tag_')

In [28]:
df_tag = df_tag[['id_num',
               'id',
               'tag_id',
               'tag_sectionName',
               'tag_webTitle',
               'tag_description'
              ]]

In [29]:
df_tag.head()

Unnamed: 0,id_num,id,tag_id,tag_sectionName,tag_webTitle,tag_description
0,1,football/2022/nov/12/harry-kane-world-cup-fitn...,football/harry-kane,Football,Harry Kane,
1,1,football/2022/nov/12/harry-kane-world-cup-fitn...,football/world-cup-2022,Football,World Cup 2022,
2,1,football/2022/nov/12/harry-kane-world-cup-fitn...,football/england,Football,England,
3,1,football/2022/nov/12/harry-kane-world-cup-fitn...,football/tottenham-hotspur,Football,Tottenham Hotspur,"Read the latest Tottenham Hotspur news, transf..."
4,1,football/2022/nov/12/harry-kane-world-cup-fitn...,football/football,Football,Football,


In [30]:
name_csv = 'tags/guardian-article-data_start-date-' + from_date + '_end-date-' + to_date + '_tags.csv'

In [31]:
df_tag.to_csv(name_csv,index=False,sep=';')

In [32]:
df_tag[df_tag['tag_webTitle'] == 'Justin Trudeau']

Unnamed: 0,id_num,id,tag_id,tag_sectionName,tag_webTitle,tag_description
