# The Guardian 

In [1]:
import pandas as pd
import requests
import spacy

## Scrape the Data from API

We will use free [The Guardian API](https://open-platform.theguardian.com/explore/). We want to scrape all the articles under the "World" section in 2022 and we define the URL.

### Define URL

Based on the API link, first we can use variables to generate date list of 2022 'https://content.guardianapis.com/world?{randomDate}&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575' 


In [40]:
# World Newa, 202201, concatenation of month
from datetime import datetime, timedelta

# Define the start and end dates
start_date = datetime(2022, 1, 1)
end_date = datetime(2022, 12, 31)

# generate monthly date ranges
def generate_monthly_ranges(start_date, end_date):
    current_date = start_date
    date_ranges = []
    
    while current_date <= end_date:
        next_month = current_date.replace(day=1) + timedelta(days=32)
        first_day_of_next_month = next_month.replace(day=1)
        last_day_of_month = (first_day_of_next_month - timedelta(days=1))
        
        date_ranges.append(
            f"from-date={current_date.strftime('%Y-%m-%d')}&to-date={last_day_of_month.strftime('%Y-%m-%d')}"
        )
        
        current_date = first_day_of_next_month
    
    return date_ranges

# Generate and print date ranges for each month
dateList = generate_monthly_ranges(start_date, end_date)

#print(dateList)

url_without_date = 'https://content.guardianapis.com/world?{randomDate}&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575'
url_with_date = []
for date in dateList:
    newUrl = url_without_date.format(randomDate = date)
    url_with_date.append(newUrl)
url_with_date

['https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575',
 'https://content.guardianapis.com/world?from-date=2022-02-01&to-date=2022-02-28&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575',
 'https://content.guardianapis.com/world?from-date=2022-03-01&to-date=2022-03-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575',
 'https://content.guardianapis.com/world?from-date=2022-04-01&to-date=2022-04-30&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575',
 'https://content.guardianapis.com/world?from-date=2022-05-01&to-date=2022-05-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575',
 'https://content.guardianapis.com/world?from-date=2022-06-01&to-date=2022-06-30&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e

We have decided to scrape 100 articles per month, adhering to the API rules, which provide data per page with a limit of 10 articles. Consequently, for each month, we aim to scrape 10 pages. However, it's worth noting that the last page of the month may not always contain 10 articles. To address the issue of unequal sample sizes, we exclude the last page of each month during the selection process. For example, the last page in April only had 4 articles.

In [47]:
# World Newa, 202201, concatenation of page
import random
url_full = []
for url in url_with_date:
    response=requests.get(url)
    x = response.json()
    total_pages = x['response']['pages']
    random_pages = random.sample(range(1, total_pages - 1), 10)
    for page in random_pages:
        url_with_page = url + '&page=' + str(page)
        url_full.append(url_with_page)
url_full

['https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page=46',
 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page=10',
 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page=22',
 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page=14',
 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-blocks=all&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page=72',
 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-block

In [4]:
# eg
url="https://content.guardianapis.com/search?section=world&from-date=2023-11-01&show-blocks=all&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page="

In [32]:
# test: World News, 202201-202301 = 35814
url = 'https://content.guardianapis.com/search?q=world&from-date=2022-01-01&to-date=2023-01-01&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575'

In [31]:
# test: World Newa, 202201
url = 'https://content.guardianapis.com/search?section=world&from-date=2022-01-01&to-date=2022-01-31&show-tags=all&show-fields=all&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575'

In [15]:
# test: World Newa, 202201, 拼接
url = 'https://content.guardianapis.com/world?from-date=2022-01-01&to-date=2022-01-31&page-size=10&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page='

## Get Texts 

Use requests to get all the response from lists of URL and store the responses in list in output

In [4]:
info=[]
def json(url1):
    response=requests.get(url1)
    x=response.json()
    info.append(x)

In [48]:
info=[]
output=[json(url1) for url1 in url_full]

The API provides a wealth of information unrelated to our research. Therefore, we are only extracting the relevant data, including type, web title, section name, web publication date, web URL, tags, and elements. All this data is saved in the 'info' list.

In [49]:
extracted_data = [
    {
        'type': item['type'],
        'webTitle': item['webTitle'],
        'sectionName': item['sectionName'],
        'webPublicationDate': item['webPublicationDate'],
        'webUrl': item['webUrl'],
        'tags': [
            {
                'tagTitle': tag['webTitle'],
                'tagURL': tag['webUrl'],
            }
            for tag in item['tags']
        ],
        'tagCount': len(item['tags']),
        'elements': [
            {
                'id': element['id'],
                'bodyTextSummary': element.get('bodyTextSummary', ''),
                'lastModifiedDate': element.get('lastModifiedDate', ''),
            }
            
            for item in x['response']['results']
            #if 'blocks' in item and 'body' in item['blocks']
            for element in item.get('blocks', {}).get('body', [])
            #if isinstance(element, dict)  # Check if element is a dictionary
            
        ],
    }
    for x in info
    if 'results' in x['response']
    for item in x['response']['results']
]

#for i in extracted_data:
    #print(i['type'], i['tagCount'], i['webTitle'], '!!!!!!!!!!!!', i['webPublicationDate'], i['webUrl'])


## Save corpus in text files

We intend to use the article title as the filename. Therefore, we need to remove special characters. Additionally, to prevent overwriting, we are employing Counter() to append suffixes to files with the same name.

In [105]:
#save corpus to txt 
import os
from collections import Counter

def cleanFilename(filename):
    invalid_chars = '<>:"/\\|?*'
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename

output_directory = 'Guardian_corpus_txt'
file_counts = Counter()
iteration_count = 0

for line in extracted_data:
    base_filename = cleanFilename(line['webTitle'])
    extension = '.txt'
    filename = base_filename + extension

    # Check if the filename already exists
    count = file_counts[filename]
    while os.path.join(output_directory, filename) in file_names:
        # If yes, increment the count and add the suffix
        count += 1
        filename = f"{base_filename}-{count}{extension}"

    file_counts[filename] = count
    file_names.append(os.path.join(output_directory, filename))

    with open(os.path.join(output_directory, filename), 'w', encoding='utf-8') as file:
        # Write the data dictionary to the file
        file.write(str(line))

    iteration_count += 1 

print(f"Total iterations: {iteration_count}")


Total iterations: 1200


## Data Quality Check

While collecting data, we encountered instances where the results did not meet our expectations. We invested considerable effort in troubleshooting to identify and address the issues. As the debugging process is omitted from the main sections, we believe it is a valuable experience in data collection. In this section, we will explain how we identified and resolved the problems.

### 1. check the amount of articles per month

Ideally, we should have a total of 1200 articles. However, we observed that there are only 1194 in total. This discrepancy arose from inadvertently selecting the last page of April, which only contains 4 articles. To ensure a consistent count of 100 articles per month, we write functions to verify the number of articles for each month.

In [50]:
#test bring into table
#print(extracted_data[1])
df = pd.DataFrame(extracted_data)
df.to_csv('Guardian_corpus.csv', index=False)
print(corpus_df.head(2))

                                            Filename  \
0  Mystery deepens as owners say Hong Kong floati...   
1                          Dom Phillips obituary.txt   

                                            Document  
0  {'type': 'article', 'webTitle': 'Mystery deepe...  
1  {'type': 'article', 'webTitle': 'Dom Phillips ...  


In [51]:
# Convert 'webPublicationDate' to datetime object
df['webPublicationDate'] = pd.to_datetime(df['webPublicationDate'])

# Extract the month from the 'webPublicationDate' column
df['month'] = df['webPublicationDate'].dt.to_period('M')

# Example: If you have a 'webPublicationDate' column in your DataFrame
web_publication_month_counts = df['month'].value_counts()

# Display the unique months and their counts
print("Number of unique months:", len(web_publication_month_counts))
print("\nUnique months and their counts:")
print(web_publication_month_counts)


Number of unique months: 12

Unique months and their counts:
2022-01    100
2022-02    100
2022-03    100
2022-04    100
2022-05    100
2022-06    100
2022-07    100
2022-08    100
2022-09    100
2022-10    100
2022-11    100
2022-12    100
Freq: M, Name: month, dtype: int64




### 2. Prevent Overwriting

We noticed that the number of text files is 1195, which is less than the expected 1200. After comparing the file names with the dataframe list, we identified the presence of identical web titles, which are used as file names. Consequently, the files were being overwritten.
(P.S.: If you run the code now, you will not encounter the same output as we have already addressed and removed the problematic code.)

In [79]:
# count files
filename_df['filename'].count()

1195

In [78]:
# count dataframe
df['webTitle'].count()

1200

Here, we find the rewrite file name. 

In [102]:
from collections import Counter
len(set(file_names))
# Use Counter to count occurrences of each element
element_counts = Counter(file_names)
# Find repeated elements (elements with count greater than 1)
repeated_elements = [element for element, count in element_counts.items() if count > 1]

# Print the repeated elements
print(repeated_elements)

['What happened in the Russia-Ukraine war this week_ Catch up with the must-read news and analysis.txt']


Since we identified the problem, we added lines of code to verify if the file names are repeated, and we appended a suffix to the repeated ones. The solution is presented in the "Save Corpus in Text Files" section.

In [3]:
df = pd.read_csv('Guardian_spacy.csv')
df['Token'] = df['Document'].copy()
print(df['Token'].head(2))

0    {'webTitle': 'Israel-Gaza war live: any attemp...
1    {'webTitle': 'Macron confident Orbán can be pe...
Name: Tokens, dtype: object


In [21]:
def get_token(text): 
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    text = ''.join(character for character in text
                   if character not in punctuation)
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return doc

df['Token'] = df['Document'].apply(get_token)

print(df['Token'].head(2))

0    (webTitle, IsraelGaza, war, live, any, attempt...
1    (webTitle, Macron, confident, Orbán, can, be, ...
Name: Token, dtype: object


In [26]:
def lemma(text):
    return [(token.lemma_) for token in text]
df['Lemma'] = df['Token'].apply(lemma)
print(df['Lemma'].head(2))

0    [webTitle, IsraelGaza, war, live, any, attempt...
1    [webTitle, Macron, confident, Orbán, can, be, ...
Name: Lemma, dtype: object


In [25]:
def pos(text):
    return [(token.pos_) for token in text]

df['Pos'] = df['Token'].apply(pos)
print(df['Pos'].head(2))

0    [PROPN, PROPN, NOUN, VERB, DET, NOUN, PART, VE...
1    [PROPN, PROPN, ADJ, PROPN, AUX, AUX, VERB, PAR...
Name: Pos, dtype: object


In [33]:
print(df.head(2))

                                            Filename  \
0  Israel-Gaza war live_ any attempt to isolate G...   
1  Macron confident Orbán can be persuaded to sup...   

                                            Document  \
0  {'webTitle': 'Israel-Gaza war live: any attemp...   
1  {'webTitle': 'Macron confident Orbán can be pe...   

                                               Token  \
0  (webTitle, IsraelGaza, war, live, any, attempt...   
1  (webTitle, Macron, confident, Orbán, can, be, ...   

                                                 Pos  \
0  [PROPN, PROPN, NOUN, VERB, DET, NOUN, PART, VE...   
1  [PROPN, PROPN, ADJ, PROPN, AUX, AUX, VERB, PAR...   

                                               Lemma  
0  [webTitle, IsraelGaza, war, live, any, attempt...  
1  [webTitle, Macron, confident, Orbán, can, be, ...  


In [34]:
df.to_csv('Guardian_pandas_spacy_03.csv', index=False)