# The Guardian 

In [1]:
import pandas as pd
import requests
import spacy

## Scrape the Data from API

I want to scrape all the articles within the 'World' section from the first five pages of December 2023 and I define the URL.

In [53]:
url="https://content.guardianapis.com/search?section=world&from-date=2023-12-1&to-date=2023-12-31&show-blocks=all&api-key=4c043d21-d53e-4a99-a6f3-1a08745b7575&page="

In [54]:
urllist=[]
for i in range(1,6): #115
    a=url
    b=str(i)
    c=a+b
    urllist.append(c)
info=[]
def json(url1):
    response=requests.get(url1)
    x=response.json()
    info.append(x)

In [57]:
output=[json(url1) for url1 in urllist]

I only need specific data, which includes the title, date, URL, and contents. I retrieve this data using the API. Since The Guardian has a special type of article called liveblog, like this [article](https://www.theguardian.com/world/live/2023/dec/31/russia-ukraine-war-live-kharkiv-under-wave-of-drone-attacks-on-new-years-eve), which includes a list of ongoing updates. I've simplified the content extraction by focusing solely on the latest reports within the article.

In [65]:
extracted_data = [
    {
        'webTitle': item['webTitle'],
        #'sectionName': item['sectionName'],
        'webPublicationDate': item['webPublicationDate'],
        'webUrl': item['webUrl'],
        #'elements':[result['bodyTextSummary'] for result in item['blocks']['body']],
        'bodyTextSummary': item['blocks']['body'][0]['bodyTextSummary'], # extract the first report of the article  
    }
    for response in info if 'results' in response['response']
    for item in response['response']['results']
    #for result in item['blocks']['body']
    
]

In [66]:
len(extracted_data)

50

In [69]:
extracted_data[20]

{'webTitle': 'UK has failed to act to free Alaa Abd el-Fattah from jail in Egypt, family says',
 'webPublicationDate': '2023-12-31T05:00:52Z',
 'webUrl': 'https://www.theguardian.com/world/2023/dec/31/uk-failed-act-free-alaa-abd-el-fattah-jail-egypt-family-says',
 'bodyTextSummary': 'The family of the imprisoned British-Egyptian writer and activist Alaa Abd el-Fattah have said the British government has failed to act to free him, a year after the prime minister, Rishi Sunak, told his sister the government was “totally committed to resolving your brother’s case”. A figurehead in Egypt’s 2011 uprising, which overthrew Hosni Mubarak as president, Abd el-Fattah spent most of the past decade behind bars for his activism. He was rearrested in 2019 following a brief period out of prison but under police surveillance, and was sentenced in December 2021 to a further five years in detention for spreading “false news undermining national security”, after resharing a social media post about tortur

## Save corpus in text files

I intend to use the web title as the filename. Therefore, I need to remove special characters. Additionally, to prevent overwriting, I am employing Counter() to append suffixes to files with the same name.

In [71]:
#save corpus to txt 
import os
from collections import Counter

def cleanFilename(filename):
    invalid_chars = '<>:"/\\|?*'
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename

output_directory = 'guardian_corpus_spacy_improved/'
file_counts = Counter()
iteration_count = 0


for line in extracted_data:
    base_filename = cleanFilename(line['webTitle'])
    extension = '.txt'
    filename = base_filename + extension

    # Check if the filename already exists
    count = file_counts[filename]
    while os.path.join(output_directory, filename) in file_names:
        # If yes, increment the count and add the suffix
        count += 1
        filename = f"{base_filename}-{count}{extension}"

    file_counts[filename] = count
    file_names.append(os.path.join(output_directory, filename))

    with open(os.path.join(output_directory, filename), 'w', encoding='utf-8') as file:
        # Write the data dictionary to the file
        file.write(str(line))

    iteration_count += 1 

print(f"Total iterations: {iteration_count}")

Total iterations: 50


## Bring metadata into csv 

The metadata is in a clear structure, so I've simply used pandas to transform it into a DataFrame.

In [72]:
# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(extracted_data)

# Save the DataFrame to a CSV file
df.to_csv('Guardian_Spacy_3_improved.csv', index=False)

In [73]:
df.head()

Unnamed: 0,webTitle,webPublicationDate,webUrl,bodyTextSummary
0,Israeli airstrikes kill scores in Gaza as war ...,2023-12-31T20:44:31Z,https://www.theguardian.com/world/2023/dec/31/...,At least 100 people have been killed in Gaza i...
1,Queen Margrethe II of Denmark announces surpri...,2023-12-31T18:27:16Z,https://www.theguardian.com/world/2023/dec/31/...,The queen of Denmark has announced that she is...
2,US Navy downs missiles in Red Sea after ship a...,2023-12-31T17:30:00Z,https://www.theguardian.com/world/2023/dec/31/...,The US Navy has shot down two anti-ship missil...
3,Family of UK mother and son killed in Alps ava...,2023-12-31T17:27:58Z,https://www.theguardian.com/world/2023/dec/31/...,The family of a mother and son who died in an ...
4,Venice to limit tourist group size to 25 to pr...,2023-12-31T16:50:08Z,https://www.theguardian.com/world/2023/dec/31/...,Venice is to limit the size of tourist groups ...


## Get Tokens with spacy

In [74]:
def get_token(text): 
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    text = ''.join(character for character in text
                   if character not in punctuation)
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return doc

df['Token'] = df['bodyTextSummary'].apply(get_token)

print(df['Token'].head(2))

0    (At, least, 100, people, have, been, killed, i...
1    (The, queen, of, Denmark, has, announced, that...
Name: Token, dtype: object


In [75]:
df.head(2)

Unnamed: 0,webTitle,webPublicationDate,webUrl,bodyTextSummary,Token
0,Israeli airstrikes kill scores in Gaza as war ...,2023-12-31T20:44:31Z,https://www.theguardian.com/world/2023/dec/31/...,At least 100 people have been killed in Gaza i...,"(At, least, 100, people, have, been, killed, i..."
1,Queen Margrethe II of Denmark announces surpri...,2023-12-31T18:27:16Z,https://www.theguardian.com/world/2023/dec/31/...,The queen of Denmark has announced that she is...,"(The, queen, of, Denmark, has, announced, that..."


## Get Lemmas with Spacy

In [76]:
def lemma(text):
    return [(token.lemma_) for token in text]
df['Lemma'] = df['Token'].apply(lemma)
print(df['Lemma'].head(2))

0    [at, least, 100, people, have, be, kill, in, G...
1    [the, queen, of, Denmark, have, announce, that...
Name: Lemma, dtype: object


## Get POS with Spacy

In [77]:
def pos(text):
    return [(token.pos_) for token in text]

df['Pos'] = df['Token'].apply(pos)
print(df['Pos'].head(2))

0    [ADP, ADJ, NUM, NOUN, AUX, AUX, VERB, ADP, PRO...
1    [DET, NOUN, ADP, PROPN, AUX, VERB, SCONJ, PRON...
Name: Pos, dtype: object


In [78]:
print(df.head(2))

                                            webTitle    webPublicationDate  \
0  Israeli airstrikes kill scores in Gaza as war ...  2023-12-31T20:44:31Z   
1  Queen Margrethe II of Denmark announces surpri...  2023-12-31T18:27:16Z   

                                              webUrl  \
0  https://www.theguardian.com/world/2023/dec/31/...   
1  https://www.theguardian.com/world/2023/dec/31/...   

                                     bodyTextSummary  \
0  At least 100 people have been killed in Gaza i...   
1  The queen of Denmark has announced that she is...   

                                               Token  \
0  (At, least, 100, people, have, been, killed, i...   
1  (The, queen, of, Denmark, has, announced, that...   

                                               Lemma  \
0  [at, least, 100, people, have, be, kill, in, G...   
1  [the, queen, of, Denmark, have, announce, that...   

                                                 Pos  
0  [ADP, ADJ, NUM, NOUN, AUX, AUX,

In [79]:
df.to_csv('Guardian_Spacy_03_annotated_improved.csv', index=False)