# Group Project

## Processing Data

**This is to preprocess the data scraped from the website, and create a metadata and data csv file containing information about articles on the Guardian and aticles themselves, which is convinient for human to read and for machine to analyze.

by:Zhuoli Lu




## Installing and Importing

In [1]:
# Import os to upload documents and metadata
import os
# Imprort yaml
import yaml
# Import json
import json
# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'



## Create a DataFrame from parsing JSON strings in all files

The data we scraped from the website by Gaurdian API are structured and stored in txt files. The form of data is JSON, also as known as JavaScript Object Notation. As JSON is a lightweight data-interchange format that is easy for humans to read and write, we can clearly recognized metadata and data information from these txt files. JSON is also pretty easy for machine to parse and generate. This step is to use python language and to import yaml to parse these JSON strings and then create a DataFrame. We will get keys like Web Title, Section Name,Web Publication Date, Web URL, Tags, Tag Count and Body Text Summary, which is needed for next analysis. Every key maps to a specific value. We store these keys and values into corresponding columns within the dataframe.

In [52]:

# Create empty list for storing data
all_data = []

# Specify the directory where the files are located
folder_path = 'Guardian_corpus_txt_improved'

# Iterate through each file in the folder
for file_name in os.listdir(folder_path):
    # Construct the full file path
    file_path = os.path.join(folder_path, file_name)

    # Make sure to process txt files only
    if os.path.isfile(file_path) and file_path.endswith('.txt'):
        # Read txt files
        with open(file_path, 'r', encoding='utf-8') as file:
          file_content = file.read()

        # Use yaml to parse JSON data
        try:
            data = yaml.safe_load(file_content)
            all_data.append(data)
        except yaml.YAMLError as e:
            print(f"Error parsing JSON in {file_name}: {e}")

# Transform the data into Pandas DataFrame
df = pd.DataFrame(all_data)

# Look at the head of the DataFrame
print(f"Total files processed: {len(all_data)}")
df.head()


Total files processed: 1200


Unnamed: 0,type,webTitle,sectionName,webPublicationDate,webUrl,tags,tagCount,bodyTextSummary
0,article,Russia launches multiple rocket attacks in Kha...,World news,2022-02-28T20:03:36Z,https://www.theguardian.com/world/2022/feb/28/...,"[{'tagTitle': 'Ukraine', 'tagURL': 'https://ww...",19,Russian forces have launched rocket attacks th...
1,article,Hong Kong journalist allowed to travel to UK a...,World news,2022-09-22T14:41:33Z,https://www.theguardian.com/world/2022/sep/22/...,"[{'tagTitle': 'Hong Kong', 'tagURL': 'https://...",11,The head of Hong Kong’s journalists’ associati...
2,article,"Mariupol cemetery images show 1,400 graves dug...",World news,2022-07-15T07:58:35Z,https://www.theguardian.com/world/2022/jul/15/...,"[{'tagTitle': 'Ukraine', 'tagURL': 'https://ww...",9,New satellite images show an expanding grave s...
3,article,‘We need the truth’: the campaign to ‘de-Russi...,World news,2022-06-04T04:00:26Z,https://www.theguardian.com/world/2022/jun/04/...,"[{'tagTitle': 'Ukraine', 'tagURL': 'https://ww...",11,Standing in front of a statue of a gruff-looki...
4,article,"Four dead, one missing after New Zealand fishi...",World news,2022-03-21T03:31:51Z,https://www.theguardian.com/world/2022/mar/21/...,"[{'tagTitle': 'New Zealand', 'tagURL': 'https:...",7,A fishing trip in New Zealand has ended in tra...


## Extracting and parsing data in Dataframe

In this step, we extract dada needed for analysis in the project, and parse values in the tag colunm, which are shown as a list that couldn't be directly used for analysis.

In [53]:
# Used to store extracted data
web_titles = []
pub_time = []
web_tags = []
tag_count = []
web_url = []
article_text = []


# Iterate through the original DataFrame
for index, row in df.iterrows():
    # Extract the title
    title = row['webTitle']  # Assuming the article title is in the 'webTitle' column
    time = row ['webPublicationDate']
    count = row['tagCount']
    url = row['webUrl']
    article = row['bodyTextSummary']


    temp_tags = []



    # Examine whether the web has been processed
    if len(web_titles) > 0 and title == web_titles[-1] and time == pub_time[-1]:
        # if yes, add the new tag as the last tag in one raw
        if isinstance(row['tags'], list):
            for tag in row['tags']:
                if 'tagTitle' in tag:
                    web_tags[-1] += ", " + tag['tagTitle']  # Append the tag to the last item
    else:
        # if else, then process it as a new web in a new raw
        temp_tags = [tag['tagTitle'] for tag in row['tags'] if 'tagTitle' in tag] if isinstance(row['tags'], list) else []
        web_titles.append(title)
        pub_time.append(time)
        tag_count.append(count)
        web_url.append(url)
        article_text.append(article)
        web_tags.append(', '.join(temp_tags))  # Convert the list of tags to a comma-separated string


# Create a new DataFrame
d3 = {'WebTitle': web_titles, 'WebUrl' : web_url, 'PubTime':pub_time, 'Tags':web_tags, 'TagCounts':tag_count, 'Text':article_text}
metadata_df = pd.DataFrame(d3)

# Display the first few rows of the new DataFrame
metadata_df.head()

Unnamed: 0,WebTitle,WebUrl,PubTime,Tags,TagCounts,Text
0,Russia launches multiple rocket attacks in Kha...,https://www.theguardian.com/world/2022/feb/28/...,2022-02-28T20:03:36Z,"Ukraine, Russia, World news, Nato, European Un...",19,Russian forces have launched rocket attacks th...
1,Hong Kong journalist allowed to travel to UK a...,https://www.theguardian.com/world/2022/sep/22/...,2022-09-22T14:41:33Z,"Hong Kong, Asia Pacific, World news, China, Pr...",11,The head of Hong Kong’s journalists’ associati...
2,"Mariupol cemetery images show 1,400 graves dug...",https://www.theguardian.com/world/2022/jul/15/...,2022-07-15T07:58:35Z,"Ukraine, Russia, Europe, World news, Article, ...",9,New satellite images show an expanding grave s...
3,‘We need the truth’: the campaign to ‘de-Russi...,https://www.theguardian.com/world/2022/jun/04/...,2022-06-04T04:00:26Z,"Ukraine, Russia, Europe, World news, Article, ...",11,Standing in front of a statue of a gruff-looki...
4,"Four dead, one missing after New Zealand fishi...",https://www.theguardian.com/world/2022/mar/21/...,2022-03-21T03:31:51Z,"New Zealand, Asia Pacific, World news, Article...",7,A fishing trip in New Zealand has ended in tra...


## Transforming data format (data cleaning)

This step is to transform the time format to a datatime format.The previous format is ISO 8601, an international standard for date and time representations. We use the datetime module in python, and convert these ISO 8601 formatted strings into datetime objects.These objects are convenient to extract specific date or time components, or to calculate the difference between two dates,which could be useful for analysis.

In [54]:
metadata_df['PubTime'] = pd.to_datetime(metadata_df['PubTime'])

metadata_df.head()

Unnamed: 0,WebTitle,WebUrl,PubTime,Tags,TagCounts,Text
0,Russia launches multiple rocket attacks in Kha...,https://www.theguardian.com/world/2022/feb/28/...,2022-02-28 20:03:36+00:00,"Ukraine, Russia, World news, Nato, European Un...",19,Russian forces have launched rocket attacks th...
1,Hong Kong journalist allowed to travel to UK a...,https://www.theguardian.com/world/2022/sep/22/...,2022-09-22 14:41:33+00:00,"Hong Kong, Asia Pacific, World news, China, Pr...",11,The head of Hong Kong’s journalists’ associati...
2,"Mariupol cemetery images show 1,400 graves dug...",https://www.theguardian.com/world/2022/jul/15/...,2022-07-15 07:58:35+00:00,"Ukraine, Russia, Europe, World news, Article, ...",9,New satellite images show an expanding grave s...
3,‘We need the truth’: the campaign to ‘de-Russi...,https://www.theguardian.com/world/2022/jun/04/...,2022-06-04 04:00:26+00:00,"Ukraine, Russia, Europe, World news, Article, ...",11,Standing in front of a statue of a gruff-looki...
4,"Four dead, one missing after New Zealand fishi...",https://www.theguardian.com/world/2022/mar/21/...,2022-03-21 03:31:51+00:00,"New Zealand, Asia Pacific, World news, Article...",7,A fishing trip in New Zealand has ended in tra...


## Removing extra spaces (data cleaning)

The beginnings of some of the texts may contain extra spaces (indicated by \t or \n). These characters can be replaced by a single space using the str.replace() method.

In [55]:
# Remove extra spaces from papers
metadata_df['Text'] = metadata_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
metadata_df.head()


Unnamed: 0,WebTitle,WebUrl,PubTime,Tags,TagCounts,Text
0,Russia launches multiple rocket attacks in Kha...,https://www.theguardian.com/world/2022/feb/28/...,2022-02-28 20:03:36+00:00,"Ukraine, Russia, World news, Nato, European Un...",19,Russian forces have launched rocket attacks th...
1,Hong Kong journalist allowed to travel to UK a...,https://www.theguardian.com/world/2022/sep/22/...,2022-09-22 14:41:33+00:00,"Hong Kong, Asia Pacific, World news, China, Pr...",11,The head of Hong Kong’s journalists’ associati...
2,"Mariupol cemetery images show 1,400 graves dug...",https://www.theguardian.com/world/2022/jul/15/...,2022-07-15 07:58:35+00:00,"Ukraine, Russia, Europe, World news, Article, ...",9,New satellite images show an expanding grave s...
3,‘We need the truth’: the campaign to ‘de-Russi...,https://www.theguardian.com/world/2022/jun/04/...,2022-06-04 04:00:26+00:00,"Ukraine, Russia, Europe, World news, Article, ...",11,Standing in front of a statue of a gruff-looki...
4,"Four dead, one missing after New Zealand fishi...",https://www.theguardian.com/world/2022/mar/21/...,2022-03-21 03:31:51+00:00,"New Zealand, Asia Pacific, World news, Article...",7,A fishing trip in New Zealand has ended in tra...


## Saving the metadata csv files.

In [23]:

metadata_df.to_csv('metadata_improved.csv')