# Objective

After viewing the data scraped from The Sun website, some inconsistencies were discovered along with the instruction to only scrape from a particular category. This file contains a script that removes, and correct the inconsistencies and also exclude data from other categories except the selected one which in this case is "Politics and Power"

This file then goes futher to lowercase, remove punctuations, special characters and extra whitespaces from the data in the "text" column and then assign the manipilated data to a new column called "cleaned_text".

In [1]:
# importing libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

In [2]:
df = pd.read_csv(r"C:\Users\USER\UIT-webscraping\articles_paragraphs.csv") # loading the file 

In [3]:
df = df.drop('Unnamed: 0', axis=1) # removing the index column

In [4]:
df = df.rename(columns={'paragraph': 'text'}) # renaming the column appropriately

In [5]:
df = df[~df['text'].str.strip().str.match(r'(?i)^(by|from|a)[a-z]')].reset_index(drop=True) # deleting rows that begin with by, from, and a 

df['text'] = df['text'].str.replace(r'(?i)^(the)([a-z])', r'\1 \2', regex=True) # inserting space after "the" in rows where it is the first word at the begining of a row

In [7]:
selected_category = df[df['category'] == 'Politics & Power'].reset_index(drop=True) # selecting paragraphs from a particular paragraph only

In [8]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [9]:
# Lowercase, remove punctuation/special characters and extra whitespaces from the text
selected_category['cleaned_text'] = (selected_category['text'].str.lower()
                                     .str.replace(r'[^\w\s]', '', regex=True)
                                     .str.replace(r'\s+', ' ', regex=True).str.strip())

# Remove stop words
filtered_texts = []
for text in selected_category['cleaned_text']:
    words = text.split()               # Split into words
    filtered_words = []                # Prepare empty list for filtered words
    for word in words:
        if word not in stop_words:    # Check if word is NOT a stop word
            filtered_words.append(word)  # If not stop word, add to list
    cleaned_text = ' '.join(filtered_words)  # Join filtered words back into string
    filtered_texts.append(cleaned_text)       # Append cleaned text to list

# Assign the cleaned column back
selected_category['cleaned_text'] = filtered_texts

In [10]:
selected_category.to_csv(r"C:\Users\USER\UIT-webscraping\articles_paragraphs_politics.csv", index=False) # saving the cleaned dataset as a new file