<a href="https://colab.research.google.com/github/ChaithanyaSaiB/UMBC-DATA606-Capstone/blob/main/notebooks/Dataset_Creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Creation

In this notebook, I would be taking the initial dataset from UCI Machine Learning Repository that contains health news tweets from various health agencies with URLs. An article's content is extracted using that URL. Then content is preprocessing and stored for topic modeling

## Installs and imports

This section includes all the packages installs and imports which are necessary for execution of the notebook

In [2]:
!pip install fake-useragent

Collecting fake-useragent
  Downloading fake_useragent-1.5.1-py3-none-any.whl (17 kB)
Installing collected packages: fake-useragent
Successfully installed fake-useragent-1.5.1


In [3]:
# Importing necessary libraries
import re
import os
import requests
import pandas as pd
import numpy as np
from urllib.parse import urlparse
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Import the tqdm library for progress bar
from tqdm import tqdm

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Loading Data and Article Content Extraction

### Load and Combine Data Files

Here we load and combine all the data files for 5 news agencies namely - **NPR Health**, **CNN Health**, **CBC Health**, **Everyday Health** and **LA Times Health**

In [4]:
# Define the directory path containing the .txt files
directory = '/content'

# Initialize an empty list to store DataFrames
dfs = []

# Iterate over all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        # Read the .txt file into a DataFrame
        df_name = os.path.splitext(filename)[0]  # Extract filename without extension
        try:
            df = pd.read_csv(os.path.join(directory, filename),
                             sep=r"(?<!\s)[|](?!\s)",
                             names=['ID', 'DateTime', 'Tweet'],
                             engine="python")
            # Add the DataFrame to the list
            dfs.append(df)
            print(f"DataFrame '{df_name}' loaded with {len(df)} rows.")
        except UnicodeDecodeError:
            print(f"Error reading {filename}: UnicodeDecodeError. Skipping this file.")

# Concatenate all DataFrames in the list into one DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Display the combined DataFrame
print("Combined DataFrame:")
print(combined_df)

DataFrame 'cnnhealth' loaded with 4061 rows.
DataFrame 'cbchealth' loaded with 3741 rows.
DataFrame 'nprhealth' loaded with 4837 rows.
DataFrame 'latimeshealth' loaded with 4171 rows.
DataFrame 'everydayhealth' loaded with 3239 rows.
Combined DataFrame:
                       ID                        DateTime  \
0      576880531301801984  Sat Mar 14 23:00:11 +0000 2015   
1      576820122666471424  Sat Mar 14 19:00:08 +0000 2015   
2      576744652717461504  Sat Mar 14 14:00:15 +0000 2015   
3      576736754436304896  Sat Mar 14 13:28:52 +0000 2015   
4      576736614766010368  Sat Mar 14 13:28:18 +0000 2015   
...                   ...                             ...   
20044  289382400222441472  Thu Jan 10 14:45:15 +0000 2013   
20045  289374853075718145  Thu Jan 10 14:15:15 +0000 2013   
20046  289371042584076289  Thu Jan 10 14:00:07 +0000 2013   
20047  289367315647193089  Thu Jan 10 13:45:18 +0000 2013   
20048  289367039930408961  Thu Jan 10 13:44:13 +0000 2013   

             

### Extract URL for each article

We extract the URL from Tweet column of the dataframe which often contains the URL of the article the tweet mentions for a further read if interested by tweet viewer. For simplicity, we utilize the tweets with one link only and discard the rest

In [None]:
# Extract all URLs from article titles and links in combined_df
combined_df['URL'] = [re.findall(r'https?://\S+', text) for text in combined_df['Article Title and Link']]

# Drop rows with multiple or no URLs
multiple_URLs_indices = combined_df['Article Title and Link'][combined_df['URL'].apply(len) != 1].index
combined_df.drop(multiple_URLs_indices, inplace=True)

# Extract URL for articles
combined_df['URL'] = [url[0] for url in combined_df['URL']]


### Article Content Extraction Functions

These are the functions that contain the code to extract website name from URL, fetch article's HTML content from a URL, extract article's body content from HTML using BeautifulSoup and apply above functions to extract content along with storing it in a new column

In [None]:
# Function to fetch content from a URL
def fetch_content(url):
    # Create a UserAgent instance
    ua = UserAgent()
    try:
        # Define user-agent header using a random user agent
        headers = {'User-Agent': ua.random}

        # Create a session to handle cookies and maintain connection state
        with requests.Session() as session:
            response = session.get(url, headers=headers, timeout=4, allow_redirects=True)
            response.raise_for_status()  # Raise an error for 4xx and 5xx status codes
            final_url = response.url  # Get the final URL after following redirects
            website_name = get_website_name(final_url)
            return (response.text, website_name)
    except Exception as e:
        print("Error fetching content:", e)
        return (None, None)

# Function to extract website name from URL
def get_website_name(url):
    parsed_url = urlparse(url)
    return parsed_url.netloc

# Function to extract body content from HTML using BeautifulSoup
def extract_body(html_content_with_website_name):
    html_content, website_name = html_content_with_website_name
    if html_content:
        soup = BeautifulSoup(html_content, 'html.parser')
        if soup:
            if website_name == 'www.cnn.com':
                article_body = soup.findAll('p', class_='paragraph')
            elif website_name == 'www.health.usnews.com':
                article_body = soup.findAll('p')
            elif website_name == 'www.latimes.com':
                article_body = soup.findAll('p')
            elif website_name == 'www.npr.org':
                storytext_div = soup.find('div', id='storytext')
                if storytext_div:
                    article_body = storytext_div.find_all('p')
                else:
                    article_body = None
            elif website_name == 'www.cbc.ca':
                storytext_div = soup.find('div', class_='story')
                if storytext_div:
                    article_body = storytext_div.find_all('p')
                else:
                    article_body = None
            else:
                article_body = None

            if article_body:
                return article_body
            else:
                return None
        else:
            return None
    else:
        return None

# Apply web scraping to extract content and store it in a new column along with progress bar
def extract_content_with_progress(url_list):
    content_list = []
    for url in tqdm(url_list, desc='Extracting content'):
        content = extract_body(fetch_content(url))
        content_list.append(content)
    return content_list

### Extract Article Content Using URLs

Dataset is broken into 10 splits to accomodate the RAM space required and article content extraction functions are applied

In [None]:
# Calculate the total number of rows
total_rows = len(combined_df)

# Calculate the size of each split
split_size = total_rows // 10

# Define the start and end indices for each split
split_1 = combined_df.iloc[:split_size].copy()
split_2 = combined_df.iloc[split_size:2*split_size].copy()
split_3 = combined_df.iloc[2*split_size:3*split_size].copy()
split_4 = combined_df.iloc[3*split_size:4*split_size].copy()
split_5 = combined_df.iloc[4*split_size:5*split_size].copy()
split_6 = combined_df.iloc[5*split_size:6*split_size].copy()
split_7 = combined_df.iloc[6*split_size:7*split_size].copy()
split_8 = combined_df.iloc[7*split_size:8*split_size].copy()
split_9 = combined_df.iloc[8*split_size:9*split_size].copy()
split_10 = combined_df.iloc[9*split_size:].copy()

In [None]:
split_1['Content'] = extract_content_with_progress(split_1['URL'])
split_1.to_csv('split_1_preprocessed.csv', index=False)

In [None]:
split_2['Content'] = extract_content_with_progress(split_2['URL'])
split_2.to_csv('split_2_preprocessed.csv', index=False)

In [None]:
split_3['Content'] = extract_content_with_progress(split_3['URL'])
split_3.to_csv('split_3_preprocessed.csv', index=False)

In [None]:
split_4['Content'] = extract_content_with_progress(split_4['URL'])
split_4.to_csv('split_4_preprocessed.csv', index=False)

In [None]:
split_5['Content'] = extract_content_with_progress(split_5['URL'])
split_5.to_csv('split_5_preprocessed.csv', index=False, escapechar='\\')

In [None]:
split_6['Content'] = extract_content_with_progress(split_6['URL'])
split_6.to_csv('split_6_preprocessed.csv', index=False, escapechar='\\')

In [None]:
split_7['Content'] = extract_content_with_progress(split_7['URL'])
split_7.to_csv('split_7_preprocessed.csv', index=False, escapechar='\\')

In [None]:
split_8['Content'] = extract_content_with_progress(split_8['URL'])
split_8.to_csv('split_8_preprocessed.csv', index=False)

In [None]:
split_9['Content'] = extract_content_with_progress(split_9['URL'])
split_9.to_csv('split_9_preprocessed.csv', index=False)

In [None]:
split_10['Content'] = extract_content_with_progress(split_10['URL'])
split_10.to_csv('split_10_preprocessed.csv', index=False)

## Text Preprocessing

### Loading and Combining Data

Combining all the results of splits from previous step

In [19]:
# List to store DataFrames of all splits
dfs = []

# Read each CSV file and append its DataFrame to the list
for i in range(1, 11):
    filename = f'/content/split_{i}_preprocessed.csv'
    df = pd.read_csv(filename)
    dfs.append(df)

# Concatenate all DataFrames in the list
combined_df = pd.concat(dfs, ignore_index=True)

### Format and Clean Data

Formatting string to list, concatenating strings within list, cleaning text, removing 'nan' and NaNs from the text

In [20]:
# Convert all the content to list type
combined_df['Content'] = combined_df['Content'].apply(lambda x: [x])

# Combine content paragraphs into a single string
combined_df['Content'] = [" ".join([str(p_tag).strip() for p_tag in content]) for content in combined_df['Content']]

In [22]:
# Function to clean text
def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, 'html.parser').get_text()
    # Substitute hyphens with empty spaces
    text = re.sub(r'-', ' ', text)
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    return text

# Apply the clean_text function to the 'Content' column
combined_df['Content'] = combined_df['Content'].apply(clean_text)

In [23]:
# Remove 'nan' strings
combined_df['Content'] = combined_df['Content'].replace('nan', np.nan)

# Drop rows with missing content
combined_df.dropna(subset=['Content'], inplace=True)

### Preprocessing Data

Here as part of preprocessing, we do word tokenization, pos tagging, lemmatization and stop word removal on the text

In [24]:
# Convert to wordnet tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None  # Use default POS for lemmatization

In [25]:
# Tokenize and preprocess text data
def preprocess_text(text):
    tokens = word_tokenize(text)  # Tokenize text
    lemmatizer = WordNetLemmatizer()  # Initialize lemmatizer
    pos_tags = nltk.pos_tag(tokens)  # Get part-of-speech tags
    for i, (token, tag) in enumerate(pos_tags):
        pos = get_wordnet_pos(tag)  # Convert NLTK POS tags to WordNet POS tags
        if pos:
            tokens[i] = lemmatizer.lemmatize(token, pos=pos)  # Lemmatize tokens
        else:
            tokens[i] = lemmatizer.lemmatize(token)  # Use default POS for lemmatization
    stop_words = set(stopwords.words('english'))  # Get stopwords
    custom_stopwords = [    # Custom stopwords
    "patient", "doctor", "say", "year", "state", "day", "need", "come", "well",
    "make", "think", "know", "go", "use", "one", "like", "people", "may",
    "many", "still", "even", "two", "way", "good", "much", "back", "new",
    "time", "first", "really",
    "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
    "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"
    ]
    tokens = [token for token in tokens if token not in stop_words and token not in custom_stopwords]  # Remove stopwords
    return ' '.join(tokens)

combined_df['Content'] = combined_df['Content'].apply(preprocess_text)

### Saving Data for Modeling

In [35]:
# Save preprocessed data to CSV
combined_df.to_csv('combined_preprocessed.csv', index=False)