# Assignment 1.1

### Exercise 1.1 - News Article Scraping and Cleaning (20)

Scrape 50 different news articles from various news websites and perform the following steps using basic python and regular expressions:

1. Write a regular expression to extract all the URLs present in the news articles.
2. How would you use regular expressions to extract the publication dates of the news articles?
3. Create a regex pattern to extract all the author names mentioned in the articles.
4. Explain how you would use regular expressions to extract email addresses of the authors from the articles.
5. Write a regex pattern to identify and extract all the phone numbers mentioned in the news articles.
6. How would you clean the text to remove HTML tags and special characters using regular expressions?
7. Write a regular expression to identify and extract all mentions of organizations or companies in the articles.
8. What approach would you take to clean the text and remove unnecessary whitespace and line breaks using regular expressions?
9. Explain how you would use regular expressions to identify and extract all the headlines or titles of the news articles.
10. Write a regex pattern to detect and extract all the mentions of important events or incidents discussed in the articles.

## Setup

In [1]:
import json
import time
import os
import re
import csv
import random
import requests
import xml.etree.ElementTree as ET
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## STEP 1 - Selenium Scraper

In [2]:
# --- Configuration ---
CSV_INPUT_FILENAME = 'news_urls_updated.csv'
JSON_OUTPUT_FILENAME = 'scraped_articles_final.json'
# --- Setup Parameters ---
WAIT_TIME = 5 # seconds to wait for initial page load
AD_WAIT_TIME = 3 # seconds to specifically wait for the ad/popup to appear

# 1. FUNCTION TO READ URLS FROM CSV FILE (No Change)
def get_urls_from_csv(filename):
    """Reads the 'URL' column from a CSV file and returns a list of URLs."""
    urls = []
    
    if not os.path.exists(filename):
        print(f"ERROR: Input file not found at '{filename}'.")
        return []
    
    try:
        with open(filename, mode='r', newline='', encoding='utf-8') as file:
            reader = csv.reader(file)
            header = next(reader)
            
            try:
                url_index = header.index('URL')
            except ValueError:
                print("ERROR: 'URL' column not found in the CSV header.")
                return []
            
            for row in reader:
                if len(row) > url_index:
                    urls.append(row[url_index])
        
        print(f"Successfully loaded {len(urls)} URLs from '{filename}'.")
        return urls
        
    except Exception as e:
        print(f"An error occurred while reading the CSV: {e}")
        return []

# --- Get the dynamic list of URLs ---
urls_to_scrape = get_urls_from_csv(CSV_INPUT_FILENAME)

if not urls_to_scrape:
    print("No URLs to scrape. Exiting script.")
    exit()

# --- This list will store the data for each article ---
scraped_articles_data = []

# --- Setup Selenium WebDriver ---
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
print("Selenium WebDriver initiated...")

# Loop through each URL
for url in urls_to_scrape:
    try:
        print(f"\nNavigating to {url}...")
        driver.get(url)
        time.sleep(WAIT_TIME)

        # ----------------------------------------------------
        # 2. AD HANDLING LOGIC (METHOD 1)
        # ----------------------------------------------------
        # We define a common set of XPATHs/CSS selectors for popups/ads.
        # This will try to find and click any of the following elements that appear.
        AD_LOCATORS = [
            (By.XPATH, "//button[contains(text(), 'Close') or contains(text(), 'No thanks') or contains(text(), 'Skip')]"),
            (By.CSS_SELECTOR, "div.ad-close, button.close-ad, button[aria-label='Close']"),
            (By.XPATH, "//a[contains(@class, 'close') or contains(@class, 'ad-dismiss')]")
        ]

        ad_closed = False
        for locator_type, locator_value in AD_LOCATORS:
            try:
                # Wait for up to AD_WAIT_TIME seconds for the element to be clickable
                close_button = WebDriverWait(driver, AD_WAIT_TIME).until(
                    EC.element_to_be_clickable((locator_type, locator_value))
                )
                
                # Execute the click and break the loop if successful
                close_button.click()
                print(f"✨ Ad closed using: {locator_value}")
                time.sleep(1) # Give the page a moment to settle
                ad_closed = True
                break
                
            except Exception:
                # If a specific locator fails, just move on to the next one
                continue
        
        if not ad_closed:
            print("Note: No common ad/popup element found or closed.")
            
        # ----------------------------------------------------
        # END OF AD HANDLING
        # ----------------------------------------------------

        print("Page loaded. Extracting HTML...")
        final_html = driver.page_source

        article_data = {
            "source_url": url,
            "html_content": final_html
        }

        scraped_articles_data.append(article_data)
        print(f"Successfully processed and added data for the article.")

    except Exception as e:
        print(f"Failed to process {url}. Error: {e}")

# --- Close the browser ---
driver.quit()
print("\nBrowser closed.")

# --- Save the collected data to a single JSON file ---
print(f"Saving all collected data to {JSON_OUTPUT_FILENAME}...")

with open(JSON_OUTPUT_FILENAME, 'w', encoding='utf-8') as json_file:
    json.dump(scraped_articles_data, json_file, ensure_ascii=False, indent=4)

print(f"All done. Data saved to {JSON_OUTPUT_FILENAME}")

Successfully loaded 60 URLs from 'news_urls_updated.csv'.
Selenium WebDriver initiated...

Navigating to https://www.bbc.com/news/articles/c2056729058o?at_medium=RSS&at_campaign=rss...
Note: No common ad/popup element found or closed.
Page loaded. Extracting HTML...
Successfully processed and added data for the article.

Navigating to https://www.bbc.com/news/articles/c93xprvdy23o?at_medium=RSS&at_campaign=rss...
Note: No common ad/popup element found or closed.
Page loaded. Extracting HTML...
Successfully processed and added data for the article.

Navigating to https://www.bbc.com/news/articles/crkldd02xg8o?at_medium=RSS&at_campaign=rss...
Note: No common ad/popup element found or closed.
Page loaded. Extracting HTML...
Successfully processed and added data for the article.

Navigating to https://www.bbc.com/news/articles/c62e3pny6p7o?at_medium=RSS&at_campaign=rss...
Note: No common ad/popup element found or closed.
Page loaded. Extracting HTML...
Successfully processed and added data

##### This cell performs the web scraping using Selenium. First, it reads target news article URLs from `news_urls_updated.csv`. It then sets up and launches an automated Chrome browser instance. For each URL, it navigates to the page, waits briefly for loading, and attempts to close common ad popups by searching for predefined button/link elements. After handling potential ads, it captures the full HTML source code of the loaded page. Finally, it stores each URL and its corresponding HTML content in a list and saves this entire list as a JSON file named `scraped_articles_final.json`.

## STEP 2 - Strip JavaScript

In [5]:
import re
import json
import os

# --- Configuration ---
RAW_INPUT_FILENAME = 'scraped_articles_final.json' # Input from Cell 2
NO_SCRIPTS_OUTPUT_FILENAME = 'scraped_articles_NO_SCRIPTS.json' # Output for Cell 4

# --- Enhanced Removal Function (Using '' replacement) ---
def strip_unwanted_blocks(html_content):
    """Removes content within <script>, <noscript>, blocks,
       and inline event handlers, replacing with empty strings."""
    if html_content is None: return ""

    cleaned_html = html_content

    # 1. Remove standard <script> blocks
    SCRIPT_PATTERN = r'<script\b[^>]*>.*?</script>'
    cleaned_html = re.sub(SCRIPT_PATTERN, '', cleaned_html, flags=re.DOTALL | re.IGNORECASE) # Use ''

    # 2. Remove <noscript> blocks
    NOSCRIPT_PATTERN = r'<noscript\b[^>]*>.*?</noscript>'
    cleaned_html = re.sub(NOSCRIPT_PATTERN, '', cleaned_html, flags=re.DOTALL | re.IGNORECASE) # Use ''

    # 3. Remove HTML comments
    HTML_COMMENT_PATTERN = r''
    cleaned_html = re.sub(HTML_COMMENT_PATTERN, '', cleaned_html, flags=re.DOTALL) # Use ''

    # 4. Remove inline event handlers (like onclick="...")
    INLINE_EVENT_PATTERN = r'\s+on\w+\s*=\s*(?:"[^"]*"|\'[^\']*\')'
    cleaned_html = re.sub(INLINE_EVENT_PATTERN, '', cleaned_html, flags=re.IGNORECASE) # Use ''

    return cleaned_html

# --- Main Processing Logic ---
print(f"--- Starting Script/Noscript/Comment Stripping ---")
try:
    if not os.path.exists(RAW_INPUT_FILENAME):
        print(f"ERROR: File '{RAW_INPUT_FILENAME}' not found. Run Cell 2 first.")
    else:
        with open(RAW_INPUT_FILENAME, 'r', encoding='utf-8') as f:
            raw_data = json.load(f)
        print(f"Successfully loaded {len(raw_data)} articles from {RAW_INPUT_FILENAME}.")

        stripped_data = []
        articles_processed_count = 0
        for article in raw_data:
            raw_html = article.get("html_content", "")
            source_url = article.get("source_url", "N/A")
            html_no_scripts = strip_unwanted_blocks(raw_html) # Apply function
            stripped_entry = {"source_url": source_url, "html_content": html_no_scripts}
            stripped_data.append(stripped_entry)
            articles_processed_count += 1

        with open(NO_SCRIPTS_OUTPUT_FILENAME, 'w', encoding='utf-8') as json_file:
            json.dump(stripped_data, json_file, ensure_ascii=False, indent=4)
        print(f"\n--- Script/Noscript/Comment Stripping Complete ---")
        print(f"Successfully processed {articles_processed_count} articles.")
        print(f"Saved results to {NO_SCRIPTS_OUTPUT_FILENAME}")

        # Display sample (still might have weird spacing from original source)
        if stripped_data:
             sample_output = stripped_data[0]['html_content']
             first_meaningful_chunk = re.search(r'\S.*\S', sample_output, re.DOTALL)
             if first_meaningful_chunk:
                 print("\n--- Sample Output after Stripping (First 500 chars of meaningful content) ---")
                 print(repr(first_meaningful_chunk.group(0)[:500]) + "...") # Use repr
             else:
                 print("\n--- Sample Output after Stripping (First 500 chars) ---")
                 print(repr(sample_output[:500]) + "...") # Use repr

except Exception as e:
    print(f"An unexpected error occurred during processing: {e}")

--- Starting Script/Noscript/Comment Stripping ---
Successfully loaded 60 articles from scraped_articles_final.json.

--- Script/Noscript/Comment Stripping Complete ---
Successfully processed 60 articles.
Saved results to scraped_articles_NO_SCRIPTS.json

--- Sample Output after Stripping (First 500 chars of meaningful content) ---
'<html lang="en-GB"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width"><title>Son of hostage Amiram Cooper, whose body remains in Gaza, says \'it\'s not over\'</title><meta name="page.section" content="News"><meta name="page.subsection" content="Middle_east"><meta property="og:title" content="Son of hostage Amiram Cooper, whose body remains in Gaza, says \'it\'s not over\'"><meta name="twitter:title" content="Son of hostage Amiram Cooper, whose body remains in Gaza, says \'it'...


#### This cell performs the first stage of HTML cleaning. It reads the raw HTML data saved in `scraped_articles_final.json`. For each article's HTML content, it calls the `strip_unwanted_blocks` function. This function uses regular expressions (`re.sub`) to find and remove the contents of `<script>` tags, `<noscript>` tags, HTML comments (``), and inline JavaScript event handlers (like `onclick="..."`), replacing these removed sections with empty strings. The resulting HTML, now stripped of most script-related content but retaining other tags, is saved for each article into a new file named `scraped_articles_NO_SCRIPTS.json`.

## STEP 3 (Q6) - Remove HTML Tags & Special Characters

In [6]:
import re
import json
import os

# --- Configuration ---
RAW_JSON_INPUT = 'scraped_articles_NO_SCRIPTS.json' # Input from Enhanced Cell 3
CLEAN_JSON_OUTPUT = 'scraped_articles_CLEANED.json' # Output for Cell 5

# --- Cleaning Regex Patterns ---
# 1. Remove <style> blocks and content
STYLE_TAG_PATTERN = r'<style\b[^>]*>.*?</style>'
# 2. (Optional but kept from previous) Remove common leftover JS patterns
LEFTOVER_JS_PATTERN = r'(?:var|let|const|window\.|document\.)\s*\w+\s*=[^;>]+[;>]|{\s*"?[^"]+"?\s*:[^}]+};?'
# 3. Remove HTML comments
HTML_COMMENT_PATTERN = r''
# 4. Remove common HTML entities like &nbsp; &amp; etc.
HTML_ENTITY_PATTERN = r'&[a-zA-Z#0-9]+;'
# 5. Remove all *other* remaining HTML tags
HTML_TAG_PATTERN = r'<[^>]+>'
# 6. *** Pattern to remove unwanted special characters (keeping basics + whitespace \s) ***
#    This pattern defines what to *keep*. The regex matches everything ELSE.
SPECIAL_CHAR_PATTERN = r'[^\w\s\.,!?\'\-]' # Keep word chars, whitespace, and basic punctuation ',.!?-'

# --- Load the No-Script/No-Comment Data ---
try:
    with open(RAW_JSON_INPUT, 'r', encoding='utf-8') as f:
        intermediate_data = json.load(f)
    print(f"Successfully loaded {len(intermediate_data)} articles from {RAW_JSON_INPUT}")
except FileNotFoundError:
    print(f"ERROR: File '{RAW_JSON_INPUT}' not found. Ensure Enhanced Cell 3 ran successfully.")
    intermediate_data = []
except Exception as e:
    print(f"ERROR: Could not load JSON data: {e}")
    intermediate_data = []

cleaned_articles_data = []

if intermediate_data:
    print(f"\nStarting refined cleaning process on {len(intermediate_data)} articles...")
    articles_cleaned_count = 0
    for article in intermediate_data:
        text_to_clean = article.get("html_content", "")
        source_url = article.get("source_url", "N/A")

        # --- Apply Removals Sequentially ---
        # 1. Remove Style Blocks (replace with empty string)
        text_to_clean = re.sub(STYLE_TAG_PATTERN, '', text_to_clean, flags=re.DOTALL | re.IGNORECASE)
        # 2. Remove Heuristic JS (replace with empty string)
        text_to_clean = re.sub(LEFTOVER_JS_PATTERN, '', text_to_clean, flags=re.DOTALL | re.IGNORECASE)
        # 3. Remove Comments (replace with empty string)
        text_to_clean = re.sub(HTML_COMMENT_PATTERN, '', text_to_clean, flags=re.DOTALL)
        # 4. Remove HTML Entities (replace with space)
        text_to_clean = re.sub(HTML_ENTITY_PATTERN, ' ', text_to_clean)
        # 5. Remove Remaining HTML Tags (replace with empty string)
        text_to_clean = re.sub(HTML_TAG_PATTERN, '', text_to_clean)
        # 6. *** Remove Special Characters (replace matches with space) ***
        text_no_special_chars = re.sub(SPECIAL_CHAR_PATTERN, ' ', text_to_clean)

        # Basic strip, but major whitespace cleanup happens in Cell 5
        final_intermediate_text = text_no_special_chars.strip()

        # --- Store Cleaned Data ---
        cleaned_entry = {
            "source_url": source_url,
            "clean_text": final_intermediate_text # This text STILL has excess internal whitespace
        }
        cleaned_articles_data.append(cleaned_entry)
        articles_cleaned_count +=1

    # --- Save the Intermediate Cleaned Data ---
    try:
        with open(CLEAN_JSON_OUTPUT, 'w', encoding='utf-8') as json_file:
            json.dump(cleaned_articles_data, json_file, ensure_ascii=False, indent=4)
        print(f"\n--- Cleaning Complete (Q6 - Tags/Styles/Entities/Special Chars Removed) ---")
        print(f"Successfully processed (Q6 - v6) {articles_cleaned_count} articles.")
        print(f"Intermediate cleaned data saved to {CLEAN_JSON_OUTPUT}")
        print("NOTE: Run Cell 5 (Q8) next to normalize whitespace.")

        # --- Display Sample (will likely show excess whitespace) ---
        if cleaned_articles_data:
            sample_text = cleaned_articles_data[0]['clean_text'][:500]
            print("\n--- Sample BEFORE Whitespace Normalization (First 500 characters) ---")
            print(repr(sample_text) + "...") # Use repr()

    except Exception as e:
         print(f"ERROR: Could not save cleaned data to {CLEAN_JSON_OUTPUT}: {e}")

else:
    print("\nNo data loaded from no-scripts file. Cannot proceed with cleaning.")

Successfully loaded 60 articles from scraped_articles_NO_SCRIPTS.json

Starting refined cleaning process on 60 articles...

--- Cleaning Complete (Q6 - Tags/Styles/Entities/Special Chars Removed) ---
Successfully processed (Q6 - v6) 60 articles.
Intermediate cleaned data saved to scraped_articles_CLEANED.json
NOTE: Run Cell 5 (Q8) next to normalize whitespace.

--- Sample BEFORE Whitespace Normalization (First 500 characters) ---
"Son of hostage Amiram Cooper, whose body remains in Gaza, says 'it's not over'Skip to contentAdvertisementWatch LiveBritish Broadcasting CorporationSubscribeSign InHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveDocumentariesHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveDocumentariesWeatherNewslettersWatch LiveAdvertisement'It's not over,' says son of hostage whose body remains in Gaza7 days agoShareSaveAlice CuddySenior international reporter, in Tel Aviv"...


#### This cell executes the cleaning steps required by Question 6. It loads the partially cleaned HTML from `scraped_articles_NO_SCRIPTS.json`. It then applies several regular expression substitutions (`re.sub`) sequentially: removing `<style>` blocks, heuristic JavaScript remnants, HTML comments (though the pattern `r''` is non-functional here), HTML entities (replacing with spaces), and all remaining HTML tags (replacing with empty strings). Finally, it removes unwanted special characters (anything not alphanumeric, whitespace, or basic punctuation `.,!?'\-`), replacing them with spaces. The resulting text, still containing excess whitespace, is saved to `scraped_articles_CLEANED.json`.

## STEP 4 (Q8) - Remove Unnecessary Whitespace

In [7]:
import re
import json
import os
import time

# --- Configuration ---
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json' # Input from Q6 v6

# --- Whitespace Normalization Pattern ---
WHITESPACE_PATTERN = r'\s+' # Matches one or more whitespace characters

# --- Load the intermediate cleaned data ---
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        partially_cleaned_data = json.load(f)
    print(f"Successfully loaded {len(partially_cleaned_data)} articles from {CLEAN_JSON_INPUT}")
except FileNotFoundError:
    print(f"ERROR: File '{CLEAN_JSON_INPUT}' not found. Please ensure the Q6 script ran successfully.")
    partially_cleaned_data = []
except Exception as e:
    print(f"ERROR: Could not load data: {e}")
    partially_cleaned_data = []

if partially_cleaned_data:
    print(f"\nStarting whitespace normalization on {len(partially_cleaned_data)} articles...")
    start_time = time.time()
    articles_normalized_count = 0
    # Iterate and update 'clean_text'
    for article in partially_cleaned_data:
        current_text = article.get("clean_text", "")

        # --- Normalize ALL whitespace ---
        normalized_text = re.sub(WHITESPACE_PATTERN, ' ', current_text).strip()

        # Overwrite the field
        article["clean_text"] = normalized_text
        articles_normalized_count += 1

    # --- Save the fully cleaned data back ---
    try:
        with open(CLEAN_JSON_INPUT, 'w', encoding='utf-8') as json_file:
            json.dump(partially_cleaned_data, json_file, ensure_ascii=False, indent=4)
        end_time = time.time()
        print(f"\n--- Whitespace Normalization Complete ---")
        print(f"Successfully normalized whitespace (Q8 v5) for {articles_normalized_count} articles in {end_time - start_time:.2f} seconds.")
        print(f"Final cleaned data saved back to {CLEAN_JSON_INPUT}")

        # Show final sample
        if partially_cleaned_data:
            sample_text = partially_cleaned_data[0]['clean_text'][:500]
            print("\n--- Sample of FINAL Cleaned Text (First 500 characters) ---")
            print(sample_text + "...")

    except Exception as e:
         print(f"ERROR: Could not save final cleaned data to {CLEAN_JSON_INPUT}: {e}")

else:
    print("\nNo intermediate cleaned data loaded. Cannot perform whitespace normalization.")

Successfully loaded 60 articles from scraped_articles_CLEANED.json

Starting whitespace normalization on 60 articles...

--- Whitespace Normalization Complete ---
Successfully normalized whitespace (Q8 v5) for 60 articles in 0.02 seconds.
Final cleaned data saved back to scraped_articles_CLEANED.json

--- Sample of FINAL Cleaned Text (First 500 characters) ---
Son of hostage Amiram Cooper, whose body remains in Gaza, says 'it's not over'Skip to contentAdvertisementWatch LiveBritish Broadcasting CorporationSubscribeSign InHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveDocumentariesHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveDocumentariesWeatherNewslettersWatch LiveAdvertisement'It's not over,' says son of hostage whose body remains in Gaza7 days agoShareSaveAlice CuddySenior international reporter, in Tel Aviv...


#### This cell addresses Question 8 by normalizing whitespace. It reads the intermediate `scraped_articles_CLEANED.json` file produced by the previous step. For each article's text, it applies the regular expression `r'\s+'` using `re.sub` to replace any sequence of one or more whitespace characters (spaces, tabs, newlines) with a single space (`' '`). It also uses the `.strip()` method to remove any leading or trailing whitespace. This fully cleaned and normalized text is then saved back, overwriting the `scraped_articles_CLEANED.json` file.

## Q1 - Extracting URLs (from Raw HTML)

In [9]:
import re
import json

# --- Configuration ---
# Define the name of the file created by your Selenium script
JSON_INPUT_FILENAME = 'scraped_articles_final.json' 

# Load the scraped data from the JSON file
try:
    with open(JSON_INPUT_FILENAME, 'r', encoding='utf-8') as f:
        scraped_articles_data = json.load(f)
except FileNotFoundError:
    print(f"ERROR: File '{JSON_INPUT_FILENAME}' not found. Please ensure the scraping script ran successfully.")
    scraped_articles_data = []
except Exception as e:
    print(f"ERROR: Could not load JSON data: {e}")
    scraped_articles_data = []

# --- Regex Pattern for URLs ---
# This pattern captures http or https, followed by any non-whitespace characters,
# and is robust enough to catch most common web links.
URL_PATTERN = r"(https?:\/\/[^\s\"]+)" 
# NOTE: We use r"..." for raw string to handle backslashes correctly.

all_extracted_urls = set() # Use a set to automatically store only unique URLs
article_count = 0

if scraped_articles_data:
    print(f"Starting URL extraction from {len(scraped_articles_data)} articles...")
    
    for article in scraped_articles_data:
        article_count += 1
        html_content = article.get("html_content", "")
        
        # Find all matches in the HTML content
        found_urls = re.findall(URL_PATTERN, html_content)
        
        # Add the found URLs to the set
        for url in found_urls:
            all_extracted_urls.add(url)
            
    # --- Output Summary ---
    print("\n--- Extraction Complete ---")
    print(f"Total articles processed: {article_count}")
    print(f"Total unique URLs extracted: {len(all_extracted_urls)}")
    
    # Display the first 10 extracted URLs for verification
    print("\nSample Extracted URLs:")
    for i, url in enumerate(list(all_extracted_urls)[:10]):
        print(f"{i+1}. {url}")

Starting URL extraction from 60 articles...

--- Extraction Complete ---
Total articles processed: 60
Total unique URLs extracted: 3666

Sample Extracted URLs:
1. https://ichef.bbci.co.uk/news/1536/cpsprodpb/23e2/live/24f0dc60-a845-11f0-92db-77261a15b9d2.png.webp
2. https://ichef.bbci.co.uk/news/240/cpsprodpb/81d8/live/d175edc0-a842-11f0-92db-77261a15b9d2.png.webp
3. https://www.bbc.com/news/articles/ckgywnjkrlqo
4. https://media.cnn.com/api/v1/images/stellar/prod/045a27b3-fd48-4c69-bc4f-1409769c4825.jpg?q=h_562,w_1000,x_0,y_0/w_1280
5. https://people.com/movies/jake-gyllenhaal-jamie-lee-curtis-living-together-covid-lockdown/
6. https://bbc.com/news/articles/clykwd9e256o
7. https://ichef.bbci.co.uk/news/800/cpsprodpb/64a5/live/89868750-a9c2-11f0-8da2-811fba9518ff.jpg.webp
8. https://ichef.bbci.co.uk/news/480/cpsprodpb/e725/live/e73966a0-991f-11f0-928c-71dbb8619e94.jpg.webp
9. https://ichef.bbci.co.uk/news/240/cpsprodpb/dda7/live/b22b30a0-a7cb-11f0-b50c-8f62428b85e3.jpg.webp
10. https:/

#### This cell answers Question 1 by extracting URLs. It specifically loads the *raw* HTML data collected by the scraper (`scraped_articles_final.json`). Using the regular expression `r"(https?:\/\/[^\s\"]+)"`, it finds all substrings within the raw HTML that look like standard web links (starting with `http://` or `https://` and continuing until whitespace or a quote). `re.findall` is used to capture all such occurrences. These extracted URLs are stored in a Python set to automatically keep only unique ones, and a sample of the results is printed.

## Q2 - Extract Publication Dates (from Raw HTML)

In [10]:
import re
import json
import random # <--- New import for random selection

# --- Configuration ---
JSON_INPUT_FILENAME = 'scraped_articles_final.json' 
extracted_dates_list = []
SAMPLE_SIZE = 10 # <--- Define the size of the random sample

# Load the scraped data 
try:
    with open(JSON_INPUT_FILENAME, 'r', encoding='utf-8') as f:
        scraped_articles_data = json.load(f)
except Exception as e:
    print(f"❌ ERROR: Could not load JSON data: {e}")
    scraped_articles_data = []

# --- Comprehensive Date Regex Patterns ---
# Run these patterns in order from most predictable to least predictable.
DATE_PATTERNS = [
    # 1. ISO 8601 format (e.g., 2025-10-14T11:51:31Z) - often in meta tags
    r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z)',
    # 2. YYYY/MM/DD format (e.g., 2025/10/14)
    r'(\d{4}/\d{2}/\d{2})',
    # 3. Full Month DD, YYYY format (e.g., October 14, 2025)
    r'(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s+\d{4}',
    # 4. Abbreviated Month DD, YYYY format (e.g., Oct 14, 2025)
    r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},\s+\d{4}',
]

if scraped_articles_data:
    print(f"Starting date extraction from {len(scraped_articles_data)} articles...")

    # --- 1. Extraction Loop (Finds Dates for ALL Articles) ---
    for article in scraped_articles_data:
        html_content = article.get("html_content", "")
        extracted_date = "NOT FOUND"
        
        for pattern in DATE_PATTERNS:
            # Use re.search for the first match, which is usually the publication date
            match = re.search(pattern, html_content)
            if match:
                extracted_date = match.group(0) # Get the entire matched string
                break # Stop searching once a match is found
        
        extracted_dates_list.append({
            "source_url": article.get("source_url", "N/A"),
            "publication_date": extracted_date
        })

    # --- 2. Random Sampling (Select 10 Dates) ---
    print("\n--- Extraction Complete ---")
    
    # Filter out 'NOT FOUND' entries before sampling for better verification
    found_dates = [item for item in extracted_dates_list if item['publication_date'] != 'NOT FOUND']
    
    if len(found_dates) >= SAMPLE_SIZE:
        # Use random.sample to get exactly SAMPLE_SIZE unique items
        random_sample = random.sample(found_dates, SAMPLE_SIZE)
        print(f"Successfully extracted dates for {len(found_dates)} articles.")
        print(f"Displaying a random sample of {SAMPLE_SIZE} extracted dates:")
    else:
        # If fewer than 10 dates were found, use all of them
        random_sample = found_dates
        print(f"⚠️ Only {len(found_dates)} dates were successfully extracted. Displaying all of them:")


    # --- Output Summary (Random Sample) ---
    print("-" * 30)
    for i, item in enumerate(random_sample):
        print(f"{i+1:02d}. URL: {item['source_url']} | Date: {item['publication_date']}")
    print("-" * 30)

Starting date extraction from 60 articles...

--- Extraction Complete ---
Successfully extracted dates for 52 articles.
Displaying a random sample of 10 extracted dates:
------------------------------
01. URL: https://www.bbc.com/news/articles/cj9zmeerp1xo?at_medium=RSS&at_campaign=rss | Date: October 1, 2025
02. URL: https://www.aljazeera.com/news/2025/10/14/madagascar-president-dissolves-parliament-after-fleeing-army-backed-protest?traffic_source=rss | Date: 2025-10-14T12:38:43Z
03. URL: https://www.aljazeera.com/video/newsfeed/2025/10/14/deadly-storm-batters-alaska-leaving-thousands-displaced?traffic_source=rss | Date: 2025-10-14T13:33:46Z
04. URL: https://www.cnn.com/2023/04/17/entertainment/jamie-foxx-remains-hospitalized/index.html | Date: 2023-04-18T00:27:07Z
05. URL: https://www.aljazeera.com/news/2025/10/14/mapping-the-rise-in-israeli-settler-attacks-across-the-occupied-west-bank?traffic_source=rss | Date: 2025-10-14T14:46:26Z
06. URL: https://www.aljazeera.com/news/2025/10/14

#### This cell addresses Question 2, extracting publication dates, again using the *raw* HTML data (`scraped_articles_final.json`). It defines a list of regex patterns (`DATE_PATTERNS`) matching common date formats (ISO 8601, YYYY/MM/DD, textual month formats). For each article, it iterates through these patterns, using `re.search` to find the *first* occurrence that matches any pattern. This first match is assumed to be the publication date and is stored. Finally, the code filters out articles where no date was found and displays a random sample of the extracted dates.

## Q3 - Extract Author Names (from Cleaned Text)

In [11]:
import re
import json
import random

# --- Configuration ---
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json' 
AUTHOR_SAMPLE_SIZE = 10 

# Load the CLEANED data 
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        cleaned_data = json.load(f)
except Exception as e:
    print(f"❌ ERROR: Could not load cleaned JSON data: {e}")
    cleaned_data = []

# --- Author Regex Pattern ---
# Matches common "By [Name]" phrases, expecting 2-6 capitalized words.
AUTHOR_PATTERN = r'(?:By|BY|by|Author|AUTHORED BY|REPORTER|Credit)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,5})'

extracted_authors_list = []
article_count = 0

if cleaned_data:
    print(f"Starting author extraction from {len(cleaned_data)} CLEANED articles...")

    for article in cleaned_data:
        article_count += 1
        clean_text = article.get("clean_text", "")
        extracted_author = "NOT FOUND"
        
        # Search the entire text for the author pattern
        match = re.search(AUTHOR_PATTERN, clean_text)
        
        if match:
            # Group 1 holds the captured name
            extracted_author = match.group(1).strip()
        
        extracted_authors_list.append({
            "source_url": article.get("source_url", "N/A"),
            "author_name": extracted_author
        })

    # --- Random Sampling (Select 10 Authors) ---
    print("\n--- Extraction Complete ---")
    
    found_authors = [item for item in extracted_authors_list if item['author_name'] != 'NOT FOUND']
    
    if len(found_authors) >= AUTHOR_SAMPLE_SIZE:
        random_sample = random.sample(found_authors, AUTHOR_SAMPLE_SIZE)
        print(f"Successfully extracted author names for {len(found_authors)} articles.")
        print(f"Displaying a random sample of {AUTHOR_SAMPLE_SIZE} extracted authors:")
    else:
        random_sample = found_authors
        print(f"Only {len(found_authors)} author names were successfully extracted. Displaying all of them:")


    # --- Output Summary (Random Sample) ---
    print("-" * 30)
    for i, item in enumerate(random_sample):
        print(f"{i+1:02d}. URL: {item['source_url']} | Author: {item['author_name']}")
    print("-" * 30)

Starting author extraction from 60 CLEANED articles...

--- Extraction Complete ---
Successfully extracted author names for 34 articles.
Displaying a random sample of 10 extracted authors:
------------------------------
01. URL: https://www.cnn.com/2023/04/18/media/fox-dominion-settlement/index.html | Author: Marshall Cohen
02. URL: https://www.cnn.com/2023/04/18/politics/white-house-toddler/index.html | Author: Arlette Saenz
03. URL: https://www.aljazeera.com/news/2025/10/14/who-is-in-charge-of-madagascar-after-president-rajoelina-flees?traffic_source=rss | Author: Shola Lawal
04. URL: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/h_8d51e3ae2714edaa0dace837305d03b8 | Author: Catherine Thorbecke
05. URL: https://www.cnn.com/2023/04/18/entertainment/jake-gyllenhaal-jamie-lee-curtis-pandemic-living/index.html | Author: Marianne Garvey
06. URL: https://www.bbc.com/news/articles/c62e3pny6p7o?at_medium=RSS&at_campaign=rss | Author: Prime Minister Bart
07. URL: http

#### This cell answers Question 3, extracting author names from the *fully cleaned* text (`scraped_articles_CLEANED.json`). It uses `re.search` with the pattern `r'(?:By|...|Credit)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,5})'`. This pattern looks for common attribution keywords (like "By", "Author") followed by whitespace, and then captures (in Group 1) a sequence of 2 to 6 capitalized words assumed to be the author's name. If a match is found, the captured name is extracted. A random sample of the found author names is then printed.

## Q4 - Extract Email Addresses (from Cleaned Text)

In [12]:
import re
import json
import random

# --- Configuration ---
# File containing the cleaned text content
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json' 
EMAIL_SAMPLE_SIZE = 10 

# Load the cleaned data 
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        cleaned_data = json.load(f)
except Exception as e:
    print(f"❌ ERROR: Could not load cleaned JSON data: {e}")
    cleaned_data = []

# --- Email Regex Pattern ---
# Matches standard email structure: username@domain.tld
EMAIL_PATTERN = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'

all_extracted_emails = set()
article_count = 0

if cleaned_data:
    print(f"Starting email extraction from {len(cleaned_data)} CLEANED articles...")

    for article in cleaned_data:
        article_count += 1
        # Use the 'clean_text' field
        clean_text = article.get("clean_text", "")
        
        # Find all email matches in the cleaned text
        found_emails = re.findall(EMAIL_PATTERN, clean_text)
        
        # Add the found emails to a set to ensure uniqueness
        for email in found_emails:
            # Simple filter to exclude common non-author/spam-like emails
            if not email.startswith(('support@', 'info@', 'admin@', 'careers@')):
                all_extracted_emails.add(email)
            
    # --- 2. Random Sampling (Select 10 Emails) ---
    print("\n--- Extraction Complete ---")
    
    if len(all_extracted_emails) >= EMAIL_SAMPLE_SIZE:
        # Convert set to list for random sampling
        random_sample = random.sample(list(all_extracted_emails), EMAIL_SAMPLE_SIZE)
        print(f"Successfully extracted {len(all_extracted_emails)} unique email addresses.")
        print(f"Displaying a random sample of {EMAIL_SAMPLE_SIZE} extracted emails:")
    else:
        random_sample = list(all_extracted_emails)
        print(f"Only {len(all_extracted_emails)} email addresses were extracted. Displaying all of them:")


    # --- Output Summary (Random Sample) ---
    print("-" * 30)
    for i, email in enumerate(random_sample):
        print(f"{i+1:02d}. Email: {email}")
    print("-" * 30)

# The email list is stored in 'all_extracted_emails' if you need it for later.

Starting email extraction from 60 CLEANED articles...

--- Extraction Complete ---
Only 0 email addresses were extracted. Displaying all of them:
------------------------------
------------------------------


#### This cell addresses Question 4 by extracting email addresses from the *cleaned* text (`scraped_articles_CLEANED.json`). It uses `re.findall` and a standard email regex pattern (`r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'`) to find all strings matching the typical email structure. The results are then filtered to exclude common service addresses (like "info@", "support@"). All unique, potentially relevant email addresses are collected in a set, and a random sample (or all, if few are found) is displayed.

## Q5 - Extract Phone Numbers (from Cleaned Text)

In [22]:
import re
import json
import random
import time # Ensure time is imported

# --- Configuration ---
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json'
NUMBER_SAMPLE_SIZE = 10

# --- Load the cleaned data ---
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        cleaned_data = json.load(f)
    print(f"Successfully loaded {len(cleaned_data)} articles from {CLEAN_JSON_INPUT}")
except Exception as e:
    print(f"ERROR: Could not load cleaned JSON data: {e}")
    cleaned_data = []

# --- Phone Number Regex Pattern ---
# Captures standard global and North American formats.
PHONE_NUMBER_PATTERN = r'(\+\d{1,3}[-.\s]?)?(\(?\d{3}\)?[-.\s]?)(\d{3}[-.\s]?)(\d{4})'

# --- Test the Regex ---
print("\n--- Testing Regex Pattern ---")
test_strings = [
    "Call 555-123-4567 for info.",
    "Contact us at (123) 456-7890.",
    "Number: +1 987 654 3210.",
    "Maybe 456-7890 or just 1234567.", # 7 digits straight
    "Not a phone 1234.",
    "Text with 2024 year."
]
for test_str in test_strings:
    matches = re.findall(PHONE_NUMBER_PATTERN, test_str)
    reconstructed = ["".join(m).strip() for m in matches if len(re.sub(r'[^\d]', '', "".join(m))) >= 7]
    print(f"String: '{test_str}' -> Found Tuples: {matches} -> Reconstructed & Validated: {reconstructed}")
print("-" * 30)


# --- Extraction from Articles ---
all_extracted_numbers = set()
article_count = 0
found_tuples_count = 0 # Track raw matches

if cleaned_data:
    print(f"\nStarting phone number extraction from {len(cleaned_data)} CLEANED articles...")
    start_time = time.time()

    for article_index, article in enumerate(cleaned_data): # Get index for reporting
        article_count += 1
        clean_text = article.get("clean_text", "")
        if not clean_text: continue # Skip empty articles

        # Find all potential matches (returns list of tuples)
        found_tuples = re.findall(PHONE_NUMBER_PATTERN, clean_text)

        if found_tuples: # If the regex found *anything*
            found_tuples_count += len(found_tuples)
            # Optional: Print raw tuples found for debugging
            # print(f"  Raw tuples found in article {article_index}: {found_tuples}")

            # Process the results
            for match_tuple in found_tuples:
                number_str = "".join(match_tuple).strip()
                digits_only = re.sub(r'[^\d]', '', number_str)

                # Validation: at least 7 digits
                if len(digits_only) >= 7:
                    all_extracted_numbers.add(number_str)
                    # Optional: Print validated number found
                    # print(f"  Validated number in article {article_index}: {number_str}")


    end_time = time.time()
    print(f"\nExtraction loop finished in {end_time - start_time:.2f} seconds.")
    print(f"Total raw regex matches (tuples found): {found_tuples_count}")

    # --- Random Sampling (Select 10 Phone Numbers) ---
    print("\n--- Extraction Complete ---")

    if len(all_extracted_numbers) >= NUMBER_SAMPLE_SIZE:
        random_sample = random.sample(list(all_extracted_numbers), NUMBER_SAMPLE_SIZE)
        print(f"Successfully extracted {len(all_extracted_numbers)} unique validated phone numbers.")
        print(f"Displaying a random sample of {NUMBER_SAMPLE_SIZE} extracted phone numbers:")
    elif len(all_extracted_numbers) > 0:
         random_sample = list(all_extracted_numbers)
         print(f"Successfully extracted {len(all_extracted_numbers)} unique validated phone numbers.")
         print(f"Displaying all extracted phone numbers:")
    else: # No numbers found
        random_sample = []
        print("No validated phone numbers were extracted.")
        print("Possible reasons: No numbers in text, regex too specific, or cleaning issues.")


    # --- Output Summary (Random Sample) ---
    if random_sample:
        print("-" * 30)
        for i, number in enumerate(random_sample):
            print(f"{i+1:02d}. Phone: {number}")
        print("-" * 30)

else:
    print("\nNo cleaned data loaded. Cannot extract phone numbers.")

Successfully loaded 60 articles from scraped_articles_CLEANED.json

--- Testing Regex Pattern ---
String: 'Call 555-123-4567 for info.' -> Found Tuples: [('', '555-', '123-', '4567')] -> Reconstructed & Validated: ['555-123-4567']
String: 'Contact us at (123) 456-7890.' -> Found Tuples: [('', '(123) ', '456-', '7890')] -> Reconstructed & Validated: ['(123) 456-7890']
String: 'Number: +1 987 654 3210.' -> Found Tuples: [('+1 ', '987 ', '654 ', '3210')] -> Reconstructed & Validated: ['+1 987 654 3210']
String: 'Maybe 456-7890 or just 1234567.' -> Found Tuples: [] -> Reconstructed & Validated: []
String: 'Not a phone 1234.' -> Found Tuples: [] -> Reconstructed & Validated: []
String: 'Text with 2024 year.' -> Found Tuples: [] -> Reconstructed & Validated: []
------------------------------

Starting phone number extraction from 60 CLEANED articles...

Extraction loop finished in 0.02 seconds.
Total raw regex matches (tuples found): 0

--- Extraction Complete ---
No validated phone numbers 

#### This cell answers Question 5 by attempting to extract phone numbers from the *cleaned* text (`scraped_articles_CLEANED.json`). It includes a section to test its regex pattern, `r'(\+\d{1,3}[-.\s]?)?(\(?\d{3}\)?[-.\s]?)(\d{3}[-..\s]?)(\d{4})'`, against sample strings. The pattern uses optional groups to match various common US/international formats. The code uses `re.findall` on each article's text, reconstructs the number from the matched groups, and validates that the result contains at least 7 digits before adding it to a set. Finally, it reports that zero validated numbers were found in the dataset using this specific pattern.

## Q7 - Extract Organizations/Companies (from Cleaned Text)

In [14]:
import re
import json
import random

# --- Configuration ---
# File containing the cleaned text content
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json' 
ORG_SAMPLE_SIZE = 10 

# Load the cleaned data 
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        cleaned_data = json.load(f)
except Exception as e:
    print(f"ERROR: Could not load cleaned JSON data: {e}")
    cleaned_data = []

# --- Organization Regex Pattern (Suffix-Based) ---
# Targets capitalized phrases ending in a formal business/organization suffix.
ORG_PATTERN = r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+(?:Corp|Inc|Ltd|GmbH|Co|Foundation|Agency|Council|Group|Bank|Hospital|University)\b)'

all_extracted_orgs = set()
article_count = 0

if cleaned_data:
    print(f"Starting organization extraction from {len(cleaned_data)} CLEANED articles...")

    for article in cleaned_data:
        article_count += 1
        clean_text = article.get("clean_text", "")
        
        # Find all organizational matches in the clean text
        # re.MULTILINE is not needed here, but re.IGNORECASE would make it too broad.
        found_orgs = re.findall(ORG_PATTERN, clean_text)
        
        # Add the found organizations to a set for uniqueness
        for org in found_orgs:
            # We enforce a minimum length and filter out common single words to reduce noise
            if len(org.split()) >= 2 and org.lower() not in ['group', 'bank', 'company']:
                all_extracted_orgs.add(org.strip())
            
    # --- 2. Random Sampling (Select 10 Organizations) ---
    print("\n--- Extraction Complete ---")
    
    if len(all_extracted_orgs) >= ORG_SAMPLE_SIZE:
        random_sample = random.sample(list(all_extracted_orgs), ORG_SAMPLE_SIZE)
        print(f"Successfully extracted {len(all_extracted_orgs)} unique organizations/companies.")
        print(f"Displaying a random sample of {ORG_SAMPLE_SIZE} extracted names:")
    else:
        random_sample = list(all_extracted_orgs)
        print(f"Only {len(all_extracted_orgs)} names were extracted. Displaying all of them:")


    # --- Output Summary (Random Sample) ---
    print("-" * 30)
    for i, org in enumerate(random_sample):
        print(f"{i+1:02d}. Organization: {org}")
    print("-" * 30)

Starting organization extraction from 60 CLEANED articles...

--- Extraction Complete ---
Successfully extracted 26 unique organizations/companies.
Displaying a random sample of 10 extracted names:
------------------------------
01. Organization: Tufts University
02. Organization: Asahi Group
03. Organization: Birmingham City Council
04. Organization: Every West Bank
05. Organization: Beef Products Inc
06. Organization: Arturo Jimenez Anadolu Agency
07. Organization: Nasser Hospital
08. Organization: News Corp
09. Organization: World Bank
10. Organization: Ahli Arab Hospital
------------------------------


#### This cell addresses Question 7 by extracting potential organization names from the *cleaned* text (`scraped_articles_CLEANED.json`). It uses `re.findall` with the pattern `r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+(?:Corp|...|University)\b)'`. This pattern searches for sequences of capitalized words followed by a common organizational suffix (Inc, Ltd, Agency, University, etc.) ending at a word boundary. The results are filtered to include only multi-word names and exclude generic matches like "Bank". Unique organization names found are collected, and a random sample is displayed.

## Q9 - Extracting Titles/Headlines (from No-Script HTML)

In [15]:
import re
import json
import random

# --- Configuration ---
# *** IMPORTANT: Switching back to the RAW HTML input file ***
JSON_INPUT_FILENAME = 'scraped_articles_NO_SCRIPTS.json' 
HEADLINE_SAMPLE_SIZE = 10 

# Load the RAW scraped data 
try:
    with open(JSON_INPUT_FILENAME, 'r', encoding='utf-8') as f:
        raw_data = json.load(f)
except Exception as e:
    print(f"ERROR: Could not load RAW JSON data: {e}")
    raw_data = []

# --- Headline Regex Pattern (Targeting <title> tag) ---
# This is a highly accurate method for extracting the headline from raw HTML.
HEADLINE_PATTERN = r'<title>(.*?)</title>'

extracted_headlines_list = []
article_count = 0

if raw_data:
    print(f"Starting headline extraction from {len(raw_data)} RAW articles (using <title> tag)...")

    for article in raw_data:
        article_count += 1
        # Use the 'html_content' field from the raw data
        raw_html = article.get("html_content", "")
        extracted_headline = "NOT FOUND"
        
        # Use re.search because the title tag only appears once per document
        match = re.search(HEADLINE_PATTERN, raw_html, re.DOTALL | re.IGNORECASE)
        
        if match:
            # Group 1 holds the captured title text
            # We strip whitespace and clean up any potential leftover entities
            headline = match.group(1).strip()
            # Clean up common elements like ' | CNN' or ' - BBC News'
            headline = re.sub(r' \| .*$', '', headline) 
            headline = re.sub(r' - .*$', '', headline)
            extracted_headline = headline
        
        extracted_headlines_list.append({
            "source_url": article.get("source_url", "N/A"),
            "headline": extracted_headline
        })

    # --- 2. Random Sampling (Select 10 Headlines) ---
    print("\n--- Extraction Complete ---")
    
    found_headlines = [item for item in extracted_headlines_list if item['headline'] != 'NOT FOUND']
    
    if len(found_headlines) >= HEADLINE_SAMPLE_SIZE:
        random_sample = random.sample(found_headlines, HEADLINE_SAMPLE_SIZE)
        print(f"Successfully extracted headlines for {len(found_headlines)} articles.")
        print(f"Displaying a random sample of {HEADLINE_SAMPLE_SIZE} extracted headlines:")
    else:
        random_sample = found_headlines
        print(f"Only {len(found_headlines)} headlines were extracted. Displaying all of them:")


    # --- Output Summary (Random Sample) ---
    print("-" * 30)
    for i, item in enumerate(random_sample):
        # Limit headline output for clean display
        print(f"{i+1:01d}. URL: {item['source_url']} | Headline: {item['headline']}")
    print("-" * 30)

Starting headline extraction from 60 RAW articles (using <title> tag)...

--- Extraction Complete ---
Successfully extracted headlines for 60 articles.
Displaying a random sample of 10 extracted headlines:
------------------------------
1. URL: https://www.bbc.com/news/articles/clykwd9e256o?at_medium=RSS&at_campaign=rss | Headline: Cuban dissident José Daniel Ferrer arrives in US exile
2. URL: https://www.cnn.com/2023/04/17/media/dominion-fox-news-allegations/index.html | Headline: Here are the Fox broadcasts and tweets Dominion says were defamatory
3. URL: https://www.cnn.com/2023/04/17/entertainment/jamie-foxx-remains-hospitalized/index.html | Headline: Jamie Foxx remains hospitalized nearly a week after experiencing ‘medical complication’
4. URL: https://www.cnn.com/2023/04/18/politics/mccarthy-biden-debt-ceiling/index.html | Headline: The US economy could depend on McCarthy corralling his extremist Republican troops
5. URL: https://www.bbc.com/news/articles/c0rpwk51qxro?at_medium=R

#### This cell answers Question 9 by extracting article headlines. It reads the partially cleaned HTML from `scraped_articles_NO_SCRIPTS.json` (output of Cell 8). It employs `re.search` with the pattern `r'<title>(.*?)</title>'` to capture the text content between the HTML title tags, ignoring case and handling potential newlines. Simple post-processing using `re.sub` is applied to remove common site name suffixes often appended to titles (e.g., " | CNN"). The extracted headlines are collected, and a random sample is printed.

## Q10 - Extracting Important Events/Incidents (from Cleaned Text)

In [16]:
import re
import json
import random

# --- Configuration ---
CLEAN_JSON_INPUT = 'scraped_articles_CLEANED.json' # Input: Fully cleaned text
EVENT_SAMPLE_SIZE = 10

# --- NEW Event Extraction Function ---
def extract_events(text):
    """Extracts potential events using multiple specific regex patterns."""
    events = []
    if text is None: # Handle None input
        return []

    try:
        # Pattern 1: Events with action verbs (attacked, killed, arrested, etc.)
        events.extend(re.findall(r'[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+(?:attacked|killed|arrested|bombed|invaded|struck|destroyed|launched|conducted|announced|declared|signed|passed|voted|elected|resigned|died|crashed|exploded|collapsed)\s+[a-z\s]+', text))

        # Pattern 2: Natural disasters and catastrophes
        events.extend(re.findall(r'(?:earthquake|tsunami|hurricane|tornado|flood|wildfire|storm|cyclone|typhoon|avalanche|landslide|drought|famine)\s+(?:in|hit|struck|devastated|destroyed)\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*', text, re.IGNORECASE))

        # Pattern 3: Political/military events
        events.extend(re.findall(r'(?:summit|conference|protest|rally|demonstration|war|conflict|battle|attack|strike|raid|invasion|election|referendum|coup|revolution)\s+(?:in|at|on)\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*', text, re.IGNORECASE))

        # Pattern 4: Incidents with casualties
        events.extend(re.findall(r'\d+\s+(?:people|civilians|soldiers|protesters|victims)\s+(?:killed|injured|wounded|died|missing|evacuated|arrested|detained)', text, re.IGNORECASE))

        # Pattern 5: Major announcements
        events.extend(re.findall(r'(?:announced|declared|imposed|launched|initiated|suspended|cancelled|postponed)\s+(?:a|an|the)?\s*[a-z\s]{3,30}', text, re.IGNORECASE))

        # Pattern 6: Specific event names (capitalized multi-word phrases ending in keywords)
        events.extend(re.findall(r'\b(?:[A-Z][a-z]+\s+){2,4}(?:Summit|Conference|Agreement|Treaty|Accord|Crisis|Disaster|Attack|Incident)\b', text))

        # --- Filtering within the function ---
        # 1. Clean whitespace
        cleaned_events = [e.strip() for e in events]
        # 2. Filter by length
        cleaned_events = [e for e in cleaned_events if 10 < len(e) < 150]
        # 3. Filter common sentence starters
        cleaned_events = [e for e in cleaned_events if not e.lower().startswith(('the ', 'this ', 'that '))]
        # 4. Remove duplicates
        unique_events = list(set(cleaned_events))

        return unique_events

    except Exception as e:
        print(f"  Warning: Regex error during event extraction - {e}")
        return [] # Return empty list on error for this article


# --- Load the Cleaned Data ---
try:
    with open(CLEAN_JSON_INPUT, 'r', encoding='utf-8') as f:
        cleaned_data = json.load(f)
    print(f"Successfully loaded {len(cleaned_data)} articles from {CLEAN_JSON_INPUT}")
except Exception as e:
    print(f"ERROR: Could not load cleaned JSON data: {e}")
    cleaned_data = []

# --- Main Extraction Loop ---
all_extracted_events = set() # Use a set to store unique events across all articles

if cleaned_data:
    print(f"\nStarting event extraction from {len(cleaned_data)} articles...")
    articles_processed_count = 0
    for article in cleaned_data:
        clean_text = article.get("clean_text", "")

        # Call the new function to get events for this article
        found_events = extract_events(clean_text)

        # Add the unique events found in this article to the overall set
        if found_events:
            all_extracted_events.update(found_events)
        articles_processed_count += 1

    print(f"\nProcessed {articles_processed_count} articles.")

    # --- Random Sampling and Output ---
    print("\n--- Extraction Complete ---")

    if not all_extracted_events:
        print("No events extracted matching the defined patterns.")
    elif len(all_extracted_events) >= EVENT_SAMPLE_SIZE:
        random_sample = random.sample(list(all_extracted_events), EVENT_SAMPLE_SIZE)
        print(f"Successfully extracted {len(all_extracted_events)} unique potential event mentions.")
        print(f"Displaying a random sample of {EVENT_SAMPLE_SIZE} extracted events:")
    else: # Fewer than SAMPLE_SIZE found
        random_sample = list(all_extracted_events)
        print(f"Only {len(all_extracted_events)} unique potential event mentions were extracted. Displaying all of them:")

    # --- Output Summary (Random Sample) ---
    if random_sample:
        print("-" * 30)
        for i, event in enumerate(random_sample):
            # Limit length for display if needed
            print(f"{i+1:01d}. Event: {event[:100]}{'...' if len(event)>100 else ''}")
        print("-" * 30)

else:
    print("\nNo cleaned data loaded. Cannot extract events.")

Successfully loaded 60 articles from scraped_articles_CLEANED.json

Starting event extraction from 60 articles...

Processed 60 articles.

--- Extraction Complete ---
Successfully extracted 100 unique potential event mentions.
Displaying a random sample of 10 extracted events:
------------------------------
1. Event: Conference on Arms Control
2. Event: cancelled ceremonies
3. Event: Hamas attacked on
4. Event: war in Ukraine and called for peace in the conflict
5. Event: Commerce announced export controls preventing
6. Event: protest in front of Argentina
7. Event: war in Gaza began
8. Event: war on the Palestinian population after the ceasefire
9. Event: strike on alleged drug vessel near VenezuelaNobel PrizeVenezuelaRelatedChinese Nobel laureate and ph...
10. Event: declared in Proclamation
------------------------------


#### This cell answers Question 10 by extracting potential event mentions from the *cleaned* text (`scraped_articles_CLEANED.json`). It defines and uses a function, `extract_events`, which applies multiple specific regex patterns targeting different event types (actions with verbs, disasters, political/military keywords + locations, casualty reports, announcements, specific capitalized names ending in keywords like "Summit"). The function collects all matches, filters them for length and common starting words, and ensures uniqueness. The main part iterates through articles, calls this function, aggregates unique results across all articles into a set, and displays a random sample.