# Introduction to Scraping Vanderbilt DSI and Blog Posts
<hr>
In regards to the README.md, it has proven to be challenging getting the right data from news sources. However, any new data should provide a good base for the same underlying theory of being able to update world knowledge. Therefore, we will instead scrape the Vanderbilt DSI pages and recent blog posts. This approach might prove to be easier than collecting news items and creating ground truth for all of them.


In [90]:
# Importing Libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import os
import json
from tqdm import tqdm
from urllib.parse import urljoin
from dotenv import load_dotenv
from datetime import datetime

load_dotenv()


True

In [None]:
format it like a md file, the vanderbilt DSI news and blogs link is:"https://www.vanderbilt.edu/datascience/"
use unordered lists where necessary

In [91]:
# Setting up the environment and constants
DATA_PATH = os.getenv('DATA_PATH')
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"The directory {DATA_PATH} does not exist.")

# Base URL to scrape
base_url = "https://www.vanderbilt.edu/datascience/"


In [72]:
def get_filtered_links(response):
    soup = BeautifulSoup(response.text, "html.parser")
    # Find all links with the class "story-list__image-link"
    links = soup.find_all('a', class_='story-list__image-link')
    filtered_links = [link.get('href') for link in links]
    print(f"Found {len(filtered_links)} links: {filtered_links}")
    return filtered_links

url_to_scrape = "https://www.vanderbilt.edu/datascience/2024/page/1"
print(f"Scraping URL: {url_to_scrape}")
response = requests.get(url_to_scrape)
print(f"Response status code: {response.status_code}")
if response.status_code == 200:
    filtered_links = get_filtered_links(response)
    print(f"Filtered links: {filtered_links}")
else:
    print(f"Failed to retrieve {url_to_scrape}: Status code {response.status_code}")
filtered_links

Scraping URL: https://www.vanderbilt.edu/datascience/2024/page/1
Response status code: 200
Found 10 links: ['https://www.vanderbilt.edu/datascience/2024/10/31/8549/', 'https://www.vanderbilt.edu/datascience/2024/10/24/transforming-automotive-insights-the-nissan-vanderbilt-collaboration/', 'https://www.vanderbilt.edu/datascience/2024/10/03/vanderbilt-data-science-postdoctoral-fellows-push-the-boundaries-of-generative-ai-in-protein-science/', 'https://www.vanderbilt.edu/datascience/2024/10/01/deep-dive-preview-leveraging-ai-for-usability-and-safety-in-electronic-health-records/', 'https://www.vanderbilt.edu/datascience/2024/09/12/ai-flash/', 'https://www.vanderbilt.edu/datascience/2024/09/10/8389/', 'https://www.vanderbilt.edu/datascience/2024/09/02/ai-deep-dive-smart-data-synergy/', 'https://www.vanderbilt.edu/datascience/2024/09/01/ai-summer-showcase-innovating-across-health-education-and-science/', 'https://www.vanderbilt.edu/datascience/2024/08/13/predicting-pediatric-malpractice-ris

['https://www.vanderbilt.edu/datascience/2024/10/31/8549/',
 'https://www.vanderbilt.edu/datascience/2024/10/24/transforming-automotive-insights-the-nissan-vanderbilt-collaboration/',
 'https://www.vanderbilt.edu/datascience/2024/10/03/vanderbilt-data-science-postdoctoral-fellows-push-the-boundaries-of-generative-ai-in-protein-science/',
 'https://www.vanderbilt.edu/datascience/2024/10/01/deep-dive-preview-leveraging-ai-for-usability-and-safety-in-electronic-health-records/',
 'https://www.vanderbilt.edu/datascience/2024/09/12/ai-flash/',
 'https://www.vanderbilt.edu/datascience/2024/09/10/8389/',
 'https://www.vanderbilt.edu/datascience/2024/09/02/ai-deep-dive-smart-data-synergy/',
 'https://www.vanderbilt.edu/datascience/2024/09/01/ai-summer-showcase-innovating-across-health-education-and-science/',
 'https://www.vanderbilt.edu/datascience/2024/08/13/predicting-pediatric-malpractice-risks-with-ai/',
 'https://www.vanderbilt.edu/datascience/2024/08/13/enhancing-diagnostics-with-ai-dri

In [80]:
filtered_links = []
for i in tqdm(range(2021, 2025), desc="Years"):
    for j in range(1, 10):
        url_to_scrape = base_url + str(i) + "/page/" + str(j)
        print(f"Scraping url: {url_to_scrape}")
        response = requests.get(url_to_scrape)
        if response.status_code != 200:
            print(f"Failed to retrieve {url_to_scrape}: Status code {response.status_code}")
            break
        filtered_links += get_filtered_links(response)
print(f"Total links collected: {len(filtered_links)}")
filtered_links

Years:   0%|          | 0/4 [00:00<?, ?it/s]

Scraping url: https://www.vanderbilt.edu/datascience/2021/page/1
Found 1 links: ['https://www.vanderbilt.edu/datascience/2021/09/02/vanderbilt-data-science-institute-and-witt-announce-new-scholarship-and-winner/']
Scraping url: https://www.vanderbilt.edu/datascience/2021/page/2


Years:  25%|██▌       | 1/4 [00:00<00:02,  1.15it/s]

Failed to retrieve https://www.vanderbilt.edu/datascience/2021/page/2: Status code 404
Scraping url: https://www.vanderbilt.edu/datascience/2022/page/1
Found 10 links: ['https://www.vanderbilt.edu/datascience/2022/12/12/surgical-informed-consent-nlp-research/', 'https://www.vanderbilt.edu/datascience/2022/12/02/data-matters-to-host-first-spring-short-course-series-from-march-13-16-tennessee/', 'https://www.vanderbilt.edu/datascience/2022/11/18/ai-winter-intensive-workshop-jan-3-6/', 'https://www.vanderbilt.edu/datascience/2022/11/14/detecting-colliding-black-holes-ai-deep-dive-dec-2/', 'https://www.vanderbilt.edu/datascience/2022/11/14/reminders-for-community-safety/', 'https://www.vanderbilt.edu/datascience/2022/11/09/ai-deep-learning-class-open-to-graduate-students-seniors/', 'https://www.vanderbilt.edu/datascience/2022/11/08/becoming-a-better-negotiator-applying-deep-learning-for-negotiation-research/', 'https://www.vanderbilt.edu/datascience/2022/11/07/ai-deep-dive-nov-11-state-of-

Years:  50%|█████     | 2/4 [00:03<00:03,  1.86s/it]

Failed to retrieve https://www.vanderbilt.edu/datascience/2022/page/7: Status code 404
Scraping url: https://www.vanderbilt.edu/datascience/2023/page/1
Found 10 links: ['https://www.vanderbilt.edu/datascience/2023/12/12/vanderbilt-universitys-data-science-institute-presents-ai-winter-2024/', 'https://www.vanderbilt.edu/datascience/2023/12/05/reflecting-on-an-inspiring-ai-showcase-at-the-dsi/', 'https://news.vanderbilt.edu/2023/11/16/jules-white-appointed-to-senior-advisor-role-in-office-of-the-chancellor/#new_tab', 'https://as.vanderbilt.edu/robert-penn-warren-center/faculty-advisory-committee/faculty-fellowship-opportunities/#new_tab', 'https://law.vanderbilt.edu/vpa-helps-to-chart-the-future-of-ai-governance-in-washington/?j=432515&sfmc_sub=48800757&l=300_HTML&u=12819024&mid=514009777&jb=1001&utm_source=sfmc&utm_medium=email&utm_campaign=Staff+MyVU+111623&utm_term=LH+Link+4&utm_id=432515&sfmc_id=48800757#new_tab', 'https://www.vanderbilt.edu/datascience/2023/11/02/ai-deep-dive-harnes

Years:  75%|███████▌  | 3/4 [00:05<00:02,  2.07s/it]

Found 10 links: ['https://www.vanderbilt.edu/datascience/2023/02/21/data-science-institute-hosting-women-in-data-science-conference-march-7/', 'https://www.vanderbilt.edu/datascience/2023/02/15/vandygraf-hosting-nasa-astrophysicist-mohammad-safarzadeh-to-discuss-ai-applications-to-gravity-waves/', 'https://www.vanderbilt.edu/datascience/2023/02/15/ai-in-the-wild-how-might-ai-reshape-the-work-of-data-scientists/', 'https://www.vanderbilt.edu/datascience/2023/02/12/wondry-hosting-quantum-machine-learning-expert-to-talk-using-quantum-computing-with-machine-learning/', 'https://www.vanderbilt.edu/datascience/2023/02/12/ai-deep-dive-2-3-exploring-prompt-engineering-for-stable-diffusion-models/', 'https://www.vanderbilt.edu/datascience/2023/02/12/ai-deep-dive-2-3-exploring-prompt-engineering-for-stable-diffusion-models/', 'https://www.vanderbilt.edu/datascience/2023/02/01/applications-due-feb-15-for-vanderbilt-universitys-fall-2023-data-science-m-s-program/', 'https://www.vanderbilt.edu/data

Years: 100%|██████████| 4/4 [00:07<00:00,  1.75s/it]

Failed to retrieve https://www.vanderbilt.edu/datascience/2024/page/4: Status code 404
Total links collected: 172





['https://www.vanderbilt.edu/datascience/2021/09/02/vanderbilt-data-science-institute-and-witt-announce-new-scholarship-and-winner/',
 'https://www.vanderbilt.edu/datascience/2022/12/12/surgical-informed-consent-nlp-research/',
 'https://www.vanderbilt.edu/datascience/2022/12/02/data-matters-to-host-first-spring-short-course-series-from-march-13-16-tennessee/',
 'https://www.vanderbilt.edu/datascience/2022/11/18/ai-winter-intensive-workshop-jan-3-6/',
 'https://www.vanderbilt.edu/datascience/2022/11/14/detecting-colliding-black-holes-ai-deep-dive-dec-2/',
 'https://www.vanderbilt.edu/datascience/2022/11/14/reminders-for-community-safety/',
 'https://www.vanderbilt.edu/datascience/2022/11/09/ai-deep-learning-class-open-to-graduate-students-seniors/',
 'https://www.vanderbilt.edu/datascience/2022/11/08/becoming-a-better-negotiator-applying-deep-learning-for-negotiation-research/',
 'https://www.vanderbilt.edu/datascience/2022/11/07/ai-deep-dive-nov-11-state-of-the-art-audio-models-speech

In [81]:

data = []

for link in filtered_links:
    response = requests.get(link)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Extract title
        title_tag = soup.find('title')
        title = title_tag.get_text(strip=True) if title_tag else 'No Title'
        
        # Extract content
        rich_text_divs = soup.find_all('div', class_='rich-text')
        content = ' '.join(div.get_text(strip=True) for div in rich_text_divs)
        
        # Extract date from p class "byline__line2"
        date_tag = soup.find('p', class_='byline__line2')
        if date_tag:
            date_str = date_tag.get_text(strip=True)
            try:
                date = datetime.strptime(date_str, '%b %d, %Y, %I:%M %p')
            except ValueError:
                date = 'Invalid Date Format'
        else:
            date = 'No Date'
        
        data.append({'url': link, 'title': title, 'content': content, 'date': date})

df = pd.DataFrame(data)
df

Unnamed: 0,url,title,content,date
0,https://www.vanderbilt.edu/datascience/2021/09...,Vanderbilt Data Science Institute and WiTT Ann...,WiTT awards first annual 50% tuition scholarsh...,2021-09-02 18:44:00
1,https://www.vanderbilt.edu/datascience/2022/12...,Surgical Informed Consent NLP Research | DSI |...,Researchers in the Surgical Ethics Program at ...,2022-12-12 19:21:00
2,https://www.vanderbilt.edu/datascience/2022/12...,Data Matters to host first spring short-course...,TheNational Consortium for Data Scienceis a co...,2022-12-02 15:50:00
3,https://www.vanderbilt.edu/datascience/2022/11...,AI Winter Intensive Workshop: Jan 3-6 | DSI | ...,The practice of data science is changing with ...,2022-11-18 17:26:00
4,https://www.vanderbilt.edu/datascience/2022/11...,Detecting Colliding Black Holes: AI Deep Dive ...,Cosmic cataclysmic events such as the collisio...,2022-11-14 17:39:00
...,...,...,...,...
166,https://www.vanderbilt.edu/datascience/2024/03...,Vanderbilt Data Science Students Triumph at Ha...,Vanderbilt Data Science Students Shine at Hack...,2024-03-01 12:26:00
167,https://www.vanderbilt.edu/datascience/2024/02...,AI Deep Dive: Decoding Interactions | DSI | Va...,AI Deep Dive: Decoding InteractionsEvent Overv...,2024-02-15 14:28:00
168,https://www.vanderbilt.edu/datascience/2024/02...,Empowering Change: The ‘Freequalizer’ Project ...,Spotlight on “Freequalizer”: A Beacon of Mento...,2024-02-01 12:59:00
169,https://www.vanderbilt.edu/datascience/2024/01...,AI Winter 2024 Recap | DSI | Vanderbilt Univer...,Reflecting on the AI Winter Workshop: A Journe...,2024-01-10 14:04:00


In [87]:
df.date = pd.to_datetime(df.date, errors='coerce')
df.to_parquet(os.path.join(DATA_PATH, 'vanderbilt_dsi_blog_posts.parquet'), index=False, coerce_timestamps='ms', allow_truncated_timestamps=True)


# Converting data to QA pairs

In [107]:
df = pd.read_parquet(os.path.join(DATA_PATH, 'vanderbilt_dsi_blog_posts.parquet'))
df.head()

Unnamed: 0,url,title,content,date
0,https://www.vanderbilt.edu/datascience/2021/09...,Vanderbilt Data Science Institute and WiTT Ann...,WiTT awards first annual 50% tuition scholarsh...,2021-09-02 18:44:00
1,https://www.vanderbilt.edu/datascience/2022/12...,Surgical Informed Consent NLP Research | DSI |...,Researchers in the Surgical Ethics Program at ...,2022-12-12 19:21:00
2,https://www.vanderbilt.edu/datascience/2022/12...,Data Matters to host first spring short-course...,TheNational Consortium for Data Scienceis a co...,2022-12-02 15:50:00
3,https://www.vanderbilt.edu/datascience/2022/11...,AI Winter Intensive Workshop: Jan 3-6 | DSI | ...,The practice of data science is changing with ...,2022-11-18 17:26:00
4,https://www.vanderbilt.edu/datascience/2022/11...,Detecting Colliding Black Holes: AI Deep Dive ...,Cosmic cataclysmic events such as the collisio...,2022-11-14 17:39:00


In [108]:
import re
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

In [109]:
# b. Clean the text
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = text.encode('ascii', 'ignore').decode()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text.lower()

def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

df.drop_duplicates(inplace=True)

df.loc[:, 'clean_text'] = df.loc[:, 'content'].apply(clean_text)

df.loc[:, 'is_english'] = df.loc[:, 'clean_text'].apply(is_english)

df.head()

Unnamed: 0,url,title,content,date,clean_text,is_english
0,https://www.vanderbilt.edu/datascience/2021/09...,Vanderbilt Data Science Institute and WiTT Ann...,WiTT awards first annual 50% tuition scholarsh...,2021-09-02 18:44:00,witt awards first annual tuition scholarship t...,True
1,https://www.vanderbilt.edu/datascience/2022/12...,Surgical Informed Consent NLP Research | DSI |...,Researchers in the Surgical Ethics Program at ...,2022-12-12 19:21:00,researchers in the surgical ethics program at ...,True
2,https://www.vanderbilt.edu/datascience/2022/12...,Data Matters to host first spring short-course...,TheNational Consortium for Data Scienceis a co...,2022-12-02 15:50:00,thenational consortium for data scienceis a co...,True
3,https://www.vanderbilt.edu/datascience/2022/11...,AI Winter Intensive Workshop: Jan 3-6 | DSI | ...,The practice of data science is changing with ...,2022-11-18 17:26:00,the practice of data science is changing with ...,True
4,https://www.vanderbilt.edu/datascience/2022/11...,Detecting Colliding Black Holes: AI Deep Dive ...,Cosmic cataclysmic events such as the collisio...,2022-11-14 17:39:00,cosmic cataclysmic events such as the collisio...,True


In [100]:
def create_qa_pair(text):
    # Simplified placeholder function
    # Needs to be adapted based on actual text patterns
    parts = text.split(' ')
    if len(parts) > 4:
        question = ' '.join(parts[:4]) + '?'
        answer = ' '.join(parts[4:])
        return {'question': question, 'answer': answer}
    else:
        return None

df['qa_pair'] = df['clean_text'].apply(create_qa_pair)
df = df.dropna(subset=['qa_pair'])

df['question'] = df['qa_pair'].apply(lambda x: x['question'])
df['answer'] = df['qa_pair'].apply(lambda x: x['answer'])

qa_df = df[['question', 'answer']]

# Split the QA data
qa_train, qa_temp = train_test_split(qa_df, test_size=0.2, random_state=42)
qa_val, qa_test = train_test_split(qa_temp, test_size=0.5, random_state=42)

# # Save QA datasets
# qa_train.to_json('qa_train.jsonl', orient='records', lines=True)
# qa_val.to_json('qa_validation.jsonl', orient='records', lines=True)
# qa_test.to_json('qa_test.jsonl', orient='records', lines=True)

# # Create Hugging Face datasets for QA
# train_dataset = Dataset.from_pandas(qa_train)
# val_dataset = Dataset.from_pandas(qa_val)
# test_dataset = Dataset.from_pandas(qa_test)

# qa_dataset = DatasetDict({
#     'train': train_dataset,
#     'validation': val_dataset,
#     'test': test_dataset
# })

# # Save QA datasets
# qa_dataset.save_to_disk('qa_dataset')