# Step 1: Data Collection/Scraping

### Part 0: Source Brainstorm
- Media formats: Websites, Blogs, Instagram Posts, Twitter Posts, Youtube Videos, Research Papers, **Books**
- Topics: Pitching, Biomechanics, Pitching Training, Strength and Conditioning, Command, Velocity, Stuff, Pitch Analytics
- Credible Sources: Tread Athletics, Driveline Baseball, ArmoredHeat, Medical Research, ConnectedPerformnace (Spinal Engine Theory), Baseball Performance, 108Performance
- Starter Sources:
    - Tread, Driveline

### Objective:
- Scrape publicly available baseball training articles from websites, blogs, social media posts.
- Extract text content, without any unnecessary elements (ads/html)
- Save clean text into csv file
### Libraries for web scraping 
- **requests**
    - sends HTTP requests to websites to retrieve data
    - Downloads HTML content of a webpage
    - Handles Errors
- **BeautifulSoup**
    - Python package for pasrsing HTML and XML documents
    - Extracts specific elements (titles/paragraphs/links)
    - Removes unwanted elements (ads, Javascript)
    - Works with multiple parsers (html.parser, lxml)
- **How they're used together...**
    - **requests** fetches the raw HTML of a webpage.
    - **BeautifulSoup** extracts the useful content
    - Then we have to process, clean and store the data.
- **Selenium**
    - Opens and interacts with web pages like a real user (clicks buttons, scrolls, inputs text).
    - Loads JavaScript-rendered content that requests can't access.
    - Automates web scraping for dynamic sites by simulating browsing behavior.

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import os

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import random

## Part 1: Websites + Blogs

### Objective:
- Scrape publicly available baseball training articles from websites and blogs. 
- Extract text content, without any unnecessary elements (ads/html)
- Save clean text into csv file

### Tread Athletics Blogs:

In [14]:
# User-Agent rotation (prevents getting blocked)
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]

# Base URL of the blog
base_url = "https://treadathletics.com/posts/"
page_url = base_url  # Start from the first page

# List to store only article URLs
all_articles = []

while page_url:
    print(f"Scraping: {page_url}")

    # Use a random User-Agent for each request
    headers = {"User-Agent": random.choice(USER_AGENTS)}
    response = requests.get(page_url, headers=headers)

    # Stop scraping if request fails
    if response.status_code != 200:
        print(f"Failed to access {page_url} (Status Code: {response.status_code})")
        break

    # Parse HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all blog post containers
    articles = soup.find_all("article")

    for article in articles:
        # Find link within each article
        link_tag = article.find("a")
        if link_tag:
            url = link_tag["href"]

            # Append only the URL
            all_articles.append(url)

    # Find the "Next" page button and get its link
    next_page = soup.find("a", class_="page-numbers next")

    if next_page:
        next_page_url = next_page["href"]
        print(f"Next page found: {next_page_url}")
        page_url = next_page_url  # Move to the next page
        time.sleep(random.randint(4, 8))  # Add a random delay between requests
    else:
        print("No more pages found.")
        page_url = None  # Stop if no more pages exist

# Print all collected article URLs
print("\nFound Articles:")
for idx, url in enumerate(all_articles, start=1):
    print(f"{idx}. {url}")

Scraping: https://treadathletics.com/posts/
Next page found: https://treadathletics.com/posts/page/2/
Scraping: https://treadathletics.com/posts/page/2/
Next page found: https://treadathletics.com/posts/page/3/
Scraping: https://treadathletics.com/posts/page/3/
Next page found: https://treadathletics.com/posts/page/4/
Scraping: https://treadathletics.com/posts/page/4/
Next page found: https://treadathletics.com/posts/page/5/
Scraping: https://treadathletics.com/posts/page/5/
Next page found: https://treadathletics.com/posts/page/6/
Scraping: https://treadathletics.com/posts/page/6/
Next page found: https://treadathletics.com/posts/page/7/
Scraping: https://treadathletics.com/posts/page/7/
Next page found: https://treadathletics.com/posts/page/8/
Scraping: https://treadathletics.com/posts/page/8/
No more pages found.

Found Articles:
1. https://treadathletics.com/2024-year-in-review/
2. https://treadathletics.com/2024-pro-day/
3. https://treadathletics.com/2023-year-in-review/
4. https:

In [16]:
# Scrape the articles!
# List to store scraped article data
scraped_articles = []

# Function to scrape each article
def scrape_article(url):
    print(f"Scraping: {url}")
    
    # Use a random User-Agent
    headers = {"User-Agent": random.choice(USER_AGENTS)}
    
    # Send HTTP request
    response = requests.get(url, headers=headers)

    # If request fails, skip this article
    if response.status_code != 200:
        print(f"Failed to access {url}")
        return None

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract article title from <h1> or meta tag
    title_tag = soup.find("h1")
    title = title_tag.text.strip() if title_tag else "No Title Found"
    
    meta_title = soup.find("meta", {"property": "og:title"})
    if meta_title and meta_title["content"]:
        title = meta_title["content"]

    # Extract article date from meta tag
    date_tag = soup.find("meta", {"property": "article:published_time"})
    date = date_tag["content"] if date_tag else "No Date Found"

    # Extract article content (all paragraphs inside <p> tags)
    content_paragraphs = soup.find_all("p")
    content = "\n".join([p.text.strip() for p in content_paragraphs if p.text.strip()])

    # Return the scraped data
    return {"title": title, "date": date, "url": url, "content": content}

# Loop through all articles and scrape them
for article_url in all_articles:
    article_data = scrape_article(article_url)
    if article_data:
        scraped_articles.append(article_data)
    time.sleep(random.randint(4, 8))  # Add a random delay to avoid getting blocked

# Convert scraped data into a DataFrame
df = pd.DataFrame(scraped_articles)

# Save data to a CSV file
output_file = "tread_articles.csv"
df.to_csv(output_file, index=False)

print(f"\nAll articles saved to {output_file}")

Scraping: https://treadathletics.com/2024-year-in-review/
Scraping: https://treadathletics.com/2024-pro-day/
Scraping: https://treadathletics.com/2023-year-in-review/
Scraping: https://treadathletics.com/tread-amateur-pro-weekend-23-recap/
Scraping: https://treadathletics.com/2023-pro-day-recap/
Scraping: https://treadathletics.com/2022-year-in-review/
Scraping: https://treadathletics.com/rmbt/
Scraping: https://treadathletics.com/pap/
Scraping: https://treadathletics.com/2021-year-in-review/
Scraping: https://treadathletics.com/energy-flow/
Scraping: https://treadathletics.com/rj-petit/
Scraping: https://treadathletics.com/10-lessons/
Scraping: https://treadathletics.com/2020-year-in-review/
Scraping: https://treadathletics.com/high-velocity-training/
Scraping: https://treadathletics.com/nate-pearson-2/
Scraping: https://treadathletics.com/nate-pearson-1/
Scraping: https://treadathletics.com/korean-baseball/
Scraping: https://treadathletics.com/2019-year-in-review/
Scraping: https://t

In [20]:
# View the text
import pandas as pd

# Load the CSV file
file_path = r"C:\Users\Sean Salvador\Documents\2025 DS Projects\PitchingRAG\ExtractedText\Websites_Blogs\Tread\tread_articles.csv"
df = pd.read_csv(file_path)
pd.set_option("display.max_colwidth", None)  # Show full content without truncation
# Show the first article's content
print(df.iloc[0]["content"]) 

“Never doubt that a small group of thoughtful, committed people can change the world; indeed, it’s the only thing that ever has.“
2024 was arguably Tread’s best year yet.
Tons more pitchers signed & drafted, despite the ever increasingly competitive landscape.
College commitments galore, and dozens of incredibly smart and talented people who joined our team from every corner of the country.
Let’s dive into what made 2024 a year to remember, and what Tread has in store for 2025.
In 2023, we added 14 members to the Tread team.
This past year, we decided to go ahead and flip that number around, adding 41 new and talented members.
With almost every hire relocating away from home to come join our team in Charlotte, we don’t take their decisions for granted – and we’ve made a commitment to each and every one of them that Tread will do everything it can to foster their personal and professional growth.
These new members include:
A few honorable mentions:
Quite a few of our coaches have now ma

### Driveline Blogs:


In [35]:
# Extract Blog URLs, then Scrape the URLs
# User-Agent rotation (prevents getting blocked)
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]

def get_article_urls(base_url):
    """
    Scrapes all article URLs from the given category page (handles pagination).
    
    :param base_url: The URL of the Driveline category page (e.g., Pitching, Health).
    :return: A list of article URLs.
    """
    page_url = base_url
    all_articles = []

    while page_url:
        print(f"Scraping: {page_url}")

        # Use a random User-Agent for each request
        headers = {"User-Agent": random.choice(USER_AGENTS)}
        response = requests.get(page_url, headers=headers)

        # Stop scraping if request fails
        if response.status_code != 200:
            print(f"Failed to access {page_url} (Status Code: {response.status_code})")
            break

        # Parse HTML content
        soup = BeautifulSoup(response.text, "html.parser")

        # Find all blog post containers
        articles = soup.find_all("article", class_="blog-post")

        for article in articles:
            # Find link within each article
            link_tag = article.find("a", href=True)
            if link_tag:
                url = link_tag["href"]
                all_articles.append(url)

        # Find the "Next" page button and get its link
        next_page = soup.find("a", class_="next page-numbers")

        if next_page:
            page_url = next_page["href"]  # Move to the next page
            print(f"Next page found: {page_url}")
            time.sleep(random.randint(4, 8))  # Add a random delay between requests
        else:
            print("No more pages found.")
            page_url = None  # Stop if no more pages exist

    print(f"\nTotal Articles Found: {len(all_articles)}")
    return all_articles

def scrape_articles(article_urls, category_name):
    """
    Scrapes content from the given list of article URLs.

    :param article_urls: List of article URLs to scrape.
    :param category_name: Name of the category (used for naming output file).
    :return: Saves scraped data to a CSV file.
    """
    scraped_articles = []

    for url in article_urls:
        print(f"Scraping: {url}")

        # Use a random User-Agent
        headers = {"User-Agent": random.choice(USER_AGENTS)}

        # Send HTTP request
        response = requests.get(url, headers=headers)

        # If request fails, skip this article
        if response.status_code != 200:
            print(f"Failed to access {url}")
            continue

        # Parse the HTML content
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract article title from <h1> or meta tag
        title_tag = soup.find("h1", class_="post__title")
        title = title_tag.text.strip() if title_tag else "No Title Found"

        meta_title = soup.find("meta", {"property": "og:title"})
        if meta_title and meta_title.get("content"):
            title = meta_title["content"]

        # Extract article date from meta tag
        date_tag = soup.find("meta", {"property": "article:published_time"})
        date = date_tag["content"] if date_tag else "No Date Found"

        # Extract article content (all paragraphs inside <p> tags)
        content_paragraphs = soup.find("div", class_="post__content").find_all("p") if soup.find("div", class_="post__content") else []
        content = "\n".join([p.text.strip() for p in content_paragraphs if p.text.strip()])

        # Store scraped data
        scraped_articles.append({"title": title, "date": date, "url": url, "content": content})

        # Add delay to avoid getting blocked
        time.sleep(random.randint(4, 8))

    # Convert scraped data into a DataFrame
    df = pd.DataFrame(scraped_articles)

    # Save data to a CSV file
    output_file = f"driveline_{category_name}_articles.csv"
    df.to_csv(output_file, index=False)

    print(f"\n✅ All articles saved to {output_file}")

In [38]:
# Extract the URLs - probably a faster way to write this
# Extract Pitching articles
pitching_urls = get_article_urls("https://www.drivelinebaseball.com/category/pitching/")

# Extract Health articles
health_urls = get_article_urls("https://www.drivelinebaseball.com/category/health/")

# Extract Offseason Training articles 
offseason_training_urls = get_article_urls("https://www.drivelinebaseball.com/category/offseason-training/")

# Extract Pulse articles 
pulse_urls = get_article_urls("https://www.drivelinebaseball.com/category/pulse/")

# Extract Strength Training Articles
strength_urls = get_article_urls("https://www.drivelinebaseball.com/category/strength-training/")

# Extract Research articles
research_urls = get_article_urls("https://www.drivelinebaseball.com/category/research/")

pitching_urls, health_urls, offseason_training_urls, pulse_urls, research_urls

Scraping: https://www.drivelinebaseball.com/category/pitching/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/2/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/2/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/3/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/3/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/4/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/4/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/5/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/5/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/6/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/6/
Next page found: https://www.drivelinebaseball.com/category/pitching/page/7/
Scraping: https://www.drivelinebaseball.com/category/pitching/page/7/
No more pages found.

Total Articles Found: 64
Scraping

(['https://www.drivelinebaseball.com/2024/05/revisiting-stuff-plus/',
  'https://www.drivelinebaseball.com/2022/10/a-quantitative-analysis-of-the-lead-leg-block-and-its-contributions-to-velocity/',
  'https://www.drivelinebaseball.com/2022/03/dillon-tate-using-biomechanics-for-player-development/',
  'https://www.drivelinebaseball.com/2022/01/ultimate-guide-baseball-pitch-grips/',
  'https://www.drivelinebaseball.com/2022/01/changes-in-ball-weight-new-plyo/',
  'https://www.drivelinebaseball.com/2021/12/what-is-stuff-quantifying-pitches-with-pitch-models/',
  'https://www.drivelinebaseball.com/2021/11/pulse-throw-app-updates/',
  'https://www.drivelinebaseball.com/2021/11/thomas-ruwe-from-not-pitching-in-high-school-to-throwing-98-mph/',
  'https://www.drivelinebaseball.com/2021/11/caleb-thielbar-coaching-dii-baseball-to-10-million-dollar-season/',
  'https://www.drivelinebaseball.com/2021/10/optimizing-breaking-ball-shape-through-data-driven-pitch-design-part-ii/',
  'https://www.driv

In [39]:
# Now make one text file per category 
all_driveline_urls = {
    "pitching": pitching_urls,
    "health": health_urls,
    "offseason_training": offseason_training_urls,
    "pulse": pulse_urls,
    "strength_training": strength_urls,
    "research": research_urls
}

# Loop through each category and scrape the articles
for category, urls in all_driveline_urls.items():
    if urls:  # Ensure there's something to scrape
        scrape_articles(urls, category)

Scraping: https://www.drivelinebaseball.com/2024/05/revisiting-stuff-plus/
Scraping: https://www.drivelinebaseball.com/2022/10/a-quantitative-analysis-of-the-lead-leg-block-and-its-contributions-to-velocity/
Scraping: https://www.drivelinebaseball.com/2022/03/dillon-tate-using-biomechanics-for-player-development/
Scraping: https://www.drivelinebaseball.com/2022/01/ultimate-guide-baseball-pitch-grips/
Scraping: https://www.drivelinebaseball.com/2022/01/changes-in-ball-weight-new-plyo/
Scraping: https://www.drivelinebaseball.com/2021/12/what-is-stuff-quantifying-pitches-with-pitch-models/
Scraping: https://www.drivelinebaseball.com/2021/11/pulse-throw-app-updates/
Scraping: https://www.drivelinebaseball.com/2021/11/thomas-ruwe-from-not-pitching-in-high-school-to-throwing-98-mph/
Scraping: https://www.drivelinebaseball.com/2021/11/caleb-thielbar-coaching-dii-baseball-to-10-million-dollar-season/
Scraping: https://www.drivelinebaseball.com/2021/10/optimizing-breaking-ball-shape-through-dat

## Part 2: Social Media Posts - Text

**Objective**
- Extract from main Driveline and Tread accounts.
    - Ben's Account, Kyle's Account... 
- Also pull from Driveline and Tread personal coach accounts...
    - ArowThrows, Crider_Performance ...
- Start with main accounts for now!

In [2]:
#Twitter Extraction 
import snscrape.modules.twitter as sntwitter
import pandas as pd

def scrape_twitter(username, num_tweets=None):
    """
    Scrapes ALL available tweets from a given Twitter/X account.
    
    :param username: Twitter handle (e.g., "DrivelineBB")
    :param num_tweets: Number of tweets to scrape (None = all available)
    :return: Saves tweets to a CSV file named 'twitter_{username}_raw.csv'
    """
    print(f"🔄 Scraping ALL tweets from @{username}...")

    # List to store tweet data
    tweets = []

    # Scrape tweets
    for tweet in sntwitter.TwitterSearchScraper(f'from:{username}').get_items():
        tweets.append([tweet.date, tweet.content, tweet.url])

        # If num_tweets is set, limit the number of tweets scraped
        if num_tweets is not None and len(tweets) >= num_tweets:
            break

    # Convert to DataFrame
    df = pd.DataFrame(tweets, columns=["date", "text", "url"])

    # Save to CSV
    output_file = f"twitter_{username}_raw.csv"
    df.to_csv(output_file, index=False)

    print(f"✅ Scraped {len(df)} tweets from @{username} and saved to {output_file}")


In [4]:
import certifi
import os

# Check where the SSL certificates are stored
print(f"✅ SSL Certificate Path: {certifi.where()}")

# Ensure Python is using the correct SSL certificate
os.environ["SSL_CERT_FILE"] = certifi.where()



✅ SSL Certificate Path: C:\Users\Sean Salvador\.conda\envs\seansalds\Lib\site-packages\certifi\cacert.pem


In [8]:
# Approach 2
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def fetch_tweets(username, num_tweets=20):
    """
    Scrapes tweets from a given Twitter/X account using the mobile site.
    
    :param username: Twitter handle (e.g., "DrivelineBB")
    :param num_tweets: Number of tweets to scrape per account (default=20)
    :return: Saves tweets to a CSV file named 'twitter_{username}_mobile.csv'
    """
    print(f"🔄 Scraping tweets from @{username}...")

    url = f"https://mobile.twitter.com/{username}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    # Send request
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"❌ Failed to fetch tweets from @{username} (Status Code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, "html.parser")

    # Find tweet containers
    tweet_divs = soup.find_all("div", class_="tweet-text")
    tweets = []

    # Extract text and tweet URLs
    for tweet_div in tweet_divs[:num_tweets]:  # Limit to num_tweets
        tweet_text = tweet_div.get_text(strip=True)
        tweet_link = f"https://mobile.twitter.com{tweet_div.find_parent('a')['href']}"

        tweets.append({"username": username, "text": tweet_text, "url": tweet_link})

    # Save to CSV
    df = pd.DataFrame(tweets)
    output_file = f"twitter_{username}_mobile.csv"
    df.to_csv(output_file, index=False)

    print(f"✅ Scraped {len(df)} tweets from @{username} and saved to {output_file}")



In [9]:
# Extract from main accounts to start 
# List of Twitter accounts to scrape
main_accounts = ["DrivelineBB", "drivelinekyle", "TreadHQ", "TreadAthletics"]

# Loop through each account and scrape tweets
for account in main_accounts:
    fetch_tweets(account, num_tweets=20)  # Adjust num_tweets if needed
    time.sleep(5)  # Add delay to prevent getting blocke

🔄 Scraping tweets from @DrivelineBB...
✅ Scraped 0 tweets from @DrivelineBB and saved to twitter_DrivelineBB_mobile.csv
🔄 Scraping tweets from @drivelinekyle...
✅ Scraped 0 tweets from @drivelinekyle and saved to twitter_drivelinekyle_mobile.csv
🔄 Scraping tweets from @TreadHQ...
✅ Scraped 0 tweets from @TreadHQ and saved to twitter_TreadHQ_mobile.csv
🔄 Scraping tweets from @TreadAthletics...
✅ Scraped 0 tweets from @TreadAthletics and saved to twitter_TreadAthletics_mobile.csv


In [16]:
# Approach 3

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time

def fetch_tweets_selenium(username, max_tweets=None):
    """
    Scrapes tweets from a Twitter/X account using Selenium.
    
    :param username: Twitter handle (e.g., "DrivelineBB")
    :param max_tweets: Maximum number of tweets to scrape (None = ALL tweets)
    :return: Saves tweets to 'twitter_{username}_selenium.csv'
    """
    print(f"🔄 Scraping tweets from @{username} using Selenium...")

    url = f"https://twitter.com/{username}"

    # Set up Selenium WebDriver
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # Run in headless mode (no browser UI)
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")

    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(url)
    
    time.sleep(5)  # Allow time for initial tweets to load

    tweets = set()  # Use a set to avoid duplicates
    scroll_attempts = 0
    last_tweet_count = 0

    while True:
        # Find all tweet text elements
        tweet_divs = driver.find_elements(By.XPATH, "//div[@data-testid='tweetText']")
        
        # Extract text and add to set
        for tweet_div in tweet_divs:
            tweet_text = tweet_div.text.strip()
            if tweet_text:
                tweets.add(tweet_text)

        # Check if we have enough tweets
        if max_tweets and len(tweets) >= max_tweets:
            break

        # Scroll down multiple times to load more tweets
        for _ in range(3):
            driver.find_element(By.TAG_NAME, "body").send_keys(Keys.PAGE_DOWN)
            time.sleep(2)  # Give time for new tweets to load
        
        # Check if the number of tweets has increased
        if len(tweets) == last_tweet_count:
            scroll_attempts += 1
        else:
            scroll_attempts = 0  # Reset if new tweets loaded

        last_tweet_count = len(tweets)

        # Stop if no new tweets detected after multiple attempts
        if scroll_attempts > 10:  # Increase this if needed
            print("🚨 Stopping: No new tweets detected.")
            break

    # Save to DataFrame and CSV
    df = pd.DataFrame(list(tweets), columns=["text"])
    output_file = f"twitter_{username}_selenium.csv"
    df.to_csv(output_file, index=False)

    driver.quit()
    print(f"✅ Scraped {len(df)} tweets from @{username} and saved to {output_file}")


In [17]:
accounts = ["DrivelineBB", "drivelinekyle", "TreadHQ", "TreadAthletics"]
for account in accounts:
    fetch_tweets_selenium(account, max_tweets=None)


🔄 Scraping tweets from @DrivelineBB using Selenium...
🚨 Stopping: No new tweets detected.
✅ Scraped 20 tweets from @DrivelineBB and saved to twitter_DrivelineBB_selenium.csv
🔄 Scraping tweets from @drivelinekyle using Selenium...


KeyboardInterrupt: 

In [21]:
# Approach 4 - using article https://scrapfly.io/blog/how-to-scrape-twitter/
import subprocess

# Run the Playwright script externally to avoid Jupyter issues
subprocess.run(["python", "scrape_twitter.py"])


CompletedProcess(args=['python', 'scrape_twitter.py'], returncode=0)

In [None]:
subprocess.run(["python", "scrape_twitter.py"], capture_output=True, text=True)


## Part 3: Social Media Posts - Videos 

### Objective
- Scrape from short and long term educational videos (Tread Youtube Playlist, Tread Instagram Posts, Driveline Youtube Playlist, Driveline Posts ...)


### Part 4: Free Educational Downloads...

- Tread
    - Sample Throwing Program
    - Gain Up To 3-5 Pounds In The Next 3 Weeks
    - Get An In-Season Routine (Free Download)
    - See How You Stack Up (Free Analysis)
- Driveline
    - Research Output/Articles
    - 6 Week Sprinting Program
    - 16 Weeks In-Season and Off-Season Throwing and Lifting Program
    - 8 Weeks Catcher Receiving and Velocity Program
    - Return to Throwing Program
    - List of Tracking Sheets
    - List of Video Sheets
    - Traq Resources