# Step 1: Data Collection/Scraping

### Part 0: Source Brainstorm
- Media formats: Websites, Blogs, Instagram Posts, Twitter Posts, Youtube Videos, Research Papers, **Books**
- Topics: Pitching, Biomechanics, Pitching Training, Strength and Conditioning, Command, Velocity, Stuff, Pitch Analytics
- Credible Sources: Tread Athletics, Driveline Baseball, ArmoredHeat, Medical Research, ConnectedPerformnace (Spinal Engine Theory), Baseball Performance, 108Performance
- Starter Sources:
    - Tread, Driveline

### Objective:
- Scrape publicly available baseball training articles from websites, blogs, social media posts.
- Extract text content, without any unnecessary elements (ads/html)
- Save clean text into csv file
### Libraries for web scraping 
- **requests**
    - sends HTTP requests to websites to retrieve data
    - Downloads HTML content of a webpage
    - Handles Errors
- **BeautifulSoup**
    - Python package for pasrsing HTML and XML documents
    - Extracts specific elements (titles/paragraphs/links)
    - Removes unwanted elements (ads, Javascript)
    - Works with multiple parsers (html.parser, lxml)
- **How they're used together...**
    - **requests** fetches the raw HTML of a webpage.
    - **BeautifulSoup** extracts the useful content
    - Then we have to process, clean and store the data.

In [3]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import os

## Part 1: Websites + Blogs

### Objective:
- Scrape publicly available baseball training articles from websites and blogs. 
- Extract text content, without any unnecessary elements (ads/html)
- Save clean text into csv file

### Tread Athletics:

In [5]:
# Go through each page of their blob section, extract all article titles
base_url = "https://treadathletics.com/posts/"
page_url = base_url  # Start from the first page

# Store titles in a list
all_articles = []

while page_url:
    print(f"Scraping: {page_url}")

    # Send HTTP request
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(page_url, headers=headers)

    if response.status_code != 200:
        print(f"⚠️ Failed to access {page_url}")
        break

    # Parse HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all article elements (titles and links)
    article_links = soup.find_all("a", href=True)

    for link in article_links:
        title = link.text.strip()
        url = link["href"]

        # Filter out navigation links, only keep post links
        if "/posts/" not in url and "treadathletics.com" in url:
            all_articles.append({"title": title, "url": url})

    # 📌 Find "Next" page button (if available)
    next_page = soup.find("a", class_="next page-numbers")

    if next_page:
        page_url = next_page["href"]  # Get next page URL
        time.sleep(2)  # Pause to avoid getting blocked
    else:
        page_url = None  # Stop when no "Next" button exists

# 📌 Print all collected article titles and links
print("\n✅ Found Articles:")
for idx, article in enumerate(all_articles, start=1):
    print(f"{idx}. {article['title']} - {article['url']}")

Scraping: https://treadathletics.com/posts/

✅ Found Articles:
1.  - https://treadathletics.com
2. ABOUT - https://treadathletics.com/why-tread/
3. RESULTS - https://treadathletics.com/featured-stories/
4. TRAINING - https://treadathletics.com/coaching/
5. REMOTE COACHING - https://treadathletics.com/coaching/
6. IN-HOUSE TRAINING - https://treadathletics.com/in-house-training/
7. FREE TRAINING (CATCHERS) - https://treadathletics.com/catchers/
8. RESOURCES - https://treadathletics.com/downloads/
9. FREE DOWNLOADS - https://treadathletics.com/downloads/
10. OUR TEAM - https://treadathletics.com/our-team/
11. SHOP - https://treadathletics.com/product-shop/
12. CAREERS - https://treadathletics.com/careers/
13. FULL-TIME - https://treadathletics.com/careers/
14. INTERNSHIPS - https://treadathletics.com/internships/
15. CONTACT - https://treadathletics.com/contact-us/
16. ABOUT - https://treadathletics.com/why-tread/
17. RESULTS - https://treadathletics.com/featured-stories/
18. TRAINING - 

In [4]:
# Print titles, decide which are relevant for Pitching Chatbot

In [None]:
# Go through useful articles and extract text

## Part 2: Social Media Posts - Text

## Part 3: Social Media Posts - Videos 

### Objective
- Scrape from short and long term educational videos (Tread Youtube Playlist, Tread Instagram Posts, Driveline Youtube Playlist, Driveline Posts ...)


### Part 4: Free Educational Downloads...

- Tread
    - Sample Throwing Program
    - Gain Up To 3-5 Pounds In The Next 3 Weeks
    - Get An In-Season Routine (Free Download)
    - See How You Stack Up (Free Analysis)
- 