# Lesson: Data Sources

In data science, the first step is always getting the data. In this lesson, we'll explore three of the most common ways to bring data into your Python environment using the `pandas` library.

1. **CSV Files (Comma Separated Values)**
2. **Google Sheets**
3. **Webpages (HTML Tables)**

In [None]:
import pandas as pd

## 1. CSV Files

CSV files are the most common format for data exchange. They are simple text files where each piece of data is separated by a comma. You can load them from a local file on your computer or directly from a URL on the internet.

In [None]:
# Loading from a URL (e.g., Indiana Pacers 2024-2025 player stats)
csv_url = "https://raw.githubusercontent.com/Data-Dunkers/data/refs/heads/main/NBA/team/2024-2025/IND_2024-2025_players.csv"
df_csv = pd.read_csv(csv_url)
df_csv.head()

## 2. Google Sheets

Google Sheets is a popular tool for collaborative data collection. To load data from a Google Sheet, you don't need to download it; you can read it directly into Python by following these steps:

1. **Share the Sheet**: Click the **Share** button and set the access to **"Anyone with the link"**.
2. **Copy the Link**: Copy the URL of your Google Sheet.
3. **Modify the URL**: Replace the end of the URL (everything after the last `/`) with `export?format=csv`.

**Example Conversion:**
*   **Original**: `.../d/1ZULKhYzsMd4eYwiprsyGgE9Df3gaVtO8WRalUQDn-xE/edit#gid=0`
*   **Modified**: `.../d/1ZULKhYzsMd4eYwiprsyGgE9Df3gaVtO8WRalUQDn-xE/export?format=csv`

In [None]:
# Example: Loading a public Google Sheet with sample basketball data
sheet_url = "https://docs.google.com/spreadsheets/d/1ZULKhYzsMd4eYwiprsyGgE9Df3gaVtO8WRalUQDn-xE/export?format=csv"
df_sheet = pd.read_csv(sheet_url)
df_sheet.head()

## 3. Webpages (HTML Tables)

Sometimes data is displayed in a table on a website, like on Wikipedia or a sports news site. You can use `pd.read_html()` to automatically scan a webpage for all its tables and return them as a list of DataFrames.

In [None]:
# Example: Scraping a table from Wikipedia
wiki_url = "https://en.wikipedia.org/wiki/Indiana_Pacers"
tables = pd.read_html(wiki_url)

# Show the first table found on the page
tables[0].head()

## Reflection Questions

1. **Which data source felt the easiest to use? Why?**
2. **What are some risks of using `pd.read_html()` to get data from a website?**
3. **When would you use a Google Sheet instead of a simple CSV file?**