# Let's Code!

Follow along, run the code cells (Shift + Enter), and let's get ready for the hackathon!

## 1. Setup: Importing Our Tools

First, we need to import the libraries we'll use. We give them standard nicknames (`pd`, `plt`) to make our code shorter.

- `pandas`: For creating and manipulating our main data structure, the DataFrame.
- `matplotlib.pyplot`: For making basic plots.

In [22]:
# !pip install pandas matplotlib
# import libraries

# Let's make sure plots show up nicely in the notebook
%matplotlib inline 

print("Libraries imported! Ready to code!")

Libraries imported! Ready to code!


## 2. Loading the Data

Now, let's load the Aggie Baseball data from the CSV file into a Pandas DataFrame. A DataFrame is like a programmable spreadsheet.

**Important:** Make sure the `aggie_baseball_batting.csv` file is in the same directory as this notebook, or change the `DATA_PATH` variable below to the correct file path on your computer.

In [4]:
# Adjust this path if your file is somewhere else!
DATA_PATH = 'file.csv' 
# ========================================

# Read the CSV files into a pandas dataframe


## 3. Inspecting Our DataFrame: First Look

We loaded *something*, but what does it look like? Let's use some essential Pandas commands to inspect our DataFrame.

- `df.head()`: Shows the first 5 rows.
- `df.tail()`: Shows the last 5 rows.
- `df.shape`: Shows the number of (rows, columns).
- `df.info()`: Shows column names, non-null counts, and data types (Dtypes) for each column (Series).

In [5]:
# Display the first few rows
print("--- First 5 Rows --- ")


--- First 5 Rows --- 


In [6]:
# Display the last few rows
print("--- Last 5 Rows --- ")


--- Last 5 Rows --- 


In [8]:
# Get the dimensions (rows, columns)
print("--- DataFrame Shape --- ")


--- DataFrame Shape --- 


In [9]:
# Get column info (names, non-null counts, Dtypes)
print("--- DataFrame Info --- ")


--- DataFrame Info --- 


**Observation:** Look at the `Dtype` column from `df.info()`. Notice that `Date`, `W/L`, and `Score` are listed as `object`. This usually means Pandas sees them as text strings. We need to fix this to perform calculations and date-based analysis!

## 4. Cleaning Time: Fixing Data Types

Let's convert those `object` columns into more useful types.

### 4.1 Fixing the Date Column

We'll use `pd.to_datetime()` to convert the 'Date' column strings into actual datetime objects. Then, we'll extract the year into a new column called 'Year' using the `.dt` accessor, which is super handy for working with dates.

In [12]:
print("Fixing Date column...")
# Change the Date string to a DateTime Object

# Create a 'Year' column

print("Date fixed, 'Year' column added.")
# Verify the change
# print("Data type of 'Date' column:", df['Date'].dtype)
# print("Data type of 'Year' column:", df['Year'].dtype)

Fixing Date column...
Date fixed, 'Year' column added.


### 4.2 Parsing the Score Column

The 'Score' column (e.g., '5-6') needs to be split into two numerical columns: 'Runs_For' (A&M's score) and 'Runs_Against' (Opponent's score).

1.  Use the `.str.split()` method to split the string at the hyphen ('-'). `expand=True` makes new columns.
2.  Convert these new columns to numbers using `pd.to_numeric()`. We use `errors='coerce'` which will turn any problematic values (like missing scores) into `NaN` (Not a Number) instead of crashing.

In [13]:
print("Parsing Score column...")
# Split the 'Score' string

# Create new columns and convert to numeric

# print("'Runs_For' and 'Runs_Against' columns added.")
# print("Data type of 'Runs_For':", df['Runs_For'].dtype)
# print("Data type of 'Runs_Against':", df['Runs_Against'].dtype)

Parsing Score column...


### 4.3 Parsing the W/L Column

The 'W/L' column is text ('W' or 'L'). For easier analysis (like grouping), let's create a boolean (True/False) column called 'Win'. We can use `.str.startswith('W')` which returns `True` if the string begins with 'W'.

In [15]:
print("Parsing W/L column...")
# Parse the Win-Loss Column and convert it to a boolean

print("'Win' (boolean) column added.")
# print("Data type of 'Win':", df['Win'].dtype)

Parsing W/L column...
'Win' (boolean) column added.


### 4.4 Verify All Cleaning

Let's run `df.info()` again and look at the newly created columns to make sure everything looks good.

In [16]:
print("--- Verifying Cleaned DataFrame Info --- ")


--- Verifying Cleaned DataFrame Info --- 


In [17]:
print("\n--- Checking New/Modified Columns --- ")
# display(df[['Date', 'Year', 'Score', 'Runs_For', 'Runs_Against', 'W/L', 'Win']].head())


--- Checking New/Modified Columns --- 


**Success!** Our key columns now have useful data types (`datetime64`, `int64`, `float64`, `bool`). Now we can analyze!

## 5. Analysis Time! Answering Questions

With our cleaned DataFrame, we can start asking questions using filtering, grouping, sorting, and aggregation methods.

### Q1: How have Home Runs per year changed?

We can `groupby()` the 'Year' column and then calculate the `sum()` of the 'HR' column for each year.

In [18]:
print("--- Q1: Home Runs per Year ---")
# Calculate the Home Runs by year

print("Total Home Runs per Year (showing last 10):")
# Show the last 10 years

--- Q1: Home Runs per Year ---
Total Home Runs per Year (showing last 10):


### Q2: What are the average Runs For/Against in Wins vs. Losses?

Let's `groupby()` our boolean 'Win' column and calculate the `mean()` for both 'Runs_For' and 'Runs_Against'.

In [19]:
print("--- Q2: Average Runs in Wins vs Losses ---")
# Group by 'Win', then select the columns to average, then calculate mean

# The index will be False (Losses) and True (Wins)


--- Q2: Average Runs in Wins vs Losses ---


### Q3: How do we hit against a specific rival (e.g., Texas)?

1.  **Filter:** Create a smaller DataFrame containing only games where the 'Opponent' column matches our rival.
2.  **Calculate:** Perform calculations (like batting average: Total Hits / Total At Bats) on this *filtered* DataFrame.

**Important:** The opponent name (`'Texas'` below) must *exactly* match how it appears in the 'Opponent' column of your CSV file!

In [None]:
print("--- Q3: Performance vs. Specific Rival ---")
# Make sure this name matches your data EXACTLY (case-sensitive!)
RIVAL_NAME = 'Texas' 
# ==================================

# Filter the DataFrame to get only games vs Rival


print(f"Found {len(rival_games)} games against {RIVAL_NAME}.")

# Calculate stats for these games
# hits_vs_rival =
# at_bats_vs_rival = 
# avg_vs_rival = 

# avg_runs_vs_rival = 
# win_rate_vs_rival = 

print(f"\nStats vs {RIVAL_NAME}:")
# print(f"  Batting Avg: {avg_vs_rival:.3f}")
# print(f"  Avg Runs Scored: {avg_runs_vs_rival:.2f}")
# print(f"  Win Rate: {win_rate_vs_rival:.1%}")

# Compare to overall
# overall_avg = 
# overall_win_rate = 
print(f"\nCompare to Overall:")
# print(f"  Overall Batting Avg: {overall_avg:.3f}")
# print(f"  Overall Win Rate: {overall_win_rate:.1%}")

### Q4: What were the highest scoring games for A&M?

We can use `.sort_values()` on the 'Runs_For' column. `ascending=False` puts the highest values first. Then use `.head()` to see the top few.

In [None]:
print("--- Q4: Highest Scoring Games (for A&M) ---")
high_scoring_games = df.sort_values('Runs_For', ascending=False)

# Display the top 5 highest scoring games, showing relevant columns
display(high_scoring_games[['Date', 'Opponent', 'Score', 'W/L', 'Runs_For']].head(5))

## 6. Quick Visualization

Pandas DataFrames and Series can plot themselves using Matplotlib under the hood. Let's make a quick bar chart of the Home Runs per Year data we calculated earlier.

In [21]:
print("--- Visualizing HRs per Year ---")
# hrs_per_year.plot(kind='bar', figsize=(12, 6))

# Add plot details


--- Visualizing HRs per Year ---


## Workshop Complete!

Nice work! We've successfully loaded, inspected, cleaned, analyzed, and even visualized real Aggie Baseball data using Pandas.

Remember the core workflow:
1.  **Load:** `pd.read_csv()`
2.  **Inspect:** `head()`, `tail()`, `shape`, `info()`, `describe()`
3.  **Clean:** Fix data types (`to_datetime`, `to_numeric`), handle missing values (`dropna`, `fillna`), create new columns.
4.  **Analyze:** Filter rows `df[...]`, `groupby()`, aggregate (`sum`, `mean`, `count`), `sort_values()`.
5.  **(Bonus) Visualize:** `.plot()`

You now have the foundational skills to tackle data challenges in the hackathon. Good luck, have fun, and Gig 'Em!