# 1.5 Estimates of Location
Imagine you’re lining up your classmates by height to find the “middle” person—someone who’s not the tallest or shortest, just right in the center. Or picture a vote for the class’s favorite game, where one option stands out as the most popular. That’s what estimates of location are all about—finding the central point or most typical value in a dataset using tools like the mean, median, and mode. It’s like discovering the heartbeat of your data!

## What Are Estimates of Location?
Estimates of location help us summarize where the data clusters. Here’s the trio:

- **Mean**: The average, like sharing all your toys equally among friends. Add up the values and divide by the count.
- **Median**: The middle value when data is sorted, like picking the kid in the exact center of the line.
- **Mode**: The most frequent value, like the game everyone votes for the most.

Let’s use our existing `toy_data.csv` dataset, which has 100 rows of toy counts and more. We’ll focus on the `Number of Toys` column. Load it like this:

In [None]:
import pandas as pd

# Import the sample dataset from the data folder
data = pd.read_csv('data/toy_data.csv')

# Set 'ID' as index for easy access
data.set_index('ID', inplace=True)

print(data.head())  # Show the first 5 rows to get a peek

The dataset includes `Name`, `Favorite Toy`, `Number of Toys`, `Price per Toy`, and `Is Gifted`. Let’s calculate the estimates for `Number of Toys`:

For mean:

In [None]:
# Calculate mean number of toys
mean_toys = data['Number of Toys'].mean()
print(f"Mean number of toys: {mean_toys:.2f}")  # Outputs something like: Mean number of toys: 10.50 (varies)

For median:

In [None]:
# Calculate median number of toys
median_toys = data['Number of Toys'].median()
print(f"Median number of toys: {median_toys:.2f}")  # Outputs something like: Median number of toys: 10.00 (varies)

For mode (most frequent value):

In [None]:
# Calculate mode number of toys
mode_toys = data['Number of Toys'].mode()
print(f"Mode number of toys: {mode_toys[0] if not mode_toys.empty else 'No mode'}")  # Outputs the most frequent value, e.g., 5

With 100 random toy counts between 1 and 20, the mean and median will vary, and the mode might be a common value like 5 or 10. Let’s visualize this with a histogram:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Plot histogram of Number of Toys
data['Number of Toys'].hist(bins=20, color='lightcoral')
plt.axvline(mean_toys, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_toys:.2f}')
plt.axvline(median_toys, color='green', linestyle='dashed', linewidth=2, label=f'Median: {median_toys:.2f}')
plt.title('Distribution of Number of Toys')
plt.xlabel('Number of Toys')
plt.ylabel('Frequency')
plt.legend()
plt.show()

## Why Is This Necessary?

- **In Mathematics**: These measures summarize data, helping us understand trends or compare groups, like averaging test scores.
- **In Machine Learning (ML)**: They provide a baseline to predict values or fill in missing data, guiding model training.

## Relevance in Machine Learning
Estimates of location are the starting line for ML. The mean might set a default prediction (e.g., average toy count), while the median handles skewed data (e.g., if one friend had 50 toys). The mode helps with categorical predictions, like guessing the most popular item. They’re essential for setting expectations before diving into complex models.

## Applications

- **Average Scores**: Teachers use the mean to gauge class performance on a test.
- **Popular Items**: Stores track the mode to stock the most-liked products.

## Step-by-Step Example
Let’s find the center of our toy data:

1. **Load the Data**: Import `toy_data.csv` from `data/`.
2. **Calculate Mean**: Add all `Number of Toys`, divide by 100—our average!
3. **Find Median**: Sort the 100 values, pick the middle one (50th).
4. **Check Mode**: Look for the most frequent toy count.

The mean and median give us a sense of balance, while the mode highlights the crowd favorite. The histogram above shows how they fit the distribution!

## Practical Insights

- **Context Matters**: The mean can be skewed by outliers (e.g., 50 toys), so the median might be safer.
- **Frequency Clues**: The mode shines with categorical data, like favorite colors.
- **Combination Power**: Use all three to get a fuller picture of your data.

Let’s test an outlier by adding a friend with 50 toys and recalculating:

In [None]:
# Add an outlier
data_with_outlier = pd.concat([data, pd.DataFrame({'Name': ['Zara'], 'Favorite Toy': ['Robot'], 'Number of Toys': [50], 'Price per Toy': [25.00], 'Is Gifted': [False]}, index=[101])])

# Recalculate mean and median
mean_with_outlier = data_with_outlier['Number of Toys'].mean()
median_with_outlier = data_with_outlier['Number of Toys'].median()
print(f"Mean with outlier: {mean_with_outlier:.2f}")
print(f"Median with outlier: {median_with_outlier:.2f}")

## Common Pitfalls to Avoid

- **Outlier Impact**: A single huge value (e.g., 50) can drag the mean up—watch for this!
- **Even Counts**: With an even number of values, the median is the average of the two middle numbers—don’t forget!
- **No Mode**: If all values are unique, the mode won’t help—plan for this.

Let’s check the mode for `Favorite Toy` (categorical data) to see if there’s a popular pick:

In [None]:
# Calculate mode for Favorite Toy
mode_toy = data['Favorite Toy'].mode()[0]
print(f"Most popular toy: {mode_toy}")  # Outputs the most frequent toy, e.g., 'Car'

## What’s Next?
We’ve found the center of our toy world. Next, we’ll explore how spread out the data is with estimates of variability—think of it as measuring how far apart the kids stand in line. Ready to dig deeper?