# 1.8 Exploring Binary Categorical Data
Imagine you’re flipping a coin to decide—heads or tails, yes or no—or taking a class vote on pizza versus burgers. That’s the world of binary categorical data, where we deal with two options, like pass/fail or like/dislike. We use tools like mode, expected value, and probability to uncover patterns in this simple yet powerful data type. It’s like reading the room to see what everyone’s leaning toward!

## What Is Exploring Binary Categorical Data?
Binary categorical data has just two categories, often coded as 0 and 1 or yes/no. We explore it with:

- **Mode**: The most common category, like the winning vote in our pizza debate.
- **Expected Value**: The average outcome if we repeated the experiment, like guessing how many heads in 10 coin flips.
- **Probability**: The chance of each category, like the likelihood of passing a test.

Let’s use a custom dataset of toy preferences. We’ll generate a dataset with 100 entries and save it as `toy_preferences.csv` in your `data` folder. Load it like this:

In [None]:
import pandas as pd
import numpy as np
import os

# Generate a fair amount of data
os.makedirs("data", exist_ok=True)
names = ["Aisha", "Ben", "Clara", "David", "Ella", "Farhan", "Grace", "Hiro", "Isla", "Jack"]
n = 100
toy_data = pd.DataFrame({
    "Name": np.random.choice(names, n),
    "Likes Cars (1=Yes, 0=No)": np.random.choice([0, 1], n, p=[0.4, 0.6])  # 60% like cars
})
toy_data.to_csv('data/toy_preferences.csv', index=False)
print("Dataset saved as data/toy_preferences.csv with", n, "entries")

# Load the dataset
data = pd.read_csv('data/toy_preferences.csv')

print(data.head())  # Show the first 5 rows

Let’s calculate these measures. For mode:

In [None]:
# Calculate mode
mode_preference = data['Likes Cars (1=Yes, 0=No)'].mode()[0]
print(f"Mode preference: {'Yes' if mode_preference == 1 else 'No'}")  # Outputs: Mode preference: Yes

For expected value (mean, since it’s binary):

In [None]:
# Calculate expected value
expected_value = data['Likes Cars (1=Yes, 0=No)'].mean()
print(f"Expected value (probability of Yes): {expected_value:.2f}")  # Outputs something like: 0.60

For probability of “Yes” (same as mean here):

In [None]:
# Calculate probability of Yes
probability_yes = data['Likes Cars (1=Yes, 0=No)'].mean()
print(f"Probability of liking cars: {probability_yes:.2f}")  # Outputs something like: 0.60

Let’s visualize the distribution with a bar plot:

In [None]:
import matplotlib.pyplot as plt

# Count Yes and No
counts = data['Likes Cars (1=Yes, 0=No)'].value_counts().sort_index()
plt.bar(['No', 'Yes'], counts, color=['lightcoral', 'lightgreen'])
plt.title('Toy Preference Distribution')
plt.ylabel('Count')
plt.show()

With 100 entries and 60% liking cars, the mode is 1 (Yes), the expected value is around 0.60, and the probability of Yes matches, showing a clear preference for cars!

## Why Is This Necessary?

- **In Mathematics**: It helps analyze simple decisions or outcomes, like calculating pass rates in a class.
- **In Machine Learning (ML)**: It’s the foundation for classification tasks, where models predict yes/no labels.

## Relevance in Machine Learning
Binary data is the starting point for classification models, like predicting if a customer will buy (yes/no). The mode sets a baseline, expected value guides probability estimates, and probability shapes model training. Understanding this ensures models don’t misjudge rare categories.

## Applications

- **Pass/Fail Rates**: Schools track how many students pass a test to adjust teaching.
- **Product Preferences**: Companies use yes/no surveys to see which products are hits.

## Step-by-Step Example
Let’s explore our toy preferences:

1. **Load the Data**: Import `toy_preferences.csv` from `data/`.
2. **Find Mode**: 1 (Yes) wins with more votes—cars are the favorite!
3. **Calculate Expected Value**: Around 0.60 means 60% like cars on average.
4. **Determine Probability**: Matches expected value, 60% chance of Yes.

Run the code above to confirm—our data leans toward car lovers!

## Practical Insights

- **Balance Check**: If Yes and No are close (e.g., 50/50), the mode might not tell the full story—use probability.
- **Small Samples**: With 100 entries, 0.60 is more reliable than a tiny sample.
- **Binary Coding**: Consistent 0/1 coding avoids confusion—stick to it!

Let’s adjust the preference ratio to 50/50 and recalculate:

In [None]:
# Generate data with 50/50 preference
toy_data_50_50 = pd.DataFrame({
    "Name": np.random.choice(names, n),
    "Likes Cars (1=Yes, 0=No)": np.random.choice([0, 1], n, p=[0.5, 0.5])
})
toy_data_50_50.to_csv('data/toy_preferences_50_50.csv', index=False)

# Load and analyze
data_50_50 = pd.read_csv('data/toy_preferences_50_50.csv')
expected_value_50_50 = data_50_50['Likes Cars (1=Yes, 0=No)'].mean()
print(f"Expected value (probability of Yes) with 50/50: {expected_value_50_50:.2f}")  # Outputs something like: 0.50

## Common Pitfalls to Avoid

- **Imbalanced Data**: If 90% say Yes, the mode hides the minority—consider both categories.
- **Misusing Expected Value**: It’s an average, not a guarantee—don’t overrely on it.
- **Small Sample Bias**: 100 votes reduce bias, but even more data is better for precision.

Let’s check the count of Yes/No for the 50/50 case:

In [None]:
# Visualize 50/50 distribution
counts_50_50 = data_50_50['Likes Cars (1=Yes, 0=No)'].value_counts().sort_index()
plt.bar(['No', 'Yes'], counts_50_50, color=['lightcoral', 'lightgreen'])
plt.title('50/50 Toy Preference Distribution')
plt.ylabel('Count')
plt.show()

## What’s Next?
We’ve cracked the yes/no code. Next, we’ll move to sampling and probability in Module 2—think of it as rolling the dice on bigger datasets. Ready to level up?