# 1.3 Data Frames and Indexes
Imagine you’re in a massive library with thousands of books, and you need to find one fast. You’d flip to the card catalog, where each book has an index number to guide you—row by row, shelf by shelf. That’s what data frames and indexes are in the data world: a slick way to organize and access info using Python’s pandas library, like having a treasure map with labeled dig spots!

## What Are Data Frames and Indexes?
A data frame is like a super-organized table in pandas, where rows are observations (e.g., people) and columns are variables (e.g., names, toy counts). An index is like the library’s card catalog—a special label for each row, making it a breeze to find or filter data. Let’s check out our toy data:

We’ll load it from a file called `toy_data.csv` in a `data` folder. Make sure it’s there first (see the script below to create it if needed). Here’s how to get started:

In [1]:
import pandas as pd

# Import the sample dataset
data = pd.read_csv('data/toy_data.csv')

print(data.head())  # Show the first 5 rows to get a peek

   ID    Name Favorite Toy  Number of Toys  Price per Toy  Is Gifted
0   1  Farhan         Kite               5          11.61      False
1   2   Aisha         Ball              14           8.15      False
2   3    Ella         Lego               6          11.07      False
3   4    Jack         Doll              14          32.68      False
4   5    Jack        Robot               7          10.68      False


The `toy_data.csv` now contains a larger dataset with 100 rows, including columns like `ID`, `Name`, `Favorite Toy`, `Number of Toys`, `Price per Toy`, and `Is Gifted`. A sample might look like this (varies due to randomness):

| ID  | Name   | Favorite Toy | Number of Toys | Price per Toy | Is Gifted |
|-----|--------|--------------|----------------|---------------|-----------|
| 1   | Aisha  | Car          | 5              | 12.34         | False     |
| 2   | Ben    | Doll         | 3              | 8.56          | True      |
| 3   | Clara  | Block        | 7              | 19.87         | False     |

The index starts as row numbers (0, 1, 2, ...), but let’s make it more useful by using `ID` as the index:

In [2]:
# Set 'ID' as the index
data.set_index('ID', inplace=True)

print(data.head())

      Name Favorite Toy  Number of Toys  Price per Toy  Is Gifted
ID                                                               
1   Farhan         Kite               5          11.61      False
2    Aisha         Ball              14           8.15      False
3     Ella         Lego               6          11.07      False
4     Jack         Doll              14          32.68      False
5     Jack        Robot               7          10.68      False


Now, finding data for a specific ID (e.g., ID 3 for Clara) is a snap!

In [3]:
# Access data for ID 3 (Clara's entry, if consistent)
clara_data = data.loc[3, ['Favorite Toy', 'Number of Toys', 'Price per Toy']]
print(f"Data for ID 3: {clara_data}")  # Outputs something like: Data for ID 3: Favorite Toy       Block
                                                        # Number of Toys      7
                                                        # Price per Toy    19.87

Data for ID 3: Favorite Toy       Lego
Number of Toys        6
Price per Toy     11.07
Name: 3, dtype: object


## Why Is This Necessary?

- **In Mathematics**: Indexes and data frames let us tweak data fast, like sorting grades or filtering numbers for analysis.
- **In Machine Learning (ML)**: This setup is key for preprocessing—cleaning and shaping data before models get to work.

## Relevance in Machine Learning
Data frames with indexes are the prep station for ML. They let you carve out specific rows (e.g., high toy counts) or columns (e.g., favorite toys) to train models, like guessing who might buy more toys. Without this, you’d be digging through a haystack—indexes are your metal detector!

## Applications

- **Filtering Datasets**: A retailer might filter sales by region with indexes to nail marketing targets.
- **Time Series Analysis**: Stock prices indexed by date can reveal trends over time.

## Step-by-Step Example
Let’s play library detectives with our toy data:

1. **Import the Data**: Load `toy_data.csv` from the `data` folder with `pd.read_csv()`.
2. **Set an Index**: Use `set_index('ID')` to make IDs the key—now ID 3 is Clara’s spot.
3. **Access It**: Grab Clara’s toy count with `data.loc[3, 'Number of Toys']`.

Let’s add a filter—find entries where the number of toys is above 10 and calculate total value:

In [4]:
# Filter for toy counts above 10
high_toy_count = data[data['Number of Toys'] > 10]
print(high_toy_count)

# Calculate total value (Number of Toys * Price per Toy) for these entries
high_toy_count['Total Value'] = high_toy_count['Number of Toys'] * high_toy_count['Price per Toy']
print(high_toy_count[['Number of Toys', 'Price per Toy', 'Total Value']])

      Name Favorite Toy  Number of Toys  Price per Toy  Is Gifted
ID                                                               
2    Aisha         Ball              14           8.15      False
4     Jack         Doll              14          32.68      False
6      Ben       Puzzle              15          46.33       True
8     Jack         Lego              11          44.41      False
10   Clara         Ball              14          23.72       True
11    Ella         Lego              15          35.13       True
13  Farhan         Kite              19          31.35      False
14   David        Teddy              18          32.37      False
16  Farhan        Train              17          28.20      False
19   David         Kite              18          39.17       True
20   Aisha          Car              15           6.78       True
21    Hiro         Ball              19          10.07      False
24   Grace        Robot              18          26.23       True
27   Grace

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  high_toy_count['Total Value'] = high_toy_count['Number of Toys'] * high_toy_count['Price per Toy']


    Number of Toys  Price per Toy  Total Value
ID                                            
2               14           8.15       114.10
4               14          32.68       457.52
6               15          46.33       694.95
8               11          44.41       488.51
10              14          23.72       332.08
11              15          35.13       526.95
13              19          31.35       595.65
14              18          32.37       582.66
16              17          28.20       479.40
19              18          39.17       705.06
20              15           6.78       101.70
21              19          10.07       191.33
24              18          26.23       472.14
27              14           7.95       111.30
29              20          48.59       971.80
30              13          31.24       406.12
32              14          34.73       486.22
33              18          32.31       581.58
35              18          21.62       389.16
37           

## Practical Insights

- **Custom Indexes**: Labels like IDs or dates beat plain numbers for clarity.
- **Efficiency**: Indexes let you grab data in one line, a lifesaver with 100+ rows.
- **Flexibility**: Reset or switch indexes with `reset_index()` or `set_index()` anytime.

Let’s try resetting the index and sorting by toy count:

In [5]:
# Reset index and sort by Number of Toys
data_reset = data.reset_index().sort_values(by='Number of Toys', ascending=False)
print(data_reset.head())

    ID  Name Favorite Toy  Number of Toys  Price per Toy  Is Gifted
28  29  Hiro        Teddy              20          48.59      False
79  80  Jack         Doll              20          22.32      False
66  67  Ella        Teddy              20          45.56      False
50  51   Ben       Puzzle              20          47.38       True
73  74  Jack        Train              20          41.52      False


## Common Pitfalls to Avoid

- **Duplicate Indexes**: Two IDs would trip up `loc`—keep indexes unique (our IDs are already unique here).
- **Missing Files**: If `toy_data.csv` isn’t in `data/`, you’ll hit an error—check the path!
- **OverIndexing**: Too many index levels (e.g., ID and Name) can slow you down—keep it lean.

Let’s test a scenario with a missing file path (commented out to avoid breaking the notebook):

In [6]:
# This would fail if the file isn't there
# data_wrong_path = pd.read_csv('wrong_path/toy_data.csv')  # Uncomment to test error handling

# Instead, let’s assume the file exists and check for missing values
print(data.isnull().sum())  # Should show 0 for all columns with this dataset

Name              0
Favorite Toy      0
Number of Toys    0
Price per Toy     0
Is Gifted         0
dtype: int64


## What’s Next?
We’ve nailed finding treasures with indexes. Next, we’ll tackle nonrectangular data structures—think of it as exploring maps with quirky shapes. Ready for the next adventure?