# Week 9 Lecture 2
## pandas
- [pandas](https://pandas.pydata.org/) is a Python library for working with DataFrames, the Pyret equivalent of a Table

In [1]:
import pandas as pd

A Pyret table:
```arr
orders = table: date, dish, quantity, order_type
  row: "2023-07-01", "Pasta", 2, "dine-in"
  row: "2023-07-01", "Salad", 1, "takeout"
end
```
- pandas DataFrame

In [2]:
data = {
    'date': ['2023-07-01', '2023-07-01', '2023-07-02'],
    'dish': ['Pasta', 'Salad', 'Burger'],
    'quantity': [2, 1, 3],
    'order_type': ['dine-in', 'takeout', 'dine-in']
}

orders = pd.DataFrame(data)

## Loading and Accessing Data
- pandas provides the `read_csv` method for loading CSV files


Loading in Pyret
```arr
orders = load-table: date, dish, quantity, order_type
  source: csv-table-file("orders.csv", default-options)
end
```


In [3]:
orders = pd.read_csv("orders.csv")

- You can view the first five rows with the `head()` method and the last five with `tail()`

In [None]:
orders.head()

In [None]:
orders.tail()

- Rows can be accessed using the `iloc` accessor and square bracket notation for row numbers

Pyret way:
```arr
orders.row-n(1)["dish"]
```

In [None]:
orders.iloc[1]

In [None]:
orders.iloc[1]["dish"]

- Extracting Columns as Lists

Pyret way:
```arr
quantities = orders.get-column("quantity")
```

In [5]:
quantities = orders['quantity']

- There are methods for computing statistics from a columns

Pyret way:
```arr
mean(orders, "quantity")    # Direct table operation
sum(orders, "quantity")     # Direct table operation
```

In [None]:
orders['quantity'].mean() 

In [None]:
orders['quantity'].sum()

- You can get a Series of unique values using the [unique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html#pandas.Series.unique) method

In [None]:
orders['order_type'].unique()

## Class Exercises
### Creating and Loading DataFrames
- Create a DataFrame manually with `workouts` data `activity` and `duration`. Make at least 5 rows.

In [12]:
import pandas as pd

workouts = {
    'activity': ['Bicep curl', 'Deadlift', 'Treadmill', 'Battle ropes', 'Situps'],
    'duration': [5, 3, 15, 7, 6]
}

workout_data_frame = pd.DataFrame(workouts)

- Load the CSV from `photos.csv` into a DataFrame. Print the first 5 rows.

In [2]:
import pandas as pd

photos = pd.read_csv("photos.csv")
photos.head()

### Accessing Data
- Get the second row from your `workouts` DataFrame (remember: Python uses 0-based indexing).

In [4]:
import pandas as pd

photos.iloc[1]

Location    London, UK
Subject       Mountain
Date        2024-09-14
Name: 1, dtype: object

- Extract the `activity` column and print all unique activity names.

In [13]:
import pandas as pd

workouts = {
    'activity': ['Bicep curl', 'Deadlift', 'Treadmill', 'Battle ropes', 'Situps'],
    'duration': [5, 3, 15, 7, 6]
}   

workout_data_frame = pd.DataFrame(workouts)

workout_data_frame['activity'].unique()

array(['Bicep curl', 'Deadlift', 'Treadmill', 'Battle ropes', 'Situps'],
      dtype=object)

- Get the duration value from the third workout (combining row and column access).

In [15]:
import pandas as pd

workouts = {
    'activity': ['Bicep curl', 'Deadlift', 'Treadmill', 'Battle ropes', 'Situps'],
    'duration': [5, 3, 15, 7, 6]
}   


duration_of_third = workout_data_frame.iloc[2]['duration']

- What happens if you try to access a row that doesn't exist? Try it and note the error.


In [18]:
import pandas as pd
impossible_row = workout_data_frame.iloc[7]

#There is an index error: the single positional indexer is out of bounds.

<class 'IndexError'>: single positional indexer is out-of-bounds

- What happens if you try to access a column that doesn't exist? Try it and note the error.


In [19]:
import pandas as pd
impossible_column = workout_data_frame.iloc['repetitions']

#Type error: Cannot index by location index with non-integer key

<class 'TypeError'>: Cannot index by location index with a non-integer key

### Extracting Columns & Statistics
- Extract the `duration` column from your `workouts` DataFrame and store it in a variable called `durations`.

In [21]:
durations = workout_data_frame['duration']

- Work with the `durations` Series to find: `.mean()`, `.sum()`, `.max()`, `.min()`.

In [9]:
return workout_data_frame['duration'].mean()
return workout_data_frame['duration'].sum()
return workout_data_frame['duration'].max()
return workout_data_frame['duration'].min()

- Calculate the `range` (difference between `max` and `min`) of workout durations.

In [26]:
range = workout_data_frame['duration'].max() - workout_data_frame['duration'].min()

- For the photos dataset, extract a numeric column and calculate its median using `.median()`.

In [37]:
photos["Subject"].value_counts().median()

np.float64(7.0)