<a href="https://colab.research.google.com/github/MonkeyWrenchGang/PythonBootcamp/blob/main/day_4/4_4_Pandas_GroupBy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GroupBy in Pandas

1. Introduction:
   - The `groupby()` function in Pandas is used to group data based on one or more columns.
   - It allows us to perform operations on groups of data, such as aggregation, transformation, or filtering.

2. Syntax:
   - The basic syntax for using `groupby` in Pandas is as follows:
     ```python
     grouped = dataframe.groupby('column_name')
     ```
   - Here, `'column_name'` represents the column(s) we want to group the data by.

3. Common Aggregation Functions:
   - Pandas provides several built-in aggregation functions that can be used with `groupby`:
     - `sum()`: Calculates the sum of the values in each group.
     - `mean()`: Computes the mean (average) of the values in each group.
     - `count()`: Counts the number of records in each group.
     - `min()`, `max()`: Returns the minimum and maximum values in each group.
     - `size()`: Returns the size (number of elements) of each group.
     - `agg()`: Allows us to apply custom aggregation functions or perform multiple aggregations simultaneously.


```python
  # Example: Grouping and Aggregating Data

  # Import the necessary libraries
  import pandas as pd

  # Create a sample sales DataFrame
  data = {
      'Product': ['A', 'B', 'C', 'A', 'B', 'C'],
      'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
      'Sales': [100, 200, 150, 250, 120, 180]
  }
  sales_data = pd.DataFrame(data)

  # Group the data by 'Region' and calculate the total sales in each region
  grouped_data = sales_data.groupby('Region')['Sales'].sum()

  # Print the grouped and aggregated data
  print(grouped_data)

  ```
  Result:
```
Region
East    370
West    630
```








## Basic Example

In [3]:
# Import the necessary libraries
import pandas as pd

# Create a sample sales DataFrame
data = {
    'Product': ['A', 'B', 'C', 'A', 'B', 'C'],
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Sales': [100, 200, 150, 250, 120, 180]
}
sales_data = pd.DataFrame(data)

# Group the data by 'Region' and 'Product'
grouped_data = sales_data.groupby(['Region','Product'])['Sales'].sum()

grouped_data

Region  Product
East    A          100
        B          120
        C          150
West    A          250
        B          200
        C          180
Name: Sales, dtype: int64

# Let's practice


---



1. import pandas as pd
2. import the following CSV into a dataframe called `abnb`
```
"https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_3/data/sd_listings.csv"
```
  - check it out using head()



In [4]:
import pandas as pd

abnb = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_3/data/sd_listings.csv")
abnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6,Stay in the 1st house EVER posted on Airbnb,29,Sara,,North Hills,32.75522,-117.12873,Entire home/apt,371,2,153,2019-09-14,0.87,1,10,0,
1,29967,"Great home, 10 min walk to Beach",129123,Michael,,Pacific Beach,32.80751,-117.2576,Entire home/apt,310,4,89,2022-12-21,0.59,5,235,17,
2,38245,Point Loma: Den downstairs,164137,Melinda,,Roseville,32.74217,-117.21931,Private room,113,1,149,2022-07-24,1.0,3,357,3,
3,54001,"La Jolla Garden Cottage: Blks to Ocn; 2Bdms, 1...",252692,Marsha,,La Jolla,32.81301,-117.26856,Entire home/apt,258,5,301,2022-11-18,2.06,2,67,21,"tier_2, STR-05706L"
4,62274,"charming, colorful, close to beach",302986,Isabel,,Pacific Beach,32.80583,-117.24244,Entire home/apt,103,1,763,2022-12-04,5.2,3,300,68,


In [10]:
abnb['neighbourhood'].value_counts().head(10)

Mission Bay        1843
Pacific Beach      1007
La Jolla            812
East Village        749
Ocean Beach         609
North Hills         601
Midtown             563
Gaslamp Quarter     348
Loma Portal         348
Old Town            267
Name: neighbourhood, dtype: int64

# Create a Grouped Dataset


---


## DataFrameGroupBy Object in Pandas

- A `DataFrameGroupBy` object represents the result of applying the `groupby` operation in Pandas.
- It provides an interface to work with data that has been grouped based on one or more columns.

Key points about a `DataFrameGroupBy` object:

1. Grouping:
   - It represents a collection of groups based on the grouping column(s).
   - It allows you to group data based on one or more columns.

2. Aggregation:
   - It provides an interface for applying aggregate functions like `sum()`, `mean()`, `count()`, etc. to the groups.
   - You can calculate summary statistics for each group or perform custom aggregations.

3. Iteration:
   - It supports iteration over the groups using a `for` loop.
   - You can perform operations on each group individually.

4. Accessing Groups:
   - It allows you to access individual groups using the `get_group()` method.
   - You can access specific groups for further analysis or processing.



Example:

```python
grouped_data = df.groupby('ColumnA')
sum_by_group = grouped_data['ColumnB'].sum()
```

Let's practice with our abnb data

0. filter for following neighborhoods:
  - La Jolla
  - Pacific Beach
  - East Village
  - Mission Bay
  - North Hills
  - Gaslamp Quarter

1. create a grouped dataset by 'neighbourhood' called grouped_neighbourhood
2. What is the mean  'price' by 'neighbourhood'?
3. use .agg to get the mean, min, max `price` and the mean `minimum_nights`, note to use .agg() with more than one column you pass a dictionary object like this:
  ```python
   grouped_neighbourhood.agg({'price':["mean","min","max"],
      'minimum_nights':["mean","median"]})

  ```
  - deal w. multi-indexes




0. filter for following neighborhoods:
  - La Jolla
  - Pacific Beach
  - East Village
  - Mission Bay
  - North Hills
  - Gaslamp Quarter

remember to  use `in [ ]` inside your quyery

```python
abnb_filter = abnb.query("neighborhood in ['La Jolla', 'Pacific Beach', 'East Village', 'Mission Bay', 'North Hills', 'Gaslamp Quarter']")

```

In [13]:
abnb_filter = abnb.query("neighbourhood in ['La Jolla', \
                         'Pacific Beach', 'East Village', \
                         'Mission Bay', 'North Hills', \
                         'Gaslamp Quarter']")

abnb_filter.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6,Stay in the 1st house EVER posted on Airbnb,29,Sara,,North Hills,32.75522,-117.12873,Entire home/apt,371,2,153,2019-09-14,0.87,1,10,0,
1,29967,"Great home, 10 min walk to Beach",129123,Michael,,Pacific Beach,32.80751,-117.2576,Entire home/apt,310,4,89,2022-12-21,0.59,5,235,17,
3,54001,"La Jolla Garden Cottage: Blks to Ocn; 2Bdms, 1...",252692,Marsha,,La Jolla,32.81301,-117.26856,Entire home/apt,258,5,301,2022-11-18,2.06,2,67,21,"tier_2, STR-05706L"
4,62274,"charming, colorful, close to beach",302986,Isabel,,Pacific Beach,32.80583,-117.24244,Entire home/apt,103,1,763,2022-12-04,5.2,3,300,68,
5,189785,La Jolla/PB Ocean views! \n1 block to beach & ...,915738,Jeff,,La Jolla,32.81041,-117.26637,Private room,126,31,43,2022-12-09,0.4,1,180,7,


1. create a grouped dataset by 'neighbourhood' called grouped_neighbourhood



In [14]:
grouped_neighbourhood = abnb_filter.groupby('neighbourhood')
print(type(grouped_neighbourhood))
grouped_neighbourhood


<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbf04929750>

2. What is the mean  'price' by 'neighbourhood'?

what do you notice about the data?

In [15]:
res2 = grouped_neighbourhood['price'].mean()
print(f"type ={type(res2)}")
res2

type =<class 'pandas.core.series.Series'>


neighbourhood
East Village       282.423231
Gaslamp Quarter    244.589080
La Jolla           668.667488
Mission Bay        564.046663
North Hills        202.282862
Pacific Beach      328.405164
Name: price, dtype: float64

3. use .agg to get the mean, min, max `price` and the mean `minimum_nights`, note to use .agg() with more than one column you pass a dictionary object like this:
  ```python
   grouped_neighbourhood.agg({'price':["mean","min","max"],
      'minimum_nights':["mean","median"]})

  ```

In [17]:
grouped_neighbourhood.agg({'price':["mean","min","max"],
      'minimum_nights':["mean","median"]})

Unnamed: 0_level_0,price,price,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,mean,min,max,mean,median
neighbourhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
East Village,282.423231,10,10000,10.153538,2.0
Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
La Jolla,668.667488,0,10000,6.115764,3.0
Mission Bay,564.046663,35,100000,3.747151,3.0
North Hills,202.282862,32,3000,5.717138,2.0
Pacific Beach,328.405164,0,2114,6.017875,3.0


## Multi-Index

### Summary of Multi-Indexes with GroupBy in Pandas

When you perform a `groupby` + `aggregate` operation it will result in a multi-index DataFrame.
- A multi-index consists of two or more levels of indexing on one or more axes.
- Each level represents a different category or grouping based on the columns used in the `groupby` operation.

Key points about multi-indexes:

0. They are PAIN IN THE A#$ to deal with IMO!

1. Structure:
   - Multi-indexes have multiple levels, each representing a different category or grouping.

2. Accessing Data:
   - Use `loc` or `iloc` to access data in a multi-index DataFrame, specifying values from each level.

3. Indexing and Slicing:
   - Multi-indexes support advanced indexing and slicing operations, allowing for selective data retrieval.

```python
res_w_multi_index = grouped_neighbourhood.agg({'price':["mean","min","max"],
      'minimum_nights':["mean","median","count]})

```

suppose you want to select just minimum_nights and median from your result, how the heck do you do that?

```python
res_w_multi_index['minimum_nights']['median']
selected_data
```

In [19]:
res_w_multi_index = grouped_neighbourhood.agg({'price':["mean","min","max"],
      'minimum_nights':["mean","median"]})

res_w_multi_index

Unnamed: 0_level_0,price,price,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,mean,min,max,mean,median
neighbourhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
East Village,282.423231,10,10000,10.153538,2.0
Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
La Jolla,668.667488,0,10000,6.115764,3.0
Mission Bay,564.046663,35,100000,3.747151,3.0
North Hills,202.282862,32,3000,5.717138,2.0
Pacific Beach,328.405164,0,2114,6.017875,3.0


In [21]:
                                  # level 1         # level 2
selected_data = res_w_multi_index['minimum_nights'][['median']
selected_data

Unnamed: 0_level_0,mean,median
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
East Village,10.153538,2.0
Gaslamp Quarter,7.123563,2.0
La Jolla,6.115764,3.0
Mission Bay,3.747151,3.0
North Hills,5.717138,2.0
Pacific Beach,6.017875,3.0


# .reset_index()


---

`.reset_index()` allows us to flatten the dataframe (somewhat, it is still a multi-index). By calling `reset_index()` on the DataFrame (res_w_multi_index in this case), the multi-index levels will be converted to columns, and the DataFrame will have a simple integer index.

In [22]:
res_w_multi_index.reset_index()

Unnamed: 0_level_0,neighbourhood,price,price,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,mean,median
0,East Village,282.423231,10,10000,10.153538,2.0
1,Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
2,La Jolla,668.667488,0,10000,6.115764,3.0
3,Mission Bay,564.046663,35,100000,3.747151,3.0
4,North Hills,202.282862,32,3000,5.717138,2.0
5,Pacific Beach,328.405164,0,2114,6.017875,3.0


## use .columns

In [23]:
res_w_multi_index.reset_index().columns

MultiIndex([( 'neighbourhood',       ''),
            (         'price',   'mean'),
            (         'price',    'min'),
            (         'price',    'max'),
            ('minimum_nights',   'mean'),
            ('minimum_nights', 'median')],
           )

how can i access the data after reset index?

In [24]:
res_3 = res_w_multi_index.reset_index()
res_3

Unnamed: 0_level_0,neighbourhood,price,price,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,mean,median
0,East Village,282.423231,10,10000,10.153538,2.0
1,Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
2,La Jolla,668.667488,0,10000,6.115764,3.0
3,Mission Bay,564.046663,35,100000,3.747151,3.0
4,North Hills,202.282862,32,3000,5.717138,2.0
5,Pacific Beach,328.405164,0,2114,6.017875,3.0


One way

In [33]:
res_3["price"]['mean']

0    282.423231
1    244.589080
2    668.667488
3    564.046663
4    202.282862
5    328.405164
Name: mean, dtype: float64

# Another way
```python
res_3[[("neighbourhood",""),("price",'mean')]]
```

In [31]:
res_3[[("neighbourhood",""),("price",'mean')]]

Unnamed: 0_level_0,neighbourhood,price
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
0,East Village,282.423231
1,Gaslamp Quarter,244.58908
2,La Jolla,668.667488
3,Mission Bay,564.046663
4,North Hills,202.282862
5,Pacific Beach,328.405164


# Mike's secret recipe for lazy handling of multi-indexes


---

simply rename the columns


In [34]:
res_3.columns

MultiIndex([( 'neighbourhood',       ''),
            (         'price',   'mean'),
            (         'price',    'min'),
            (         'price',    'max'),
            ('minimum_nights',   'mean'),
            ('minimum_nights', 'median')],
           )

In [35]:
res_3.columns = ["neighbourhood", "price_mean","price_min","price_max","minnights_mean","minnights_median" ]
res_3

Unnamed: 0,neighbourhood,price_mean,price_min,price_max,minnights_mean,minnights_median
0,East Village,282.423231,10,10000,10.153538,2.0
1,Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
2,La Jolla,668.667488,0,10000,6.115764,3.0
3,Mission Bay,564.046663,35,100000,3.747151,3.0
4,North Hills,202.282862,32,3000,5.717138,2.0
5,Pacific Beach,328.405164,0,2114,6.017875,3.0


In [36]:
res_3.columns

Index(['neighbourhood', 'price_mean', 'price_min', 'price_max',
       'minnights_mean', 'minnights_median'],
      dtype='object')

# Secret Recipe No. 2


---



In [39]:
res_3 = res_w_multi_index
res_3

Unnamed: 0_level_0,price,price,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,mean,min,max,mean,median
neighbourhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
East Village,282.423231,10,10000,10.153538,2.0
Gaslamp Quarter,244.58908,43,4159,7.123563,2.0
La Jolla,668.667488,0,10000,6.115764,3.0
Mission Bay,564.046663,35,100000,3.747151,3.0
North Hills,202.282862,32,3000,5.717138,2.0
Pacific Beach,328.405164,0,2114,6.017875,3.0


In [40]:
res_3.stack().reset_index()

Unnamed: 0,neighbourhood,level_1,minimum_nights,price
0,East Village,max,,10000.0
1,East Village,mean,10.153538,282.423231
2,East Village,median,2.0,
3,East Village,min,,10.0
4,Gaslamp Quarter,max,,4159.0
5,Gaslamp Quarter,mean,7.123563,244.58908
6,Gaslamp Quarter,median,2.0,
7,Gaslamp Quarter,min,,43.0
8,La Jolla,max,,10000.0
9,La Jolla,mean,6.115764,668.667488


## Sample Solution


---



In [None]:
#
import pandas as pd

abnb = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_3/data/sd_listings.csv")
abnb.head()

#
series_price = abnb['price']
mean_price = series_price.mean()
max_price = series_price.max()
min_price = series_price.min()

print(f"San Diego price analysis: \n mean: ${mean_price:.2f}, \n max : ${max_price:.2f}, \n min : ${min_price:.2f}")

#
abnb_slice = abnb[["price", "minimum_nights", "number_of_reviews"]]
abnb_slice.sum()
abnb_slice.mean()
abnb_slice.count()

