# Load Pandas

In [None]:
import pandas as pd

# Load Data Set

- we'll load the housing price data to continue experimenting with Pandas

- [here](https://www.kaggle.com/c/home-data-for-ml-course/data) is the source of the data 
  - find the "Download All" button to download the entire data set



In [None]:
# read the csv from drive (google drive in this case)
data = pd.read_csv('/content/drive/My Drive/Datasets/home-data-for-ml-course/train.csv')
# add your own path above to read the train.csv file


In [None]:
# display the previously loaded DataFrame
data

# Grouping 

- often we want to group our data, and then do something specific to the group the data is in

- we do this with the `groupby()` operation

- you can think of each group we generate as being a slice of our DataFrame containing only data with values that match

### `.count()` with `.groupby()`

- one function we've seen before is the `value_counts()` function 
  - we can replicate what `value_counts()` does by doing the following

In [None]:
# using `count()` to get the number of counts of Neighborhood entries
data.groupby('Neighborhood').Neighborhood.count()

In [None]:
# using `value_counts()` to get the number of counts of Neighborhood entries
data.Neighborhood.value_counts()

### `.min()` with `.groupby()`

- to get the least lot area in each Neighborhood 

In [None]:
# least LotArea in each Neighborhood 
data.groupby('Neighborhood').LotArea.min()

### `.apply()` with `.groupby()`
- the output DataFrame of the `.groupby()` function is accessible to us directly using the `apply()` method 
  - we can then manipulate the data in any way we see fit

In [None]:
# check the first year that appears for each neighboorhood
data.groupby('Neighborhood').apply(lambda dataframe: dataframe.YearBuilt.iloc[0])

- `.apply()` function can be used for even more fine grain control
  - by using `.groupby()` with two columns: 
    - `Neighborhood` 
    - `YearBuilt`

In [None]:
# data to find the best conditon based on neighborhood and year built 
data.groupby(['Neighborhood','YearBuilt']).apply(lambda dataframe: dataframe.loc[dataframe.OverallCond.idxmax()])

- aside: [`.idxmax()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) documentation page

### `.agg()` with `.groupby()`

- another `groupby()` method is `agg()`
 - it lets you run a bunch of different functions on your DataFrame simultaneously
 
- for instance, we can generate a simple statistical summary of the dataset as follows:

In [None]:
# simple statistical summary of sale price by neighborhood
data.groupby('Neighborhood').SalePrice.agg([len,min,max])

- effective use of `groupby()` will allow you to do lots of really powerful things with your dataset

# Complex Indexing

- in the examples we've seen so far 
  - we've been working with DataFrame or Series objects with a single-label index
  
- `groupby()` is slightly different 
  - depending on the operation we run, it will sometimes result in what is called a *multi-index*

- a *multi-index* differs from a *regular index* in that it has multiple levels

- *multi-index* example:

In [None]:
# get the number of building types per year for each neighborhood
building_types_in_neighborhood_per_year = data.groupby(['Neighborhood','YearBuilt']).BldgType.agg([len])
building_types_in_neighborhood_per_year # this generates a multi-index DataFrame

In [None]:
#check type of index of this data 
type(building_types_in_neighborhood_per_year.index)

- [multi-indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) - pandas documentation page

- multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices

- they also require two levels of labels to retrieve a value

- the multi-index method you will use most often is the one for converting back to a regular index, the `reset_index()` method

In [None]:
# convert multi-index to regular-index
building_types_in_neighborhood_per_year.reset_index()

# Sorting 

- looking at `building_types_in_neighborhood_per_year`,
  - the grouping shows date in index order, not in value order
  - i.e. when outputting the result of a `.groupby()`, the order of the rows is dependent on the values in the index, not in the resultant data

- to get data in the order want it in we can sort it ourselves
  - the `sort_values()` method is used for this

In [None]:
# reset the index and save it back into the same variable
building_types_in_neighborhood_per_year = building_types_in_neighborhood_per_year.reset_index()

# sort values by `len`
building_types_in_neighborhood_per_year.sort_values(by='len')

- `.sort_values()` defaults to an ascending sort, where the lowest values go first

- however, most of the time we want a descending sort, where the higher numbers go first

In [None]:
# sort values by `len` in descending order
building_types_in_neighborhood_per_year.sort_values(by='len', ascending=False)

- to sort by index values, use the method `sort_index()` 
- this method has the same arguments and default order

In [None]:
# sort values by index 
building_types_in_neighborhood_per_year.sort_index()

- finally, you can sort by more than one column at a time:

In [None]:
# sort values by neighborhood and length
building_types_in_neighborhood_per_year.sort_values(by=['Neighborhood','len'])