# Lesson 14: `pandas` Part 4: Grouping and Sorting Code-Along Notebook

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Group data with `groupby()`
2. Sort data with `sort_values()`


## Files Needed for this lesson: `wine.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [2]:
import os, pandas as pd

from google.colab import drive
drive.mount('/content/drive/')

os.chdir('/content/drive/MyDrive/Files_for_pandas/')



Mounted at /content/drive/


In [None]:
# change to desired directory



# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `wine.csv`
>- Set the index to column 0

In [5]:
wineReviews=pd.read_csv('wine.csv', index_col=0)

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [6]:
wineReviews.shape

(129971, 13)

### Check a couple of rows of data

In [8]:
rows = wineReviews.shape[0]

columns = wineReviews.shape[1]

In [9]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


# Descriptive Analytics with `groupby()`
>- General syntax: dataFrame.groupby(['fields to group by']).fieldsToanalyze.aggregation

### Now, what is/are the question(s) being asked of the data?
>- All analytics projects start with questions (from you, your boss, some decision maker, etc)

###  How many wines have been rated at each point value?

In [10]:
wineReviews.groupby(['points']).points.count()

points
80       397
81       692
82      1836
83      3025
84      6480
85      9530
86     12600
87     16933
88     17207
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64

### How much does the least expensive wine for each point rating cost?

In [11]:
minPrice= wineReviews.groupby(['points']).price.min()


minPrice

points
80      5.0
81      5.0
82      4.0
83      4.0
84      4.0
85      4.0
86      4.0
87      5.0
88      6.0
89      7.0
90      8.0
91      7.0
92     11.0
93     12.0
94     13.0
95     20.0
96     20.0
97     35.0
98     50.0
99     44.0
100    80.0
Name: price, dtype: float64

### Question: How much does the most expensive wine for each point rating cost?

In [13]:
maxPrice = wineReviews.groupby(['points']).price.max()

maxPrice

points
80       69.0
81      130.0
82      150.0
83      225.0
84      225.0
85      320.0
86      170.0
87      800.0
88     3300.0
89      500.0
90      510.0
91     2013.0
92      750.0
93      770.0
94     1125.0
95      973.0
96     2500.0
97     2000.0
98     1900.0
99      850.0
100    1500.0
Name: price, dtype: float64

### What is the overall maximum price for all wines?

In [14]:
max(maxPrice)

3300.0

### What is the lowest price for a wine rating of 100?

In [16]:
minPrice[100]

80.0

### What is the highest price for a wine rating of 80?

In [17]:
maxPrice[80]

69.0

### What is the maximum rating for each country?

In [None]:
countryMax = wineReviews.groupby(['country']).points.max()

countryMax

### What is the maximum rating for China?

In [19]:
countryMax['China']

89

##### Another way to get maximum ratring for China combining `where` and `groupby`

In [20]:
wineReviews.where(wineReviews['country']=='China').groupby(['country']).points.max()

country
China    89.0
Name: points, dtype: float64

### What are some summary stats for price for each country?
>- Using the `agg()` function for specific summary stats
>>- What is the sample size?
>>- What is the minimum?
>>- What is the maximum?
>>- What is the mean?
>>- What is the median?
>>- What is the standard deviation?

In [None]:
round(wineReviews.groupby(['country']).price.agg(['count', 'min', 'max', 'mean', 'median', 'std']), 2)

In [None]:
countryAgg = wineReviews.groupby(['country']).price.describe()

countryAgg

## What are the descriptive analytics for country and province?
>- We can group by multiple fields by adding more to our groupby() function

In [None]:
wineReviews.groupby(['country','province']).points.describe()

## What are the descriptive price analytics for the US?
>- Add `get_group()` syntax

In [27]:
wineReviews.groupby(['country']).get_group('US').price.describe()

count    54265.000000
mean        36.573464
std         27.088857
min          4.000000
25%         20.000000
50%         30.000000
75%         45.000000
max       2013.000000
Name: price, dtype: float64

## What are the summary wine rating stats for Colorado?
>- Note that states are coded in this dataset under province

In [29]:
wineReviews.groupby(['country','province']).get_group(('US','Colorado')).points.describe()

count    68.000000
mean     86.117647
std       1.943450
min      80.000000
25%      85.000000
50%      86.000000
75%      87.000000
max      91.000000
Name: points, dtype: float64

# Sorting Results
>- Add sort_values() syntax
>- Default is ascending order
## What are the summary stats for points for each country?
>- Sort the results from lowest to highest mean points

In [None]:
wineReviews.groupby(['country']).points.describe().sort_values(by = 'mean',ascending = False)

### To sort in descending order...
>- Use ascending = False