# `pandas` Part 2: this notebook is a 2nd lesson on `pandas`
## The main objective of this tutorial is to slice up some DataFrames using `pandas`
>- Reading data into DataFrames is step 1
>- But most of the time we will want to select specific pieces of data from our datasets

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Read data into a pandas DataFrame
2. Select specific data from a pandas DataFrame
3. Insert data into a DataFrame

## Files Needed for this lesson: `wine.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
>- Note the `as pd` is optional but is a common alias used for pandas and makes writing the code a bit easier
2. Create or load data into a pandas DataFrame or Series
>- In practice, you will likely be loading more datasets than creating but we will learn both
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame
5. Then you will likely want to slice up your data into smaller subset datasets
>- This step is the focus of this lesson

# First, check your working directory and move to your desired directory

In [1]:
import os

from google.colab import drive

drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
# change to desired directory
os.chdir('/content/drive/MyDrive/Files_for_pandas/')

# Step 1: Import pandas and give it an alias

In [3]:
import pandas as pd

# Step 2 Read Data Into a DataFrame
>- Knowing how to create your own data can be useful
>- However, most of the time we will read data into a DataFrame from a csv or Excel file

## File Needed: `wine.csv`
>- Make sure you download this file from Canvas and place in your working directory

### Read the csv file with `pd.read_csv('fileName.csv`)
>- Set the index to column 0

In [5]:
wineReviews=pd.read_csv('wine.csv', index_col= 0)

### Check how many rows/records and columns are in the the `wine_reviews` DataFrame
>- Use `shape`

In [8]:
wineReviews.shape

(129971, 13)

### Check a couple of rows of data

In [9]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


### Now we can access columns in the dataframe using syntax similar to how we access values in a dictionary

In [11]:
wineReviews['country']

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

### To get a single value...

In [12]:
wineReviews['country'][500]

'Spain'

### Using the indexing operator and attribute selection like we did above should seem familiar
>- We have accessed data like this using dictionaries
>- However, pandas also has it's own selection/access operators, `loc` and `iloc`
>- For basic operations, we can use the familiar dictionary syntax
>- As we get more advanced, we should use `loc` and `iloc`
>- It might help to think of `loc` as "label based location" and `iloc` as "index based location"

### Both `loc` and `iloc` start with with the row then the column
#### Use `iloc` for index based location similar to what we have done with lists and dictionaries
#### Use `loc` for label based location. This uses the column names vs indexes to retrieve the data we want.

# First, let's look at index based selection using `iloc`

## As we work these examples, remember we specify row first then column

### Selecting the first row using `iloc`
>- For the wine reviews dataset this is our header row

In [13]:
wineReviews.iloc[0, : ]

country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O’Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

### To return all the rows of a particular column with `iloc`
>- To get everything, just put a `:` for row and/or column

In [14]:
wineReviews.iloc[:, 0]

0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

### To return the first three rows of the first column...

In [15]:
wineReviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

### To return the second and third rows...

In [17]:
wineReviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

### We can also pass a list for the rows to get specific values

In [18]:
wineReviews.iloc[[1, 2, 3, 5, 100, 10000], 0]

1        Portugal
2              US
3              US
5           Spain
100            US
10000      France
Name: country, dtype: object

### Can we pass lists for both rows and columns...?

In [19]:
wineReviews.iloc[[1, 4, 5], [0, 1, 2]]

Unnamed: 0,country,description,designation
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block
5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro


### We can also go from the end of the rows just like we did with lists
>- The following gets the last 5 records for country in the dataset

In [20]:
wineReviews.iloc[-5: , 0]

129966    Germany
129967         US
129968     France
129969     France
129970     France
Name: country, dtype: object

### To get the last 5 records for all columns...

In [21]:
wineReviews.iloc[ -5: , ]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Label-Based Selection with `loc`
## With `loc`, we use the names of the columns to retrieve data

In [22]:
wineReviews.loc[0, 'country']

'Italy'

### Get all the records for the following fields/columns using `loc`:
>- taster_name
>- taster_twitter_handle
>- points

In [24]:
wineReviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Unnamed: 0,taster_name,taster_twitter_handle,points
0,Kerin O’Keefe,@kerinokeefe,87
1,Roger Voss,@vossroger,87
2,Paul Gregutt,@paulgwine,87
3,Alexander Peartree,,87
4,Paul Gregutt,@paulgwine,87
...,...,...,...
129966,Anna Lee C. Iijima,,90
129967,Paul Gregutt,@paulgwine,90
129968,Roger Voss,@vossroger,90
129969,Roger Voss,@vossroger,90


# Notice we have been using the default index so far
## We can change the index with `set_index`

In [None]:
wineReviews.set_index('title')

# Conditional Selection
>- Suppose we only want to analyze data for one country, reviewer, etc...
>- Or we want to pull the data only for points and/or prices above a certain criteria

In [None]:
wineReviews.loc[wineReviews['country']== 'US']

## Which wines are from the US with 95 or greater points?

In [32]:
wineReviews.loc[(wineReviews['points']>= 95) & (wineReviews['country']== 'US' ) ].head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
352,US,Citrus-kissed saltiness lies at the core of th...,South River,96,68.0,California,Russian River Valley,Sonoma,Virginie Boone,@vboone,Rochioli 2014 South River Chardonnay (Russian ...,Chardonnay,Rochioli
355,US,A waft of smoky char and toasty oak provide a ...,Sweetwater,96,68.0,California,Russian River Valley,Sonoma,Virginie Boone,@vboone,Rochioli 2014 Sweetwater Chardonnay (Russian R...,Chardonnay,Rochioli


# Some notes on our previous example:
>- We just quickly took at dataset that has almost 130K rows and reduced it to one that has 993
>- This tells us that less that 1% of the wines are from the US and have ratings of 95 or higher
>- With some simple slicing using pandas we already have some decent start to an analytics project

# Q: What are all the wines from Italy or that have a rating higher than 95?
>- To return the results for an "or" question use the pipe `|` between your conditions  

In [None]:
wineReviews.loc[(wineReviews['country']== 'Italy') | (wineReviews['points']>= 95)]

# Q: What are all the wines from Italy or France?
>- We can do this with an or statement or the `isin()` selector
>- Note: if you know SQL, this is the same thing as the IN () statement
>- Using `isin()` replaces multiple "or" statements and makes your code a little shorter

In [35]:
wineReviews.loc[wineReviews['country'].isin(['Italy', 'France'])].head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo


# Q: What are all the wines without prices?
>- Here we can use the `isnull` method to show when values are not entered for a particular column

In [37]:
wineReviews.loc[wineReviews['price'].isnull()].head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
13,Italy,This is dominated by oak and oak-driven aromas...,Rosso,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Masseria Setteporte 2012 Rosso (Etna),Nerello Mascalese,Masseria Setteporte
30,France,Red cherry fruit comes laced with light tannin...,Nouveau,86,,Beaujolais,Beaujolais-Villages,,Roger Voss,@vossroger,Domaine de la Madone 2012 Nouveau (Beaujolais...,Gamay,Domaine de la Madone


# What are all the wines with prices?
>- Use `notnull()`

In [40]:
wine_notnullPrice= wineReviews.loc[wineReviews['price'].notnull()]

# We can also add columns/fields to our DataFrames

In [45]:
wineReviews['critic'] = 'everyone is a critic'

In [46]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,critic
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,everyone is a critic
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,everyone is a critic
