<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

In [None]:
import pandas as pd
import numpy as np
import datetime

pd.options.display.float_format = '{:,.2f}'.format

<br><br><br><br><br>

## Extract a specific column from a dataframe

In [None]:
countries = pd.DataFrame({
    'Letter': ['a', 'b', 'c'],
    'Country': ['Andorra', 'Belgium', 'Croatia']}, 
    index=[5, 6, 7])

countries

In [None]:
# Remember: each column in a DataFrame is a Series

countries['Country']

In [None]:
type(countries['Country'])

In [None]:
data = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv")
data.head()

In [None]:
data['Customer Id']

#### Each column is a series, therefore it has a data type:

In [None]:
type(data['Customer Id'])

In [None]:
dir(countries['Country'])

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Select multiple data columns (by name)

In [None]:
data['Date'].head()

In [None]:
data['Customer Id'].head()

In [None]:
data.head()

In [None]:
# Here, we want to extract only the `Customer Id` and 
# `Date` columns.

data[ ['Date'] ]

<br/>
<br/>

## Select data columns (by index)

In [None]:
data.head()

In [None]:
# See list of all columns

data.columns

**NOTE:** Remember, index counting starts at 0! So:
* first column is at index 0
* second column is at index 1
* etc.

In [None]:
# Get the name of the second column

data.columns[1]

In [None]:
data[ data.columns[1] ]

In [None]:
# Get the actual data in that column

data[ data.columns[1] ] # equivalent to data['Helpfulness']

In [None]:
data[ ['Customer Id', 'Empathy'] ]

In [None]:
data[ [data.columns[0], data.columns[3]] ].head()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Select data rows (by row position)

Let's see how many rows / columns are in our `data`

In [None]:
data.shape

In [None]:
data.head()

You can use the slicing notation to select data rows (by position)

In [None]:
m = [0, 1, 2, 3, 4, 5]

m[1:3]

In [None]:
data = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv")
data.head()

In [None]:
data[10:20]

<br>

**IMPORTANT NOTE: the numbers between `[]` above are NOT the actual row index labels. They simply indicate row order.**

<br>

Consider the example below:
* `sample` will randomly sample a subset of the dataframe
* when `frac=1` it will randomly sample the entire dataframe
    * in other words, it will shuffle (reorder) the rows in the dataframe

In [None]:
data.sample(frac=1)

In [None]:
x = data.sample(frac=1)
x.head(10)

The code below returns the first three rows in the dataframe, regardless of their row labels:

In [None]:
x[10:20]

The code below returns the fifth, sixth and seventh rows in the dataframe, regardless of their row labels:

In [None]:
x[4:7]

In [None]:
x[10:20]

**QUESTION:** Where else have we seen this notation `[10:20]`?

**ANSWER:** `lists`! `ndarrays`!

In [None]:
data.head()

In [None]:
# Get every other row in the DataFrame

data[10:20:2]

In [None]:
data.sample(frac=1)[10:20:2]

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Better way to access data using row index labels and column names

Use the dataframe index using `.loc`:
  * instead of using the row order, you can use the **row index labels**
  * can be much faster than using row order (if the row index labels are unique and can be hashed)

In [None]:
data.head()

In [None]:
data.loc

#### Get the row with the row labels `2`, `3`, and `4`

In [None]:
data.head()

In [None]:
x = data.sample(frac=1)
x.head()

In [None]:
x.loc[ [2, 3, 4] ]

In [None]:
x.loc.__getitem__( [2, 3, 4] )

In [None]:
x.loc[ [2, 3, 4] ]

Compare with this:

In [None]:
list(range(2, 5))

In [None]:
x[2:5]

**You can, of course, just get a single row:**

In [None]:
x.loc[ [3] ]

#### Get a bunch of rows and columns

This uses row order:

In [None]:
x[30:36]

This uses row labels:
  * it will include all the rows between the row (or rows) that has label `30` and the row (or rows) that has the label `35`
  * it may be empty if the row with label `35` shows up before the row with label `30`
  * it may contain all your data if your very first row has the label `30` and your very last row has the label `35`

In [None]:
x.loc[30:35]

If you got no data above, try running the code below a few times (it will keep shuffling the rows):

In [None]:
data.sample(frac=1).loc[30:35]

**Using `.loc[]` you can, of course, just get the rows with specific labels (perhaps using a list of row labels you want to extract):**

In [None]:
list(range(30, 39))

In [None]:
x.loc[ range(30, 39) ]

Notice how the row with label 35 is included as many times as you specified

In [None]:
data.loc[ [30, 35, 35, 33, 34, 35, 400] ]

**Using `.loc[]` you can also specify which columns you want:**

In [None]:
data.sample(frac=1).loc[ 30:35 ]

In [None]:
x.loc[ 30:35, ['Customer Id', 'Empathy'] ] 

In [None]:
x = data.sample(frac=1)

In [None]:
x.loc[30:35, ['Customer Id', 'Empathy']] 

#### Get a bunch of rows and ALL columns

In [None]:
data.loc[30:35, 'Customer Id':'Empathy']

In [None]:
# same as above
data.loc[30:35]

In [None]:
data.loc[ [1, 5, 10], ['Customer Id', 'Helpfulness', 'Empathy']]

In [None]:
data.loc[[1, 5, 10], 'Customer Id':'Rep Id']

In [None]:
data.loc[[1, 5, 10], data.columns[0]:data.columns[4]]

### CAUTION!

`loc` expects the optional column selector to list the columns by **name** or using a **boolean array**

## Using `loc` with rows-only selector

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/df_loc_1.png?1" width="300"/>
</div>

## Using `loc` with rows and columns selectors

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/df_loc_2.png?1" width="500"/>
</div>

**THIS WON'T WORK**

In [None]:
data.head()

In [None]:
data.loc[30:35, 0:3]

**THIS WILL WORK**

In [None]:
data.loc[30:35, 'Customer Id':'Courtesy']

In [None]:
data.loc[30:35, data.columns[0]:data.columns[2]]

In [None]:
data.loc[30:35, data.columns[:3]]

**THIS WON'T WORK**

In [None]:
data.loc[30:35, 0]

**THIS WILL WORK**

In [None]:
data.loc[30:35, [data.columns[0]] ]

**THIS _WILL_ WORK!**

In [None]:
data.loc[30:35, [True, False, False, False, False, True, False, False, False]]

In [None]:
data.loc[30:35, ['Customer Id', 'Date']]

In [None]:
data.columns[:2]

In [None]:
data.loc[30:35, data.columns[:2]]

**REMEMBER:**
  * when using `.loc`, the row selector requires **row index labels** or a boolean array and the column selector requires **column index labels (column names)** or a boolean array
  * if your row index is numeric, the row index labels will be numbers
  * if your row index is made of strings, dates, etc. your row index labels will be strings, dates, etc.

In [None]:
df = pd.DataFrame({
    'Month': ['January', 'February', 'March', 'April'], 
    'Temperature': [30, 35, 40, 45]})

In [None]:
df

In [None]:
df = df.set_index('Month')
df

In [None]:
df.columns

In [None]:
df.index

Compare:

In [None]:
df[0:2]

In [None]:
# This won't work (there are no rows with index labels 0 or 2)
df.loc[0:2]

In [None]:
# But this will work
df.loc['January': 'February']

In [None]:
df.loc['January': 'February', ['Temperature']]

<br><br>

In [None]:
data = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv")
data.head()

In [None]:
data = data.set_index('Rep Id')

In [None]:
data.head()

In [None]:
data.loc[2105, ['Courtesy', 'Empathy']]

<br><br><br><br><br><br>

## Exercise

(1) Using `pd.read_csv()` re-import the csv above (located at this address: https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv) and store into a dataframe called `survey_data`. 

This time, we want to only import the following columns: `Customer Id`, `Helpfulness`, `Courtesy`, `Empathy` and `Rep Id`. 

**HINT:** use the optional `usecols` parameter when calling `read_csv` (more info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

(2) Using `set_index`, set the `Customer Id` column as the dataframe index and display the top 10 rows of the resulting dataframe. 

(3) Using `.loc[]`, print columns `Courtesy` and `Empathy` of all the rows starting from the beginning of the dataframe and going all the way until the row with a `Customer Id` of 1991. 

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

#### Solution:

#### (1)

In [None]:
survey_data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv',
    usecols=['Customer Id', 'Helpfulness', 'Courtesy', 'Empathy', 'Rep Id']
)

survey_data.head()

#### (2)

In [None]:
# survey_data = survey_data.set_index('Customer Id')

In [None]:
survey_data.set_index('Customer Id', inplace=True)

survey_data.head(10)

In [None]:
survey_data.columns

In [None]:
survey_data.index

#### (3)

In [None]:
survey_data.loc[:1991, ['Courtesy', 'Empathy']]

<br/>
<br/>

## Better way to access data by row position and column position

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv'
)

In [None]:
data.head()

<br><br>

#### How do we access the first 2 columns of the first 5 rows?

* using `.loc` is not very easy - it expects **row labels and column names**

* so we have to do something like this:

In [None]:
data[:5]

In [None]:
data.columns[:2]

In [None]:
data[:5][ data.columns[:2] ]

* Meh, this works, but it's kinda ugly to look at. And we have to type lots of square brackets. Is there a better way?

<br><br>

#### Yep, there is!

* use `.iloc`

* like `.loc`, except that instead of **row labels** and **column names**, it uses **row position** and **column position**.

In [None]:
data.iloc

In [None]:
data.iloc[:5, :2]

#### Using `iloc` with rows-only selector

In [None]:
data.iloc

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\df_iloc_1.png" width="300"/>
</div>

#### Using `iloc` with rows and columns selectors

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\df_iloc_2.png" width="500"/>
</div>

### Get the row at index 4 (fifth row in the dataframe)

In [None]:
data.head()

In [None]:
data.iloc[ [4] ]

In [None]:
x = data.sample(frac=1)

x.head(7)

In [None]:
# Check out the `Name` attribute below. This indicates the
# actual row index label for the extracted row.

x.iloc[ [4] ]

Compare, again, with `.loc`:

In [None]:
x.loc[ [4] ]

#### Get a bunch of rows and the first three columns

In [None]:
x.head(20)

In [None]:
x.iloc[10:14, :3]

In [None]:
x.columns[:3]

In [None]:
x.loc[ [422, 44, 536, 590], x.columns[:3] ]

#### Get a bunch of rows and a bunch of columns

In [None]:
x.columns[1:4]

In [None]:
x[10:14][x.columns[1:4]]

In [None]:
x.iloc[10:14, 1:4]

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Get unique values in a column

In [None]:
data.head()

In [None]:
data['Helpfulness'].unique()

**NOTE:** `.unique()` is a method on a `Series` object

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Exercise

Import the csv file available at `https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv` into a Pandas DataFrame and answer the questions below.

1. Get the unique values in the `Courtesy` column for the 13th, 14th, 15th and 16th row in the `data` dataframe
<br><br>
2. Get the unique values in the `Courtesy` column for the rows with labels 13, 14, 15, 16 in the `data` dataframe

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

**SOLUTION:**

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv'
)

In [None]:
data.head()

**1. Unique values in the `Courtesy` column in the 13th, 14th, 15th and 16th row in the `data` dataframe:**

In [None]:
data.iloc[12:16]

In [None]:
data.iloc[12:16, 2]

In [None]:
# Option 1

data.iloc[12:16, 2].unique()

In [None]:
data.iloc[12:16]['Courtesy']

In [None]:
# Option 2

data.iloc[12:16]['Courtesy'].unique()

In [None]:
data[12:16]

In [None]:
data[12:16]['Courtesy']

In [None]:
# Option 3

data[12:16]['Courtesy'].unique()

In [None]:
data.columns[2]

In [None]:
data[12:16][data.columns[2]]

In [None]:
# Option 4

data[12:16][data.columns[2]].unique()

In [None]:
data.iloc[12:16][data.columns[2]]

In [None]:
# Option 5

data.iloc[12:16][data.columns[2]].unique()

**2. Unique values in the `Courtesy` column for the rows with labels 13, 14, 15, 16 in the data dataframe:**

In [None]:
data.loc[[13, 14, 15, 16], 'Courtesy']

In [None]:
data.loc[[13, 14, 15, 16], 'Courtesy'].unique()

In [None]:
data.loc[13:16, 'Courtesy'] # only correct if the index is sorted!!

In [None]:
data.loc[13:16, 'Courtesy'].unique()

In [None]:
list(range(13, 20, 2))

In [None]:
data.loc[range(13, 17), 'Courtesy'].unique()