# Table of Contents
* [Pandas: Examing Data](#Pandas:-Examing-Data)
	* [Learning Objectives:](#Learning-Objectives:)
	* [Set-Up](#Set-Up)
* [Examining Columns](#Examining-Columns)
	* [Read Data into a DataFrame](#Read-Data-into-a-DataFrame)
	* [Column selection](#Column-selection)
	* [Dropping and Adding Columns](#Dropping-and-Adding-Columns)
	* [String Operations](#String-Operations)
	* [Datetime Operations](#Datetime-Operations)
* [Indexing](#Indexing)
	* [Column Indexing](#Column-Indexing)
	* [Boolean arrays](#Boolean-arrays)
	* [The `.loc` accessor](#The-.loc-accessor)
	* [isin with strings](#isin-with-strings)
	* [Statistical Operations through Indexing](#Statistical-Operations-through-Indexing)
	* [Positional Indexing](#Positional-Indexing)
	* [The `.iloc` accessor](#The-.iloc-accessor)
* [Section Review](#Section-Review)


# Pandas: Examing Data

## Learning Objectives:

After completion of this module, learners should be able to:

* load data into a DataFrame
* use column selection methods
* perform string and datetime operations
* make categeorical data types
* use boolean and fancy indexing

## Set-Up

In [1]:
# Import modules needed for this lesson
import pandas as pd

# Set Jupyter notebook display options
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.options.display.width = 100

# Test the version of pandas: it should be 0.16 or greater
pd.__version__

'0.18.1'

***

# Examining Columns

## Read Data into a DataFrame

CSV files, being tabular data, have a natural mapping into Pandas DataFrames. 
* Here we read in beer reviews from beeradvocate.com, via [Stanford SNAP](https://snap.stanford.edu/data/web-BeerAdvocate.html). 
* Pandas can tell if a file is zipped, and will unzip in memory.

In [3]:
df = pd.read_csv('data/beer2.csv.gz',
                  index_col=0,
                  parse_dates=['time'])

In [4]:
df.head()

Unnamed: 0,abv,beer_id,brewer_id,beer_name,...,profile_name,review_taste,text,time
0,7.0,2511,287,Bell's Cherry Stout,...,blaheath,4.5,Batch 8144\tPitch black in color with a 1/2 f...,2009-10-05 21:31:48
1,5.7,19736,9790,Duck-Rabbit Porter,...,GJ40,4.0,Sampled from a 12oz bottle in a standard pint...,2009-10-05 21:32:09
2,4.8,11098,3182,Fürstenberg Premium Pilsener,...,biegaman,3.5,Haystack yellow with an energetic group of bu...,2009-10-05 21:32:13
3,9.5,28577,3818,Unearthly (Imperial India Pale Ale),...,nick76,4.0,"The aroma has pine, wood, citrus, caramel, an...",2009-10-05 21:32:37
4,5.8,398,119,Wolaver's Pale Ale,...,champ103,3.0,A: Pours a slightly hazy golden/orange color....,2009-10-05 21:33:14


Pandas provides useful inpsection tools to help understand the dtypes that were generated for each column. 

Notice that some columns do not have 50,000 *non-null* values.

In [5]:
# Notice the column (1) names, (2) number of non-null, (3) dtype, (4) memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 13 columns):
abv                  48389 non-null float64
beer_id              50000 non-null int64
brewer_id            50000 non-null int64
beer_name            50000 non-null object
beer_style           50000 non-null object
review_appearance    50000 non-null float64
review_aroma         50000 non-null float64
review_overall       50000 non-null float64
review_palate        50000 non-null float64
profile_name         50000 non-null object
review_taste         50000 non-null float64
text                 49991 non-null object
time                 50000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 5.3+ MB


In [6]:
# Summary statistics, using the built-in .describe() method

df.describe()



Unnamed: 0,abv,beer_id,brewer_id,review_appearance,review_aroma,review_overall,review_palate,review_taste
count,48389.000000,50000.000000,50000.000000,50000.000000,50000.000000,50000.00000,50000.000000,50000.000000
mean,7.232406,25379.008540,3505.655420,3.874010,3.778880,3.84982,3.778840,3.832480
std,2.387132,20600.374872,5696.746136,0.596153,0.677396,0.70716,0.660929,0.711728
...,...,...,...,...,...,...,...,...
50%,,25880.000000,583.000000,4.000000,4.000000,4.00000,4.000000,4.000000
75%,,46302.000000,3694.000000,4.000000,4.000000,4.50000,4.000000,4.500000
max,27.000000,68386.000000,27039.000000,5.000000,5.000000,5.00000,5.000000,5.000000


## Column selection

A variety of column operations are available

In [None]:
# The column names, as an Index object
print( df.columns )

In [None]:
# The column labels can be converted to list of strings

list(df.columns)

Columns in a pandas DataFrame can be accessed using the .attribute method

In [None]:
# View column data using .attribute, column label

df.beer_style

Or with dictionary-like key selection

In [None]:
# View column data using [bracket], column label string

df['beer_style']

A single column (**Series**) can be selected using the column name as an attribute (or accessor) of the **DataFrame** object.

In [None]:
beer_series = df.beer_name
print( type(beer_series) )

In [None]:
beer_series.head()

## Dropping and Adding Columns

In [None]:
df.head(3)

In [None]:
# Adding a new column

df['alcohol'] = 0.0
df.head(3)

In [None]:
df.info()

Over-writing one column with another

In [None]:
df['alcohol'] = df['abv']
df

In [None]:
# Dropping a column

del df['abv']

In [None]:
df.info()

## String Operations

Varios srtring operations can be applied to the columna data values

In [None]:
# Print first 4 letter of each string in the column

df.beer_style.str[0:4]

In [None]:
# Print the length of each string in the column

df.beer_style.str.len()

In [None]:
# Get boolean (True/False) mask for strings matching a pattern

df.beer_style.str.contains('[A|a]merican')

In [None]:
# Extract the rows that match the True

boolean_mask = df.beer_style.str.contains('[A|a]merican')
df.loc[boolean_mask, 'beer_style']

In [None]:
# To see full documentation for str() method, uncomment the following:

# help(df.beer_style.str)

## Datetime Operations

Using the `.dt` accessor, datetime manipulations can be applied to a datetime series or column elementwise.

In [None]:
# select the `time` column, noteice the dtype=datetime

df.time

Because the `dtype=datatime`, we can perform DataTime operations on the column

In [None]:
# Example: Just show the data, not the full date-time

df.time.dt.date

In [None]:
# Just show the time, not the full date-time

df.time.dt.time

In [None]:
# Just show the hour

df.time.dt.hour

In [None]:
# To see full help documentation for the DataTime properties, uncomment the following:

# help(df.time.dt)

# Indexing

Much of the same indexing methods avialable in Numpy can be applied to Pandas DataFrames and Series.

## Column Indexing

...can be indexed by brackets or by .attribute accessor

In [None]:
# Recall the column names: they are you "indices"
df.columns

In [None]:
# "Indexing" by brackets and any one of the column names/labels

df['beer_name']

In [None]:
# Indexing into rows of the column

df.loc[0:3,'beer_name']

In [None]:
 # Index by .attribute accessor
    
df.beer_name

In [None]:
# Indexing into rows of the column

df.beer_name.loc[0:3]

Loc indexes by row then column

In [None]:
df.loc[0,'beer_name']

In [None]:
df.loc[0:3,'beer_name']

## Boolean arrays

In [None]:
# boolean series
df.alcohol < 5

In [None]:
# boolean indexing along rows
# The index value from the full dataframe is preserved
df.loc[df.alcohol < 5]

# Note the number of rows left after the operation

## The `.loc` accessor

The `.loc` accessor method allows for chained selection of rows and columns. 

The selection pattern is row first

```
df.loc[ row , column ]
```

**Remember**: using `.loc` means that the right endpoint in a slice is returned.

In [None]:
df.head(3)

In [None]:
df.loc[0,'alcohol']

In [None]:
df.loc[2,'time']

The values passed to `.loc` for either rows or columns can be values in the indices or **boolean indexing arrays**.

In [None]:
df.loc[df.alcohol < 5, ['beer_style', 'review_overall']]

Using bitwise comparison operations between boolean arrays can provide complex selection logic.

In [None]:
# reviews written after June 2009 for beer with alcohol<5
(df.alcohol < 5) & (df.time > '2009-06')

# equivalent to
#(df.alcohol < 5) & (df.time > pd.Timestamp('2009-06'))

In [None]:
# reviews written after June 2009 for beer with alcohol<5 OR reviews with a rating >= 4.5
boolean_chain = ( 
                   ( (df.alcohol < 5)  & (df.time > '2009-06')  )
                 | (df.review_overall >= 4.5)
                 )
df.loc[boolean_chain]

## isin with strings

In [None]:
# .contains() method returns booleans...

df.beer_style.str.contains('IPA')

In [None]:
#... so .contain() can be used with .loc() as well:

df.loc[df.beer_style.str.contains('IPA')]

In [None]:
# Summary statistics for hoppy (IPA) beers

df.loc[ df.beer_style.str.contains('IPA') ].describe()

In [None]:
# Summary statistics for NOT hoppy (IPA) beers

df.loc[ df.beer_style.str.contains('IPA')==False ].describe()

## Statistical Operations through Indexing

In [None]:
# count unique entries of beer_id

beer_ids = df.beer_id.value_counts(sort=True)  # default is sorted, but wanted to be explicit
beer_ids
# left is ID, right is number of times it appears

In [None]:
# All reviews for the top three most reviewed beers
print( "Top 3 \nBeer ID, Reviews \n", beer_ids[0:3], "\n" )

df.loc[df.beer_id.isin(beer_ids[0:3].index)]

***

## Positional Indexing

In [None]:
df.head(5)

## The `.iloc` accessor

If you prefer NumPy-like indexing, use the `.iloc` accessor:

Similar to `.loc`, the `.iloc` accessor allows row and column by integer position or slice only.

In [None]:
## implied row
df.iloc[[2,5,10]]

In [None]:
# row, column
df.iloc[[2,5,10],0:3]

***

# Section Review

Examining Data
* Reading data files
* Inspecting the DataFrame
* Examining Columns
   * Selecting Columns
   * Adding and Removing Columns
   * String operations
   * DateTime operations