In [2]:
#Necessary so that each cell can produce multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 4.1.1: Getting Started with Pandas
***
 - Learn how to use __Pandas Series__
 - Learn how to use __Pandas Dataframes__

Scotch Whiskey:
- prized for complexity and flavors
- Regions where it's produced are believed to have distinct flavor profiles

In this case study, we classify Scotch whiskeys based on flavor 

Use Pandas, NumPy, scikit-learn, Bokeh
 

This dataset consists of tasting ratings of readily available, single-malt, scotch whiskey from almost every active distillery in Scotland
- 86 malt whiskies rated between 0 and 4 in 12 different categories, ranked by 10 different tasters

Let's first look at series in Pandas

In [3]:
import pandas as pd
x = pd.Series([6,3,8,6])
x

0    6
1    3
2    8
3    6
dtype: int64

Since we didn't specify an index, Pandas is using the default index (a series of integers starting at 0 and increasing one by one for each subsequent row). 

Now let's specify an index explicitly

In [4]:
x = pd.Series([6,3,8,6], index = ['q','w','e','r'])
x

q    6
w    3
e    8
r    6
dtype: int64

You can use the index to specify values or a list of values. 
- EX: if you wanted to look up the value of 'w' you could do:

In [5]:
x['w']

3

If we want multiple entries, we construct a list of the entries we're interested in:

In [6]:
x[['r', 'w']]

r    6
w    3
dtype: int64

Many ways to construct a series object in Pandas: a common way is by passing a dictionary:

In [7]:
age = {'Tim':29, 'Jim':31, 'Pam':27, 'Sam':35}
x=pd.Series(age)
x

Tim    29
Jim    31
Pam    27
Sam    35
dtype: int64

Note that the index is the keys of the dictionary in order, and the values are the values

Now let's take a look at Dataframes:
- Represent table-like data and have __row__ and __column__ indices
- Like with series, there are many ways to construct a dataframe
    - A common way is by passing a dictionary where the value objects are lists/Numpy Arrays of equal lengths:
    

In [8]:
data = {'name':['Tim', 'Jim', 'Pam', 'Sam'],
        'age':[29,31,27,35],
        'ZIP':['02115', '02130', '67700', '00100']}
x = pd.DataFrame(data, columns = ['name', 'age', 'ZIP'])
x

Unnamed: 0,name,age,ZIP
0,Tim,29,2115
1,Jim,31,2130
2,Pam,27,67700
3,Sam,35,100


To retrieve a column, we can use dictionary-like notation, or specify the name of the column as an attribute of the DataFrame

In [9]:
x['name']
x.name

0    Tim
1    Jim
2    Pam
3    Sam
Name: name, dtype: object

0    Tim
1    Jim
2    Pam
3    Sam
Name: name, dtype: object

We often need to reindex a series or a dataframe object. This does not affect the relationship between the index and the corresponding data, it reorders the data in the object

In [10]:
x = pd.Series([6,3,8,6], index = ['q','w','e','r'])
x
x.index

#Let's take the index and construct a new python list 
print('Reindexed:')
x.reindex(sorted(x.index))

q    6
w    3
e    8
r    6
dtype: int64

Index(['q', 'w', 'e', 'r'], dtype='object')

Reindexed:


e    8
q    6
r    6
w    3
dtype: int64

Pandas supports arithmetic operations. The data alignment happens by index (entries that have the same index are added together).
If the indices do not match, Pandas introduces a NAN (not a number) object

In [12]:
x = pd.Series([6,3,8,6], index = ['q','w','e','r'])
y = pd.Series([7,3,5,2], index = ['e','q','r','t'])
x
y
x+y

q    6
w    3
e    8
r    6
dtype: int64

e    7
q    3
r    5
t    2
dtype: int64

e    15.0
q     9.0
r    11.0
t     NaN
w     NaN
dtype: float64

Both dataframes had e, q, and r but only one of each had t and w. This is why they have NaN entries for those indices.  

## 4.1.2: Loading and Inspecting Data
***
 - Learn how to __load a CSV__ file using Pandas
 - Learn how to __view the beginning and end__ of a Pandas DataFrame
 - Learn how to __index__ a Pandas DataFrame by location
 