# Pandas Introduction

First step it to import pandas and read the data file.
import as pd by convention

In [1]:
import pandas as pd

A data frame can be created from many sources but to make one manually the best way is using a dictionary with a str key and list value dict{str | int: list[]}.

In [2]:
fruits = pd.DataFrame({'Apples':[30], 'Bananas':[21]})

An index can be added to the DataFrame to help sort the DataFrame and lebel the rows within the DataFrame:

In [3]:
fruit_index = ['2017 Sales', '2018 Sales']

fruit_data = {
    'Apples':[35, 41],
    'Bananas':[21, 34]
}

fruit_sales = pd.DataFrame(fruit_data, index=fruit_index)

A pandas Series is like one column of a DataFrame with its own index and datatype (dtype).

In [None]:
series_data = ['4 cups', '1 cup', '2 large', '1 can']

series_index = ['Flour', 'Milk', 'Eggs', 'Spam']

ingredients = pd.Series(series_data, index=series_index, name='Dinner')

Pandas can also read different types of data that already exists. For these examples we will use the file 'winemag-data_first150k.csv' which contains wine review data from a magazine.

In [6]:
reviews = pd.read_csv('winemag-data_first150k.csv')

reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


Notice pandas did not automatically recognize the index column used in this data 'Unnamed: 0' in order to pass this index column to pandas, use the argument index_col: int.

In [7]:
reviews = pd.read_csv('winemag-data_first150k.csv', index_col=0)

reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


## Capturing Specific Data

To refine the data or select certain data there is some basic notation that can be used.
The first is to use the series attribute of the dataframe but this only works for columns written without reserved characters where the second can handle things like 'maiden name' that have a space, both work the same.

In [9]:
# Individual columns can be accessed 2 ways
reviews.country == reviews['country']

# Individual data points can be accessed through the column name + row index
print(reviews.country[0])

US


Pandas has it's own accessor operators loc and iloc that are built specifically for pandas.

    - iloc: index-based selection, selecting data baed on it's numerical positoin in the data
    - loc: label-based selection, selects data based on column and row names. also supports  regular indexing 

Both are row-first, column-second which is the opposite of native Python
KEY DIFFERENCE: iloc uses regular python indexing ([0:3] returns values at index 0, 1, 2) where loc is inclusive ([0:3] returns values 0, 1, 2, 3. 4 values!)

In [18]:
# df.iloc
# Selects the first row in the DataFrame
reviews.iloc[0]

# Select the first column in the DataFrame
reviews.iloc[:, 0]

# Selects the first 3 rows from the first column
reviews.iloc[:3, 0]

# Selects the second and third rows from the first column
reviews.iloc[1:3, 0]

# A list can also be passed to select certain values
reviews.iloc[[0, 5, 15], 0]

# Negative values can also be used to start indexing from the end
reviews.iloc[-5:]

#df.loc
# Selects the first entry in rreviews
reviews.loc[0, 'country']

# Selects the columns by name
reviews.loc[:, ['country', 'price', 'points']]

Unnamed: 0,country,price,points
0,US,235.0,96
1,Spain,110.0,96
2,US,90.0,96
3,US,65.0,96
4,France,66.0,95
...,...,...,...
150925,Italy,20.0,91
150926,France,27.0,91
150927,Italy,20.0,91
150928,France,52.0,90


We can set the index of the dataframe if we have something better to sort it by using set_index. (df.set_index('column name'))

To peform more interesting investigations of the data and filter based on certain properties we can use python comparator operators.

In [19]:
reviews.country == 'Italy'

0         False
1         False
2         False
3         False
4         False
          ...  
150925     True
150926    False
150927     True
150928    False
150929     True
Name: country, Length: 150930, dtype: bool

This creates a pd.Series of True / False that we can then use with loc to select the relevant data.

In [20]:
reviews.loc[reviews.country == 'Italy']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
10,Italy,"Elegance, complexity and structure come togeth...",Ronco della Chiesa,95,80.0,Northeastern Italy,Collio,,Friulano,Borgo del Tiglio
32,Italy,"Underbrush, scorched earth, menthol and plum s...",Vigna Piaggia,90,,Tuscany,Brunello di Montalcino,,Sangiovese,Abbadia Ardenga
35,Italy,"Forest floor, tilled soil, mature berry and a ...",Riserva,90,135.0,Tuscany,Brunello di Montalcino,,Sangiovese,Carillon
37,Italy,"Aromas of forest floor, violet, red berry and ...",,90,29.0,Tuscany,Vino Nobile di Montepulciano,,Sangiovese,Avignonesi
38,Italy,"This has a charming nose that boasts rose, vio...",,90,23.0,Tuscany,Chianti Classico,,Sangiovese,Casina di Cornia
...,...,...,...,...,...,...,...,...,...,...
150920,Italy,"Rich and mature aromas of smoke, earth and her...",Brut Riserva,91,19.0,Northeastern Italy,Trento,,Champagne Blend,Letrari
150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora


We can string together multiple conditions using the & (and) or | (or)

In [21]:
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
10,Italy,"Elegance, complexity and structure come togeth...",Ronco della Chiesa,95,80.0,Northeastern Italy,Collio,,Friulano,Borgo del Tiglio
32,Italy,"Underbrush, scorched earth, menthol and plum s...",Vigna Piaggia,90,,Tuscany,Brunello di Montalcino,,Sangiovese,Abbadia Ardenga
35,Italy,"Forest floor, tilled soil, mature berry and a ...",Riserva,90,135.0,Tuscany,Brunello di Montalcino,,Sangiovese,Carillon
37,Italy,"Aromas of forest floor, violet, red berry and ...",,90,29.0,Tuscany,Vino Nobile di Montepulciano,,Sangiovese,Avignonesi
38,Italy,"This has a charming nose that boasts rose, vio...",,90,23.0,Tuscany,Chianti Classico,,Sangiovese,Casina di Cornia
...,...,...,...,...,...,...,...,...,...,...
150920,Italy,"Rich and mature aromas of smoke, earth and her...",Brut Riserva,91,19.0,Northeastern Italy,Trento,,Champagne Blend,Letrari
150922,Italy,Made by 30-ish Roberta Borghese high above Man...,Superiore,91,,Northeastern Italy,Colli Orientali del Friuli,,Tocai,Ronchi di Manzano
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora


In [22]:
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


Pandas has some built in conditional selectors. 
    - isin() can take a value or list of values that "isin" a given series of the dataframe.
    - isnull() and notnull() can filter out entries based on whether or not there are null entries in a given column

Values can be assigned to a dataframe using basic assignemnt notation.

In [23]:
# Sets all country entries to 'everyone'
reviews['country'] = 'everyone'

# Iterates over a range or object to asign more dynamically
reviews['index_backwards'] = range(len(reviews), 0, -1)