# Moving Beyond the Basics

In [2]:
import pandas as pd
reviews = pd.read_csv('winemag-data_first150k.csv')

## Summary Functions

There are a number of built in function that can allow us to see an overview of data in a column (or row) similar to the describe() and head() functions

In [10]:
# Gives the mean value of a int column
reviews.points.mean()

# Returns a list of unique values in a column
reviews.country.unique()

# Returns a list of the unique values and how many of each there are
reviews.country.value_counts()

# Returns the index of the first occurence of the max value in a series
reviews.price.idxmax()

34920

## Maps

Maps are a mathemetical concept that takes one set of values and 'maps' it onto another set of values.
In data science they are a common tool to transform data to the format we want


The first and simpler mapping function in pandas is the .map() method.

    - map() expects a single value from the Series and returns a transformed version of the value (you can use a lambda function or a custom function for the transformation).
    
    - apply() is the equivalent method to transform a whole DataFrame by calling a custom method on each row.

This is how we can apply a custom function to a column of data.
Both map() and apply() return a new transformed version of the Series / DataFrame respectively and don't edit in place.

In [None]:
# Takes the points series and shifts the mean to be 0
review_points_mean = reviews.points.mean()

reviews.points.map(lambda p: p - review_points_mean)

# Function that shifts the points so the mean is 0
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')
# If we had called reviews.apply() with axis='index', then instead of passing a function to 
# transform each row, we would need to give a function to transform each column.

# Maintains original values since we did not assign output of .apply()
reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


Lots of common functions are built into pandas already such as the basic mathematical operators + - * and /.

Pandas will understand what we are trying to do when we use these operators on series of equal length. Very versatile.

These basic operators are FASTER than map() and apply() but not as FLEXIBLE.

In [9]:
# Faster way to remean
reviews_points_mean = reviews.points.mean()
reviews.points - reviews_points_mean

# Create new Series based on other series
reviews.country + ' - ' + reviews.region_1

0                  US - Napa Valley
1                      Spain - Toro
2               US - Knights Valley
3            US - Willamette Valley
4                   France - Bandol
                    ...            
150925    Italy - Fiano di Avellino
150926           France - Champagne
150927    Italy - Fiano di Avellino
150928           France - Champagne
150929           Italy - Alto Adige
Length: 150930, dtype: object