# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.  This tutorial will cover different operations we can apply to our data to get the input "just right". 

We'll use the Wine Magazine data for demonstration.

In [6]:

import pandas as pd
pd.set_option('display.max_rows', 5)
import numpy as np
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

In [8]:
reviews

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,87,18.0,California,Amador County,Sierra Foothills,Virginie Boone,@vboone,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,
88999,Austria,Zweigelt can do easy-drinking styles but in th...,Heideboden,90,26.0,Burgenland,,,Anne Krebiehl MW,@AnneInVino,Nittnaus Hans und Christine 2013 Heideboden Zw...,Zweigelt,Nittnaus Hans und Christine,


# Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [10]:
reviews.points.describe()

count    119988.000000
mean         88.442236
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [12]:
reviews.taster_name.describe()

count          95071
unique            19
top       Roger Voss
freq           23560
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [14]:
reviews.points.mean()

88.44223589025569

To see a list of unique values we can use the `unique()` function:

In [16]:
reviews.taster_name.unique()

array(['Roger Voss', 'Virginie Boone', nan, 'Michael Schachner',
       'Anna Lee C. Iijima', 'Paul Gregutt', 'Sean P. Sullivan',
       'Kerin O’Keefe', 'Anne Krebiehl\xa0MW', 'Lauren Buzzeo',
       'Joe Czerwinski', 'Alexander Peartree', 'Matt Kettmann',
       'Jim Gordon', 'Susan Kostrzewa', 'Mike DeSimone', 'Jeff Jenssen',
       'Christina Pickard', 'Carrie Dykes', 'Fiona Adams'], dtype=object)

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [18]:
reviews.taster_name.value_counts()

taster_name
Roger Voss           23560
Michael Schachner    14046
                     ...  
Fiona Adams             24
Christina Pickard        6
Name: count, Length: 19, dtype: int64

Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

In [20]:
reviews.country + " - " + reviews.region_1

id
94355                    NaN
126883      US - North Coast
                 ...        
18824     US - Amador County
88999                    NaN
Length: 119988, dtype: object

In [22]:
first_name = "ali"
last_n = "mohamed"
print(first_name+" " +last_n)

ali mohamed
