#  Pandas tutorial

Pandas (panel data) is the second most useful Python library for data analysis and preparation. It allows to work with tabular data and provides very efficient and easy methods for:
- data selection
- data modification
- data indexing
- merging of data from various sources
- transforming data
- feeding data into `scikit-learn` and other ML-related libraries

In this tutorial we will go through the most useful and common operations performed on pandas.

There are two fundamental data structures that you need to understand:

- `pd.Series`: similar to a list, requires all elements to be of the same type, but provides additional methods and operations
- `pd.DataFrame`: this is the data structure which represents tabular data, each column in a data frame is a `Series` object, in addition each data frame contains a row index and a column index.

Let's dive into coding.

## Manual creation of a data frame

By convention, `pandas` library is imported using the alias `pd`.

The simplest way to create a dataframe is to provide a dictionary of lists. Each key becomes the name of the column, each list becomes the series contained in the column.

In [3]:
import pandas as pd

df = pd.DataFrame(
{
    'Code': ['PL', 'DE', 'GB', 'CZ'],
    'Name': ['Poland', 'Germany', 'Great Britain', 'Czech Republic'],
    'Population': [38000000, 80000000, 65000000, 10000000]
})

df

Unnamed: 0,Code,Name,Population
0,PL,Poland,38000000
1,DE,Germany,80000000
2,GB,Great Britain,65000000
3,CZ,Czech Republic,10000000


Each column is a `pd.Series` object. We can inspect it using either the dot notation, or by referring to the column by its name in brackets.

In [248]:
df.Population

In [249]:
df[['Population','Code']]

## Reading data from a file

There are two most common ways of reading text files into `pandas`:
- `pd.read_table`: assumes tab-separated text file
- `pd.read_csv`: assumes comma-separated text file

For the sake of reproducibility we will use public onlie datasets and we will read them directly off the Web. Please take a moment to investigate these datasets:

- [Chipotle orders](https://bit.ly/chiporders)
- [UFO sighting reports](https://bit.ly/uforeports)
- [IMDB movie ratings](https://bit.ly/imdbratings)
- [Drinking by country](https://bit.ly/drinksbycountry)

In [250]:
orders = pd.read_table('https://bit.ly/chiporders')

orders.head(10)

In [251]:
ufo = pd.read_csv('https://bit.ly/uforeports')

ufo.head()

Individual series can be concatenated just like strings in Python.

In [252]:
orders.item_name + ' ' + orders.item_price

## Analyzing a data frame

`pandas` provides simple methods that allow you to investigate the aggregate properties of individual series and the entire data frame.

In [4]:
movies = pd.read_csv('https://bit.ly/imdbratings')

movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


A simple way to quickly learn the distribution of a feature is to use the `describe()` method

In [254]:
movies.duration.describe()

The output of the `describe()` method depends on whether the feature is numerical or categorical.

In [255]:
movies.genre.describe()

One can apply the `describe()` method to the entire dataframe as well.

In [256]:
movies.describe()

In [257]:
movies.shape

In [258]:
movies.columns

In [259]:
movies.dtypes

For more advanced analysis of `pandas` dataframe we can use the excellent `pandas-profiling` library

In [5]:
from ydata_profiling import ProfileReport

movies_profile = ProfileReport(df=movies, title="Analysis of the Movies dataframe", explorative=True)
movies_profile

# or simply: 
# movies.profile_report(title="Analysis of the Movies dataframe")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Renaming columns

In [6]:
ufo = pd.read_csv('https://bit.ly/uforeports')

ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

In [7]:
ufo['Colors Reported']

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
18236    NaN
18237    NaN
18238    NaN
18239    RED
18240    NaN
Name: Colors Reported, Length: 18241, dtype: object

If a column name contains a space, it can no longer be used with the dot notation. There are many ways a column may be renamed.

In [8]:
ufo.rename(columns={'Colors Reported': 'Colors_Reported', 'Time': 'Date and tmie'})

Unnamed: 0,City,Colors_Reported,Shape Reported,State,Date and tmie
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45


In [9]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [10]:
col_names = ['city', 'colors_reported', 'shape_reported', 'state', 'time']

pd.read_csv('https://bit.ly/uforeports', names=col_names)

Unnamed: 0,city,colors_reported,shape_reported,state,time
0,City,Colors Reported,Shape Reported,State,Time
1,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
2,Willingboro,,OTHER,NJ,6/30/1930 20:00
3,Holyoke,,OVAL,CO,2/15/1931 14:00
4,Abilene,,DISK,KS,6/1/1931 13:00
...,...,...,...,...,...
18237,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18238,Spirit Lake,,DISK,IA,12/31/2000 23:00
18239,Eagle River,,,WI,12/31/2000 23:45
18240,Eagle River,RED,LIGHT,WI,12/31/2000 23:45


In [11]:
ufo.columns = col_names + ['non existing column']

ufo.head()

ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements

## Dropping rows and columns

An important concept in `pandas` is the concept of an **axis**. An axis is the direction in which an operation is performed. 0-axis refers to an operation that is applied to each row, 1-axis refers to an operation which is applied to all columns.

By default, `pandas` expects the rows to be dropped, so if you want to drop a column, you have to explicitly state `axis=1`.

In [12]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [13]:
drinks.shape

(193, 6)

In [50]:
drinks.mean()

KeyError: "['continent'] not found in axis"

In [29]:
drinks.mean(axis=1)

TypeError: can only concatenate str (not "int") to str

In [31]:
ufo = pd.read_csv('https://bit.ly/uforeports')

In [32]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [33]:
ufo.drop('City', axis=1)

Unnamed: 0,Colors Reported,Shape Reported,State,Time
0,,TRIANGLE,NY,6/1/1930 22:00
1,,OTHER,NJ,6/30/1930 20:00
2,,OVAL,CO,2/15/1931 14:00
3,,DISK,KS,6/1/1931 13:00
4,,LIGHT,NY,4/18/1933 19:00
...,...,...,...,...
18236,,TRIANGLE,IL,12/31/2000 23:00
18237,,DISK,IA,12/31/2000 23:00
18238,,,WI,12/31/2000 23:45
18239,RED,LIGHT,WI,12/31/2000 23:45


In [19]:
ufo_backup = ufo.set_index('City')

In [20]:
ufo_backup.head()

Unnamed: 0_level_0,Colors Reported,Shape Reported,State,Time
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ithaca,,TRIANGLE,NY,6/1/1930 22:00
Willingboro,,OTHER,NJ,6/30/1930 20:00
Holyoke,,OVAL,CO,2/15/1931 14:00
Abilene,,DISK,KS,6/1/1931 13:00
New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [21]:
ufo_backup.drop('Ithaca', axis=0)

Unnamed: 0_level_0,Colors Reported,Shape Reported,State,Time
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Willingboro,,OTHER,NJ,6/30/1930 20:00
Holyoke,,OVAL,CO,2/15/1931 14:00
Abilene,,DISK,KS,6/1/1931 13:00
New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
Valley City,,DISK,ND,9/15/1934 15:30
...,...,...,...,...
Grant Park,,TRIANGLE,IL,12/31/2000 23:00
Spirit Lake,,DISK,IA,12/31/2000 23:00
Eagle River,,,WI,12/31/2000 23:45
Eagle River,RED,LIGHT,WI,12/31/2000 23:45


In [22]:
ufo.drop('Time', axis=1)

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY
...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL
18237,Spirit Lake,,DISK,IA
18238,Eagle River,,,WI
18239,Eagle River,RED,LIGHT,WI


In [23]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [24]:
ufo.drop([1,3,4]).head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
5,Valley City,,DISK,ND,9/15/1934 15:30
6,Crater Lake,,CIRCLE,CA,6/15/1935 0:00
7,Alma,,DISK,MI,7/15/1936 0:00


In [25]:
ufo.index[0:10]

RangeIndex(start=0, stop=10, step=1)

In [26]:
ufo.drop(ufo.index[0:3]).head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
5,Valley City,,DISK,ND,9/15/1934 15:30
6,Crater Lake,,CIRCLE,CA,6/15/1935 0:00
7,Alma,,DISK,MI,7/15/1936 0:00


All these operations do not change the underlying object, but return a modified copy of the object. If you want to perform the operation on the object, you should:
- add `inplace=True`: more efficient and explicit, or
- use assignment: may be a bit slower, but many people prefer this

In [27]:
ufo.drop('State', axis=1, inplace=True)

ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,Time
0,Ithaca,,TRIANGLE,6/1/1930 22:00
1,Willingboro,,OTHER,6/30/1930 20:00
2,Holyoke,,OVAL,2/15/1931 14:00
3,Abilene,,DISK,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,4/18/1933 19:00


In [28]:
ufo = ufo.drop([0,1,4])

ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,Time
2,Holyoke,,OVAL,2/15/1931 14:00
3,Abilene,,DISK,6/1/1931 13:00
5,Valley City,,DISK,9/15/1934 15:30
6,Crater Lake,,CIRCLE,6/15/1935 0:00
7,Alma,,DISK,7/15/1936 0:00


## Exercise

1. Read the *Titanic* dataset from https://tinyurl.com/y9p968ys into a dataframe called `titanic`
2. Display first 15 rows of the dataset
3. Rename `PassengerId` to `ID`, `Lname` to `last_name`, and `Name` to `first_name`
4. Remove all rows for which the cabin number is not known

In [36]:
titanic = pd.read_csv('https://tinyurl.com/y9p968ys')
titanic.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Lname,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Braund,Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Heikkinen,Miss. Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Allen,Mr. William Henry,male,35.0,0,0,373450,8.05,,S
5,6,0,3,Moran,Mr. James,male,,0,0,330877,8.4583,,Q
6,7,0,1,McCarthy,Mr. Timothy J,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,Palsson,Master. Gosta Leonard,male,2.0,3,1,349909,21.075,,S
8,9,1,3,Johnson,Mrs. Oscar W (Elisabeth Vilhelmina Berg),female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Nasser,Mrs. Nicholas (Adele Achem),female,14.0,1,0,237736,30.0708,,C


In [43]:
titanic.rename(columns={'PassengerId': 'ID', 'Lname': 'last_name', 'Name': 'first_name'}, inplace=True)
titanic.head(15)

Unnamed: 0,ID,Survived,Pclass,last_name,first_name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,McCarthy,Mr. Timothy J,male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,Sandstrom,Miss. Marguerite Rut,female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,Bonnell,Miss. Elizabeth,female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,Beesley,Mr. Lawrence,male,34.0,0,0,248698,13.0,D56,S
23,24,1,1,Sloper,Mr. William Thompson,male,28.0,0,0,113788,35.5,A6,S
27,28,0,1,Fortune,Mr. Charles Alexander,male,19.0,3,2,19950,263.0,C23 C25 C27,S
31,32,1,1,Spencer,Mrs. William Augustus (Marie Eugenie),female,,1,0,PC 17569,146.5208,B78,C
52,53,1,1,Harper,Mrs. Henry Sleeper (Myna Haxtun),female,49.0,1,0,PC 17572,76.7292,D33,C


In [44]:
titanic.dropna(subset=['Cabin'], inplace=True)
titanic.head(15)

Unnamed: 0,ID,Survived,Pclass,last_name,first_name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,McCarthy,Mr. Timothy J,male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,Sandstrom,Miss. Marguerite Rut,female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,Bonnell,Miss. Elizabeth,female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,Beesley,Mr. Lawrence,male,34.0,0,0,248698,13.0,D56,S
23,24,1,1,Sloper,Mr. William Thompson,male,28.0,0,0,113788,35.5,A6,S
27,28,0,1,Fortune,Mr. Charles Alexander,male,19.0,3,2,19950,263.0,C23 C25 C27,S
31,32,1,1,Spencer,Mrs. William Augustus (Marie Eugenie),female,,1,0,PC 17569,146.5208,B78,C
52,53,1,1,Harper,Mrs. Henry Sleeper (Myna Haxtun),female,49.0,1,0,PC 17572,76.7292,D33,C


## Sorting data frames

You can sort individual series within a data frame, and you can sort the entire data frame. Sorting can be made permanent.

In [34]:
movies = pd.read_csv('https://bit.ly/imdbratings')

movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [51]:
movies.duration.sort_index()

0      142
1      175
2      200
3      152
4      154
      ... 
974    116
975    118
976    138
977    114
978    126
Name: duration, Length: 979, dtype: int64

In [52]:
movies.duration.sort_values(ascending=False)

476    242
157    238
78     229
142    224
445    220
      ... 
293     68
88      68
258     67
338     66
389     64
Name: duration, Length: 979, dtype: int64

In [53]:
movies.sort_values('title', ascending=False)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
864,7.5,[Rec],R,Horror,78,"[u'Manuela Velasco', u'Ferran Terraza', u'Jorg..."
526,7.8,Zulu,UNRATED,Drama,138,"[u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac..."
615,7.7,Zombieland,R,Comedy,88,"[u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha..."
677,7.7,Zodiac,R,Crime,157,"[u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M..."
955,7.4,Zero Dark Thirty,R,Drama,157,"[u'Jessica Chastain', u'Joel Edgerton', u'Chri..."
...,...,...,...,...,...,...
110,8.3,2001: A Space Odyssey,G,Mystery,160,"[u'Keir Dullea', u'Gary Lockwood', u'William S..."
698,7.6,127 Hours,R,Adventure,94,"[u'James Franco', u'Amber Tamblyn', u'Kate Mara']"
201,8.1,12 Years a Slave,R,Biography,134,"[u'Chiwetel Ejiofor', u'Michael Kenneth Willia..."
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."


In [54]:
movies.sort_values(['content_rating', 'duration'])

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
713,7.6,The Jungle Book,APPROVED,Animation,78,"[u'Phil Harris', u'Sebastian Cabot', u'Louis P..."
513,7.8,Invasion of the Body Snatchers,APPROVED,Horror,80,"[u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga..."
272,8.1,The Killing,APPROVED,Crime,85,"[u'Sterling Hayden', u'Coleen Gray', u'Vince E..."
703,7.6,Dracula,APPROVED,Horror,85,"[u'Bela Lugosi', u'Helen Chandler', u'David Ma..."
612,7.7,A Hard Day's Night,APPROVED,Comedy,87,"[u'John Lennon', u'Paul McCartney', u'George H..."
...,...,...,...,...,...,...
387,8.0,Midnight Cowboy,X,Drama,113,"[u'Dustin Hoffman', u'Jon Voight', u'Sylvia Mi..."
86,8.4,A Clockwork Orange,X,Crime,136,"[u'Malcolm McDowell', u'Patrick Magee', u'Mich..."
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


## Filter rows by a value in a column

The filtering in `pandas` works very similarly to the way we do filtering in `NumPy`. We will start with creating a boolean series based on a condition, and then we will pass this series as an indexer to the data frame.

In [55]:
movies.genre == 'Horror'

0      False
1      False
2      False
3      False
4      False
       ...  
974    False
975    False
976    False
977     True
978    False
Name: genre, Length: 979, dtype: bool

In [281]:
movies.shape

In [282]:
movies[movies.genre.isin(['Horror','Thriller','Mystery'])]

If we want to combine several conditions, there are two important things to remember:
- each condition must be in parentheses to help `pandas` establish the order of execution
- instead of `and`, `or` operators we must use `&` and `|`

In [283]:
movies[ (movies.genre == 'Horror') & (movies.duration > 120)]

We may also negate an index

In [284]:
movies[~(movies.duration > 120)]

## Exercise

1. Sort the `titanic` dataframe in the decreasing order of the fare price
2. Create a new dataframe `titanic_f` containing only information on female passengers
3. Create a new dataframe `titanic_x` containing only young passenges (age < 18) who did not embark in Cork.

## String methods

`pandas` makes it very easy to use all the string methods on data frame. Just remember to access these methods via the `str` attribute of the series. Method invocations can be easily chained because each method returns a `Series` object.

In [285]:
orders = pd.read_table('https://bit.ly/chiporders')

orders.head()

In [286]:
orders.item_name

In [287]:
orders.item_name.str

In [288]:
orders.item_name.str.upper().str.lower().str.len()

In [289]:
orders.item_name.str.replace('Tomato','Pomodoro').str.lower().str.split()

## Exercise

1. Create a list of names of passengers consisting of the title (Mr., Miss., Mrs.) and the last name.
2. Create a list of names of passengers consisting of the first initial and the last name

## Changing data type of a series

All elements in a `Series` object must have the same type. It is possible to cast the entire series to a new type using the builtin `Series.astype()` function. This can be done either during data reading, or after the data frame has been created.

In [290]:
orders.dtypes

In [291]:
orders.head()

In [292]:
orders.order_id.astype(float)

In [293]:
orders = pd.read_table('https://bit.ly/chiporders', dtype={'quantity': float})

orders.dtypes

If we want to convert `item_price` to a number, we have to first remove the dollar sign from the string representation of the series, and then cast the entire series.

In [294]:
orders['item_price'] = orders.item_price.str.replace('$','').astype(float)

In [295]:
orders.dtypes

## Group by

`Pandas` offers a very broad range of methods for advanced data processing. A common operation is to create aggregates of the table based on the grouping of data on a column. This can be easily achieved using a single function call.

In [296]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

drinks.head()

In [297]:
drinks.mean()

In [None]:
drinks.groupby('continent')

In [None]:
drinks.groupby('continent').beer_servings.mean()

In [None]:
drinks.groupby('continent').std()

We can apply several aggregate functions to a grouped data frame using the `agg()` function

In [None]:
drinks.groupby('continent').agg(['mean', 'min', 'max', 'count'])

The results of the group by operation can be quickly visualized.

In [298]:
%matplotlib inline

drinks.groupby('continent').mean().plot(kind='bar')

## Exercise

1. Compute the number of passengers and the average ticket fare based on the port of embarkment
2. Compare the number of female and male passengers who have survived the sinking
3. Compare the mean age of passengers who survived the sinking with the mean age of passengers who have died

## Exploring the data frame

Here we list some useful functions to run after reading the data to get some better understanding of the data.

In [299]:
movies = pd.read_csv('https://bit.ly/imdbratings')

movies.head()

In [300]:
movies.star_rating.describe()

In [301]:
movies.genre.describe()

In [302]:
movies.genre.value_counts()

In [303]:
movies.genre.value_counts(normalize=True)

Since the result of the `value_counts()` function is a `Series` object, we can process it further.

In [304]:
movies.genre.value_counts(normalize=True).plot(kind='bar')

In [305]:
movies.duration.plot(kind='hist')

In [306]:
movies.genre.unique()

In [307]:
genres = movies.genre.unique()

for g in genres:
    df = movies[movies.genre == g]
    ...

A useful method allows you to quickly create pivot tables from series objects.

In [308]:
genres = movies.genre
ratings = movies.content_rating

In [309]:
genres

In [310]:
ratings

In [311]:
pd.crosstab(genres, ratings)

## Handling missing values

When working with a data frame, we must be careful when the data contains missing values. Two functions are very useful when working with missing values:
- `isnull()`: returns `True` if a value is missing
- `dropna()`: allows to remove rows and/or columns with missing values

In [312]:
ufo = pd.read_csv('https://bit.ly/uforeports')

ufo.head()

In [313]:
ufo['Colors Reported'].isnull()

In [314]:
ufo['Colors Reported'].isnull().sum()

In [315]:
ufo.isnull().sum()

In [316]:
ufo.shape

In [317]:
ufo.dropna(how='all', axis=0, subset=['City', 'Colors Reported']) # all, subset

## Exercise

1. Remove from the `titanic` dataframe records which do not have the age of the passenger registered
2. Compute the number of missing cabin numbers for each class of passengers

## What is an index?

An `Index` is a special type that can be used to access rows and columns. There are three main uses for an index:
- identification of rows/columns
- selection of rows/columns
- alignment of rows

In [318]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

drinks.head()

In [319]:
drinks.set_index('country', inplace=True)

drinks.head()

Notice that the index of a data frame is inherited by all series.

In [320]:
drinks.continent.head()

In [321]:
drinks.continent.value_counts().values

In [322]:
drinks.continent.value_counts().index

Index is very useful to select both rows and columns. All you need to remember is that `pd.loc` function expects you to provide index entries for rows and columns that you want to retrieve.

In [323]:
drinks.head()

In [324]:
drinks.loc['Poland', :]

In [325]:
drinks.loc['Gabon':'Guyana', 'beer_servings']

In [326]:
drinks.columns

In [327]:
drinks.loc[['Poland', 'Germany', 'France'], 'beer_servings':'wine_servings']

Let's create a `Series` object with an index that can be aligned with our `drinks` data frame.

In [328]:
population = pd.Series([4000000, 38000000, 80000000, 70000000], 
                       index=['Albania', 'Poland', 'Germany', 'France'], 
                       name='population')

population

In [329]:
drinks.loc[['Albania', 'Poland', 'Germany', 'France', 'Greece']].beer_servings * population

In [330]:
pd.concat([drinks, population], axis=1)

You can always revert to a default "row number" index and move the index column to the column list.

In [331]:
drinks.reset_index(inplace=True)

drinks.head()

## Indexing with `loc`, `iloc`, and `ix`

This is quite confusing. Try to remember the following rules:
- `loc` uses row/column indexes (aka labels), the ranges are **inclusive**
- `iloc` uses integer positions on the list of rows and columns, the ranges are **exclusive**
- `ix` was an old way of indexing a dataframe allowing to use both labels and integer positions, it is deprecated

In [332]:
ufo = pd.read_csv('https://bit.ly/uforeports')

ufo.head()

In [333]:
ufo.loc[0:3, :]

In [334]:
ufo.loc[[0,2,4], 'City':'State']

In [335]:
ufo.iloc[0:3, 1:3]

## Categories and ordered categories

For certain types of columns the data frame can be optimized by switching the type of a column (especially a column used in selection or grouping) into a category type.

In [336]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

drinks.head()

In [337]:
drinks.info()

In [338]:
drinks.info(memory_usage='deep')

In [339]:
drinks.continent.memory_usage(deep=True)

In [340]:
drinks.continent

In [341]:
drinks.continent = drinks.continent.astype('category')

drinks.continent.memory_usage(deep=True)

In [342]:
drinks.continent

You can perform various operations on a category, just remember to access it via `cat` property.

In [343]:
drinks.continent.cat.codes

In [344]:
drinks.continent.cat.as_ordered()

Ordered category can be used to allow for logical sorting of rows.

In [345]:
df = pd.DataFrame({
    'name': ['Mount Everest', 'Kilimanjaro', 'Rysy'],
    'height': ['very high', 'high', 'low']
})

df

In [346]:
df.sort_values('height')

In [347]:
from pandas.api.types import CategoricalDtype

heights = CategoricalDtype(categories=['low', 'high', 'very high'], ordered=True)

df['height'] = df.height.astype(heights) 

In [348]:
df.sort_values('height')

## Exercise

1. Change the index of the `titanic` dataframe to the ticket number
2. Change the `Pclass` attribute into a category. 

## Creating binary variables from categorical columns

Often in data mining we want to binarize categorical features. One of the most common encodings is the dummy encoding, where a feature with `n` values is turned into `n-1` binary columns

In [349]:
ufo = pd.read_csv('https://bit.ly/uforeports')

ufo.head()

In [350]:
ufo.State.map({'NY': 'New York', 'NJ': 'New Jersey', 'CO': 'Colorado'})

In [351]:
pd.get_dummies(ufo.State)

In [352]:
pd.get_dummies(ufo.State, prefix='state')

In [353]:
pd.get_dummies(ufo.State, prefix='state').sum(axis=0)

In [354]:
pd.get_dummies(ufo.State, prefix='state').sum(axis=1)

In [355]:
df = pd.DataFrame({'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'N', 'N']})
df

In [356]:
pd.get_dummies(df.gender)

In [357]:
pd.get_dummies(df.gender, drop_first=True)

## Display options 

The way `pandas` dataframes are displayed insied a notebook can be modified by accessing display options. Two functions are handy for that:
- `get_option()`: check the current setting
- `set_option()`: modify the current setting

Let's change the following settings:
- the number of rows displayed
- the precision of floats
- the maximum width of a column

In [358]:
movies = pd.read_csv('https://bit.ly/imdbratings')

movies.head()

In [359]:
pd.set_option('display.max_rows', None)
pd.set_option('display.precision', 0)
pd.set_option('display.max_colwidth', 25)

In [360]:
movies

In [361]:
pd.reset_option('display.max_rows')
pd.reset_option('display.precision')
pd.reset_option('display.max_colwidth')

## Applying functions to data frame columns

We can easily apply functions to data frame columns on the fly, or create new columns as the result of applying a function to an existing column. We've seen this behavior before. Two of the most common ways to do it are the `map()` function and the `apply()` function.

In [362]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

drinks.head()

In [363]:
drinks.country.str.lower()

In [364]:
drinks.country.map(len)

In [365]:
pd.concat([drinks.country, drinks.country.map(len)], axis=1)

In [366]:
drinks.apply(max)

In [367]:
drinks.apply(max, axis=1)

In [368]:
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1)

In [369]:
def get_serious_drinkers(total_alcohol: float) -> bool:
    if total_alcohol >= 10.0:
        return True
    else:
        return False

In [370]:
drinks['heavy_drinkers'] = drinks.total_litres_of_pure_alcohol.apply(get_serious_drinkers)

In [371]:
drinks.head()

In [372]:
drinks.total_litres_of_pure_alcohol.apply(lambda x: True if x > 10 else False )

Instead of writing a separate function, many people prefer to use anonymous lambda funciton instead. What you see below is a very common pattern for `pandas` processing.

In [373]:
drinks['country_initial'] = drinks.country.apply(lambda x: x[0])

drinks.head(20)

Sometimes, we want to apply a min/max function to a set of columns and find which column produces the result. This can be achieved using the `idxmax` function.

In [374]:
drinks.loc[:, 'beer_servings':'wine_servings']

In [375]:
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1)

In [376]:
drinks.loc[:, 'beer_servings':'wine_servings'].idxmax(axis=1)

## Exercise

1. Create a new column which contains the age each of the passengers would have had today (Titanic sunk in 1912)
2. Create a new column with the string value *survived* or *died* for each passenger
3. Create a new column `Deck` containig the symbol of the deck on which the passenger was travelling (the first letter of the cabin number)

## Joining data frames

There are multiple methods to join data frames, but we will focus on only two methods and forget about the rest:
- `pd.concat`: joins data frames vertically or horizontally
- `pd.merge`: performs database-like inner, outer, left, and right-joins based on an index or a column

In [377]:
cities = pd.DataFrame({
    'country': ['Germany', 'Germany', 'Poland', 'Poland', 'Russia', 'Russia'],
    'city': ['Berlin', 'Munich', 'Warsaw', 'Cracow', 'Moscow', 'St Petersburg'],
    'is_capital': [True, False, True, False, True, False]
})

banks = pd.DataFrame({
    'country': ['Germany', 'Germany', 'Poland', 'France', 'France'],
    'name': ['Deutsche Bank', 'Commerzbank', 'Santander', 'Credit Agricole', 'BNP Paribas']
    
})

In [378]:
pd.concat([cities, banks], axis=0)

In [379]:
pd.concat([cities, banks], axis=1)

In [380]:
pd.merge(cities, banks)

In [381]:
banks.columns = ['country_name', 'bank_name']

pd.merge(cities, banks)

In [382]:
pd.merge(cities, banks, left_on='country', right_on='country_name')

In [383]:
cities.set_index('country', inplace=True)
banks.set_index('country_name', inplace=True)

pd.merge(cities, banks, left_index=True, right_index=True)

In [384]:
pd.merge(cities, banks, left_index=True, right_index=True, how='inner')

In [385]:
pd.merge(cities, banks, left_index=True, right_index=True, how='left')

In [386]:
pd.merge(cities, banks, left_index=True, right_index=True, how='right')

In [387]:
pd.merge(cities, banks, left_index=True, right_index=True, how='outer')

## Using pipes for data processing

The most pythonic way of performing a sequence of operations is to chain operators. However, this may not result in the most readable code. A simple library called `pipe` solves this problem by borrowing the chaining syntax from R.

Before observing `pipe` in action, let us first analyze the behavior of traditional `map()` and `filter()` functions.

In [388]:
numbers = list(range(10))

even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
even_numbers

In [389]:
squares = list(map(lambda x: x**2, numbers))
squares

In [390]:
def square(x): return x**2
def is_even(x): return x % 2 == 0

squares_of_even_numbers = list(map(square, filter(is_even, numbers)))
squares_of_even_numbers

The same functionality can be much easier achieved using pipes.

In [391]:
from pipe import where, select

list(numbers 
     | where(is_even)
     | select(square)
    )

pipe operator `|` simply passes the output of one function as the input to another function. The remaining functions perform the following:
- `where`: filter out only those element of the iterable which fulfill the condition
- `select`: applies a function to each element of the iterable
- `traverse`: recursively unchain a sequence of iterables
- `groupby`: groups elements of an iterable 
- `dedup`: removes duplicates from an iterable

In [392]:
from pipe import dedup

numbers = [1, 2, 3, 4, 5] * 3

print(f"Before deduplication: {numbers}")
print(f"After deduplication: {list(numbers | dedup)}")

In [393]:
from pipe import traverse

nested_numbers = [1, 2, 3, [4, 5], [6, 7], 8, [9, 0]]

print(f"Unnested numbers: {list(nested_numbers | traverse)}")

In [394]:
from pipe import groupby

numbers = list(range(10))

even_odd_numbers = list(
    numbers 
    | groupby(lambda x: "even" if x % 2 == 0 else "odd") 
    | select(lambda x: {x[0]: list(x[1])})
    )
    
print(f"Even and odd numbers: {even_odd_numbers}")

## Exercise

Using pipes, perform the following queries:

1. List unique ages of women who survived the sinking
2. Compute the mean age of passengers based on the port of embarkment
3. Create a list of titles (Mr., Mrs., etc.) and last names of passengers who died.