# `pandas`

In [1]:
%pylab inline
plt.style.use('ggplot')

Populating the interactive namespace from numpy and matplotlib


Note the import convention:

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.random.seed(983456)

## Creating `pd.Series`

When creating Pandas `Series` you can provide values only:

In [4]:
s = pd.Series(np.random.randn(10))
s

0   -1.187327
1    0.382796
2   -0.681736
3   -3.534783
4    0.304866
5   -0.899330
6    1.194968
7   -0.446314
8    2.598223
9   -0.795144
dtype: float64

Values and series name:

In [5]:
s = pd.Series(np.random.randn(10), name="random_series")
s

0    2.433533
1   -0.677715
2    0.871098
3    0.128585
4   -1.042224
5    0.228245
6   -0.361624
7    0.447801
8    2.045056
9   -1.291771
Name: random_series, dtype: float64

Values, index and series name:

In [6]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=np.random.randint(23, size=(10,)))
s

10    0.496488
2     0.516279
20   -1.497689
0    -0.256733
1     0.149244
14   -2.084936
16    0.117294
19    1.779378
3     0.820993
3    -0.090797
Name: random_series, dtype: float64

The double indexing is okay (we see this at 3,3). It prevents us from certain operations which require unique index, but it still works when setting indexes.

In [8]:
s.index

Int64Index([10, 2, 20, 0, 1, 14, 16, 19, 3, 3], dtype='int64')

Index can be created explicitly (and can have it's own name):

In [9]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=pd.Index(np.random.randint(23, size=(10,)), name="main_index"))
s

main_index
5    -1.062838
10    0.704245
0    -0.596709
16   -0.244326
21   -0.862128
1    -0.532820
11   -0.364554
17   -2.594341
19   -0.014463
16   -0.012298
Name: random_series, dtype: float64

In [10]:
s.index

Int64Index([5, 10, 0, 16, 21, 1, 11, 17, 19, 16], dtype='int64', name='main_index')

In [11]:
s.to_frame().reset_index()

Unnamed: 0,main_index,random_series
0,5,-1.062838
1,10,0.704245
2,0,-0.596709
3,16,-0.244326
4,21,-0.862128
5,1,-0.53282
6,11,-0.364554
7,17,-2.594341
8,19,-0.014463
9,16,-0.012298


Series can be created from a dictionary as well

In [12]:
s = pd.Series({'a':3, 'c':6, 'b':2}, name="dict_series")
s

a    3
c    6
b    2
Name: dict_series, dtype: int64

In [13]:
s.index

Index(['a', 'c', 'b'], dtype='object')

## Creating `pd.DataFrame`

Agai, we can use just values and Pandas will create an index (both row and column) for us:

In [14]:
df = pd.DataFrame(np.arange(20).reshape((5,4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Easy way to access data types in a dataframe:

In [15]:
df.dtypes

0    int32
1    int32
2    int32
3    int32
dtype: object

We can provide column names:

In [16]:
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Values, index and column names:

In [19]:
import string
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,a,b,c,d
s,0,1,2,3
m,4,5,6,7
u,8,9,10,11
t,12,13,14,15
f,16,17,18,19


In [20]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

In [21]:
df.index

Index(['s', 'm', 'u', 't', 'f'], dtype='object')

Can you guess what does `df['a']` means?

In [22]:
df['a']

s     0
m     4
u     8
t    12
f    16
Name: a, dtype: int32

Can we access a row in the same way?

In [None]:
df['h'] #we can not

Each column is `pd.Series`:

In [26]:
type(df['a'])

pandas.core.series.Series

# Reading CSV files

We will use [Titanic dataset](https://www.kaggle.com/c/titanic/data):

In [27]:
titanic_train = pd.read_csv('train.csv')

By default, Pandas creates an integer row index and reads column names from `0-th` row of a CSV file:

In [28]:
titanic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [35]:
!ls

'ls' is not recognized as an internal or external command,
operable program or batch file.


Glimpse into a dataframe:

In [29]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
titanic_train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [31]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [32]:
titanic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [36]:
titanic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Basic indexing of Pandas dataframes

We can set index column in `pd.read_csv`:

In [44]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId') #make passengerID the default index
titanic_test = pd.read_csv('test.csv', index_col='PassengerId')

Accessing a single column:

In [45]:
titanic_train["Survived"]

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

A set of columns:

In [48]:
titanic_train[["Name", "Survived"]] #takes multiple columns as an array

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",0
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
3,"Heikkinen, Miss. Laina",1
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
5,"Allen, Mr. William Henry",0
...,...,...
887,"Montvila, Rev. Juozas",0
888,"Graham, Miss. Margaret Edith",1
889,"Johnston, Miss. Catherine Helen ""Carrie""",0
890,"Behr, Mr. Karl Howell",1


Just in case, column order is not important (usually): When they are of different types, since they are each taken from different numpy arrays, it doesn't matter which order. However, if you take from two columns of the same type, pandas aggregates all types into a single numpy array, per type. Therefore, order might matter when taking from the same array.

In [49]:
%timeit titanic_train[["Name", "Survived"]]

831 µs ± 65.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [50]:
%timeit titanic_train[["Survived", "Name"]]

825 µs ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Integer indexing is also available with `[]` notation, but with some peculiarities:

In [51]:
titanic_train[2:4]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


But:

In [52]:
titanic_train[2]

KeyError: 2

`[]` may be ambiguous, and it's better to use it only for column access. If you want to use row labels, use `.loc`:

In [53]:
titanic_train.loc[2] #this is the row labeled as two, not positioned as two. Its position here is actually 1, not two. For position, we use iloc.

Survived                                                    1
Pclass                                                      1
Name        Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                    female
Age                                                      38.0
SibSp                                                       1
Parch                                                       0
Ticket                                               PC 17599
Fare                                                  71.2833
Cabin                                                     C85
Embarked                                                    C
Name: 2, dtype: object

In [None]:
titanic_train.head()

Note, that `titanic_train.loc[...]` is label-based, not positional, although row labels are integers. This is even more elaborated for non-monotonic indexes (both default one and `PassengerId` are unique and monotonic).

Label-based slice (inclusive bounds):

In [54]:
titanic_train.loc[2:4] #this is a labeled base slice

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


Positional slice (exclusive upper bound):

In [55]:
titanic_train[2:4] #regular indexing is positional based slice

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


`.loc` indexing is very flexible and can combine row and column access in one run:

In [56]:
titanic_train.loc[2:4, "Age"]

PassengerId
2    38.0
3    26.0
4    35.0
Name: Age, dtype: float64

In [57]:
titanic_train.loc[2:4, ["Age"]]

Unnamed: 0_level_0,Age
PassengerId,Unnamed: 1_level_1
2,38.0
3,26.0
4,35.0


This one won't work:

In [58]:
titanic_train[2:4, ["Age"]]

TypeError: '(slice(2, 4, None), ['Age'])' is an invalid key

In [59]:
titanic_train.loc[2:10:2, ["Age"]]

Unnamed: 0_level_0,Age
PassengerId,Unnamed: 1_level_1
2,38.0
4,35.0
6,
8,2.0
10,14.0


In [None]:
titanic_train.loc[titanic_train["Age"] < 5, ["Name", "Pclass"]] #In SQL: Select... From... Where

In [60]:
titanic_train.loc[(titanic_train["Age"] < 5) & (titanic_train.Pclass == 2), "Name"]

PassengerId
44     Laroche, Miss. Simonne Marie Anne Andree
79                Caldwell, Master. Alden Gates
184                   Becker, Master. Richard F
194                  Navratil, Master. Michel M
341              Navratil, Master. Edmond Roger
408              Richards, Master. William Rowe
531                    Quick, Miss. Phyllis May
619                 Becker, Miss. Marion Louise
751                           Wells, Miss. Joan
756                   Hamalainen, Master. Viljo
828                       Mallet, Master. Andre
832             Richards, Master. George Sibley
Name: Name, dtype: object

This won't work:

In [61]:
titanic_train.loc[titanic_train["Age"] < 5 & titanic_train.Pclass == 2, "Name"] #This one is missing the brackets

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [62]:
titanic_train["Age"] < 5 & titanic_train.Pclass

PassengerId
1      False
2      False
3      False
4      False
5      False
       ...  
887    False
888    False
889    False
890    False
891    False
Length: 891, dtype: bool

In [63]:
titanic_train["Age"] < 5 & titanic_train.Pclass == 2

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

`.iloc`, in contrast, is explicitly positional and can combine both row and column positions (and upper bounds are always exclusive):

In [64]:
titanic_train.iloc[:2, 3]  # Note resulting series name: Pandas preserves column name

PassengerId
1      male
2    female
Name: Sex, dtype: object

In [66]:
titanic_train.iloc[:2, 3:6]

Unnamed: 0_level_0,Sex,Age,SibSp
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,male,22.0,1
2,female,38.0,1


You cannot mix positional and label-based indexing:

In [67]:
titanic_train.iloc[:2, "Name"]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

But you still can use filtering:

In [68]:
titanic_train.iloc[(titanic_train.Age < 10).values, 2]  # titanic_train.iloc[titanic_train.Age < 10, 2] won't work

PassengerId
8                Palsson, Master. Gosta Leonard
11              Sandstrom, Miss. Marguerite Rut
17                         Rice, Master. Eugene
25                Palsson, Miss. Torborg Danira
44     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
828                       Mallet, Master. Andre
832             Richards, Master. George Sibley
851     Andersson, Master. Sigvard Harald Elias
853                     Boulos, Miss. Nourelain
870             Johnson, Master. Harold Theodor
Name: Name, Length: 62, dtype: object

## Performance

But how index is useful? (note the filtering notation)

In [69]:
titanic_train = pd.read_csv('train.csv')

In [70]:
%timeit titanic_train[titanic_train.PassengerId==400]

623 µs ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [71]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')

In [72]:
%timeit titanic_train.loc[400] #in general, .loc is significantly faster. Here its almost marginal, but with much larger DF's, its more emphasized.

257 µs ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Combining dataframes

In [73]:
pd.concat([titanic_train, titanic_test], ignore_index=True) #we ignore indexes, because the indexes may have changed in each file.

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1304,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1305,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1306,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1307,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


Note, how Pandas filled `Survived` column with NaN (which is not even present `titanic_test`!). Pandas constructs a union, and fills missing values with Nan.

A better way to combine dataframes when index has actual meaning (which it does in ours, as both indexes are filled with PassengerID):

In [74]:
titanic = pd.concat([titanic_train, titanic_test])

In [75]:
titanic

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


# Indexing `pd.Series` in depth

In [76]:
np.random.seed(983456)

N_ELEMS = 20

s = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
              index=list(string.ascii_lowercase)[:N_ELEMS],
              name='randint_series')
s

a     8
b    17
c    10
d    10
e    12
f    13
g    10
h    12
i     4
j    13
k    16
l     5
m    10
n    10
o     4
p     1
q    10
r     6
s     3
t    17
Name: randint_series, dtype: int32

## Indexing with `[]`

In [77]:
s

a     8
b    17
c    10
d    10
e    12
f    13
g    10
h    12
i     4
j    13
k    16
l     5
m    10
n    10
o     4
p     1
q    10
r     6
s     3
t    17
Name: randint_series, dtype: int32

In [78]:
s['i']  # But there's a caveat: it may be series or just an element

4

In [79]:
s[['i']]

i    4
Name: randint_series, dtype: int32

Slicing works not the way you would expect it to work (both bounds are inclusive):

In [81]:
s['a':'f'] #in numpy, the right bound is never included. Here, we include 'f'

a     8
b    17
c    10
d    10
e    12
f    13
Name: randint_series, dtype: int32

Indexing array work as well, in any order we want:

In [82]:
s[['k', 'q', 'a', 'r']]

k    16
q    10
a     8
r     6
Name: randint_series, dtype: int32

In [83]:
s.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't'],
      dtype='object')

Note, that positional indexing works as well:

In [84]:
s[0:5]

a     8
b    17
c    10
d    10
e    12
Name: randint_series, dtype: int32

In [86]:
s[5:3:-1]

f    13
e    12
Name: randint_series, dtype: int32

## Indexing with `.loc`

In [87]:
np.random.seed(983456)

s_int_idx = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
                      index=np.random.choice(N_ELEMS, N_ELEMS, replace=False),
                      name='randint_series')
s_int_idx

16     8
2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

We have integer index. What if we use slicing here? Will it go positional or use row index?

In [88]:
s_int_idx[2:15]

4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
Name: randint_series, dtype: int32

Surprising. But that's the way Pandas works and you'll love it over time (it's API is strongly tailored to most common operations making them more concise).

In [89]:
s_int_idx

16     8
2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

In [90]:
s_int_idx[2]  # label

17

In [91]:
s_int_idx[2:5]  # position

4     10
19    10
14    12
Name: randint_series, dtype: int32

Boolean mask? Sure.

In [92]:
s_int_idx[s_int_idx.index.isin(range(2,6))]

2    17
4    10
5     5
3     1
Name: randint_series, dtype: int32

In [93]:
s_int_idx

16     8
2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

Again, `[]` may often be ambiguous. Use `.loc` or `.iloc` to make your code readable and clean:

In [94]:
s_int_idx.loc[2:15]  # label. Note that here, its not saying to go from label value 2 to label value 15 (2,3,4,5...). Rather, from two, search through current order until we reach label 15.

2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
Name: randint_series, dtype: int32

In [95]:
s_int_idx.iloc[2:15]  # position

4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
Name: randint_series, dtype: int32

What if we take some random upper bound? It won't work generally:

In [96]:
s_int_idx.loc[2:456]

KeyError: 456

Because of this:

In [97]:
s_int_idx.index.is_monotonic #False, meaning it isn't ordered.

False

But we can make it work (or rather you now know when it works and when it doesn't):

In [98]:
s_int_idx.sort_index().loc[2:234] #This just returns all value lower than the max given, as opposed to failing. This is useful in many case.
                                #  For example, we don't always know the exact time stamp to the second we want to search until. With this, we don't have to 
                                #  be specific.

2     17
3      1
4     10
5      5
6     10
7     16
8     12
9     10
10    10
11    10
12     4
13     3
14    12
15     6
16     8
17     4
18    13
19    10
Name: randint_series, dtype: int32

Because:

In [100]:
s_int_idx.sort_index().index.is_monotonic

True

We'll see why this works a bit later. We can do complex filtering/masking/boolean indexing as well:

In [101]:
s_int_idx[s_int_idx.index!=11]

16     8
2     17
4     10
19    10
14    12
0     13
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

In [102]:
s_int_idx[(s_int_idx>15) | (s_int_idx<5)]

2     17
12     4
7     16
17     4
3      1
13     3
1     17
Name: randint_series, dtype: int32

In [103]:
s_int_idx.loc[s_int_idx!=14]

16     8
2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

# Indexing `pd.DataFrame`

In [99]:
np.random.seed(983456)

df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['d', 'c', 'b', 'a'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


Ok, so `[]` (without `loc` or `iloc`) probably is positional?

In [104]:
df[2:5]

Unnamed: 0,d,c,b,a
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [106]:
df['x'] # Nope, it doesn't work that way

KeyError: 'x'

But here's the thing: **the same** `[]` notation works differently if you're using column labels:

In [107]:
df['a']

o     3
l     7
x    11
g    15
d    19
Name: a, dtype: int32

In [108]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


Note, that this one returns a dataframe:

In [109]:
df[['b']]

Unnamed: 0,b
o,2
l,6
x,10
g,14
d,18


... and this one returns `pd.Series`:

In [110]:
df['b']

o     2
l     6
x    10
g    14
d    18
Name: b, dtype: int32

In [111]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [112]:
df.columns

Index(['d', 'c', 'b', 'a'], dtype='object')

In [113]:
df.columns[2:]

Index(['b', 'a'], dtype='object')

In [114]:
df[df.columns[2:]] #Which is equivelent to using iloc

Unnamed: 0,b,a
o,2,3
l,6,7
x,10,11
g,14,15
d,18,19


In [115]:
df.iloc[:, 2:] 

Unnamed: 0,b,a
o,2,3
l,6,7
x,10,11
g,14,15
d,18,19


In [116]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


So, `[]` is positional. Is it?

In [117]:
df[:'g'] # Surprising!

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15


In [118]:
df['a':'u'] # Not really surprising. This doesn't work because this is an out of index "labeling"

KeyError: 'a'

In [119]:
df.sort_index()['a':'u'] #this one works, because it is montonic

Unnamed: 0,d,c,b,a
d,16,17,18,19
g,12,13,14,15
l,4,5,6,7
o,0,1,2,3


But neither `a`, nor `u` are even in row index!

In [120]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [121]:
df['d':] #Vies the row, like above.

Unnamed: 0,d,c,b,a
d,16,17,18,19


In [122]:
df["x":]

Unnamed: 0,d,c,b,a
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


But:

In [123]:
df.sort_index()['d':] #This one is sorted, and therefore we also return values larger than d

Unnamed: 0,d,c,b,a
d,16,17,18,19
g,12,13,14,15
l,4,5,6,7
o,0,1,2,3
x,8,9,10,11


In [124]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [125]:
df['k':'z'] # No, that won't work

KeyError: 'k'

In [126]:
df.sort_index()['k':'zjyyf']

Unnamed: 0,d,c,b,a
l,4,5,6,7
o,0,1,2,3
x,8,9,10,11


In reality, Pandas keeps track of ranking of index labels:

In [127]:
df.index.to_series().rank() #Pandas preserves rank under the hood, and this is how it orders it.

o    4.0
l    3.0
x    5.0
g    2.0
d    1.0
dtype: float64

If index is monotonic, it allows for out-of-index indexing:

In [128]:
df.sort_index().index.to_series().rank()

d    1.0
g    2.0
l    3.0
o    4.0
x    5.0
dtype: float64

In [129]:
df.sort_index()['b':'m']  # Pandas can unambiguously set 'b' to be less than 'd' and 'm' to be between 'l' and 'o'

Unnamed: 0,d,c,b,a
d,16,17,18,19
g,12,13,14,15
l,4,5,6,7


In [131]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [134]:
df.loc['o':'x', 'c'] = 5 # df.__setitem__(...)
df

Unnamed: 0,d,c,b,a
o,0,5,2,3
l,4,5,6,7
x,8,5,10,11
g,12,13,14,15
d,16,17,18,19


In [133]:
df_sub = df['o':'x']
df_sub['c'] = 5 # Not a very good idea. 
# This has both: df.__getitem__(...) in the first line, and df.__setitem__(...) in the second line.
#It gives us the error because what is a copy and what is a view in pandas sometimes changes, so it is dangerous to set values based off this kind of operation.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
df[(df['a']>12) | (df['b']<3)]

In [None]:
df

## Indexing with `.loc`

General rule is (for readability and to exclude weird bugs):

- use `[]` when accessing columns by label,
- use `.loc` when accessing both rows and columns by label,
- use `.iloc` for positional indexing.

In [135]:
np.random.seed(983456)

df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['d', 'c', 'b', 'a'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [137]:
df.loc['o']

d    0
c    1
b    2
a    3
Name: o, dtype: int32

In [138]:
df.loc['o', 'b']

2

In [139]:
df.loc['o':, 'b']

o     2
l     6
x    10
g    14
d    18
Name: b, dtype: int32

In [140]:
df.loc['g':, 'b':]

Unnamed: 0,b,a
g,14,15
d,18,19


In [141]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [143]:
df.loc['g':, 'c':'d'] # returns empty, because they aren't ordered

g
d


In [None]:
df

Column index is still an index and works in a similar manner.

In [None]:
df.columns.to_series().rank()

In [None]:
df.loc['x':, 'a'::-2]

In [None]:
df.loc['x':, 'c':'d']

In [None]:
df.sort_index(axis='columns').loc['x':, 'c':'d']

In [None]:
df.loc[:, ["a", "b"]]

`.loc` can contain a mask (Pandas will align it for you):

In [None]:
df.loc[df['c']>10, 'c'] #SQL: SELECT c FROM df WHERE c >10

In [None]:
df.loc[:, df.columns[2:]]

In [None]:
df.loc[[1,2], 'c'] # This won't work: .loc cannot use a mix

## `SettingWithCopyWarning`

In [144]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


Each indexing operation generates either a copy, or a view to the dataframe and in contrast to NumPy Pandas provides no guarantee.

In [None]:
df.loc[df['a']>10, 'c']

In [None]:
df.__setitem__?

An assignment like this works the same way as in NumPy and original dataframe is modified (under the hood it's just a call to `df.__setitem__`):

In [None]:
df.loc[df['a']>10, 'c'] = 10

In [None]:
df

This one, however, contains two chained `__getitem__` calls:

In [None]:
df.loc[df['a']>10]['c']

The following assignment generates a warning (it's unknown if `df.loc[df['a']>10]` is a view or a copy):

In [None]:
df.loc[df['a']>10]['c'] = 20.

In [None]:
df

Let's decompose it:

In [None]:
df_1 = df.loc[df['a']>10]

In [None]:
df_1

In [None]:
df_1['c'] = 25

In [None]:
df_1

In [None]:
df

# Dataframe arithmetic

In [145]:
df_1 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'b', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_1

Unnamed: 0,a,b,c,d
r,0,1,2,3
s,4,5,6,7
w,8,9,10,11
m,12,13,14,15
t,16,17,18,19
e,20,21,22,23
h,24,25,26,27
o,28,29,30,31
a,32,33,34,35
n,36,37,38,39


In [146]:
df_2 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'e', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_2

Unnamed: 0,a,e,c,d
w,0,1,2,3
u,4,5,6,7
g,8,9,10,11
y,12,13,14,15
p,16,17,18,19
i,20,21,22,23
n,24,25,26,27
l,28,29,30,31
r,32,33,34,35
v,36,37,38,39


In [147]:
# A lot of missing values
df_1 + df_2

Unnamed: 0,a,b,c,d,e
a,,,,,
e,,,,,
g,,,,,
h,,,,,
i,,,,,
l,,,,,
m,,,,,
n,60.0,,64.0,66.0,
o,,,,,
p,,,,,


We see that there are only values where there are shared values in both data frames.

We can provide a fill value for missing **operands**:

In [148]:
df_1.add(df_2, fill_value=0)

Unnamed: 0,a,b,c,d,e
a,32.0,33.0,34.0,35.0,
e,20.0,21.0,22.0,23.0,
g,8.0,,10.0,11.0,9.0
h,24.0,25.0,26.0,27.0,
i,20.0,,22.0,23.0,21.0
l,28.0,,30.0,31.0,29.0
m,12.0,13.0,14.0,15.0,
n,60.0,37.0,64.0,66.0,25.0
o,28.0,29.0,30.0,31.0,
p,16.0,,18.0,19.0,17.0


Operations between dataframes and series are aligned along column by default:

In [149]:
s_1 = pd.Series(np.arange(10),
                name='f',
                index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))

In [150]:
s_1

p    0
m    1
a    2
j    3
i    4
x    5
g    6
n    7
f    8
h    9
Name: f, dtype: int32

In [151]:
df_1

Unnamed: 0,a,b,c,d
r,0,1,2,3
s,4,5,6,7
w,8,9,10,11
m,12,13,14,15
t,16,17,18,19
e,20,21,22,23
h,24,25,26,27
o,28,29,30,31
a,32,33,34,35
n,36,37,38,39


In [152]:
df_1 + s_1 #For series, since it is just one dimension, it can either be a row or column. Pandas chooses by default for it to be a column

Unnamed: 0,a,b,c,d,f,g,h,i,j,m,n,p,x
r,2.0,,,,,,,,,,,,
s,6.0,,,,,,,,,,,,
w,10.0,,,,,,,,,,,,
m,14.0,,,,,,,,,,,,
t,18.0,,,,,,,,,,,,
e,22.0,,,,,,,,,,,,
h,26.0,,,,,,,,,,,,
o,30.0,,,,,,,,,,,,
a,34.0,,,,,,,,,,,,
n,38.0,,,,,,,,,,,,


In [None]:
s_1 + df_1

The default can be changed:

In [153]:
df_1.add(s_1, axis='index')

Unnamed: 0,a,b,c,d
a,34.0,35.0,36.0,37.0
e,,,,
f,,,,
g,,,,
h,33.0,34.0,35.0,36.0
i,,,,
j,,,,
m,13.0,14.0,15.0,16.0
n,43.0,44.0,45.0,46.0
o,,,,


Such an alignment (along columns) allows for many common operation to be written in a short form. For example, to normalize each row, we just do

In [155]:
(df_1 - df_1.mean()) / df_1.std()
#For example, if we want to normalize each column in the dataframe, we take the df, subtract the series from it, and then divide by another series.
#This normally is what we want, because columns are features in most data sets, which is what we want to analyze.

Unnamed: 0,a,b,c,d
r,-1.486301,-1.486301,-1.486301,-1.486301
s,-1.156012,-1.156012,-1.156012,-1.156012
w,-0.825723,-0.825723,-0.825723,-0.825723
m,-0.495434,-0.495434,-0.495434,-0.495434
t,-0.165145,-0.165145,-0.165145,-0.165145
e,0.165145,0.165145,0.165145,0.165145
h,0.495434,0.495434,0.495434,0.495434
o,0.825723,0.825723,0.825723,0.825723
a,1.156012,1.156012,1.156012,1.156012
n,1.486301,1.486301,1.486301,1.486301


In [156]:
df_1.mean()

a    18.0
b    19.0
c    20.0
d    21.0
dtype: float64

# Applying functions to dataframes

In [157]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


Main entry method to apply a function over rows or columns:

In [158]:
df.apply(lambda row: np.sqrt(row.d), axis=1) #we are calculating on column d

o    0.000000
l    2.000000
x    2.828427
g    3.464102
d    4.000000
dtype: float64

This one is faster, though: 

In [160]:
%timeit df['d'].apply(lambda x: np.sqrt(x))

345 µs ± 35.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Pandas allows to use NumPy functions directly:

In [161]:
np.sqrt(df['d'])

o    0.000000
l    2.000000
x    2.828427
g    3.464102
d    4.000000
Name: d, dtype: float64

Which is faster:

In [162]:
%timeit np.sqrt(df['d'])

144 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [163]:
# Better way
%timeit np.sqrt(df['d'].values) #(returns an array, so it is faster)

7.9 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [164]:
np.sqrt(df['d'].values)

array([0.        , 2.        , 2.82842712, 3.46410162, 4.        ])

Ofter replacing `apply` altogether is the best option:

In [None]:
df_copy = df.copy()
df_copy["d_sqrt"] = np.sqrt(df['d'].values)

Note, that we often can mix Pandas and NumPy:

In [166]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [167]:
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [168]:
df.values.sum(axis=1)

array([ 6, 22, 38, 54, 70])

In [169]:
df.apply(lambda x: x.sum(), axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

In [170]:
np.sum(df, axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

In [171]:
df.sum(axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

Pandas is smart enough to combine the result in a proper manner:

In [172]:
dfm = df.apply(lambda x: pd.Series({'sum': x.sum(),
                                    'sqrt': np.sqrt(x['d'])}),
               axis=1)

In [173]:
dfm

Unnamed: 0,sum,sqrt
o,6.0,0.0
l,22.0,2.0
x,38.0,2.828427
g,54.0,3.464102
d,70.0,4.0


# Dataframe statistics

In [174]:
titanic['Pclass'].value_counts()

3    709
1    323
2    277
Name: Pclass, dtype: int64

In [175]:
titanic.SibSp.value_counts()

0    891
1    319
2     42
4     22
3     20
8      9
5      6
Name: SibSp, dtype: int64

In [176]:
titanic.Embarked.value_counts() # S = Southampton, C = Cherbourg, Q = Queens Town

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [None]:
titanic.Sex.value_counts()

In [177]:
print("Average age: %2.2f" % titanic['Age'].mean())
print("STD of age: %2.2f" % titanic['Age'].std())
print("Minimum age: %2.2f" % titanic['Age'].min())
print("Maximum age: %2.2f" % titanic['Age'].max())

Average age: 29.88
STD of age: 14.41
Minimum age: 0.17
Maximum age: 80.00


In [178]:
print("Average number of siblings/spouse: %2.2f" % titanic['SibSp'].mean())
print("Average number of siblings/spouse in class 1: %2.2f" % titanic.loc[titanic.Pclass==1, 'SibSp'].mean())
print("Average number of siblings/spouse in class 2: %2.2f" % titanic.loc[titanic.Pclass==2, 'SibSp'].mean())
print("Average number of siblings/spouse in class 3: %2.2f" % titanic.loc[titanic.Pclass==3, 'SibSp'].mean())

Average number of siblings/spouse: 0.50
Average number of siblings/spouse in class 1: 0.44
Average number of siblings/spouse in class 2: 0.39
Average number of siblings/spouse in class 3: 0.57


In [179]:
print("Minimum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].min())
print("Maximum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].max())

Minimum age (not survived): 1.00
Maximum age (not survived): 74.00


In [180]:
print("Minimum age (not survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].min())
print("Maximum age (not survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].max())

Minimum age (not survived): 0.42
Maximum age (not survived): 80.00


# Replacing and renaming

In [181]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [182]:
titanic.replace(22, 122).head() #This isn't inplace, which we can see by rerunning titanic. Need to add inplace = True

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,122.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [185]:
import re #Dont actually need to import, because pylab magic already imports it.
titanic.replace(re.compile(r'\(.*\)'), '').head()
# this regex means: Find this: () and remove whatever is within it


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [186]:
titanic.rename(lambda x: x.lower(), axis=1).head() #renames the columns easily to lower case

Unnamed: 0_level_0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [187]:
titanic.rename({'SibSp':'siblings_spouses'}, axis=1).head() #rename a specific column using a dictionary

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,siblings_spouses,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# String operations

In [None]:
titanic.head()

In [195]:
titanic.replace(re.compile(r'\(.*\)'), '').Name.str.split(",", expand=True) #Expand = True moves it to different columns

Unnamed: 0_level_0,0,1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Braund,Mr. Owen Harris
2,Cumings,Mrs. John Bradley
3,Heikkinen,Miss. Laina
4,Futrelle,Mrs. Jacques Heath
5,Allen,Mr. William Henry
...,...,...
1305,Spector,Mr. Woolf
1306,Oliva y Ocana,Dona. Fermina
1307,Saether,Mr. Simon Sivertsen
1308,Ware,Mr. Frederick


In [None]:
(titanic
 .replace(re.compile(r'\(.*\)'), '')
 .Name.str
 .split(',', expand=True)
 .rename({0:'family_name', 1:'first_name'}, axis=1)
 .head())

# Cleaning data

`isnull` is very convenient method:

In [197]:
titanic.isnull().head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False


Resulting dataframe can now be used to determine if there any missing values (by column or by row):

In [198]:
titanic.isnull().any()

Survived     True
Pclass      False
Name        False
Sex         False
Age          True
SibSp       False
Parch       False
Ticket      False
Fare         True
Cabin        True
Embarked     True
dtype: bool

In [199]:
titanic.isnull().any(axis=1).head() #This is most common, as we want to drop the rows that have any missing values.

PassengerId
1     True
2    False
3     True
4    False
5     True
dtype: bool

Or calculate how many missing values are in a dataframe (by row or by column):

In [200]:
titanic.isnull().sum()

Survived     418
Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

In [201]:
titanic.head(15)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Pandas is smart enough to fill missing values by column:

In [202]:
fill_values = titanic[['Age', 'Fare']].mean() #it normalizes and doesn't take the missing values intoo account for N

In [203]:
fill_values

Age     29.881138
Fare    33.295479
dtype: float64

In [204]:
titanic[titanic.Fare.isnull()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


In [205]:
titanic.fillna(fill_values).head(15)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0.0,3,"Moran, Mr. James",male,29.881138,0,0,330877,8.4583,,Q
7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# Getting indicators and dummy variables

In [206]:
pd.get_dummies(titanic, columns=['Pclass', 'Sex', 'Embarked']).head()
#These columns are categorical variables. get_dummies is convenient one-hot encoding in pandas.
#WE can see how the columns increased for these categories to accomadate the encoding.

Unnamed: 0_level_0,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,0.0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0,0,1,0,1,0,0,1
2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,0,0,1,0,1,0,0
3,1.0,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0,0,1,1,0,0,0,1
4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1,0,0,1,0,0,0,1
5,0.0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,0,0,1,0,1,0,0,1
