# Extracting and Transforming Data

## Indexing and Slicing Dataframes

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/sales.csv', index_col='month')
df.head()

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52


Most efficient of accessing records i sthrough the `.iloc` and `.loc` methods.

* `loc` - takes labels, ranges are inclusive of last value.
* `iloc` - takes indexes, ranges are exclusive of last value.

Both use the `[row, column]` syntax.

In [2]:
df.loc['Jan', 'salt']

12.0

In [3]:
df.iloc[0, 1]

12.0

In [4]:
df.loc['Feb': 'Mar', 'eggs':'salt']

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,110,50.0
Mar,221,89.0


In [5]:
df.iloc[1:3, 0:2] # 0 index, range is exclusive

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,110,50.0
Mar,221,89.0


In [6]:
df.loc[:, 'eggs':'salt'] # select all rows

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [7]:
df.iloc[:, 0:2] # alternatively

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [8]:
df.loc['Jan':'Apr', :] # select rows 'Jan' - 'Apr' inclusively, and all columns

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20


We can use lists with both `iloc` and `loc`, allows the selection of rows or columns out of sequence.

In [9]:
df.iloc[[5,3,1], 0:2]

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jun,205,60.0
Apr,77,87.0
Feb,110,50.0


In [10]:
df.loc[['Jun', 'Apr', 'Feb'], 'eggs':'salt']

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jun,205,60.0
Apr,77,87.0
Feb,110,50.0


In [11]:
df.iloc[[0,3,5], [1,0]]

Unnamed: 0_level_0,salt,eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,12.0,47
Apr,87.0,77
Jun,60.0,205


In [12]:
df.loc[['Jan', 'Apr', 'Jun'], ['salt', 'eggs']]

Unnamed: 0_level_0,salt,eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,12.0,47
Apr,87.0,77
Jun,60.0,205


## Filtering Dataframes

Used to select specific results based on a condition and not it's position within the dataframe, e.g. apply a condition such as `df.salt > 60` to the dataframe, `df[df.salt > 60]` to return all those rows with 'salt' values greater than 60.

A boolean condition/series applied in this manner is known as a filter.

In [18]:
df[df['salt'] > 60]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Apr,77,87.0,20


Filters can be combined with logical operators, `&`, `|` and `~` (does not work with logical `and`, `or` or `not`). Each condition needs to be wrapped in `()`.

In [19]:
df[(df['eggs'] > 100) & (df['salt'] > 50)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Jun,205,60.0,55


In [20]:
df[~(df.eggs > 100) & ~(df.salt > 50)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17


Alternatively we can use numpy's `logical_and`, `logical_or` and `logical_not` ?? methods.

In [21]:
df[np.logical_and(df.eggs > 100, df.salt > 50)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Jun,205,60.0,55


In [22]:
df[np.logical_or(df.eggs < 100, df.salt > 50)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Mar,221,89.0,72
Apr,77,87.0,20
Jun,205,60.0,55


In [23]:
df_sup = pd.DataFrame([[0,50,40,60,0]], columns=['eggs', 'salt', 'spam', 'bacon', 'ham'])
df_sup.index = ['July']
df_sup

Unnamed: 0,eggs,salt,spam,bacon,ham
July,0,50,40,60,0


In [24]:
df2 = df.copy()
df2['bacon'] = [0,0,50,60,70,80]
df2['ham'] = [0,0,0,0,0,0]
df2 = df2.append(df_sup)
df2

Unnamed: 0,eggs,salt,spam,bacon,ham
Jan,47,12.0,17,0,0
Feb,110,50.0,31,0,0
Mar,221,89.0,72,50,0
Apr,77,87.0,20,60,0
May,132,,52,70,0
Jun,205,60.0,55,80,0
July,0,50.0,40,60,0


Use Pandas `all` method to select only non-zero entries. Removes any columns with zero values, ignores `NaN`.

In [25]:
df2.all()

eggs     False
salt      True
spam      True
bacon    False
ham      False
dtype: bool

In [26]:
df2.loc[:, df2.all()] # remove any column with a zero value

Unnamed: 0,salt,spam
Jan,12.0,17
Feb,50.0,31
Mar,89.0,72
Apr,87.0,20
May,,52
Jun,60.0,55
July,50.0,40


In [27]:
df2.loc[:, df2.any()] # with any column with 'all' zero values

Unnamed: 0,eggs,salt,spam,bacon
Jan,47,12.0,17,0
Feb,110,50.0,31,0
Mar,221,89.0,72,50
Apr,77,87.0,20,60
May,132,,52,70
Jun,205,60.0,55,80
July,0,50.0,40,60


**Find any column with a `NaN` value**.

In [28]:
df.isnull().any()

eggs    False
salt     True
spam    False
dtype: bool

In [29]:
df.loc[:, df.isnull().any()]

Unnamed: 0_level_0,salt
month,Unnamed: 1_level_1
Jan,12.0
Feb,50.0
Mar,89.0
Apr,87.0
May,
Jun,60.0


**Return columns with no `NaN` values**.

In [30]:
df.notnull().all()

eggs     True
salt    False
spam     True
dtype: bool

In [31]:
df.loc[:, df.notnull().all()]

Unnamed: 0_level_0,eggs,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,17
Feb,110,31
Mar,221,72
Apr,77,20
May,132,52
Jun,205,55


**Drop any rows with a `NaN` value**.

In [32]:
df_sup = pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['eggs', 'salt', 'spam'])
df_sup.index = ['July']
df_sup

Unnamed: 0,eggs,salt,spam
July,,,


In [33]:
df3 = df.append(df_sup)
df3

Unnamed: 0,eggs,salt,spam
Jan,47.0,12.0,17.0
Feb,110.0,50.0,31.0
Mar,221.0,89.0,72.0
Apr,77.0,87.0,20.0
May,132.0,,52.0
Jun,205.0,60.0,55.0
July,,,


In [34]:
df3.dropna(how='any') # drops 'May' & 'July'

Unnamed: 0,eggs,salt,spam
Jan,47.0,12.0,17.0
Feb,110.0,50.0,31.0
Mar,221.0,89.0,72.0
Apr,77.0,87.0,20.0
Jun,205.0,60.0,55.0


In [35]:
df3.dropna(how='all') # drop rows with all 'NaN' values

Unnamed: 0,eggs,salt,spam
Jan,47.0,12.0,17.0
Feb,110.0,50.0,31.0
Mar,221.0,89.0,72.0
Apr,77.0,87.0,20.0
May,132.0,,52.0
Jun,205.0,60.0,55.0


**Filtering one column based on another column**.

In [36]:
df.salt > 55

month
Jan    False
Feb    False
Mar     True
Apr     True
May    False
Jun     True
Name: salt, dtype: bool

In [37]:
df.eggs[df.salt > 55]

month
Mar    221
Apr     77
Jun    205
Name: eggs, dtype: int64

**Filter one column based on another and modify it's value**.

In [38]:
df.eggs[df.salt > 55] += 20
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,241,89.0,72
Apr,97,87.0,20
May,132,,52
Jun,225,60.0,55


## Transforming Dataframes

Best way to transform dataframes is with methods inherent to dataframes. Next, is to use numpy universal functions or `ufuncs`.

**Round numbers down with pandas `floordiv` function**.

* the operation is applied to every field in the dataframe - element-wise
* divide each field's value with the supplied scaler, and round down.

In [47]:
df.floordiv(12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,20,7.0,6
Apr,8,7.0,1
May,11,,4
Jun,18,5.0,4


Alternatively we can use Numpy's `floor_divide` function.

In [49]:
np.floor_divide(df, 12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3.0,1.0,1.0
Feb,9.0,4.0,2.0
Mar,20.0,7.0,6.0
Apr,8.0,7.0,1.0
May,11.0,,4.0
Jun,18.0,5.0,4.0


Alternatively, we can write a python function do do the job and use the `apply` method to execute the function element-wise on the dataframe.

In [50]:
def dozens(value):
    return value//12

In [51]:
df.apply(dozens)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,20,7.0,6
Apr,8,7.0,1
May,11,,4
Jun,18,5.0,4


Another option is to use a `lambda` function, incombination with `apply`.

In [52]:
df.apply(lambda val: val//12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,20,7.0,6
Apr,8,7.0,1
May,11,,4
Jun,18,5.0,4


All these operations return a new dataframe.

In [54]:
df['dozens_of_eggs'] = df.eggs.apply(lambda val: val//12)
df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,3
Feb,110,50.0,31,9
Mar,241,89.0,72,20
Apr,97,87.0,20,8
May,132,,52,11
Jun,225,60.0,55,18


Dataframes, series and index objects all come with the `str` attribute to perform string operations on the field.

In [55]:
df.index = df.index.str.upper()
df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
JAN,47,12.0,17,3
FEB,110,50.0,31,9
MAR,241,89.0,72,20
APR,97,87.0,20,8
MAY,132,,52,11
JUN,225,60.0,55,18


The `apply` function can not be applied to the index. Instead you can use `map`.

In [56]:
df.index = df.index.map(str.lower)
df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
jan,47,12.0,17,3
feb,110,50.0,31,9
mar,241,89.0,72,20
apr,97,87.0,20,8
may,132,,52,11
jun,225,60.0,55,18


We can also apply arithmetic operations on dataframes.

In [57]:
df['salty_eggs'] = df.eggs + df.salt
df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs,salty_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
jan,47,12.0,17,3,59.0
feb,110,50.0,31,9,160.0
mar,241,89.0,72,20,330.0
apr,97,87.0,20,8,184.0
may,132,,52,11,
jun,225,60.0,55,18,285.0


When performance is paramount, you should avoid using `.apply()` and `.map()` because those constructs perform Python `for-loops` over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

We'll import the `zscore` function from `scipy.stats` module and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the `z-score` is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

The `zscore` UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame.

In [58]:
from scipy.stats import zscore

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

election.head()

<class 'numpy.ndarray'>


Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin,turnout_zscore
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667,0.853734
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399,0.439846
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293,0.57565
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012,1.018647
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118,0.463391
