# DS-SF-34 | 02 | The `pandas` Library | Codealong | Answer Key

(http://pandas.pydata.org/pandas-docs/stable)

## Part A - Introduction to `pandas`

In [1]:
import os

import numpy as np

import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import re

> ## `pd.read_csv()`: load datasets from files (or even over the Internet)

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-02-zillow-starter.csv'))

> ## `DataFrame`

Let's check `df`'s type:

In [3]:
type(df)

pandas.core.frame.DataFrame

`df` is a `DataFrame`.  (http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

A `DataFrame` stores tabular data.  Let's have a look at its content:

In [4]:
df

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,12/4/15,...,550.0,sqft,,,1980.0
1,15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,...,1430.0,sqft,2435.0,sqft,1948.0
2,15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,11/12/15,...,2040.0,sqft,3920.0,sqft,1976.0
3,15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,...,1060.0,sqft,,,1930.0
4,15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,1/15/16,...,1299.0,sqft,,,1993.0
...,...,...,...,...,...,...,...,...,...,...,...
995,2124214951,"412 Green St APT A, San Francisco, CA",37800040,-122406100,1/15/16,...,264.0,sqft,,,2012.0
996,2126960082,"355 1st St UNIT 1905, San Francisco, CA",37787029,-122393638,11/20/15,...,691.0,sqft,,,2004.0
997,2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,12/10/15,...,1738.0,sqft,2299.0,sqft,1976.0
998,2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,12/15/15,...,1048.0,sqft,,,1975.0


> ## `.head()`: first 5 (default) rows

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.head.html)

In [5]:
df.head()

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,12/4/15,...,550.0,sqft,,,1980.0
1,15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,...,1430.0,sqft,2435.0,sqft,1948.0
2,15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,11/12/15,...,2040.0,sqft,3920.0,sqft,1976.0
3,15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,...,1060.0,sqft,,,1930.0
4,15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,1/15/16,...,1299.0,sqft,,,1993.0


> ## `.tail()`: last 5 (default) rows

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.tail.html)

In [6]:
df.tail()

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
995,2124214951,"412 Green St APT A, San Francisco, CA",37800040,-122406100,1/15/16,...,264.0,sqft,,,2012.0
996,2126960082,"355 1st St UNIT 1905, San Francisco, CA",37787029,-122393638,11/20/15,...,691.0,sqft,,,2004.0
997,2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,12/10/15,...,1738.0,sqft,2299.0,sqft,1976.0
998,2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,12/15/15,...,1048.0,sqft,,,1975.0
999,2136213970,"1200 Gough St, San Francisco, CA",37784770,-122424100,1/10/16,...,900.0,sqft,,,1966.0


> ## `.shape`: shape (i.e., number of rows and columns)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.shape.html)

In [7]:
df.shape

(1000, 15)

The first value (at index 0) is the number of rows, the second (at index 1), the number of columns:

In [8]:
df.shape[0]

1000

In [9]:
df.shape[1]

15

You can also use the idiomatic Python `len` function to get the number of rows:

In [10]:
len(df)

1000

> ## `.dtypes`: column types

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html)
- (http://pandas.pydata.org/pandas-docs/stable/basics.html)

In [11]:
df.dtypes

ID               int64
Address         object
Latitude         int64
Longitude        int64
DateOfSale      object
                ...   
Size           float64
SizeUnit        object
LotSize        float64
LotSizeUnit     object
BuiltInYear    float64
dtype: object

> ## `.isnull()` and `.notnull()`: NaN (Not-a-Number)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.notnull.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html)

As a data scientist, we will have to decide what to do when encountering missing values (a.k.a, not-a-numbers).  We might decide to drop the row containing it, drop the whole column, or impute it.  Today, let's focus on finding these NaNs.

In [12]:
df.isnull()

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,False,False,False,False,False,...,False,False,True,True,False
1,False,False,False,False,False,...,False,False,False,False,False
2,False,False,False,False,False,...,False,False,False,False,False
3,False,False,False,False,False,...,False,False,True,True,False
4,False,False,False,False,False,...,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,...,False,False,True,True,False
996,False,False,False,False,False,...,False,False,True,True,False
997,False,False,False,False,False,...,False,False,False,False,False
998,False,False,False,False,False,...,False,False,True,True,False


In return, we get a new `DataFrame` with Boolean values.  `True` if the value is `NaN`, `False` otherwise.

We can also get the count per column:

In [13]:
df.isnull().sum()

ID               0
Address          0
Latitude         0
Longitude        0
DateOfSale       0
              ... 
Size            33
SizeUnit        33
LotSize        444
LotSizeUnit    444
BuiltInYear     25
dtype: int64

Summing again will return the number of cells in the `DataFrame` with missing values.

In [14]:
df.isnull().sum().sum()

1215

Equivalently, we can also use the `.isnull()` function:

In [15]:
pd.isnull(df)

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,False,False,False,False,False,...,False,False,True,True,False
1,False,False,False,False,False,...,False,False,False,False,False
2,False,False,False,False,False,...,False,False,False,False,False
3,False,False,False,False,False,...,False,False,True,True,False
4,False,False,False,False,False,...,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,...,False,False,True,True,False
996,False,False,False,False,False,...,False,False,True,True,False
997,False,False,False,False,False,...,False,False,False,False,False
998,False,False,False,False,False,...,False,False,True,True,False


We also also use `.notnull()`, its complement method:

In [16]:
df.notnull()

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,True,True,True,True,True,...,True,True,False,False,True
1,True,True,True,True,True,...,True,True,True,True,True
2,True,True,True,True,True,...,True,True,True,True,True
3,True,True,True,True,True,...,True,True,False,False,True
4,True,True,True,True,True,...,True,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
995,True,True,True,True,True,...,True,True,False,False,True
996,True,True,True,True,True,...,True,True,False,False,True
997,True,True,True,True,True,...,True,True,True,True,True
998,True,True,True,True,True,...,True,True,False,False,True


In [17]:
pd.notnull(df)

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,True,True,True,True,True,...,True,True,False,False,True
1,True,True,True,True,True,...,True,True,True,True,True
2,True,True,True,True,True,...,True,True,True,True,True
3,True,True,True,True,True,...,True,True,False,False,True
4,True,True,True,True,True,...,True,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
995,True,True,True,True,True,...,True,True,False,False,True
996,True,True,True,True,True,...,True,True,False,False,True
997,True,True,True,True,True,...,True,True,True,True,True
998,True,True,True,True,True,...,True,True,False,False,True


> ### `.index` and `.columns`: row and column labels

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html)

Use the `.index` property to get the label for rows.  For columns, use the `.columns` property.

In [18]:
df.index

RangeIndex(start=0, stop=1000, step=1)

In [19]:
type(df.index)

pandas.indexes.range.RangeIndex

In this specific case, rows are just numbered from 0 to 1,000.  Note that, similarly to Python's standard `range` function, this range also excludes the last number.

In [20]:
df.columns

Index([u'ID', u'Address', u'Latitude', u'Longitude', u'DateOfSale',
       u'SalePrice', u'SalePriceUnit', u'IsAStudio', u'Beds', u'Baths',
       u'Size', u'SizeUnit', u'LotSize', u'LotSizeUnit', u'BuiltInYear'],
      dtype='object')

In [21]:
type(df.columns)

pandas.indexes.base.Index

> ## `[ [] ]` and `[]`: subsetting on columns

Selecting specific columns is performed by using the `[]` operator.

Passing a single integer, or a list of integers, to `[]` will perform a location-based lookup of the columns.

> E.g., columns #5 and #6:

In [22]:
df[ [5, 6] ]

Unnamed: 0,SalePrice,SalePriceUnit
0,710000.00,$
1,2.15,$M
2,5.60,$M
3,1.50,$M
4,970000.00,$
...,...,...
995,390000.00,$
996,860000.00,$
997,830000.00,$
998,835000.00,$


> Let's check that the column subsetting returns a `DataFrame`:

In [23]:
type(df[ [5, 6] ])

pandas.core.frame.DataFrame

> Note that a `DataFrame` can be made of a single column.  E.g., column #7 only:

In [24]:
df[ [7] ]

Unnamed: 0,IsAStudio
0,False
1,False
2,False
3,False
4,False
...,...
995,True
996,False
997,False
998,False


In [25]:
type(df[ [7] ])

pandas.core.frame.DataFrame

If the values passed to `[]` are non-integers, the `DataFrame` will attempt to match them to those in the `columns` property.

> Let's subset the `DataFrame` on columns `SalePrice` and `SalePriceUnit`:

In [26]:
df[ ['SalePrice', 'SalePriceUnit'] ]

Unnamed: 0,SalePrice,SalePriceUnit
0,710000.00,$
1,2.15,$M
2,5.60,$M
3,1.50,$M
4,970000.00,$
...,...,...
995,390000.00,$
996,860000.00,$
997,830000.00,$
998,835000.00,$


However, you cannot mix integers and non-integers.  E.g.,

In [27]:
# "df[ ['SalePrice', 6] ]" errors out...  Try it!

Not passing a list always results in a value-based lookup of the column:

In [28]:
df['Address']

0       55 Vandewater St APT 9, San Francisco, CA
1             740 Francisco St, San Francisco, CA
2             819 Francisco St, San Francisco, CA
3        199 Chestnut St APT 5, San Francisco, CA
4      111 Chestnut St APT 403, San Francisco, CA
                          ...                    
995         412 Green St APT A, San Francisco, CA
996       355 1st St UNIT 1905, San Francisco, CA
997          33 Santa Cruz Ave, San Francisco, CA
998             1821 Grant Ave, San Francisco, CA
999              1200 Gough St, San Francisco, CA
Name: Address, dtype: object

> ## `Series`

(http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

> Let's check the result type:

In [29]:
type(df['Address'])

pandas.core.series.Series

Columns can also be retrieved using "attribute" access as `DataFrame`s add a property for each column with the names of the properties as the names of the columns.  This won't work however for columns that have spaces or dots in their name.

> Let's check the value of `df`'s `.Address` property:

In [30]:
df.Address

0       55 Vandewater St APT 9, San Francisco, CA
1             740 Francisco St, San Francisco, CA
2             819 Francisco St, San Francisco, CA
3        199 Chestnut St APT 5, San Francisco, CA
4      111 Chestnut St APT 403, San Francisco, CA
                          ...                    
995         412 Green St APT A, San Francisco, CA
996       355 1st St UNIT 1905, San Francisco, CA
997          33 Santa Cruz Ave, San Francisco, CA
998             1821 Grant Ave, San Francisco, CA
999              1200 Gough St, San Francisco, CA
Name: Address, dtype: object

> Use the `.name` property (not `.columns`, that's for a `DataFrame`) to get the name of the variable stored inside it.

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.name.html)

In [31]:
df.Address.name

'Address'

> To find the zero-based location of a column, use the `.get_loc()` method of the `.columns` property.  E.g.,

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_loc.html)

In [32]:
df.columns.get_loc('Beds')

8

In [33]:
df[ [df.columns.get_loc('Beds')] ]

Unnamed: 0,Beds
0,1.0
1,
2,2.0
3,1.0
4,2.0
...,...
995,
996,1.0
997,3.0
998,2.0


We should get the same output as subsetting a `DataFrame` on `Beds`:

In [34]:
df[ ['Beds'] ]

Unnamed: 0,Beds
0,1.0
1,
2,2.0
3,1.0
4,2.0
...,...
995,
996,1.0
997,3.0
998,2.0


> ## `[]`: slicing on rows

> E.g., on the first five rows:

In [35]:
df[:5]

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,12/4/15,...,550.0,sqft,,,1980.0
1,15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,...,1430.0,sqft,2435.0,sqft,1948.0
2,15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,11/12/15,...,2040.0,sqft,3920.0,sqft,1976.0
3,15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,...,1060.0,sqft,,,1930.0
4,15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,1/15/16,...,1299.0,sqft,,,1993.0


> ## `.loc[]` and `.iloc[]`: subsetting rows by index label and location

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.loc.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iloc.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.set_index.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reset_index.html)

Until now, the index of the `DataFrame` is a numerical starting from 0 but you can specify which column(s) should be in the index.

> E.g., `ID`:

In [36]:
df = df.set_index('ID')

In [37]:
df.index

Int64Index([  15063471,   15063505,   15063609,   15064044,   15064257,
              15064295,   15064391,   15064536,   15064640,   15064669,
            ...
            2108546010, 2114498518, 2121221385, 2121978635, 2122992200,
            2124214951, 2126960082, 2128308939, 2131957929, 2136213970],
           dtype='int64', name=u'ID', length=1000)

In [38]:
df

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,12/4/15,710000.00,...,550.0,sqft,,,1980.0
15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,2.15,...,1430.0,sqft,2435.0,sqft,1948.0
15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,11/12/15,5.60,...,2040.0,sqft,3920.0,sqft,1976.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,1.50,...,1060.0,sqft,,,1930.0
15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,1/15/16,970000.00,...,1299.0,sqft,,,1993.0
...,...,...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",37800040,-122406100,1/15/16,390000.00,...,264.0,sqft,,,2012.0
2126960082,"355 1st St UNIT 1905, San Francisco, CA",37787029,-122393638,11/20/15,860000.00,...,691.0,sqft,,,2004.0
2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,12/10/15,830000.00,...,1738.0,sqft,2299.0,sqft,1976.0
2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,12/15/15,835000.00,...,1048.0,sqft,,,1975.0


> E.g., row with index 15063505:

In [39]:
df.loc[15063505]

Address        740 Francisco St, San Francisco, CA
Latitude                                  37804420
Longitude                               -122417389
DateOfSale                                11/30/15
SalePrice                                     2.15
                              ...                 
Size                                          1430
SizeUnit                                      sqft
LotSize                                       2435
LotSizeUnit                                   sqft
BuiltInYear                                   1948
Name: 15063505, dtype: object

In [40]:
type(df.loc[15063505])

pandas.core.series.Series

A single row is also a `Series`.

In [41]:
df.loc[15063505].name

15063505

Its name is its value in the index.

> E.g., rows with indices 15063505 and 15064044:

In [42]:
df.loc[ [15063505, 15064044] ]

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,2.15,...,1430.0,sqft,2435.0,sqft,1948.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,1.5,...,1060.0,sqft,,,1930.0


> E.g., rows #1 and #3:

In [43]:
df.iloc[ [1, 3] ]

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,2.15,...,1430.0,sqft,2435.0,sqft,1948.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,12/11/15,1.5,...,1060.0,sqft,,,1930.0


> ## `.at[]` and `.iat[]`: scalar lookup by label or location

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.at.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.at.html)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iat.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iat.html)

Scalar values can be looked up by label using `.at`, by passing both the row label and then the column name/value.

> E.g., row with index 15064044 and column `DateOfSale`.

In [44]:
df.at[15064044, 'DateOfSale']

'12/11/15'

Scalar values can also be looked up by location using `.iat` by passing both the row location and then the column location.

> E.g., row #3 and column #3:

In [45]:
df.iat[3, 3]

'12/11/15'

> ## Subsetting rows by Boolean selection (a.k.a., masking)

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns.  This allows us to build more complicated selections than those based simply upon index labels or positions.

> E.g., what homes have been built before 1900?

In [46]:
df.BuiltInYear < 1900

ID
15063471      False
15063505      False
15063609      False
15064044      False
15064257      False
              ...  
2124214951    False
2126960082    False
2128308939    False
2131957929    False
2136213970    False
Name: BuiltInYear, dtype: bool

This results in a `Series` that can be used to subset on the rows which values are `True`.

> Let's subset on that `Series`:

In [47]:
df[df.BuiltInYear < 1900]

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15065140,"1407 Montgomery St APT 2, San Francisco, CA",37802299,-122404941,1/28/16,976000.00,...,1000.0,sqft,,,1870.0
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,11/12/15,5.53,...,7375.0,sqft,2748.0,sqft,1890.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,12/15/15,1.50,...,1593.0,sqft,,,1895.0
15078866,"753-755 Oak St, San Francisco, CA",37773576,-122431663,11/10/15,375000.00,...,2430.0,sqft,3781.0,sqft,1890.0
15082108,"3016 Sacramento St, San Francisco, CA",37788970,-122442995,12/22/15,1.85,...,1408.0,sqft,,,1890.0
...,...,...,...,...,...,...,...,...,...,...,...
114318108,"1442 Grove St # A, San Francisco, CA",37775894,-122440467,12/24/15,675000.00,...,593.0,sqft,,,1890.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,12/17/15,1.95,...,3006.0,sqft,,,1885.0
123597223,"1954 Golden Gate Ave, San Francisco, CA",37778246,-122442763,11/24/15,1.10,...,1275.0,sqft,,,1895.0
123597388,"667 Shotwell St # A, San Francisco, CA",37757851,-122415629,2/10/16,775000.00,...,1212.0,sqft,,,1890.0


Multiple conditions can be put together.

> E.g., subset for `BuiltInYear` below 1900 and `Size` over 1500:

In [48]:
df[df.BuiltInYear < 1900][df.Size > 1500]

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,11/12/15,5.53,...,7375.0,sqft,2748.0,sqft,1890.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,12/15/15,1.5,...,1593.0,sqft,,,1895.0
15078866,"753-755 Oak St, San Francisco, CA",37773576,-122431663,11/10/15,375000.0,...,2430.0,sqft,3781.0,sqft,1890.0
15084954,"1954 Golden Gate Ave, San Francisco, CA",37778420,-122443073,11/24/15,1.1,...,1515.0,sqft,,,1895.0
15145720,"956 S Van Ness Ave, San Francisco, CA",37757832,-122417139,11/23/15,2.67,...,3500.0,sqft,4165.0,sqft,1872.0
15181209,"1001 Diamond St # 1001A, San Francisco, CA",37749461,-122435844,11/24/15,3.1,...,2032.0,sqft,1913.0,sqft,1892.0
82785514,"1394 Mcallister St, San Francisco, CA",37778463,-122434933,11/13/15,2.1,...,2300.0,sqft,,,1890.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,12/17/15,1.95,...,3006.0,sqft,,,1885.0
2122992200,"129 Octavia St, San Francisco, CA",37773192,-122424037,11/20/15,2.75,...,3655.0,sqft,,,1883.0


(or)

In [49]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)]

Unnamed: 0_level_0,Address,Latitude,Longitude,DateOfSale,SalePrice,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15076156,"1533 Sutter St, San Francisco, CA",37786658,-122426481,11/12/15,5.53,...,7375.0,sqft,2748.0,sqft,1890.0
15078536,"640 Steiner St, San Francisco, CA",37775399,-122432491,12/15/15,1.5,...,1593.0,sqft,,,1895.0
15078866,"753-755 Oak St, San Francisco, CA",37773576,-122431663,11/10/15,375000.0,...,2430.0,sqft,3781.0,sqft,1890.0
15084954,"1954 Golden Gate Ave, San Francisco, CA",37778420,-122443073,11/24/15,1.1,...,1515.0,sqft,,,1895.0
15145720,"956 S Van Ness Ave, San Francisco, CA",37757832,-122417139,11/23/15,2.67,...,3500.0,sqft,4165.0,sqft,1872.0
15181209,"1001 Diamond St # 1001A, San Francisco, CA",37749461,-122435844,11/24/15,3.1,...,2032.0,sqft,1913.0,sqft,1892.0
82785514,"1394 Mcallister St, San Francisco, CA",37778463,-122434933,11/13/15,2.1,...,2300.0,sqft,,,1890.0
119684777,"967 Hayes St, San Francisco, CA",37775645,-122432222,12/17/15,1.95,...,3006.0,sqft,,,1885.0
2122992200,"129 Octavia St, San Francisco, CA",37773192,-122424037,11/20/15,2.75,...,3655.0,sqft,,,1883.0


It is possible to subset on columns simultaneously.

> E.g., subset (a `DataFrame`) on `Address` for `BuiltInYear` below 1900 and `Size` over 1500:

In [50]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)][ ['Address'] ]

Unnamed: 0_level_0,Address
ID,Unnamed: 1_level_1
15076156,"1533 Sutter St, San Francisco, CA"
15078536,"640 Steiner St, San Francisco, CA"
15078866,"753-755 Oak St, San Francisco, CA"
15084954,"1954 Golden Gate Ave, San Francisco, CA"
15145720,"956 S Van Ness Ave, San Francisco, CA"
15181209,"1001 Diamond St # 1001A, San Francisco, CA"
82785514,"1394 Mcallister St, San Francisco, CA"
119684777,"967 Hayes St, San Francisco, CA"
2122992200,"129 Octavia St, San Francisco, CA"


> To get a `Series` instead of a `DataFrame`:

In [51]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)]['Address']

ID
15076156               1533 Sutter St, San Francisco, CA
15078536               640 Steiner St, San Francisco, CA
15078866               753-755 Oak St, San Francisco, CA
15084954         1954 Golden Gate Ave, San Francisco, CA
15145720           956 S Van Ness Ave, San Francisco, CA
15181209      1001 Diamond St # 1001A, San Francisco, CA
82785514           1394 Mcallister St, San Francisco, CA
119684777                967 Hayes St, San Francisco, CA
2122992200             129 Octavia St, San Francisco, CA
Name: Address, dtype: object

(or)

In [52]:
df[(df.BuiltInYear < 1900) & (df.Size > 1500)].Address

ID
15076156               1533 Sutter St, San Francisco, CA
15078536               640 Steiner St, San Francisco, CA
15078866               753-755 Oak St, San Francisco, CA
15084954         1954 Golden Gate Ave, San Francisco, CA
15145720           956 S Van Ness Ave, San Francisco, CA
15181209      1001 Diamond St # 1001A, San Francisco, CA
82785514           1394 Mcallister St, San Francisco, CA
119684777                967 Hayes St, San Francisco, CA
2122992200             129 Octavia St, San Francisco, CA
Name: Address, dtype: object

## Part B - Wrangling the SF Housing dataset (take 2) with `pandas`

In [53]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-02-zillow-starter.csv'), index_col = 'ID')

(`pd.read_csv` can load the dataset and set the index column for the `DataFrame` at the same time)

> ### Remove the `Latitude` and `Longitude` columns

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)

In [54]:
df.drop(['Latitude', 'Longitude'], axis = 1, inplace = True)

In [55]:
df

Unnamed: 0_level_0,Address,DateOfSale,SalePrice,SalePriceUnit,IsAStudio,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",12/4/15,710000.00,$,False,...,550.0,sqft,,,1980.0
15063505,"740 Francisco St, San Francisco, CA",11/30/15,2.15,$M,False,...,1430.0,sqft,2435.0,sqft,1948.0
15063609,"819 Francisco St, San Francisco, CA",11/12/15,5.60,$M,False,...,2040.0,sqft,3920.0,sqft,1976.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",12/11/15,1.50,$M,False,...,1060.0,sqft,,,1930.0
15064257,"111 Chestnut St APT 403, San Francisco, CA",1/15/16,970000.00,$,False,...,1299.0,sqft,,,1993.0
...,...,...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",1/15/16,390000.00,$,True,...,264.0,sqft,,,2012.0
2126960082,"355 1st St UNIT 1905, San Francisco, CA",11/20/15,860000.00,$,False,...,691.0,sqft,,,2004.0
2128308939,"33 Santa Cruz Ave, San Francisco, CA",12/10/15,830000.00,$,False,...,1738.0,sqft,2299.0,sqft,1976.0
2131957929,"1821 Grant Ave, San Francisco, CA",12/15/15,835000.00,$,False,...,1048.0,sqft,,,1975.0


> ### `SalePrice`: scale all amount to `$M`

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)

In [56]:
df.SalePriceUnit.unique()

array(['$', '$M'], dtype=object)

In [57]:
df_1 = df[df.SalePriceUnit == '$']
df_1 = df_1.drop('SalePriceUnit', axis = 1)

# Scaling sale price to $M
df_1.SalePrice /= 10 ** 6

df_6 = df[df.SalePriceUnit == '$M']
df_6 = df_6.drop('SalePriceUnit', axis = 1)

In [58]:
# Concatenate of two DataFrames by rows
df = pd.concat([df_1, df_6])

In [59]:
# Resort the new DataFrame
df.sort_index(inplace = True)

> ### `IsAStudio`: convert from a Boolean to a binary variable (i.e., 0 or 1)

In [60]:
df.IsAStudio *= 1

In [61]:
df.IsAStudio

ID
15063471      0
15063505      0
15063609      0
15064044      0
15064257      0
             ..
2124214951    1
2126960082    0
2128308939    0
2131957929    0
2136213970    0
Name: IsAStudio, dtype: object

> ### `Size`

In [62]:
df.SizeUnit.unique()

array(['sqft', nan], dtype=object)

Size is either in square feet or missing.  Almost no work needed except to remove size unit.

In [63]:
df.drop('SizeUnit', axis = 1, inplace = True)

> ### `LotSize`: scale all values to square feet

In [64]:
df.LotSizeUnit.unique()

array([nan, 'sqft', 'ac'], dtype=object)

Lot sizes are either in square feet or in acres.  Let's convert them all to square feet.

> Group #1: the `na` values:

In [65]:
df_na = df[df.LotSizeUnit.isnull()]
df_na = df_na.drop('LotSizeUnit', axis = 1)

df_na.shape[0]

444

> Group #2: the `sqft` values:

In [66]:
df_sqft = df[df.LotSizeUnit == 'sqft']
df_sqft = df_sqft.drop('LotSizeUnit', axis = 1)

df_sqft.shape[0]

552

> Group #3: the `ac` values:

In [67]:
df_ac = df[df.LotSizeUnit == 'ac']
df_ac = df_ac.drop('LotSizeUnit', axis = 1)

df_ac.shape[0]

4

> Let's scale these `acre` values into `sqft`:

In [68]:
# (1 acre = 43,560 sqft)

df_ac.LotSize *= 43560.

Let's now put everything back together...

In [69]:
df = pd.concat([df_na, df_sqft, df_ac]).sort_index()

In [70]:
df

Unnamed: 0_level_0,Address,DateOfSale,SalePrice,IsAStudio,Beds,Baths,Size,LotSize,BuiltInYear
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15063471,"55 Vandewater St APT 9, San Francisco, CA",12/4/15,0.710,0,1.0,,550.0,,1980.0
15063505,"740 Francisco St, San Francisco, CA",11/30/15,2.150,0,,2.0,1430.0,2435.0,1948.0
15063609,"819 Francisco St, San Francisco, CA",11/12/15,5.600,0,2.0,3.5,2040.0,3920.0,1976.0
15064044,"199 Chestnut St APT 5, San Francisco, CA",12/11/15,1.500,0,1.0,1.0,1060.0,,1930.0
15064257,"111 Chestnut St APT 403, San Francisco, CA",1/15/16,0.970,0,2.0,2.0,1299.0,,1993.0
...,...,...,...,...,...,...,...,...,...
2124214951,"412 Green St APT A, San Francisco, CA",1/15/16,0.390,1,,1.0,264.0,,2012.0
2126960082,"355 1st St UNIT 1905, San Francisco, CA",11/20/15,0.860,0,1.0,1.0,691.0,,2004.0
2128308939,"33 Santa Cruz Ave, San Francisco, CA",12/10/15,0.830,0,3.0,3.0,1738.0,2299.0,1976.0
2131957929,"1821 Grant Ave, San Francisco, CA",12/15/15,0.835,0,2.0,2.0,1048.0,,1975.0


> ## `.to_csv`: save the `DataFrame` into a `.csv` file

At the end of each phase (i.e., wrangling) of your data science project, it is a good idea to save your dataset into disk.  Then for the next step, create a new Jupyther notebook and load your updated dataset

In [71]:
df.to_csv(os.path.join('..', 'datasets', 'dataset-02-zillow.csv'), index_label = 'ID')

## Part C - More advanced topics

### `.groupby()`

(http://pandas.pydata.org/pandas-docs/stable/groupby.html)

> What is the mean price of houses by number of bedrooms?

In [72]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-02-zillow.csv'))

In [73]:
df[ ['Beds', 'SalePrice'] ].groupby('Beds').mean()

Unnamed: 0_level_0,SalePrice
Beds,Unnamed: 1_level_1
1.0,0.820425
2.0,1.192398
3.0,1.404323
4.0,1.87933
5.0,2.86772
6.0,2.587222
7.0,5.543167
8.0,1.96
9.0,1.766667


### `.map()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html)

When converting `SalePrice`, `Size`, and `LotSize` into `$M` and sqft, we could also have done the following:

In [74]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-02-zillow-starter.csv'))

In [75]:
df.SalePriceUnit.unique()

array(['$', '$M'], dtype=object)

In [76]:
df.SalePriceUnit.map({'$': 1. / (10 ** 6), '$M': 1.})

0      0.000001
1      1.000000
2      1.000000
3      1.000000
4      0.000001
         ...   
995    0.000001
996    0.000001
997    0.000001
998    0.000001
999    0.000001
Name: SalePriceUnit, dtype: float64

In [77]:
df.SalePrice *= df.SalePriceUnit.map({'$': 1. / (10 ** 6), '$M': 1.})

In [78]:
df.SalePrice

0      0.710
1      2.150
2      5.600
3      1.500
4      0.970
       ...  
995    0.390
996    0.860
997    0.830
998    0.835
999    0.825
Name: SalePrice, dtype: float64

In [79]:
df.drop('SalePriceUnit', axis = 1, inplace = True)

> ### Activity:  Using `.map()`, convert `Size` and `LotSize` to sqft.

In [80]:
df.SizeUnit.unique()

array(['sqft', nan], dtype=object)

In [81]:
df.drop('SizeUnit', axis = 1, inplace = True)

In [82]:
df.LotSizeUnit.unique()

array([nan, 'sqft', 'ac'], dtype=object)

In [83]:
df.LotSize *= df.LotSizeUnit.map({'sqft': 1., '$M': 43560.})

In [84]:
df.drop('LotSizeUnit', axis = 1, inplace = True)

### `.to_datetime()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)

In [85]:
df.DateOfSale

0       12/4/15
1      11/30/15
2      11/12/15
3      12/11/15
4       1/15/16
         ...   
995     1/15/16
996    11/20/15
997    12/10/15
998    12/15/15
999     1/10/16
Name: DateOfSale, dtype: object

So far, the dates stored in the `DataFrame` are just strings.  We cannot easily extract the day, month, year.  Thanksfully, `pandas` provides some facilities to do so.

In [86]:
pd.to_datetime(df.DateOfSale)

0     2015-12-04
1     2015-11-30
2     2015-11-12
3     2015-12-11
4     2016-01-15
         ...    
995   2016-01-15
996   2015-11-20
997   2015-12-10
998   2015-12-15
999   2016-01-10
Name: DateOfSale, dtype: datetime64[ns]

In [87]:
df.DateOfSale = pd.to_datetime(df.DateOfSale)

### `.apply()`

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)

In [88]:
df.DateOfSale.apply(lambda date_of_sale: date_of_sale.year)

0      2015
1      2015
2      2015
3      2015
4      2016
       ... 
995    2016
996    2015
997    2015
998    2015
999    2016
Name: DateOfSale, dtype: int64

In [89]:
df['YearOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.year)
df['MonthOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.month)
df['DayOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.day)
df['WeekDayOfSale'] = df.DateOfSale.apply(lambda date_of_sale: date_of_sale.weekday_name)

df.drop('DateOfSale', axis = 1, inplace = True)

Now, we have the day, day of the week, month, and year of the sale as features in our dataset.

In [90]:
df

Unnamed: 0,ID,Address,Latitude,Longitude,SalePrice,...,BuiltInYear,YearOfSale,MonthOfSale,DayOfSale,WeekDayOfSale
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,0.710,...,1980.0,2015,12,4,Friday
1,15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,2.150,...,1948.0,2015,11,30,Monday
2,15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,5.600,...,1976.0,2015,11,12,Thursday
3,15064044,"199 Chestnut St APT 5, San Francisco, CA",37804392,-122406590,1.500,...,1930.0,2015,12,11,Friday
4,15064257,"111 Chestnut St APT 403, San Francisco, CA",37804240,-122405509,0.970,...,1993.0,2016,1,15,Friday
...,...,...,...,...,...,...,...,...,...,...,...
995,2124214951,"412 Green St APT A, San Francisco, CA",37800040,-122406100,0.390,...,2012.0,2016,1,15,Friday
996,2126960082,"355 1st St UNIT 1905, San Francisco, CA",37787029,-122393638,0.860,...,2004.0,2015,11,20,Friday
997,2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,0.830,...,1976.0,2015,12,10,Thursday
998,2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,0.835,...,1975.0,2015,12,15,Tuesday
