[Pandas API reference](https://pandas.pydata.org/docs/reference/index.html)

[Numpy API reference](https://numpy.org/doc/stable/reference/)

In [1]:
import pandas as pd
import numpy as np

<h1>Series</h1>

a Series is like a “column” of data, a group of observations.

You can optionally provide a name for the series.

If an index isn't provided, Pandas will automatically generate one to uniquly identify each value in the series.

In [3]:
s = pd.Series([279168,  319750,  262959,  311343,  235132,  169791,  250624,  241461,  298505,  236149,  394668,
                 401353, 440978, 309764, 321404, 422716, 315285, 251290, 312562, 172683], name = 'Parcel Previous Values')
print(s)

0     279168
1     319750
2     262959
3     311343
4     235132
5     169791
6     250624
7     241461
8     298505
9     236149
10    394668
11    401353
12    440978
13    309764
14    321404
15    422716
16    315285
17    251290
18    312562
19    172683
Name: Parcel Previous Values, dtype: int64


You can select an individual value by using it's index.

s[index]

And can also filter the series by providing a predicate.

s[predicate]

In [4]:
print(f"The value of item #17 is {s[17]}")
print("Values larger than $300,000")
print(s[s > 350_000])

The value of item #17 is 251290
Values larger than $300,000
10    394668
11    401353
12    440978
15    422716
Name: Parcel Previous Values, dtype: int64


Pandas Series are built on top of NumPy arrays and support many similar operations.

In [7]:
# Round all previous values to the nearest $100
print(((s / 100).round() * 100).astype(int))

0     279200
1     319800
2     263000
3     311300
4     235100
5     169800
6     250600
7     241500
8     298500
9     236100
10    394700
11    401400
12    441000
13    309800
14    321400
15    422700
16    315300
17    251300
18    312600
19    172700
Name: Parcel Previous Values, dtype: int32


In [8]:
# Series also have built in data exploration functions
s.describe()

count        20.000000
mean     297379.250000
std       75040.079191
min      169791.000000
25%      248333.250000
50%      304134.500000
75%      320163.500000
max      440978.000000
Name: Parcel Previous Values, dtype: float64

In [14]:
# The default index can be replaced with a more meaningful value.
s.index = ['W29MU7581', 'S64GI7738', 'K89KV4863', 'Q52JT7514', 'A39EA7560', 'V25HQ0513', 'M81SE0853', 'F47JY4077',
           'U58BX6874', 'N43JY5958', 'Y49IM4670', 'N18AF8472', 'K96LF7279', 'I57UF2957', 'N54UV6765', 'D37LA7488', 
           'F48UO4632', 'Y09CT8886', 'K07IP9486', 'J73VD8024']

s

W29MU7581    279168
S64GI7738    319750
K89KV4863    262959
Q52JT7514    311343
A39EA7560    235132
V25HQ0513    169791
M81SE0853    250624
F47JY4077    241461
U58BX6874    298505
N43JY5958    236149
Y49IM4670    394668
N18AF8472    401353
K96LF7279    440978
I57UF2957    309764
N54UV6765    321404
D37LA7488    422716
F48UO4632    315285
Y09CT8886    251290
K07IP9486    312562
J73VD8024    172683
Name: Parcel Previous Values, dtype: int64

In [16]:
# Specific values can be retrieved by their corresponding index.
print(s['D37LA7488'])

# Updates are also applied using the index.
s['D37LA7488'] = 500000
print(s['D37LA7488'])

422716
500000


In [17]:
# the 'in' operator can be used to determine if a specific index exists within the series.
print('D37LA7488' in s)
print('XXXXXXXXX' in s)

True
False


<h1>DataFrames</h1>
While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in pandas is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.

Let’s look at an example that reads data from the CSV file SampleParcelData.csv.

The dataset contains the following indicators

<table style="border: 1px solid; text-align: center;">
    <tr><th style="border: 1px solid"><b>Variable Name</b></th><th style="border: 1px solid; text-align: center;"><b>Description</b></th></tr>
    <tr><td style="border: 1px solid">Parcel_Id</td><td style="border: 1px solid">Parcel identificaton number</td></tr>
    <tr><td style="border: 1px solid">Address</td><td style="border: 1px solid">Parcel address</td></tr>
    <tr><td style="border: 1px solid">YearBuilt</td><td style="border: 1px solid">Year parcel was constructed</td></tr>
    <tr><td style="border: 1px solid">LivingArea</td><td style="border: 1px solid">Sqft under air</td></tr>
    <tr><td style="border: 1px solid">LandSQFT</td><td style="border: 1px solid">Land size in sq. ft.</td></tr>    
</table>

We can read in this data from a csv file.

In [2]:
df = pd.read_csv('..//data//SampleParcelData.csv')
df

Unnamed: 0,Parcel_Id,Address,YearBuilt,LivingArea,LandSQFT
0,W29MU7581,459 Aliquam Road,2007,2891,30598
1,S64GI7738,784 Rose Avenue,1959,4298,25765
2,K89KV4863,5341 Porttitor St.,1961,3618,28207
3,Q52JT7514,1230 Magna Ave,2007,2124,13375
4,A39EA7560,4594 Nonummy Road,1981,1683,24157
5,V25HQ0513,6259 Class Road,2004,2733,9811
6,M81SE0853,7972 Rutrum Road,2003,3775,10116
7,F47JY4077,12802 Seashore Av.,1952,3765,32401
8,U58BX6874,3907 Allen Avenue,2004,3200,31792
9,N43JY5958,5703 Metus Ave,2010,3443,34212


<h4>Select Data by Position</h4>
One thing that we do all the time is to find, select and work with a subset of the data of our interests.

Subsetting a dataframe is known as slicing.  Pandas DataFrames offer several methods for slicing data, primarily using label-based indexing with .loc and integer-based indexing with .iloc. There are also ways to slice using boolean indexing and callable functions.

In [None]:
# Slicing using standard Python slicing notation

df[5:8]

Unnamed: 0,Parcel_Id,Address,YearBuilt,LivingArea,LandSQFT
5,V25HQ0513,6259 Class Road,2004,2733,9811
6,M81SE0853,7972 Rutrum Road,2003,3775,10116
7,F47JY4077,Seashore Av.,1952,3765,32401


In [None]:
# column selection using column name
df["YearBuilt"]

Parcel_Id
W29MU7581    2007
S64GI7738    1959
K89KV4863    1961
Q52JT7514    2007
A39EA7560    1981
V25HQ0513    2004
M81SE0853    2003
F47JY4077    1952
U58BX6874    2004
N43JY5958    2010
Y49IM4670    1989
N18AF8472    1994
K96LF7279    1965
I57UF2957    1969
N54UV6765    2013
D37LA7488    1982
F48UO4632    1982
Y09CT8886    1994
K07IP9486    1961
J73VD8024    1998
Name: YearBuilt, dtype: int64

In [21]:
# column selection using a list of strings
df[["YearBuilt", "LivingArea"]]

Unnamed: 0,YearBuilt,LivingArea
0,2007,2891
1,1959,4298
2,1961,3618
3,2007,2124
4,1981,1683
5,2004,2733
6,2003,3775
7,1952,3765
8,2004,3200
9,2010,3443


In [25]:
# Creating a boolean list
large_size = df['LivingArea'] > 3000
print(large_size)

0     False
1      True
2      True
3     False
4     False
5     False
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13    False
14     True
15    False
16    False
17    False
18     True
19    False
Name: LivingArea, dtype: bool


In [26]:
# slicing by boolean list
df[large_size]

Unnamed: 0,Parcel_Id,Address,YearBuilt,LivingArea,LandSQFT
1,S64GI7738,784 Rose Avenue,1959,4298,25765
2,K89KV4863,5341 Porttitor St.,1961,3618,28207
6,M81SE0853,7972 Rutrum Road,2003,3775,10116
7,F47JY4077,12802 Seashore Av.,1952,3765,32401
8,U58BX6874,3907 Allen Avenue,2004,3200,31792
9,N43JY5958,5703 Metus Ave,2010,3443,34212
14,N54UV6765,9780 Vitae Road,2013,3554,33224
18,K07IP9486,3906 Donec Rd.,1961,3153,12491


<h4>.loc (Label-based slicing):</h4>
This method uses labels (row and column names) to select data. It includes the start and stop labels in the slice.


In [None]:
# df.loc[ROW SELECTION, COLUMN SELECTION (optional)]
df.loc[0:3]


   Parcel_Id             Address  YearBuilt  LivingArea  LandSQFT
0  W29MU7581    459 Aliquam Road       2007        2891     30598
1  S64GI7738     784 Rose Avenue       1959        4298     25765
2  K89KV4863  5341 Porttitor St.       1961        3618     28207
3  Q52JT7514      1230 Magna Ave       2007        2124     13375


In [None]:
# Single column selection
df.loc[:3, 'LivingArea']

0    2891
1    4298
2    3618
3    2124
Name: LivingArea, dtype: int64

In [11]:
# multiple column selection
df.loc[7:10, ['Parcel_Id', 'YearBuilt', 'LivingArea']]

Unnamed: 0,Parcel_Id,YearBuilt,LivingArea
7,F47JY4077,1952,3765
8,U58BX6874,2004,3200
9,N43JY5958,2010,3443
10,Y49IM4670,1989,1510


In [13]:
# multiple column selection all rows
df.loc[:, ['Parcel_Id', 'YearBuilt', 'LivingArea']]

Unnamed: 0,Parcel_Id,YearBuilt,LivingArea
0,W29MU7581,2007,2891
1,S64GI7738,1959,4298
2,K89KV4863,1961,3618
3,Q52JT7514,2007,2124
4,A39EA7560,1981,1683
5,V25HQ0513,2004,2733
6,M81SE0853,2003,3775
7,F47JY4077,1952,3765
8,U58BX6874,2004,3200
9,N43JY5958,2010,3443


In [24]:
df.loc[1:3, 'YearBuilt':'LivingArea']

Unnamed: 0,YearBuilt,LivingArea
1,1959,4298
2,1961,3618
3,2007,2124


In [None]:
# Change index to a meaningful values instead of the default
df2 = df.set_index('Parcel_Id')
df2

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
W29MU7581,459 Aliquam Road,2007,2891,30598
S64GI7738,784 Rose Avenue,1959,4298,25765
K89KV4863,5341 Porttitor St.,1961,3618,28207
Q52JT7514,1230 Magna Ave,2007,2124,13375
A39EA7560,4594 Nonummy Road,1981,1683,24157
V25HQ0513,6259 Class Road,2004,2733,9811
M81SE0853,7972 Rutrum Road,2003,3775,10116
F47JY4077,12802 Seashore Av.,1952,3765,32401
U58BX6874,3907 Allen Avenue,2004,3200,31792
N43JY5958,5703 Metus Ave,2010,3443,34212


In [11]:
df2.loc['A39EA7560']

Address       4594 Nonummy Road
YearBuilt                  1981
LivingArea                 1683
LandSQFT                  24157
Name: A39EA7560, dtype: object

In [12]:
# row and column slicing
df2.loc['W29MU7581':'I57UF2957', 'LivingArea' : 'LandSQFT']

Unnamed: 0_level_0,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1
W29MU7581,2891,30598
S64GI7738,4298,25765
K89KV4863,3618,28207
Q52JT7514,2124,13375
A39EA7560,1683,24157
V25HQ0513,2733,9811
M81SE0853,3775,10116
F47JY4077,3765,32401
U58BX6874,3200,31792
N43JY5958,3443,34212


In [None]:
# slice list of rows
df2.loc[['U58BX6874', 'Y49IM4670', 'N18AF8472']]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
U58BX6874,3907 Allen Avenue,2004,3200,31792
Y49IM4670,4873 Hendrerit St.,1989,1510,22460
N18AF8472,4645 Gravida Blvd.,1994,2564,27668


In [29]:
# slicing by boolean list
large_size = df2['LivingArea'] >3000
df2.loc[large_size]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S64GI7738,784 Rose Avenue,1959,4298,25765
K89KV4863,5341 Porttitor St.,1961,3618,28207
M81SE0853,7972 Rutrum Road,2003,3775,10116
F47JY4077,12802 Seashore Av.,1952,3765,32401
U58BX6874,3907 Allen Avenue,2004,3200,31792
N43JY5958,5703 Metus Ave,2010,3443,34212
N54UV6765,9780 Vitae Road,2013,3554,33224
K07IP9486,3906 Donec Rd.,1961,3153,12491


In [16]:
# slicing py predicate
df2[df2['LandSQFT'] > 30000]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
W29MU7581,459 Aliquam Road,2007,2891,30598
F47JY4077,12802 Seashore Av.,1952,3765,32401
U58BX6874,3907 Allen Avenue,2004,3200,31792
N43JY5958,5703 Metus Ave,2010,3443,34212
N54UV6765,9780 Vitae Road,2013,3554,33224
Y09CT8886,7001 Taciti Ave,1994,2957,33330
J73VD8024,9037 Auctor Rd.,1998,2282,33800


<h4>.iloc (Integer-based slicing):</h4>
This method uses integer positions to select data, similar to standard Python list indexing. It excludes the stop index in the slice.

In [None]:
# single row: df.iloc[row_index, column_index (optional)]
df2.iloc[2]

Address       5341 Porttitor St.
YearBuilt                   1961
LivingArea                  3618
LandSQFT                   28207
Name: K89KV4863, dtype: object

In [None]:
# single column df.iloc[row_index1:row_index_2, column_index]
df2.iloc[:, 3]

Parcel_Id
W29MU7581    30598
S64GI7738    25765
K89KV4863    28207
Q52JT7514    13375
A39EA7560    24157
V25HQ0513     9811
M81SE0853    10116
F47JY4077    32401
U58BX6874    31792
N43JY5958    34212
Y49IM4670    22460
N18AF8472    27668
K96LF7279    21878
I57UF2957    13685
N54UV6765    33224
D37LA7488    18016
F48UO4632    22250
Y09CT8886    33330
K07IP9486    12491
J73VD8024    33800
Name: LandSQFT, dtype: int64

In [None]:
#List of integers: df.iloc[[row_index1, row_index2], [column_index1, column_index2]]
df2.iloc[[1, 3, 5, 7], [2,3]]


Unnamed: 0_level_0,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1
S64GI7738,4298,25765
Q52JT7514,2124,13375
V25HQ0513,2733,9811
F47JY4077,3765,32401


In [24]:
#Slice of integers: df.iloc[row_index1:row_index2, column_index1:column_index2]
df2.iloc[:4, 2:]

Unnamed: 0_level_0,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1
W29MU7581,2891,30598
S64GI7738,4298,25765
K89KV4863,3618,28207
Q52JT7514,2124,13375


<h4>Select Data by Conditions</h4>
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.


In [31]:
# complicated selection using functions
df2.loc[df2['LivingArea'] == max(df2['LivingArea'])]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S64GI7738,784 Rose Avenue,1959,4298,25765


In [32]:
# using isin for conditional selection
df2.loc[df2['YearBuilt'].isin([1961, 1965, 1969])]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
K89KV4863,5341 Porttitor St.,1961,3618,28207
K96LF7279,3631 Amet Rd.,1965,2639,21878
I57UF2957,7807 Vivamus Rd.,1969,2943,13685
K07IP9486,3906 Donec Rd.,1961,3153,12491


In [None]:
# using multiple predicates
df2[(df2['LivingArea'] > 2500) & (df2['LandSQFT'] < 12000)]

Unnamed: 0_level_0,Address,YearBuilt,LivingArea,LandSQFT
Parcel_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
V25HQ0513,6259 Class Road,2004,2733,9811
M81SE0853,7972 Rutrum Road,2003,3775,10116


In [None]:
# exporting data
df2[(df2['LivingArea'] > 2500) & (df2['LandSQFT'] < 12000)].to_csv(".//reports//ParcelsToReview.csv")