[Pandas API reference](https://pandas.pydata.org/docs/reference/index.html)

[Numpy API reference](https://numpy.org/doc/stable/reference/)

In [None]:
import pandas as pd
import numpy as np

<h1>Series</h1>

a Series is like a “column” of data, a group of observations.

You can optionally provide a name for the series.

If an index isn't provided, Pandas will automatically generate one to uniquly identify each value in the series.

In [None]:
s = pd.Series([279168,  319750,  262959,  311343,  235132,  169791,  250624,  241461,  298505,  236149,  394668,
                 401353, 440978, 309764, 321404, 422716, 315285, 251290, 312562, 172683], name = 'Parcel Previous Values')
print(s)

You can select an individual value by using it's index.

s[index]

And can also filter the series by providing a predicate.

s[predicate]

In [None]:
print(f"The value of item #17 is {s[17]}")
print("Values larger than $300,000")
print(s[s > 350_000])

Pandas Series are built on top of NumPy arrays and support many similar operations.

In [None]:
# Round all previous values to the nearest $100
print(((s / 100).round() * 100).astype(int))

In [None]:
# Series also have built in data exploration functions
s.describe()

In [None]:
# The default index can be replaced with a more meaningful value.
s.index = ['W29MU7581', 'S64GI7738', 'K89KV4863', 'Q52JT7514', 'A39EA7560', 'V25HQ0513', 'M81SE0853', 'F47JY4077',
           'U58BX6874', 'N43JY5958', 'Y49IM4670', 'N18AF8472', 'K96LF7279', 'I57UF2957', 'N54UV6765', 'D37LA7488', 
           'F48UO4632', 'Y09CT8886', 'K07IP9486', 'J73VD8024']

s

In [None]:
# Specific values can be retrieved by their corresponding index.
print(s['D37LA7488'])

# Updates are also applied using the index.
s['D37LA7488'] = 500000
print(s['D37LA7488'])

In [None]:
# the 'in' operator can be used to determine if a specific index exists within the series.
print('D37LA7488' in s)
print('XXXXXXXXX' in s)

<h1>DataFrames</h1>
While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in pandas is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.

Let’s look at an example that reads data from the CSV file SampleParcelData.csv.

The dataset contains the following indicators

<table style="border: 1px solid; text-align: center;">
    <tr><th style="border: 1px solid"><b>Variable Name</b></th><th style="border: 1px solid; text-align: center;"><b>Description</b></th></tr>
    <tr><td style="border: 1px solid">Parcel_Id</td><td style="border: 1px solid">Parcel identificaton number</td></tr>
    <tr><td style="border: 1px solid">Address</td><td style="border: 1px solid">Parcel address</td></tr>
    <tr><td style="border: 1px solid">YearBuilt</td><td style="border: 1px solid">Year parcel was constructed</td></tr>
    <tr><td style="border: 1px solid">LivingArea</td><td style="border: 1px solid">Sqft under air</td></tr>
    <tr><td style="border: 1px solid">LandSQFT</td><td style="border: 1px solid">Land size in sq. ft.</td></tr>    
</table>

We can read in this data from a csv file.

In [None]:
df = pd.read_csv('..//data//SampleParcelData.csv')
df

<h4>Select Data by Position</h4>
One thing that we do all the time is to find, select and work with a subset of the data of our interests.

Subsetting a dataframe is known as slicing.  Pandas DataFrames offer several methods for slicing data, primarily using label-based indexing with .loc and integer-based indexing with .iloc. There are also ways to slice using boolean indexing and callable functions.

In [None]:
# Slicing using standard Python slicing notation

df[5:8]

In [None]:
# column selection using column name
df["YearBuilt"]

In [None]:
# column selection using a list of strings
df[["YearBuilt", "LivingArea"]]

In [None]:
# Creating a boolean list
large_size = df['LivingArea'] > 3000
print(large_size)

In [None]:
# slicing by boolean list
df[large_size]

<h4>.loc (Label-based slicing):</h4>
This method uses labels (row and column names) to select data. It includes the start and stop labels in the slice.


In [None]:
# df.loc[ROW SELECTION, COLUMN SELECTION (optional)]
df.loc[0:3]


In [None]:
# Single column selection
df.loc[:3, 'LivingArea']

In [None]:
# multiple column selection
df.loc[7:10, ['Parcel_Id', 'YearBuilt', 'LivingArea']]

In [None]:
# multiple column selection all rows
df.loc[:, ['Parcel_Id', 'YearBuilt', 'LivingArea']]

In [None]:
df.loc[1:3, 'YearBuilt':'LivingArea']

In [None]:
# Change index to a meaningful values instead of the default
df2 = df.set_index('Parcel_Id')
df2

In [None]:
df2.loc['A39EA7560']

In [None]:
# row and column slicing
df2.loc['W29MU7581':'I57UF2957', 'LivingArea' : 'LandSQFT']

In [None]:
# slice list of rows
df2.loc[['U58BX6874', 'Y49IM4670', 'N18AF8472']]

In [None]:
# slicing by boolean list
large_size = df2['LivingArea'] >3000
df2.loc[large_size]

In [None]:
# slicing py predicate
df2[df2['LandSQFT'] > 30000]

<h4>.iloc (Integer-based slicing):</h4>
This method uses integer positions to select data, similar to standard Python list indexing. It excludes the stop index in the slice.

In [None]:
# single row: df.iloc[row_index, column_index (optional)]
df2.iloc[2]

In [None]:
# single column df.iloc[row_index1:row_index_2, column_index]
df2.iloc[:, 3]

In [None]:
#List of integers: df.iloc[[row_index1, row_index2], [column_index1, column_index2]]
df2.iloc[[1, 3, 5, 7], [2,3]]


In [None]:
#Slice of integers: df.iloc[row_index1:row_index2, column_index1:column_index2]
df2.iloc[:4, 2:]

<h4>Select Data by Conditions</h4>
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.


In [None]:
# complicated selection using functions
df2.loc[df2['LivingArea'] == max(df2['LivingArea'])]

In [None]:
# using isin for conditional selection
df2.loc[df2['YearBuilt'].isin([1961, 1965, 1969])]

In [None]:
# using multiple predicates
df2[(df2['LivingArea'] > 2500) & (df2['LandSQFT'] < 12000)]

In [None]:
# exporting data
df2[(df2['LivingArea'] > 2500) & (df2['LandSQFT'] < 12000)].to_csv(".//reports//ParcelsToReview.csv")