This notebook handles **most** of the downloading, loading, parsing, joining, and saving the combined [PLUTO](http://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page), [Rolling Sales](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page), and [RPAD](https://www1.nyc.gov/site/finance/taxes/property-assessments.page) datasets. A bit of the work is done out-of-band&mdash;you'll see why.

Note that in all of these datasets the unit of measurement is a borough-block-lot combination, each of which maps uniquely to an individual property. That property may be a building (in the case of actual buildings, small homes, and co-ops) or it may be an apartment (in the case of condominiums). None of the three datasets provide information on the size of individual apartments, though assessment values and market values are present in the RPAD and Rolling Sales datasets, respectively. Therefore there is no way to map information on apartment values. These records are removed at a later step. The focus of this project therefore is on whole-building values.

Now on to the datasets.

RPAD is a record of the assessed value, as determined by the New York City Department of Finance and used for taxation purposes, of every building and apartment in New York City. These assessed values are by the department's own admission generally a few cycles or years behind the trend of the market, but are nevertheless a valuable and almost complete record of all property values in New York City.

The PLUTO dataset agglomerates a large number of datasets published by various agencies in New York City into a single master record for categorical information on every property in New York City. It gives RPAD and Rolling Sales information a rich context.

Rolling Sales contains the actual market prices of all buildings sold in New York City in the last twelve months. Past that horizon real estate market trends make this data less concrete. Since market value is our target variable, Rolling Sales contains ground truths about what we would like to model. On the other hand it also contains a significant amount of noise, dealing with which will be discussed later.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
import zipfile
import io

In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 20)

# Download

Rolling sales data is provided as lightly formatted `xls` files (Excel):

![alt text](./rolling-sales-data-excel-screencap.png "")

Luckily the extraneous details are easily patched up post-import.

In [3]:
rolling_sales_data = dict()
rolling_sales_data_key_pairs = {'Manhattan': 'manhattan',
                                'Brooklyn': 'brooklyn',
                                'Queens': 'queens',
                                'Bronx': 'bronx',
                                'Staten Island': 'statenisland'}
for b_k, b_xls in tqdm(list(rolling_sales_data_key_pairs.items())):
    borough_rsd = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_{0}.xls'.format(b_xls))
    borough_rsd.columns = borough_rsd.iloc[3].values
    borough_rsd = borough_rsd[4:]
    rolling_sales_data[b_k] = borough_rsd

100%|██████████| 5/5 [00:25<00:00,  5.25s/it]


PLUTO data is provided as borough-denominated `csv` files packaged into a `zip`. The following code bit unpacks the data and rekeys the file (`QN.csv`, `BK.csv`, `BX.csv`, `SI.csv`, `Mn.csv`) to match the lexicon used for the rolling sales data, above.

In [4]:
r = requests.get('http://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_15v1.zip')
pluto_key_pairs = {'Manhattan': 'Mn.csv',
                   'Brooklyn': 'BK.csv',
                   'Bronx': 'BX.csv',
                   'Staten Island': 'SI.csv',
                   'Queens': 'QN.csv'}
pluto_data = dict()
for b_k, b_csv in tqdm(list(pluto_key_pairs.items())):
    with zipfile.ZipFile(io.BytesIO(r.content)) as ar:
        borough_pluto = pd.read_csv(ar.open(b_csv))
        pluto_data[b_k] = borough_pluto

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
100%|██████████| 5/5 [00:24<00:00,  4.58s/it]


Both the PLUTO and Rolling Sales datasets provide their data on a per-borough basis, and since we would like to study the entire city, we must now flatten each set of tables into two big tables. Along the way we encode an additional `Borough` column, to preserve information.

**Note**: the `Borough` mapping is a new column in the rolling sales dataset; a prexisting (numerically encoded) `BOROUGH` column is removed. The mapping overlays and replaces an older (acronym-encoded) column in the `PLUTO` one.

In [5]:
rolling_sales_agglom = pd.DataFrame(columns=rolling_sales_data['Manhattan'].columns)
pluto_data_agglom = pd.DataFrame(columns=pluto_data['Manhattan'].columns)
for b_k in tqdm(pluto_key_pairs.keys()):
    pluto_data[b_k]['Borough'] = rolling_sales_data[b_k]['Borough'] = b_k
    rolling_sales_agglom = pd.concat([rolling_sales_agglom, rolling_sales_data[b_k]], ignore_index=True)
    pluto_data_agglom = pd.concat([pluto_data_agglom, pluto_data[b_k]], ignore_index=True)
del rolling_sales_agglom['BOROUGH']

100%|██████████| 5/5 [00:08<00:00,  2.00s/it]


RPAD data is the trickiest. RPAD is split into two files, one for properties in tax class 1 (single, double, and triple -family homes) and one for those in tax classes 2/3/4 (everything else). These are provided as compressed `zip` files containing `mdb` database files (for Microsoft Access).

I handled this the easiest way I could: by jumping on a Windows Desktop, opening the databases, opening them in Microsoft Access, and exporting them to comma-delimited `txt` files using the GUI (be sure to check the box that asks if you'd like to include the field name in the first column!). There are automated ways of doing this but they're sure to be painful and this is easiest. The resulting files can then be read by `pandas`.

Check out [mdbtools](https://github.com/brianb/mdbtools) (the [homebrew installation](http://brewformulas.org/Mdbtools) for Mac OS X) for \*nix systems (not sure what the easiest way to get it on Linux is). If you're working on Windows and don't have access to Microsoft Access (heh), I'm not sure but I think [pyodbc](https://github.com/mkleehammer/pyodbc) will work.

The odd filenames come from the odd names the underlying database for its tables. They stand for tax class 1 and tax classes 2/3/4.

In [11]:
%ls

both-tabs-and-commas-screencap.png      nyc_buildings.csv
data-munging.ipynb                      nyc_pluto.csv
data-scribbles.ipynb                    nyc_rolling_sales.csv
nyc_building_sales.csv                  rolling-sales-data-excel-screencap.png


In [10]:
rpad_data_agglom = pd.concat([pd.read_csv("tc1.txt"), pd.read_csv("tc234.txt")], ignore_index=True)

# Merge

Now we flatten these two files into one by performing an outer join on the `(Borough, Block, Lot)` unique key (this is *supposedly* [standard](http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup) but turned out to be far far harder than anticipated&mdash;keep reading for more on why).

In order to do this we first remap the column names in the rolling sales dataset from `SPACED ALL CAPS` to `NoSpaceCamelCase` (as used by `PLUTO`).

In [None]:
rolling_sales_agglom.columns = [c.title().replace(' ', '') for c in list(rolling_sales_agglom.columns)]

Not every record in the Rolling Sales dataset represents an actual sale. A large number of records are of what are effectively deed transfers: sales of a building for either `$0` or occassionally some paltry some, usually between family members, sometimes as a part of a contract, and so on. Since these records do not encode any actual information they should be dropped.

`$1000` is arbitrarily chosen as a cutoff value.

In [None]:
rolling_sales_agglom = rolling_sales_agglom[rolling_sales_agglom['SalePrice'] > 1000]

The Rolling Sales dataset contains information on the sale of both individual apartments and of whole buildings, but does not contain the square footage of the apartment sales in the data. Since we need this for calculating standardized value per square foot, we have to throw out apartments entirely, unfortunately, from the dataset.

To do this we remove entries with no defined `LandSquareFeet` (no apartments have this value in any of the datasets, weirdly enough) and restrict apartment number to a lack of one, which is weirdly `'            '`, as in, a long space. See [Cleaning](#Cleaning) for more on why.

In [None]:
rolling_sales_agglom.ix[0, :]['ApartmentNumber']

In [None]:
rs_a_f = rolling_sales_agglom[rolling_sales_agglom['ApartmentNumber'] == '            ']

In [None]:
rs_a_f = rs_a_f[rs_a_f['LandSquareFeet'] > 0]

Our assumption going into this project was that the `Borough`-`Block`-`Lot` columns present in both PLUTO and the Rolling Sales dataset (these are an identifier used for taxation purposes) are equivalent, allowing us to use this combo as a unique key for a join. However, this assumption turns out to be incorrect. Both PLUTO and Rolling Sales break this assumption, and in different ways.

The PLUTO dataset contains information on the square footage of entire buildings, but not on that of individual units. This is because instead of using the usual Tax-Block-Lot system PLUTO defines and uses its own Lot configuration, painting over condominiums with multiple lots in a single building by merging them into one Lot. This is useful for geospatial visualization but voids the use of the dataset for ordinary apartment-based residential lookup: the Rolling Sales dataset contains sales information on individual apartments which, as a consequence of this tweak, cannot be mapped to any of the records in PLUTO.

That doesn't really bother us though&mdash;apartments were already disqualified by the fact that nowhere is there a data stream for their size, without which their price cannot be standardized.

Rolling Sales breaks the uniqueness assumption by, bizarrely enough, allowing the sale of sub-units within a single property block. This is really disorienting: isn't the whole point of having a single identifier that it's atomic? Apparently in the city's eyes, no. And thus it is that individual sales of components of a property are all recorded seperately!

How do we handle this? `Rolling Sales` contains both the `SalePrice` and the square footage of the property sold. Instead of working with `SalePrice` directly, we will compute an average `MarketValueSqFt` based on the sum of the `SalePrice` fields divided by the sum of the `BldgArea` fields for each unique `Borough`-`Block`-`Lot` combination in the dataset.

In [None]:
# This function is a doozy! Here' what it does, step-by-step:
# 1. Select our desired slice of the variables from the rolling sales data.
# 2. Aggregate by Borough-Block-Lot, creating a groupby object.
# 3. Merge the non-key variables via summation, converting the groupby object to the hierarchical DataFrame.
# 4. Reset the index to shake off the hierarchical index and recreate a simple numerical one.
# 5. Assign all of that to rs_a_ff.
rs_a_ff = rs_a_f[['Borough', 'Block', 'Lot', 'SalePrice', 'LandSquareFeet']].groupby(by=['Borough', 'Block', 'Lot']).sum().reset_index()
# Now all that's left is to create a new column for market value by broadcasting division.
rs_a_ff['MarketValueSqFt'] = rs_a_ff['SalePrice'] / rs_a_ff['LandSquareFeet']

How many records did we fold this way?

In [None]:
len(rs_a_f) - len(rs_a_ff)

Now the join.

In [None]:
rolling_pluto = pd.merge(rs_a_ff, pluto_data_agglom,
                         how='outer', on=['Borough', 'Block', 'Lot'])

It's not immediately apparent why, but this resulted in slightly more records than expected.

In [None]:
len(rolling_pluto) - len(pluto_data_agglom)

Should we be worried about 71 records amongst 859535 of them? Probably not. A banding procedure I use later on probably whipes most of these remaining pressure points out.

`SalePrice`-populated mergers fired correctly.

In [None]:
len(rs_a_ff) - len(rolling_pluto[rolling_pluto['SalePrice'] >= 0])

Next up we want to merge in the RPAD dataset. RPAD has a borough column much like the Rolling Sales dataset&mdash;numbers instead of the borough names that we need to join it to `rolling_pluto`. Since the data isn't provided on a borough-denominated basis, in this case we'll have to explicitly map the `BORO` numerical column to a name value.

In [None]:
rpad_key_pairs = {1.0: 'Manhattan',
                  2.0: 'Bronx',
                  3.0: 'Brooklyn',
                  4.0: 'Queens',
                  5.0: 'Staten Island',
                 }
rpad_data_agglom['Borough'] = rpad_data_agglom['BORO'].apply(lambda n: rpad_key_pairs[n])
del rpad_data_agglom['BORO']

Again this dataset has its own format for variables, in this case the format is `ALL_CAPS_SPACERS`. Again we convert to `CamelCase`.

In [None]:
rpad_data_agglom.columns = [c.title().replace('_', '') for c in list(rpad_data_agglom.columns)]

In [None]:
rpad_data_agglom.head(5)

The variables of interest in RPAD (what we want to keep after the merge) are:

* Borough (merge key).
* Block (merge key).
* Lot (merge key).
* CurFvT &mdash; Current market value, total, as of assessment sometime in 2015 and as determined by the city. This is the Finance office's best guess as to the value of this property.
* NewFvT &mdash; New market value, total, as of prospective assessment in early 2016 and as determined by the city. Note that the Finance office lags, by its own admission, generally a cycle or two behind the movements of the real estate market.
* CuravtA &mdash; current assessed value, total, as of assessment sometime in 2015 as determined by the city. Assessed value is computed using a complex and obtruse formulaic determination of rental market value. The very similar `Curavt` is a similar statistics which is rebalanced to increase no more than `20%` a year, per tax laws ([source](https://www1.nyc.gov/site/finance/taxes/property-determining-your-transitional-assessed-value.page)); it is excluded because `CuravtA` is, therefore, better correlated with actual value. `Curavt`, not market value, is what is used by the city to assess residential tax.

Tentative and final assessment information excluded because it is incomplete pending the release of this dataset for the most recent financial year.

In [None]:
rpad_columns_of_interest = ['Borough', 'Block', 'Lot', 'CurFvT', 'NewFvT', 'CuravtA']

In [None]:
# rpad_data_agglom[rpad_columns_of_interest]

In [None]:
counts = rpad_data_agglom.groupby(by=['Borough', 'Block', 'Lot']).count().reset_index()

In [None]:
counts = counts.sort_values(by='CurFvT', ascending=False)
counts[counts['Bble'] > 1]

OK now we can merg&mdash;nope!

Merging these two datasets as-is created >2800 extra entries, indicating that `Borough`-`Block`-`Lot` *still* wasn't unique somewhere.

Regrouping the `RPAD` data according to this schema reveals that this is indeed the case:

In [None]:
rpad_data_agglom.groupby(by=['Borough', 'Block', 'Lot']).count()

Zooming in on one of the cases shows why&mdash;easements are all listed seperately within the file!

In [None]:
rpad_data_agglom[(rpad_data_agglom['Borough'] == 'Bronx') &
                 (rpad_data_agglom['Block'] == 2260) &
                 (rpad_data_agglom['Lot'] == 62)]

Ok...let's fix this one too. Easements indicate encumerances on the property by others&mdash;usually the city having a subway track lying on a piece of your land or something similar (also, this property with six easements isn't even the winner, there's another somewhere with ten!) #useless-factiods.

Easements have no property value, so we can simply remove the entries corresponding to them.

In [None]:
rpad_data_agglom_f = rpad_data_agglom[rpad_data_agglom['Ease'].isnull()]

Finally the merge. Yes! It's really happening!

In [None]:
rolling_pluto_rpad = pd.merge(rolling_pluto, rpad_data_agglom_f[rpad_columns_of_interest],
                              how='inner', on=['Borough', 'Block', 'Lot'])

A bunch of records are lost in the join, which is good: the rolling pluto dataset is cleaner by-and-by than the RPAD one, so we ought to be refusing bad records, not creating new ones.

In [None]:
len(rolling_pluto_rpad) - len(rolling_pluto)

And the size of the loss is small enough to be acceptable.

In [None]:
(len(rolling_pluto_rpad) - len(rolling_pluto)) / len(rolling_pluto_rpad)

# Cleaning

`BldgArea` is an important variable for us, but quite a few records are missing it.

In [None]:
rolling_pluto_rpad['BldgArea'].value_counts()[0.0]

There's no way to interpolate that information so we'll just drop them. Ditto with records missing an `Address`&mdash;these are usually malformed records or records of places belonging to the city or to the state.

In [None]:
rolling_pluto_rpad = rolling_pluto_rpad[rolling_pluto_rpad['BldgArea'] > 0]

Some entries in the PLUTO dataset are missing an address. Certain city-owned or public spaces&mdash;parks, for example&mdash;don't get one because they're not even technically buildings. The inner join with RPAD *appears* to have taken care of this problem, but for safety's sake let's explicitly disallow it.

In [None]:
rolling_pluto_rpad = rolling_pluto_rpad[rolling_pluto_rpad['Address'].notnull()]

Now we hit the next issue. At the copy step during our merge process `pandas` complained that many of the columns that we are working with have a mixed `dtype`:

> `DtypeWarning: Columns (4,6,7,8,10,11,50,52,53,77,79) have mixed types. Specify dtype option on import or set low_memory=False.`

After casting these columns using `np.astyle(float)` failed I wrote a `try-else` block and caught on to a sentinal value in `CT2010` of `'       '`, much like the strange spacer above, that was messing with the column type. The other columns also have other variable length spacers like this floating around in them.

In [None]:
len(rolling_pluto_rpad[rolling_pluto_rpad['CT2010'] == '       '])

In [None]:
len(rolling_pluto_rpad[rolling_pluto_rpad['CB2010'] == '     '])

What a mess. This problem occurs in both float columns and string columns and occurs with strings of variable length, making it difficult to pick out how to cast to get rid of it. The problem isn't immediately evident from `pandas` displays because `pandas` strips empty space from its display, but if you `to_csv()` the file and check it out in a text editor, the reason why emerges:

![alt text](./both-tabs-and-commas-screencap.png "")

Tab delimiters and comma delimiters are two of the most common (if not *the* most common) ways of storing tabular data in a text file. Rather than choose one format or the other, the PLUTO compilers appear to have chosen...both.

The following (very inefficient) loop cleans this up.

In [12]:
def convert_floats_and_whitespace_strings_to_floats_and_strings(series):
    l = []
    for entry in [str(entry).strip() for entry in series]:
        if entry == "":
            l.append(np.nan)
        else:
            try:
                l.append(float(entry))
            except ValueError:
                l.append(entry)
    return l

In [None]:
columns_needing_fixing = rolling_pluto_rpad.columns
for column in columns_needing_fixing:
    rolling_pluto_rpad[column] = convert_floats_and_whitespace_strings_to_floats_and_strings(rolling_pluto_rpad[column])

In [None]:
rolling_pluto_rpad['CT2010'].dtype

In [None]:
rolling_pluto_rpad['CB2010'].dtype

For a data visualization component to the project I'm applying this fix to the other datasets as well and saving the result.

In [14]:
for dataset in [rolling_sales_agglom, pluto_data_agglom]:
    for column in getattr(dataset, 'columns'):
        dataset[column] = convert_floats_and_whitespace_strings_to_floats_and_strings(dataset[column])

In [16]:
pluto_data_agglom.dtypes

Borough        object
Block         float64
Lot           float64
CD            float64
CT2010        float64
CB2010        float64
SchoolDist    float64
Council       float64
ZipCode       float64
FireComp       object
               ...   
YCoord        float64
ZoneMap        object
ZMCode         object
Sanborn        object
TaxMap        float64
EDesigNum      object
APPBBL        float64
APPDate        object
PLUTOMapID    float64
Version        object
dtype: object

In [17]:
rolling_sales_agglom.to_csv("nyc_rolling_sales.csv")
pluto_data_agglom.to_csv("nyc_pluto.csv")
rpad_data_agglom.to_csv("nyc_rpad.csv")

# Partition

At this point we are ready to split the dataset into two partitions. The *predictor partition* contains all of our records with associated sales data&mdash;the ground truths on which we will build our model. The *predicted partition* contains all of the fresh records (the vast majority) which do not have sales data associated with them.

Once we are satisfied with the power of our model (constructed using the predictor data) we will apply it to the rest of the city (predicted data) and visualize the results for explanatory analysis.

In [None]:
r_p_pre = rolling_pluto_rpad[rolling_pluto_rpad['SalePrice'].notnull()]
r_p_post = rolling_pluto_rpad[rolling_pluto_rpad['SalePrice'].isnull()]

Let's enrich the dataset with calculated variables for market value by square footage.

In [None]:
# Ignore the warning.
mkt_sqft_values = r_p_pre['SalePrice'] / r_p_pre['BldgArea']
r_p_pre['MarketValueSqFt'] = mkt_sqft_values
r_p_post['MarketValueSqFt'] = np.nan

And thusly also for `CurFvT`, `NewFVT`, and `CuravtA`.

In [None]:
# Ignore the warning.
for partition in [r_p_pre, r_p_post]:    
    assessed_sqft_values = partition['CuravtA'] / partition['BldgArea']
    pre_assessed_mkt_values = partition['CurFvT'] / partition['BldgArea']
    post_assessed_mkt_values = partition['NewFvT'] / partition['BldgArea']
    partition['AssessmentValueSqFt'] = assessed_sqft_values
    partition['EstPriorMarketValueSqFt'] = pre_assessed_mkt_values
    partition['EstCurentMarketValueSqFt'] = post_assessed_mkt_values

In [None]:
r_p_pre.head(5)

Reset the indices (this isn't strictly necessary, but the resulting data is cleaner and since we can concatenate whilst ignoring indices it won't cause any problems down the line).

In [None]:
r_p_pre.reset_index(drop=True, inplace=True)
r_p_post.reset_index(drop=True, inplace=True)

Before saving to `csv`, we need to do one more thing: name the `Index`. Otherwise this column's header will not be populated.

Note that an `Index` is a `pandas` requirement, not a `csv` one. We could remove it entirely, but don't really gain anything from doing so.

In [None]:
r_p_pre.index.name = r_p_post.index.name = 'Index'

# Banding

We expect the true value of the building to be somewhat close to the approximate value calculated by the city. In reality there are a lot of cases where this public figure is as much 50% off or even 100% off, but after a certain point the source of the error becomes not the city's mis-assessment but a below-market-value sale on the part of the selling parties.

Here are the ten worst offenders, for example:

In [None]:
r_p_pre.sort_values(by = 'SalePrice', ascending=False).tail(10)[::-1][['SalePrice', 'Address', 'AssessmentValueSqFt',
                                                                       'EstPriorMarketValueSqFt',
                                                                       'EstCurentMarketValueSqFt', 'MarketValueSqFt']]

A couple K wouldn't buy you a good used car, let alone a house. We already earlier filtered out the bulk of the in-family or contractual transfers by specifying a lower real estate limit of 1000$. However this wasn't an aggressive enough culling to deal with cases in which the house was sold for real money, just not real house money, for whatever reason.

On the other end of the spectrum, property with a high value (especially skyscrapers) tend to be worth more than city estimates:

In [None]:
r_p_pre.sort_values(by = 'SalePrice', ascending=False).head(10)[['SalePrice', 'Address', 'AssessmentValueSqFt',
                                                                       'EstPriorMarketValueSqFt',
                                                                       'EstCurentMarketValueSqFt', 'MarketValueSqFt']]

Here's a rank distribution of over and under -estimates. Looks like the city's model does decently well, actually!

In [None]:
%matplotlib inline

In [None]:
m_ratio = (r_p_pre['EstCurentMarketValueSqFt'] / r_p_pre['MarketValueSqFt']).sort_values().reset_index(drop=True)

In [None]:
m_ratio.plot(logy=True, figsize=(24, 10))

What percentage of records are accurate to within plus/minus 25%?

In [None]:
len(m_ratio[(m_ratio > 0.75) & (m_ratio < 1.25)]) / len(m_ratio)

There we go&mdash;a number to try and beat.

Here's a plot of market value versus predicted value ranked by market value:

A little more information about our dataset before we select a band: is under to over -estimation somehow correlated with actual market value, as I assume is the case (it's easier to be wrong about a big building than a small one)?

In [None]:
r_p_pre_s = r_p_pre.copy()
r_p_pre_s['ValueRatio'] = r_p_pre['EstCurentMarketValueSqFt'] / r_p_pre['MarketValueSqFt']
r_p_pre_s['ASE'] = (r_p_pre['EstCurentMarketValueSqFt'] - r_p_pre['MarketValueSqFt'])**2
r_p_pre_s.sort_values(by='SalePrice').reset_index().plot(figsize=(24, 10), y='ASE', logy=True)

As expected, the city's model does best when predicting intermediate varibles, and low and high values cause the error to rise. If things seem flat, they're not: remember that this is a logorithmic plot!

Looking at a wrongness ratio:

In [None]:
(r_p_pre_s['ValueRatio']).plot(figsize=(24, 10))

This doesn't make picking a good cutoff easier; mostly it just shows that there are some really extreme outliers out there. At what point do we stop losing missinformation and start losing the real thing? One more look to decide...

In [None]:
pd.Series({n: len(r_p_pre_s[r_p_pre_s['ValueRatio'] > n]) for n in np.linspace(0, 20)}).plot(figsize=(24, 10))

In [None]:
r_p_pre_s['OffRatio'] =  ((np.abs(r_p_pre_s['EstCurentMarketValueSqFt'] - r_p_pre_s['MarketValueSqFt'])) / r_p_pre_s['MarketValueSqFt'])

I decided to be very conservative and band at no more than two standard deviations/the 1% percentile of the data&mdash;corresponding with a cutoff of `'OffRatio' = 10`.

In [None]:
len(r_p_pre_s[r_p_pre_s['ValueRatio'] > 10]) / len(r_p_pre_s)

In [None]:
r_p_pre_s = r_p_pre_s[r_p_pre_s['ValueRatio'] < 10]

Looking at the plot again afterwards:

In [None]:
((r_p_pre_s['EstCurentMarketValueSqFt']) / r_p_pre_s['MarketValueSqFt']).plot(figsize=(24, 10))

We may filter again later into the process.

Now all that remains is deleting our temporary columns and saving the data.

In [None]:
r_p_pre_s = r_p_pre_s[r_p_pre.columns]

# Save

In [None]:
r_p_pre_s.to_csv('nyc_building_sales.csv')
r_p_post.to_csv('nyc_building_nonsales.csv')