This notebook handles **most** of the downloading, loading, parsing, joining, and saving the combined [PLUTO](http://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page), [Rolling Sales](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page), and [RPAD](https://www1.nyc.gov/site/finance/taxes/property-assessments.page) datasets. A bit of the work is done out-of-band&mdash;you'll see why.

Note that in all of these datasets the unit of measurement is a borough-block-lot combination, each of which maps uniquely to an individual property. That property may be a building (in the case of actual buildings, small homes, and co-ops) or it may be an apartment (in the case of condominiums). None of the three datasets provide information on the size of individual apartments, though assessment values and market values are present in the RPAD and Rolling Sales datasets, respectively. Therefore there is no way to map information on apartment values. These records are removed at a later step. The focus of this project therefore is on whole-building values.

Now on to the datasets.

RPAD is a record of the assessed value, as determined by the New York City Department of Finance and used for taxation purposes, of every building and apartment in New York City. These assessed values are by the department's own admission generally a few cycles or years behind the trend of the market, but are nevertheless a valuable and almost complete record of all property values in New York City.

The PLUTO dataset agglomerates a large number of datasets published by various agencies in New York City into a single master record for categorical information on every property in New York City. It gives RPAD and Rolling Sales information a rich context.

Rolling Sales contains the actual market prices of all buildings sold in New York City in the last twelve months. Past that horizon real estate market trends make this data less concrete. Since market value is our target variable, Rolling Sales contains ground truths about what we would like to model. On the other hand it also contains a significant amount of noise, dealing with which will be discussed later.

In [4]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
import zipfile
import io

In [5]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 20)

# Download

Rolling sales data is provided as lightly formatted `xls` files (Excel):

![alt text](./rolling-sales-data-excel-screencap.png "")

Luckily the extraneous details are easily patched up post-import.

In [6]:
rolling_sales_data = dict()
rolling_sales_data_key_pairs = {'Manhattan': 'manhattan',
                                'Brooklyn': 'brooklyn',
                                'Queens': 'queens',
                                'Bronx': 'bronx',
                                'Staten Island': 'statenisland'}
for b_k, b_xls in tqdm(list(rolling_sales_data_key_pairs.items())):
    borough_rsd = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_{0}.xls'.format(b_xls))
    borough_rsd.columns = borough_rsd.iloc[3].values
    borough_rsd = borough_rsd[4:]
    rolling_sales_data[b_k] = borough_rsd

100%|████████████████████████████████████████████| 5/5 [00:17<00:00,  3.83s/it]


PLUTO data is provided as borough-denominated `csv` files packaged into a `zip`. The following code bit unpacks the data and rekeys the file (`QN.csv`, `BK.csv`, `BX.csv`, `SI.csv`, `Mn.csv`) to match the lexicon used for the rolling sales data, above.

In [7]:
r = requests.get('http://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_15v1.zip')
pluto_key_pairs = {'Manhattan': 'Mn.csv',
                   'Brooklyn': 'BK.csv',
                   'Bronx': 'BX.csv',
                   'Staten Island': 'SI.csv',
                   'Queens': 'QN.csv'}
pluto_data = dict()
for b_k, b_csv in tqdm(list(pluto_key_pairs.items())):
    with zipfile.ZipFile(io.BytesIO(r.content)) as ar:
        borough_pluto = pd.read_csv(ar.open(b_csv))
        pluto_data[b_k] = borough_pluto

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
100%|████████████████████████████████████████████| 5/5 [00:08<00:00,  1.94s/it]


Both the PLUTO and Rolling Sales datasets provide their data on a per-borough basis, and since we would like to study the entire city, we must now flatten each set of tables into two big tables. Along the way we encode an additional `Borough` column, to preserve information.

**Note**: the `Borough` mapping is a new column in the rolling sales dataset; a prexisting (numerically encoded) `BOROUGH` column is removed. The mapping overlays and replaces an older (acronym-encoded) column in the `PLUTO` one.

In [8]:
rolling_sales_agglom = pd.DataFrame(columns=rolling_sales_data['Manhattan'].columns)
pluto_data_agglom = pd.DataFrame(columns=pluto_data['Manhattan'].columns)
for b_k in tqdm(pluto_key_pairs.keys()):
    pluto_data[b_k]['Borough'] = rolling_sales_data[b_k]['Borough'] = b_k
    rolling_sales_agglom = pd.concat([rolling_sales_agglom, rolling_sales_data[b_k]], ignore_index=True)
    pluto_data_agglom = pd.concat([pluto_data_agglom, pluto_data[b_k]], ignore_index=True)
del rolling_sales_agglom['BOROUGH']

100%|████████████████████████████████████████████| 5/5 [00:08<00:00,  2.22s/it]


RPAD data is the trickiest. RPAD is split into two files, one for properties in tax class 1 (single, double, and triple -family homes) and one for those in tax classes 2/3/4 (everything else). These are provided as compressed `zip` files containing `mdb` database files (for Microsoft Access).

I handled this the easiest way I could: by jumping on a Windows Desktop, opening the databases, opening them in Microsoft Access, and exporting them to comma-delimited `txt` files using the GUI (be sure to check the box that asks if you'd like to include the field name in the first column!). There are automated ways of doing this but they're sure to be painful and this is easiest. The resulting files can then be read by `pandas`.

Check out [mdbtools](https://github.com/brianb/mdbtools) (the [homebrew installation](http://brewformulas.org/Mdbtools) for Mac OS X) for \*nix systems (not sure what the easiest way to get it on Linux is). If you're working on Windows and don't have access to Microsoft Access (heh), I'm not sure but I think [pyodbc](https://github.com/mkleehammer/pyodbc) will work.

The odd filenames come from the odd names the underlying database for its tables. They stand for tax class 1 and tax classes 2/3/4.

In [10]:
# rolling_sales_agglom.to_csv("test.csv")

In [15]:
%ls # tc1.txt and tc234.txt should be in this list!

 Volume in drive C is SSD_80GB
 Volume Serial Number is 9279-00B2

 Directory of C:\Users\Alex\Desktop\nyc-buildings

04/29/2016  02:34 PM    <DIR>          .
04/29/2016  02:34 PM    <DIR>          ..
04/27/2016  10:52 AM                77 .gitignore
04/29/2016  12:45 PM    <DIR>          .ipynb_checkpoints
04/29/2016  02:30 PM            70,088 both-tabs-and-commas-screencap.png
04/29/2016  02:34 PM            24,951 data-munging.ipynb
04/29/2016  01:28 PM           354,591 data-scribbles.ipynb
04/27/2016  10:52 AM        18,821,028 nyc_building_sales.csv
04/26/2016  01:42 PM       437,163,527 nyc_buildings.csv
04/26/2016  11:59 AM           211,429 rolling-sales-data-excel-screencap.png
04/29/2016  02:20 PM       527,980,770 tc1.txt
04/29/2016  02:17 PM       294,684,839 tc234.txt
               9 File(s)  1,279,311,300 bytes
               3 Dir(s)   6,913,380,352 bytes free


In [61]:
rpad_data_agglom = pd.concat([pd.read_csv("tc1.txt"), pd.read_csv("tc234.txt")], ignore_index=True)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


# Merge

Now we flatten these two files into one by performing an outer join on the `(Borough, Block, Lot)` unique key (this is [standard](http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)).

In order to do this we must first remap the column names in the rolling sales dataset from `SPACED ALL CAPS` to `NoSpaceCamelCase` (as ued by `PLUTO`).

In [18]:
rolling_sales_agglom.columns = [c.title().replace(' ', '') for c in list(rolling_sales_agglom.columns)]

At this point we hit a roadblock.

Our assumption going into this project was that the `Borough`-`Block`-`Lot` columns present in both PLUTO and the Rolling Sales dataset (these are an identifier used for taxation purposes) are equivalent, allowing us to use this combo as a unique key for a join. However, this assumption turns out to be incorrect.

The Rolling Sales dataset contains information on the sale of both individual apartments and of whole buildings, but does not contain the square footage of the apartment sales in the data.

The PLUTO dataset contains information on the square footage of entire buildings, but not on that of individual apartments. This is because instead of using the usual Tax-Block-Lot system PLUTO defines and uses its own Lot configuration, painting over condominiums with multiple lots in a single building by merging them into one Lot. This is useful for geospatial visualization but voids the use of the dataset for ordinary apartment-based residential lookup: the Rolling Sales dataset contains sales information on individual apartments which, as a consequence of this tweak, cannot be mapped to any of the records in PLUTO.

We'll have to remove apartments from the dataset and do without them. Weirdly the sentinel value in this case is `'            '`, as in, a long space. See [Cleaning](#Cleaning) for more on why.

In [19]:
rolling_sales_agglom.ix[0, :]['ApartmentNumber']

'            '

In [20]:
rs_a_f = rolling_sales_agglom[rolling_sales_agglom['ApartmentNumber'] == '            ']

Not every record in the Rolling Sales dataset represents an actual sale. A large number of records are of what are effectively deed transfers: sales of a building for either `$0` or occassionally some paltry some, usually between family members, sometimes as a part of a contract, and so on. Since these records do not encode any actual information they should be dropped.

`$1000` is arbitrarily chosen as a cutoff value.

In [21]:
rs_a_f = rs_a_f[rs_a_f['SalePrice'] > 1000]

In addition to the primary key values there are a handful of columns which are present in both datasets. However, these are encoded somewhat differently. For example, `Address` is present in both `PLUTO` and the rolling sales data, but is not consistently formatted in the former&mdash;many entries have what appears to be a leading space that would need to be stripped first.

Considering that the alphanumerical `Borough-Block-Lot` combination is standardly encoded and fully unique (*for our chosen subcase*), we can simply not consider these additional columns, as adding them to the join won't do anything for us (and create more problems than it solves).

The rolling sales data contains a number of variables which are more or less copies of the data contained in `PLUTO`. Since we cannot construct a generalized classifier based on variables which are not everywhere present (technically you can `GLM`-encode `NaN` values, but this is a poor idea), we will extract only one column of interest from the rolling sales data, the `SalePrice`.

In [22]:
rolling_non_dups = ['Borough', 'Block', 'Lot', 'SalePrice']

Now the join.

In [23]:
rolling_pluto = pd.merge(rs_a_f[rolling_non_dups], pluto_data_agglom,
                         how='outer', on=['Borough', 'Block', 'Lot'])

It's not immediately apparent why, but this resulted in more records than expected. I chose to defer investigating this until a second stage of the process.

In [24]:
len(rolling_pluto) - len(pluto_data_agglom)

14472

However `SalePrice`-populated mergers fired correctly.

In [25]:
len(rs_a_f) - len(rolling_pluto[rolling_pluto['SalePrice'] >= 0])

0

Next up we want to merge in the RPAD dataset. RPAD has a borough column much like the Rolling Sales dataset&mdash;numbers instead of the borough names that we need to join it to `rolling_pluto`. Since the data isn't provided on a borough-denominated basis, in this case we'll have to explicitly map the `BORO` numerical column to a name value.

In [62]:
rpad_key_pairs = {1.0: 'Manhattan',
                  2.0: 'Bronx',
                  3.0: 'Brooklyn',
                  4.0: 'Queens',
                  5.0: 'Staten Island',
                 }
rpad_data_agglom['Borough'] = rpad_data_agglom['BORO'].apply(lambda n: rpad_key_pairs[n])
del rpad_data_agglom['BORO']

Again this dataset has its own format for variables, in this case the format is `ALL_CAPS_SPACERS`. Again we convert to `CamelCase`.

In [69]:
rpad_data_agglom.columns = [c.title().replace('_', '') for c in list(rpad_data_agglom.columns)]

In [70]:
rpad_data_agglom.head(5)

Unnamed: 0,Bble,Block,Lot,Ease,Secvol,District,Year4,CurFvL,CurFvT,NewFvL,NewFvT,FvChgdt,Curavl,Curavt,Curexl,Curext,CuravlA,CuravtA,CurexlA,CurextA,Chgdt,TnAvl,TnAvt,TnExl,TnExt,TnAvlA,TnAvtA,TnExlA,TnExtA,Fchgdt,FnAvl,FnAvt,FnExl,FnExt,FnAvlA,FnAvtA,FnExlA,FnExtA,Txcl,OTxcl,CbnTxcl,Bldgcl,Exmtcl,Owner,HnumLo,HnumHi,StrName,Zip,TotUnit,ResUnit,LfrtDec,LdepDec,LAcre,Irreg,BfrtDec,BdepDec,BldVar,Ext,Story,Bldgs,Corner,LndArea,GrSqft,Zoning,Yrb,YrbFlag,YrbRng,Yra1,Yra1Rng,Yra2,Yra2Rng,CpBoro,CpDist,Limit,OLimit,Status1,Status2,Newlot,Droplot,Delchg,Corchg,Nodesc,Noav,Valref,Mbldg,CondoNm,CondoS1,CondoS2,CondoS3,CondoA,ComintL,ComintB,Aptno,ApBoro,ApBlock,ApLot,ApEase,ApDate,ApTime,Protest,AtGrp,Applic,Protest2,AtGrp2,Applic2,OProtst,OAtGrp,OApplic,Reuc,GeoRc,CoopNum,ExInds,ExCount,ExChgdt,Dchgdt,SmChgdt,Borough
0,1000750043,75.0,43.0,,102.0,1.0,2015.0,5690000.0,7042000.0,5456000.0,6837000.0,12/16/2015,206643.0,255744.0,0.0,0.0,206643.0,255744.0,0.0,0.0,12/16/2015,207764.0,260352.0,0.0,0.0,207764.0,260352.0,0.0,0.0,00/00/0000,207764.0,260352.0,0.0,0.0,207764.0,260352.0,0.0,0.0,1,1,,C0,,"YUEN, SO SAN",26,26,CLIFF STREET,10038.0,3.0,3.0,19.42,89.33,,I,20.0,80.0,,,4.0,1.0,,1690.0,6400.0,C6-4,1974.0,,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,,,,,0.0,0.0,,0.0,0.0,0.0,,00/00/0000,0.0,,0.0,,,0.0,,,0.0,,,0,0.0,,0.0,00/00/0000,03/19/2009,12/16/2015,Manhattan
1,1000780040,78.0,40.0,,102.0,1.0,2015.0,4620000.0,7076000.0,4620000.0,6015000.0,11/10/2015,169234.0,259200.0,0.0,0.0,169234.0,259200.0,0.0,0.0,11/10/2015,199086.0,259200.0,0.0,0.0,199086.0,259200.0,0.0,0.0,00/00/0000,199086.0,259200.0,0.0,0.0,199086.0,259200.0,0.0,0.0,1,1,,S2,,"H.B.S. EQUITIES,",86,86,NASSAU STREET,10038.0,3.0,2.0,16.58,51.75,,I,19.0,52.0,,,5.0,1.0,,854.0,4140.0,C5-5,1910.0,E,0.0,2004.0,0.0,2004.0,0.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,,,,,0.0,0.0,,0.0,0.0,0.0,,00/00/0000,0.0,,0.0,,,0.0,,,0.0,,,0,0.0,,0.0,00/00/0000,03/22/2007,11/10/2015,Manhattan
2,1000970017,97.0,17.0,,103.0,1.0,2015.0,2990000.0,5088000.0,2990000.0,5088000.0,00/00/0000,128944.0,219420.0,0.0,0.0,128944.0,219420.0,0.0,0.0,06/09/2015,136680.0,232585.0,0.0,0.0,136680.0,232585.0,0.0,0.0,00/00/0000,136680.0,232585.0,0.0,0.0,136680.0,232585.0,0.0,0.0,1,1,,S1,,"SPAEDA, DORATHEA S.",211,211,FRONT STREET,10038.0,2.0,1.0,25.0,39.83,,I,25.0,40.0,,,5.0,1.0,NE,992.0,4960.0,C6-2A,1900.0,E,0.0,1981.0,0.0,2005.0,0.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,,,,,0.0,0.0,,0.0,0.0,0.0,,00/00/0000,0.0,8.0,115.0,1.0,,0.0,,8.0,115.0,1.0,,0,0.0,,0.0,00/00/0000,03/17/2015,06/09/2015,Manhattan
3,1000970036,97.0,36.0,,103.0,1.0,2015.0,3950000.0,7401000.0,3950000.0,6291000.0,11/10/2015,77654.0,145499.0,1580.0,1580.0,77654.0,145499.0,1580.0,1580.0,01/08/2016,94325.0,150228.0,1550.0,1550.0,94325.0,150228.0,1550.0,1550.0,00/00/0000,94325.0,150228.0,1550.0,1550.0,94325.0,150228.0,1550.0,1550.0,1,1,,S2,,"BARNET, ANDREA",226,226,FRONT STREET,10038.0,3.0,2.0,25.08,71.0,,I,25.08,71.0,,,4.0,1.0,,1775.0,6600.0,C6-2A,1901.0,E,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,,,,,0.0,0.0,,0.0,0.0,0.0,,00/00/0000,0.0,,0.0,,,0.0,,8.0,18.0,1.0,,0,0.0,EEEE,1.0,00/00/0000,03/26/2008,11/10/2015,Manhattan
4,1000970044,97.0,44.0,,103.0,1.0,2015.0,1530000.0,1983000.0,1836000.0,2677000.0,12/16/2015,45239.0,58633.0,0.0,0.0,45239.0,58633.0,0.0,0.0,12/16/2015,40531.0,59097.0,0.0,0.0,40531.0,59097.0,0.0,0.0,00/00/0000,40531.0,59097.0,0.0,0.0,40531.0,59097.0,0.0,0.0,1,1,,A9,,"136 BEEKMAN, LLC",136,138,BEEKMAN STREET,10038.0,1.0,1.0,18.69,25.58,,I,18.6,25.5,,,4.0,1.0,,477.0,1900.0,C6-2A,1998.0,,1999.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,,,,,0.0,0.0,,1.0,97.0,44.0,,02/18/2003,112255.0,,0.0,,,0.0,,,0.0,,,0,0.0,,0.0,00/00/0000,09/25/2013,12/16/2015,Manhattan


The variables of interest in RPAD (what we want to keep after the merge) are:

* Borough (merge key).
* Block (merge key).
* Lot (merge key).
* CurFvT &mdash; Current market value, total, as of assessment sometime in 2015 and as determined by the city. This is the Finance office's best guess as to the value of this property.
* NewFvT &mdash; New market value, total, as of prospective assessment in early 2016 and as determined by the city. Note that the Finance office lags, by its own admission, generally a cycle or two behind the movements of the real estate market.
* CuravtA &mdash; current assessed value, total, as of assessment sometime in 2015 as determined by the city. Assessed value is computed using a complex and obtruse formulaic determination of rental market value. The very similar `Curavt` is a similar statistics which is rebalanced to increase no more than `20%` a year, per tax laws ([source](https://www1.nyc.gov/site/finance/taxes/property-determining-your-transitional-assessed-value.page)); it is excluded because `CuravtA` is, therefore, better correlated with actual value. `Curavt`, not market value, is what is used by the city to assess residential tax.

Tentative and final assessment information excluded because it is incomplete pending the release of this dataset for the most recent financial year.

In [74]:
rpad_columns_of_interest = ['Borough', 'Block', 'Lot', 'CurFvT', 'NewFvT', 'CuravtA']

In [81]:
# rpad_data_agglom[rpad_columns_of_interest]

Finally the merge. A handful of records are exhumed. Also not sure why, but it shouldn't pose too much of a problem.

In [79]:
rolling_pluto_rpad = pd.merge(rpad_data_agglom[rpad_columns_of_interest], rolling_pluto,
                              how='inner', on=['Borough', 'Block', 'Lot'])

In [83]:
len(rolling_pluto_rpad) - len(rolling_pluto)

3245

# Cleaning

`BldgArea` is an important variable for us, but quite a few records are missing it.

In [88]:
rolling_pluto_rpad['BldgArea'].value_counts()[0.0]

47609

There's no way to interpolate that information so we'll just drop them. Ditto with records missing an `Address`&mdash;these are usually malformed records or records of places belonging to the city or to the state.

In [91]:
rolling_pluto_rpad = rolling_pluto_rpad[rolling_pluto_rpad['BldgArea'] > 0]

Some entries in the PLUTO dataset are missing an address. Certain city-owned or public spaces&mdash;parks, for example&mdash;don't get one because they're not even technically buildings. The inner join with RPAD *appears* to have taken care of this problem, but for safety's sake let's explicitly disallow it.

In [97]:
rolling_pluto_rpad = rolling_pluto_rpad[rolling_pluto_rpad['Address'].notnull()]

Now we hit the next issue. At the copy step during our merge process `pandas` complained that many of the columns that we are working with have a mixed `dtype`:

> `DtypeWarning: Columns (4,6,7,8,10,11,50,52,53,77,79) have mixed types. Specify dtype option on import or set low_memory=False.`

After casting these columns using `np.astyle(float)` failed I wrote a `try-else` block and caught on to a sentinal value in `CT2010` of `'       '`, much like the strange spacer above, that was messing with the column type. The other columns also have other variable length spacers like this floating around in them.

In [100]:
len(rolling_pluto_rpad[rolling_pluto_rpad['CT2010'] == '       '])

134

In [101]:
len(rolling_pluto_rpad[rolling_pluto_rpad['CB2010'] == '     '])

488

What a mess. This problem occurs in both float columns and string columns and occurs with strings of variable length, making it difficult to pick out how to cast to get rid of it. The problem isn't immediately evident from `pandas` displays because `pandas` strips empty space from its display, but if you `to_csv()` the file and check it out in a text editor, the reason why emerges:

![alt text](./both-tabs-and-commas-screencap.png "")

Tab delimiters and comma delimiters are two of the most common (if not *the* most common) ways of storing tabular data in a text file. Rather than choose one format or the other, the PLUTO compilers appear to have chosen...both.

The following (very inefficient) loop cleans this up.

In [102]:
def convert_floats_and_whitespace_strings_to_floats_and_strings(series):
    l = []
    for entry in [str(entry).strip() for entry in series]:
        if entry == "":
            l.append(np.nan)
        else:
            try:
                l.append(float(entry))
            except ValueError:
                l.append(entry)
    return l

In [103]:
columns_needing_fixing = rolling_pluto_rpad.columns
for column in columns_needing_fixing:
    rolling_pluto_rpad[column] = convert_floats_and_whitespace_strings_to_floats_and_strings(rolling_pluto_rpad[column])

In [104]:
rolling_pluto_rpad['CT2010'].dtype

dtype('float64')

In [105]:
rolling_pluto_rpad['CB2010'].dtype

dtype('float64')

# Partition

At this point we are ready to split the dataset into two partitions. The *predictor partition* contains all of our records with associated sales data&mdash;the ground truths on which we will build our model. The *predicted partition* contains all of the fresh records (the vast majority) which do not have sales data associated with them.

Once we are satisfied with the power of our model (constructed using the predictor data) we will apply it to the rest of the city (predicted data) and visualize the results for explanatory analysis.

In [106]:
r_p_pre = rolling_pluto_rpad[rolling_pluto_rpad['SalePrice'].notnull()]
r_p_post = rolling_pluto_rpad[rolling_pluto_rpad['SalePrice'].isnull()]

Let's enrich the dataset with calculated variables for market value by square footage.

In [120]:
# Ignore the warning.
mkt_sqft_values = r_p_pre['SalePrice'] / r_p_pre['BldgArea']
r_p_pre['MarketValueSqFt'] = mkt_sqft_values
r_p_post['MarketValueSqFt'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


And thusly also for `CurFvT`, `NewFVT`, and `CuravtA`.

In [117]:
# Ignore the warning.
for partition in [r_p_pre, r_p_post]:    
    assessed_sqft_values = partition['CuravtA'] / partition['BldgArea']
    pre_assessed_mkt_values = partition['CurFvT'] / partition['BldgArea']
    post_assessed_mkt_values = partition['NewFvT'] / partition['BldgArea']
    partition['AssessmentValueSqFt'] = assessed_sqft_values
    partition['EstPriorMarketValueSqFt'] = pre_assessed_mkt_values
    partition['EstCurentMarketValueSqFt'] = post_assessed_mkt_values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [122]:
r_p_pre.head(5)

Unnamed: 0,Borough,Block,Lot,CurFvT,NewFvT,CuravtA,SalePrice,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,PolicePrct,HealthArea,Address,ZoneDist1,ZoneDist2,ZoneDist3,ZoneDist4,Overlay1,Overlay2,SPDist1,SPDist2,LtdHeight,AllZoning1,AllZoning2,SplitZone,BldgClass,LandUse,Easements,OwnerType,OwnerName,LotArea,BldgArea,ComArea,ResArea,OfficeArea,RetailArea,GarageArea,StrgeArea,FactryArea,OtherArea,AreaSource,NumBldgs,NumFloors,UnitsRes,UnitsTotal,LotFront,LotDepth,BldgFront,BldgDepth,Ext,ProxCode,IrrLotCode,LotType,BsmtCode,AssessLand,AssessTot,ExemptLand,ExemptTot,YearBuilt,BuiltCode,YearAlter1,YearAlter2,HistDist,Landmark,BuiltFAR,ResidFAR,CommFAR,FacilFAR,BoroCode,BBL,CondoNo,Tract2010,XCoord,YCoord,ZoneMap,ZMCode,Sanborn,TaxMap,EDesigNum,APPBBL,APPDate,PLUTOMapID,Version,AssessmentValueSqFt,EstPriorMarketValueSqFt,EstCurentMarketValueSqFt,MarketValueSqFt
5,Manhattan,97.0,45.0,2424000.0,3272000.0,118368.0,5250000.0,101.0,15.01,3014.0,2.0,1.0,10038.0,E006,1.0,7700.0,134 BEEKMAN STREET,C6-2A,,,,,,LM,,,C6-2A/LM,,N,A9,1.0,0,,SMITH CARTER B,458.0,2295.0,0.0,2295.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,5.0,1.0,1.0,18.0,25.5,18.0,25.5,,3.0,Y,5.0,2.0,73736.0,118368.0,0.0,0.0,1901.0,,2004.0,0.0,South Street Seaport,,5.01,6.02,6.0,6.5,1.0,1000970000.0,0.0,1501.0,983466.0,197048.0,12b,,101S023,10103.0,,0.0,,1.0,15v1,51.576471,1056.20915,1425.708061,2287.581699
10,Manhattan,132.0,24.0,4438000.0,5549000.0,216696.0,6000000.0,101.0,21.0,2004.0,2.0,1.0,10007.0,L010,1.0,7700.0,79 WARREN STREET,C6-2A,,,,,,TMU,,,C6-2A/TMU,,N,C0,2.0,0,,"79 WARREN ASSOCIATES,",1875.0,6220.0,0.0,6220.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,3.0,3.0,3.0,25.0,75.0,25.0,73.0,,3.0,N,5.0,1.0,91551.0,216696.0,0.0,0.0,1905.0,E,1984.0,0.0,,,3.32,6.02,6.0,6.5,1.0,1001320000.0,0.0,21.0,981354.0,199799.0,12b,,101S015,10104.0,,0.0,,1.0,15v1,34.838585,713.504823,892.122186,964.630225
16,Manhattan,141.0,9.0,6899000.0,7938000.0,412128.0,15900000.0,101.0,39.0,4010.0,2.0,1.0,10013.0,E007,1.0,7700.0,148 READE STREET,C6-2A,,,,,,TMU,,,C6-2A/TMU,,N,A7,1.0,0,,ALFRED M MERRIN,1370.0,6800.0,0.0,6800.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,5.0,1.0,1.0,25.0,54.03,25.0,54.0,,3.0,Y,5.0,4.0,222820.0,412128.0,0.0,0.0,1999.0,,0.0,0.0,,,4.96,6.02,6.0,6.5,1.0,1001410000.0,0.0,39.0,981461.0,200410.0,12a,,101S016,10104.0,,1001410000.0,12/4/2000,1.0,15v1,60.607059,1014.558824,1167.352941,2338.235294
19,Manhattan,141.0,13.0,6211000.0,6284000.0,370915.0,13800000.0,101.0,39.0,4010.0,2.0,1.0,10013.0,E007,1.0,7700.0,156 READE STREET,C6-2A,,,,,,TMU,,,C6-2A/TMU,,N,S1,4.0,0,,156 PAPAYA LLC,1350.0,6800.0,0.0,6800.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,6.0,1.0,2.0,25.0,53.5,25.0,54.0,,3.0,Y,5.0,4.0,200059.0,370915.0,0.0,0.0,1920.0,E,0.0,0.0,,,5.04,6.02,6.0,6.5,1.0,1001410000.0,0.0,39.0,981376.0,200461.0,12a,,101S016,10104.0,,0.0,,1.0,15v1,54.546324,913.382353,924.117647,2029.411765
99,Manhattan,348.0,14.0,4271000.0,3150000.0,43766.0,2850000.0,103.0,14.02,1000.0,1.0,1.0,10002.0,L018,7.0,7400.0,149 RIVINGTON STREET,R7A,,,,,,,,,R7A,,N,S2,4.0,0,,LASIERRA SRVC CO INC,971.0,3500.0,1550.0,1950.0,1550.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,3.0,2.0,3.0,18.67,52.0,18.67,43.0,,0.0,N,5.0,5.0,17215.0,43766.0,0.0,0.0,1910.0,E,2007.0,0.0,,,3.6,4.0,0.0,4.0,1.0,1003480000.0,0.0,1402.0,988185.0,201285.0,12c,,101N080,10201.0,,0.0,,1.0,15v1,12.504571,1220.285714,900.0,814.285714


Reset the indices (this isn't strictly necessary, but the resulting data is cleaner and since we can concatenate whilst ignoring indices it won't cause any problems down the line).

In [123]:
r_p_pre.reset_index(drop=True, inplace=True)
r_p_post.reset_index(drop=True, inplace=True)

Before saving to `csv`, we need to do one more thing: name the `Index`. Otherwise this column's header will not be populated.

Note that an `Index` is a `pandas` requirement, not a `csv` one. We could remove it entirely, but don't really gain anything from doing so.

In [124]:
r_p_pre.index.name = r_p_post.index.name = 'Index'

That does it!

In [125]:
r_p_pre.to_csv('nyc_building_sales.csv')
r_p_post.to_csv('nyc_buildings.csv')