## Using Pandas and Bokeh to Analyze and Visualze NYC Cooperative and Condominium Apartment Sales - PART 1

My goal is to review the rolling NYC sales data to see what patterns emerge in the area of cooperative and condominium apartment sales.  My initial question is which cooperative and condominium apartment buildings have the most sales by number and percentage.  I hope that this will identify some interesting patterns that can be further explored.

The overall plan is below:

    1) Load the rolling sales data into a dataframe and clean it
    2) Use information from the Department of Finance to convert the condominium lot number to building lot number 
       so that sales per building can be determined       
    3) Merge the data with information on number of residential units per building from the 
       PLUTO database so that percentage sales per building can be calculated
    4) Use "groupby" and "sort" to obtain the top 10 sales per building (by both number and 
       percentage)
    5) Create a series of bar charts using Bokeh to visualize this data

This notebook will cover steps 1 & 2 and a second notebook will cover steps 3-5.

First step is to load the required packages and to put the raw rolling sales into a dataframe.

In [138]:
import sqlite3
import pandas as pd
import pandas.io.sql as pd_sql
import numpy as np

# I have previously downloaded rolling sales datedata in an excel file from 
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

alldf=pd.read_excel('data/rollingsales_manhattanJuly2016.xls',skiprows=4, na_values=[' ',''])

# clean trailing sales column names
alldf.rename(columns={'EASE-MENT': 'EASEMENT', 'BUILDING CLASS AT PRESENT': 'BCLASS', 'APARTMENT NUMBER' : 'APT'}, inplace=True)

print(alldf.shape)

(23512, 21)


Since our focus is on individual condominium and cooperative unit sales (not townhouses or office buildings) I will limit the data to the subset of cooperatives and condominiums. 

In [139]:
bclass=['C6','C8','D0','D4','R4','R5']
df=alldf[alldf.BCLASS.isin(bclass)]
print(df.shape)
df.head(5)

(14833, 21)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASEMENT,BCLASS,ADDRESS,APT,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
25,1,ALPHABET CITY,09 COOPS - WALKUP APARTMENTS,2,373,40,,C6,"327 EAST 3RD STREET, 3E",,...,0,0,0,0,0,1920,2,C6,390000,2016-06-16
26,1,ALPHABET CITY,09 COOPS - WALKUP APARTMENTS,2,373,46,,C6,"317 EAST 3RD STREET, 9",,...,0,0,0,0,0,1925,2,C6,635000,2016-02-18
27,1,ALPHABET CITY,09 COOPS - WALKUP APARTMENTS,2,373,49,,C6,"311 EAST 3RD STREET, 24",,...,0,0,0,0,0,1920,2,C6,465000,2016-01-11
28,1,ALPHABET CITY,09 COOPS - WALKUP APARTMENTS,2,376,18,,C6,"252 EAST 7TH STREET, 21/22",,...,0,0,0,0,0,1928,2,C6,900000,2015-09-09
29,1,ALPHABET CITY,09 COOPS - WALKUP APARTMENTS,2,376,19,,C6,"254 EAST 7TH STREET, 18",,...,0,0,0,0,0,1910,2,C6,600000,2015-08-20


I have created functions to make the apartment numbers consistent (a review of the rolling sales data shows that condominium addresses have a seperate apartment field while cooperatives include the apartment in the ADDRESS field).

In [140]:
pd.options.mode.chained_assignment = None  # default='warn'

# split apt from ADDRESS and update columns in df
cleanaddress=df['ADDRESS'].apply(lambda x: pd.Series(x.split(',')))

df['ADDRESS']=cleanaddress[0]
df['APT2']=cleanaddress[1]

def fixapt(ldf):  
    try:
        if len(ldf['APT'].strip())<1:
            ldf['APT'] = ldf['APT2'] 
        return(ldf['APT'])
    except:
        return(ldf['APT'])

def stripapt(ldf):
    x=ldf['APT']    
    try:
        x=x.strip(' ')
    except:
        x=x
    return(x)    

df['APT']=df.apply(fixapt, axis=1)  
df['APT']=df.apply(stripapt, axis=1)   

Condominiums are treated differently from cooperatives in many ways, including the fact that condominium units are assigned their own unique lot number.  In order to analyze condominium sales on a building wide basis, we need to identify which building each condominium is in.  Luckily the Department of Finance provides data on the range of condominium unit lots which relate to each building lot.  

First will clean the lot information and create a seperate column 'condolot' to save the condo lot numbers. 

In [141]:
# cleans lot
'''def fixlot(ldf):  
    try:
        x=ldf['LOT2'] 
        x=str(x)
        x=x.strip('()[]=-,')
        x=int(x)
    except:
        x=ldf['LOT']
    return(x)         
'''
def savecondolot(ldf):
    condoclass=['R4','R5']
    if ldf['BCLASS'] in condoclass:
        condolot = ldf['LOT']
    else:
        condolot = 0
    return(condolot)

#df['LOT']=df.apply(fixlot, axis=1)

# creates a new column in which the condo lot number is saved
df['CONDOLOT']=df.apply(savecondolot, axis=1)

#create df of just condos 
condoclass=['R4','R5']
df_condo=df[df.BCLASS.isin(condoclass)]
print(df_condo.shape)
 

(7865, 23)


I have saved the condo lot dataframe as 'condo_blconverter.csv' and created a function which creates a column with the appropriate billing lot number from the dataframe.  Before I apply this to the dataframe I have created a condo only dataframe and after running the function I have added them back together.

In [142]:
dfKey=pd.read_csv('processed/condo_blconverter.csv')

def addbuildinglot(ldf):
    condoclass=['R4','R5']
    if ldf['BCLASS'] in condoclass:
        B=ldf['BLOCK']
        L=ldf['LOT']
        resultall = dfKey.billlot[(dfKey.block == B) & (dfKey.hilot >= L) & (dfKey.lolot <= L)]  
        try:
            result=resultall.iloc[0]
        except:
            result=0              
    else:
        result = ldf['LOT']
        print ('else ',result)
    return(result)   

df_condo['LOT'] = df_condo.apply(addbuildinglot, axis=1)

In [143]:
df_condo['LOT'].value_counts()

7501.0    3460
7502.0    1516
7503.0    1477
7504.0     374
7509.0     227
7505.0     164
7506.0     151
7508.0     135
7521.0     122
0.0         62
7507.0      45
7510.0      24
7517.0      19
7511.0      17
7514.0      16
7516.0      16
7518.0      14
7513.0      13
7515.0      11
7512.0       2
Name: LOT, dtype: int64

In [144]:
df_else=df[~df.BCLASS.isin(condoclass)]
df = df_condo.append(df_else)
df.shape

(14833, 23)

In [145]:
df.dtypes

BOROUGH                                    int64
NEIGHBORHOOD                              object
BUILDING CLASS CATEGORY                   object
TAX CLASS AT PRESENT                      object
BLOCK                                      int64
LOT                                      float64
EASEMENT                                 float64
BCLASS                                    object
ADDRESS                                   object
APT                                       object
ZIP CODE                                   int64
RESIDENTIAL UNITS                          int64
COMMERCIAL UNITS                           int64
TOTAL UNITS                                int64
LAND SQUARE FEET                           int64
GROSS SQUARE FEET                          int64
YEAR BUILT                                 int64
TAX CLASS AT TIME OF SALE                  int64
BUILDING CLASS AT TIME OF SALE            object
SALE PRICE                                 int64
SALE DATE           

In [146]:
df = df.drop(['APT2'], axis=1)

In [147]:
a=['BLOCK','LOT','APT','BUILDING CLASS CATEGORY','CONDOLOT']
df[a][1:100]

Unnamed: 0,BLOCK,LOT,APT,BUILDING CLASS CATEGORY,CONDOLOT
86,373,7501.0,6B,13 CONDOS - ELEVATOR APARTMENTS,1010
87,375,7501.0,5A,13 CONDOS - ELEVATOR APARTMENTS,1017
88,375,7501.0,6B,13 CONDOS - ELEVATOR APARTMENTS,1023
89,384,7503.0,PHF,13 CONDOS - ELEVATOR APARTMENTS,1224
90,392,7501.0,5B,13 CONDOS - ELEVATOR APARTMENTS,1019
91,392,7501.0,14E,13 CONDOS - ELEVATOR APARTMENTS,1075
92,392,7501.0,14E,13 CONDOS - ELEVATOR APARTMENTS,1075
93,392,7501.0,PHB,13 CONDOS - ELEVATOR APARTMENTS,1085
94,392,7501.0,PHB,13 CONDOS - ELEVATOR APARTMENTS,1085
95,392,7501.0,PHC,13 CONDOS - ELEVATOR APARTMENTS,1086


In [148]:
print(df.shape)
df.dtypes

(14833, 22)


BOROUGH                                    int64
NEIGHBORHOOD                              object
BUILDING CLASS CATEGORY                   object
TAX CLASS AT PRESENT                      object
BLOCK                                      int64
LOT                                      float64
EASEMENT                                 float64
BCLASS                                    object
ADDRESS                                   object
APT                                       object
ZIP CODE                                   int64
RESIDENTIAL UNITS                          int64
COMMERCIAL UNITS                           int64
TOTAL UNITS                                int64
LAND SQUARE FEET                           int64
GROSS SQUARE FEET                          int64
YEAR BUILT                                 int64
TAX CLASS AT TIME OF SALE                  int64
BUILDING CLASS AT TIME OF SALE            object
SALE PRICE                                 int64
SALE DATE           

In [149]:
df['LOT'].value_counts()

7501.0    3460
7502.0    1516
7503.0    1477
1.0        981
7504.0     374
23.0       234
29.0       227
7509.0     227
7505.0     164
7506.0     151
37.0       150
20.0       146
19.0       143
33.0       138
5.0        137
7508.0     135
17.0       133
14.0       131
7521.0     122
12.0       122
22.0       120
38.0       118
35.0       116
16.0       113
31.0       112
13.0       112
8.0        109
7.0        109
43.0       103
18.0       102
          ... 
105.0        2
522.0        2
523.0        2
7512.0       2
91.0         2
89.0         2
158.0        2
500.0        1
154.0        1
316.0        1
309.0        1
205.0        1
175.0        1
160.0        1
146.0        1
151.0        1
150.0        1
148.0        1
144.0        1
140.0        1
138.0        1
137.0        1
135.0        1
129.0        1
128.0        1
106.0        1
104.0        1
101.0        1
100.0        1
9002.0       1
Name: LOT, dtype: int64

In [150]:
df['LOT'] = pd.to_numeric(df['LOT'], errors='coerce')

In [151]:
df.dtypes

BOROUGH                                    int64
NEIGHBORHOOD                              object
BUILDING CLASS CATEGORY                   object
TAX CLASS AT PRESENT                      object
BLOCK                                      int64
LOT                                      float64
EASEMENT                                 float64
BCLASS                                    object
ADDRESS                                   object
APT                                       object
ZIP CODE                                   int64
RESIDENTIAL UNITS                          int64
COMMERCIAL UNITS                           int64
TOTAL UNITS                                int64
LAND SQUARE FEET                           int64
GROSS SQUARE FEET                          int64
YEAR BUILT                                 int64
TAX CLASS AT TIME OF SALE                  int64
BUILDING CLASS AT TIME OF SALE            object
SALE PRICE                                 int64
SALE DATE           

In [152]:
df.LOT = df.LOT.astype(int)

In [153]:
df['LOT']

85       7501
86       7501
87       7501
88       7501
89       7503
90       7501
91       7501
92       7501
93       7501
94       7501
95       7501
96       7501
97       7502
98          0
99          0
100         0
101         0
102         0
103         0
104         0
105         0
106         0
107         0
108         0
109         0
110      7504
111      7504
112      7503
113      7503
114      7503
         ... 
23433     633
23434     633
23435     633
23436     633
23437     633
23438     633
23439     633
23440     633
23441     633
23442     633
23443     633
23444     633
23445     633
23446     633
23447     633
23448     633
23449     633
23450     633
23451     633
23452     633
23453     633
23454     633
23455     633
23456     633
23457     633
23458     110
23459     120
23460     130
23461     130
23462     130
Name: LOT, dtype: int32

In [154]:
df.dtypes

BOROUGH                                    int64
NEIGHBORHOOD                              object
BUILDING CLASS CATEGORY                   object
TAX CLASS AT PRESENT                      object
BLOCK                                      int64
LOT                                        int32
EASEMENT                                 float64
BCLASS                                    object
ADDRESS                                   object
APT                                       object
ZIP CODE                                   int64
RESIDENTIAL UNITS                          int64
COMMERCIAL UNITS                           int64
TOTAL UNITS                                int64
LAND SQUARE FEET                           int64
GROSS SQUARE FEET                          int64
YEAR BUILT                                 int64
TAX CLASS AT TIME OF SALE                  int64
BUILDING CLASS AT TIME OF SALE            object
SALE PRICE                                 int64
SALE DATE           

In [155]:
df.to_csv('processed/RollingJulyProcessed.csv')