# Cleaning Department of Buildings Complaints dataset

#### By: Mahdi Shadkam-Farrokhi & Jeremy Ondov

This notebook is intended for cleaning the DOB dataset we'll be working with in a separate notebook for EDA and modeling.

### Source Material
- [Data Source](https://data.cityofnewyork.us/Housing-Development/DOB-Complaints-Received/eabe-havv)
- [Complaint Codes](https://www1.nyc.gov/assets/buildings/pdf/complaint_category.pdf)
- [Disposition Codes](https://www1.nyc.gov/assets/buildings/pdf/bis_complaint_disposition_codes.pdf)
- [Data Explains](https://docs.google.com/spreadsheets/d/10p0HLqinKbUrSjKaZC2E0ZTHDXgULT0K/edit#gid=1015257717)

## Loading libraries and data

In [1]:
import pandas as pd
import math

Given the massive size of the dataset, we'll only pull a sample from the relevant observations.

After some outside research, we discovered the DOB website went live in 2009, which drastically altered the shape of the data after that point. Therefore, we will only select observations from 2009 onward, as these are much more applicable to current events.

In [2]:
target_size = 100_000 # desired sample size

In [3]:
data_file = "./datasets/DOB_Complaints_Received.csv"
chunk_size = 100_000 # number of lines used for each iterated read through file
skip = math.ceil(1_300_000 / target_size) # the sample rate. Every "skip"th observation is selected

dtypes = {
    'Complaint Number':"int64",
    'ZIP Code':"object",
    'Special District':"object",
    'Complaint Category':"object",
    'Unit':"object",
    'Date Entered':"object",
    'Status':"object",
    'House Street':"object",
    'House Number':"object"
}

keepers = [
    'Complaint Number',
    'ZIP Code',
    'Special District',
    'Complaint Category',
    'Unit',
    'Date Entered',
    'Inspection Date',
    'Status',
    'House Street',
    'House Number'
]

iteration_obj = pd.read_csv(
                    data_file, 
                    usecols = keepers, 
                    parse_dates=['Date Entered'], 
                    iterator = True,
                    chunksize = chunk_size,
                    dtype = dtypes
                );

db = None
current_n = 0
while db is None or (db.shape[0] < target_size and iteration_obj._currow <= 2_300_000):
    raw_dataframe = iteration_obj.get_chunk()
    # removing anything before 2009
    filtered_dataframe = raw_dataframe["2009" < raw_dataframe["Date Entered"]]
    if db is None:
        db = filtered_dataframe
    # adding to sample
    db = pd.concat([db, filtered_dataframe.iloc[::skip,:]], axis = 0)
    current_n += chunk_size
    print("Working up to row # {} | Current sample length = {}".format(current_n,db.shape[0]))

iteration_obj.close() # not sure if needed, but good practice to close connections

Working up to row # 100000 | Current sample length = 55256
Working up to row # 200000 | Current sample length = 59144
Working up to row # 300000 | Current sample length = 63052
Working up to row # 400000 | Current sample length = 66961
Working up to row # 500000 | Current sample length = 70857
Working up to row # 600000 | Current sample length = 74768
Working up to row # 700000 | Current sample length = 78658
Working up to row # 800000 | Current sample length = 82596
Working up to row # 900000 | Current sample length = 86505
Working up to row # 1000000 | Current sample length = 90410
Working up to row # 1100000 | Current sample length = 94332
Working up to row # 1200000 | Current sample length = 98241
Working up to row # 1300000 | Current sample length = 102161


In [4]:
db.shape

(102161, 10)

In [5]:
db.head()

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
1,2193181,CLOSED,2013-12-17,573,10458,EAST FORDHAM ROAD,,4B,SEP,12/24/2013
6,1265849,CLOSED,2009-09-21,429,10075,EAST 77 STREET,,73,MAN.,10/03/2009
12,1404696,CLOSED,2015-09-02,21,10011,WEST 8 STREET,,37,ERT,09/03/2015
13,2149422,CLOSED,2010-08-25,2075,10462,WALLACE AVENUE,,23,ERT,10/09/2010
14,3312533,CLOSED,2009-07-20,819,11220,59 STREET,,90,CITY,09/15/2009


We'll be working with roughly 100,000 obsevations

## Data Cleaning

### Filter only closed complaints

The only relevant observations are cases that are currently closed.

In [6]:
db = db[db["Status"] == "CLOSED"]

### Converting Inspection date to datetime

In [7]:
db["Inspection Date"].head()

1     12/24/2013
6     10/03/2009
12    09/03/2015
13    10/09/2010
14    09/15/2009
Name: Inspection Date, dtype: object

Some dates are erroneous and out of bounds for conversion to date time. Although a few observations can me inferentially imputed manually, it is impractical to include such a tactic in the main workflow for many hundreds of thousands of observations. 

According to the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits), the earliest valid time stamp for proper formating is `pd.Timestamp.min`. Also, we'll remove any dates prior to 2009, which serves as our hard cutoff for consideration, since the DOB website came online January of 2009 and shows a significant deviation from prior years.

In [8]:
years = db["Inspection Date"].apply(lambda a:a.split("/")[2])

In [9]:
invalid_year_indeces = years[years.astype(int) < 2009].index

In [10]:
invalid_year_indeces.shape

(62,)

In [11]:
db.loc[invalid_year_indeces,:].head()

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
5658,4389173,CLOSED,2009-02-09,10-93,11101,JACKSON AVENUE,,31,QNS.,12/30/2008
11560,3291457,CLOSED,2009-01-06,315,11207,PENNSYLVANIA AVENUE,,5,BKLYN,11/26/2008
18369,3291632,CLOSED,2009-01-07,491,11208,EMERALD STREET,,55,BKLYN,11/20/2008
19303,1310016,CLOSED,2011-09-22,170,10012,MERCER STREET,,23,SCFLD,09/30/2000
22506,3399780,CLOSED,2012-01-01,2223,11226,CORTELYOU ROAD,,37,ERT,01/02/0212


In [12]:
db.drop(index = invalid_year_indeces, inplace = True)

In [13]:
db["Inspection Date"] = db["Inspection Date"].astype('datetime64[ns]')

In [14]:
db.dtypes

Complaint Number               int64
Status                        object
Date Entered          datetime64[ns]
House Number                  object
ZIP Code                      object
House Street                  object
Special District              object
Complaint Category            object
Unit                          object
Inspection Date       datetime64[ns]
dtype: object

In [15]:
db.describe(include = "all")

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
count,100217.0,100217,100217,100217.0,100217.0,100217,100217.0,100217.0,100217,100217
unique,,1,3593,15221.0,204.0,7179,2.0,106.0,28,3570
top,,CLOSED,2009-02-12 00:00:00,15.0,11419.0,BROADWAY,,45.0,QNS.,2017-12-05 00:00:00
freq,,100217,77,256.0,1775.0,1106,99504.0,14694.0,19081,92
first,,,2009-01-02 00:00:00,,,,,,,2009-01-02 00:00:00
last,,,2018-11-06 00:00:00,,,,,,,2018-11-06 00:00:00
mean,3276906.0,,,,,,,,,
std,1266253.0,,,,,,,,,
min,1245559.0,,,,,,,,,
25%,2166787.0,,,,,,,,,


In [16]:
db.shape

(100217, 10)

We now have over 100,000 cleaned observations.

# Feature Engineering

### Creating target variable
Our target is the number of days until a complaint's inspection date: 

$$\text{Inspection Date} - \text{Data Entered} = \text{Days until Inspection}$$

In [17]:
db["days_until_inspection"] = db["Inspection Date"] - db["Date Entered"]

In [18]:
db.describe()

Unnamed: 0,Complaint Number,days_until_inspection
count,100217.0,100217
mean,3276906.0,69 days 15:30:23.726513
std,1266253.0,189 days 17:28:03.537128
min,1245559.0,-1460 days +00:00:00
25%,2166787.0,1 days 00:00:00
50%,3479654.0,12 days 00:00:00
75%,4496623.0,66 days 00:00:00
max,5138147.0,3402 days 00:00:00


Some complaints took a negative number of days, which we investigated. Sources at NYC Open Data claimed some complaints were found by inspectors and inspected immediately, and were only filed official at a later date. Since these "complaints" are impossible to predict for, we'll remove them from consideration. 

In [19]:
# extracting raw number of days
db["days_until_inspection"] = db["days_until_inspection"].map(lambda x:x.days)

In [20]:
db[db["days_until_inspection"] <= 0].shape

(18081, 11)

We'll be dropping about 18,000 observations.

In [21]:
db = db[db["days_until_inspection"] > 0]

### Fixing Special District

The "Special District" column has an emptry string category, which we'll change to "NOT SPECIAL".

In [22]:
db["Special District"].unique()

array(['   ', 'IBZ'], dtype=object)

In [23]:
db["Special District"] = db["Special District"].map(lambda x: x if x != '   ' else "NOT SPECIAL")

In [24]:
db["Special District"].unique()

array(['NOT SPECIAL', 'IBZ'], dtype=object)

## Renaming columns

In [25]:
# removing spaces & forcing all to lowercase
db.columns = [col.lower().replace(" ", "_") for col in db.columns]

## Handling Nulls

In [26]:
db.isnull().sum()

complaint_number         0
status                   0
date_entered             0
house_number             0
zip_code                 0
house_street             0
special_district         0
complaint_category       0
unit                     0
inspection_date          0
days_until_inspection    0
dtype: int64

We're not finding nulls in our dataset, however, as part of our meta cleaning process, we will drop any null observations.

We must account for this in our process as we're assuming these values are missing completely at random, or MCAR.

In [27]:
db = db.dropna()

In [28]:
db.shape

(82136, 11)

In [29]:
db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection
1,2193181,CLOSED,2013-12-17,573,10458,EAST FORDHAM ROAD,NOT SPECIAL,4B,SEP,2013-12-24,7
6,1265849,CLOSED,2009-09-21,429,10075,EAST 77 STREET,NOT SPECIAL,73,MAN.,2009-10-03,12
12,1404696,CLOSED,2015-09-02,21,10011,WEST 8 STREET,NOT SPECIAL,37,ERT,2015-09-03,1
13,2149422,CLOSED,2010-08-25,2075,10462,WALLACE AVENUE,NOT SPECIAL,23,ERT,2010-10-09,45
14,3312533,CLOSED,2009-07-20,819,11220,59 STREET,NOT SPECIAL,90,CITY,2009-09-15,57


We now have a complete dataset, which will be used in a separate notebook for EDA and modeling.

# Save cleaned data

In [30]:
# saving db
db.to_csv("./datasets/cleaned.csv")