# Cleaning Department of Buildings Complaints dataset

#### By: Mahdi Shadkam-Farrokhi & Jeremy Ondov

### Resources
- [Data Source](https://data.cityofnewyork.us/Housing-Development/DOB-Complaints-Received/eabe-havv)
- [Complaint Codes](https://www1.nyc.gov/assets/buildings/pdf/complaint_category.pdf)
- [Disposition Codes](https://www1.nyc.gov/assets/buildings/pdf/bis_complaint_disposition_codes.pdf)
- [Data Explains](https://docs.google.com/spreadsheets/d/10p0HLqinKbUrSjKaZC2E0ZTHDXgULT0K/edit#gid=1015257717)

## Loading libraries and data

In [1]:
import pandas as pd
import math

Given the massive size of the dataset, we'll only pull a sample from the relevant observations.

After some outside research, we discovered the DOB website went live in 2009, which drastically altered the shape of the data after that point. Therefore, we will only select observations from 2009 onward, as these are much more applicable to current events.

In [None]:
target_size = 100_000 # desired sample size

In [2]:
data_file = "./data/DOB_Complaints_Received.csv"
chunk_size = 100_000 # number of lines used for each iterated read through file
skip = math.ceil(1_300_000 / target_size) # the sample rate. Every "skip"th observation is selected

dtypes = {
    'Complaint Number':"int64",
    'ZIP Code':"object",
    'Special District':"object",
    'Complaint Category':"object",
    'Unit':"object",
    'Date Entered':"object",
    'Status':"object",
    'House Street':"object",
    'House Number':"object"
}

keepers = [
    'Complaint Number',
    'ZIP Code',
    'Special District',
    'Complaint Category',
    'Unit',
    'Date Entered',
    'Inspection Date',
    'Status',
    'House Street',
    'House Number'
]

iteration_obj = pd.read_csv(
                    data_file, 
                    usecols = keepers, 
                    parse_dates=['Date Entered'], 
                    iterator = True,
                    chunksize = chunk_size,
                    dtype = dtypes
                );

db = None
current_n = 0
while db is None or (db.shape[0] < target_size and iteration_obj._currow <= 2_300_000):
    raw_dataframe = iteration_obj.get_chunk()
    # removing anything before 2009
    filtered_dataframe = raw_dataframe["2009" < raw_dataframe["Date Entered"]]
    if db is None:
        db = filtered_dataframe
    # adding to sample
    db = pd.concat([db, filtered_dataframe.iloc[::skip,:]], axis = 0)
    current_n += chunk_size
    print("Working up to row # {} | Current sample length = {}".format(current_n,db.shape[0]))

iteration_obj.close() # not sure if needed, but good practice to close connections

Working up to row # 100000 | Current sample length = 0
Working up to row # 200000 | Current sample length = 0
Working up to row # 300000 | Current sample length = 5044
Working up to row # 400000 | Current sample length = 12737
Working up to row # 500000 | Current sample length = 20430
Working up to row # 600000 | Current sample length = 21631
Working up to row # 700000 | Current sample length = 26478
Working up to row # 800000 | Current sample length = 34171
Working up to row # 900000 | Current sample length = 34714
Working up to row # 1000000 | Current sample length = 34714
Working up to row # 1100000 | Current sample length = 36543
Working up to row # 1200000 | Current sample length = 44236
Working up to row # 1300000 | Current sample length = 51929
Working up to row # 1400000 | Current sample length = 59622
Working up to row # 1500000 | Current sample length = 67315
Working up to row # 1600000 | Current sample length = 68248
Working up to row # 1700000 | Current sample length = 6824

In [3]:
db.shape

(102991, 8)

In [4]:
db.head()

Unnamed: 0,Complaint Number,Status,Date Entered,ZIP Code,Special District,Complaint Category,Unit,Inspection Date
234435,1245555,CLOSED,2009-01-02,10025,,58,BOILR,06/02/2009
234448,1245568,CLOSED,2009-01-02,10036,,4,ERT,01/02/2009
234461,1245582,CLOSED,2009-01-02,10001,,23,SCFLD,01/02/2009
234474,1245595,CLOSED,2009-01-02,10031,,54,MAN.,01/02/2009
234487,1245608,CLOSED,2009-01-02,10013,,23,SCFLD,10/09/2009


We'll be working with roughly 100,000 obsevations

## Data Cleaning

### Filter only closed complaints

In [5]:
db = db[db["Status"] == "CLOSED"]

### Converting Inspection date to datetime

In [6]:
db["Inspection Date"].head()

234435    06/02/2009
234448    01/02/2009
234461    01/02/2009
234474    01/02/2009
234487    10/09/2009
Name: Inspection Date, dtype: object

Some dates are erroneous and out of bounds for conversion to date time. Although a few observations can me inferentially imputed manually, it is impractical to include such a tactic in the main workflow for many hundreds of thousands of observations. 

According to the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits), the earliest valid time stamp for proper formating is `pd.Timestamp.min`. Also, we'll remove any dates prior to 2009, which serves as our hard cutoff for consideration.

In [7]:
years = db["Inspection Date"].apply(lambda a:a.split("/")[2])

In [8]:
invalid_year_indeces = years[years.astype(int) < 2009].index

In [9]:
invalid_year_indeces.shape

(45,)

In [10]:
db.loc[invalid_year_indeces,:].head()

Unnamed: 0,Complaint Number,Status,Date Entered,ZIP Code,Special District,Complaint Category,Unit,Inspection Date
234617,1245742,CLOSED,2009-01-05,10002,,23,SCFLD,12/20/2008
234630,1245755,CLOSED,2009-01-05,10009,,23,SCFLD,12/20/2008
234656,1245781,CLOSED,2009-01-05,10036,,23,SCFLD,12/17/2008
234799,1245929,CLOSED,2009-01-07,10024,,23,SCFLD,12/24/2008
234812,1245945,CLOSED,2009-01-07,10023,,23,SCFLD,12/19/2008


In [11]:
db.drop(index = invalid_year_indeces, inplace = True)

In [12]:
db["Inspection Date"] = db["Inspection Date"].astype('datetime64[ns]')

In [13]:
db.dtypes

Complaint Number               int64
Status                        object
Date Entered          datetime64[ns]
ZIP Code                      object
Special District              object
Complaint Category            object
Unit                          object
Inspection Date       datetime64[ns]
dtype: object

In [14]:
db.describe(include = "all")

Unnamed: 0,Complaint Number,Status,Date Entered,ZIP Code,Special District,Complaint Category,Unit,Inspection Date
count,91655.0,91655,91655,91655.0,91655.0,91655.0,91655,91655
unique,,1,3914,209.0,2.0,119.0,31,3881
top,,CLOSED,2016-10-18 00:00:00,11419.0,,45.0,QNS.,2019-02-22 00:00:00
freq,,91655,67,1592.0,91019.0,12924.0,17404,71
first,,,2009-01-02 00:00:00,,,,,2009-01-02 00:00:00
last,,,2019-09-21 00:00:00,,,,,2019-09-22 00:00:00
mean,3268666.0,,,,,,,
std,1255743.0,,,,,,,
min,1245555.0,,,,,,,
25%,2168755.0,,,,,,,


In [15]:
db.shape

(91655, 8)

We now have over 90,000 cleaned observations.

FIX IZIP CODE
db["zip_code"].map(lambda x:x.rstrip()).iloc[0]

# Feature Engineering

### Creating target variable
Our target is the number of days until a complaint's inspection date: 

$$\text{Inspection Date} - \text{Data Entered} = \text{Days until Inspection}$$

In [16]:
db["days_until_inspection"] = db["Inspection Date"] - db["Date Entered"]

In [17]:
db.describe()

Unnamed: 0,Complaint Number,days_until_inspection
count,91655.0,91655
mean,3268666.0,69 days 04:53:24.276907
std,1255743.0,195 days 16:18:54.739841
min,1245555.0,-1449 days +00:00:00
25%,2168755.0,1 days 00:00:00
50%,3498041.0,12 days 00:00:00
75%,4493822.0,63 days 00:00:00
max,5124824.0,3767 days 00:00:00


Some complaints took a negative number of days, which is impossible, so these observations will be removed.

In [18]:
# extracting raw number of days
db["days_until_inspection"] = db["days_until_inspection"].map(lambda x:x.days)

In [19]:
db[db["days_until_inspection"] <= 0].shape

(17008, 9)

We'll be dropping about 17,000 observations.

In [20]:
db = db[db["days_until_inspection"] > 0]

### Fixing Special District

The "Special District" column has an emptry string category, which we'll change to "NOT SPECIAL".

In [21]:
db["Special District"].unique()

array(['   ', 'IBZ'], dtype=object)

In [22]:
db["Special District"] = db["Special District"].map(lambda x: x if x != '   ' else "NOT SPECIAL")

In [23]:
db["Special District"].unique()

array(['NOT SPECIAL', 'IBZ'], dtype=object)

## Renaming columns

In [24]:
# removing spaces & forcing all to lowercase
db.columns = [col.lower().replace(" ", "_") for col in db.columns]

## Handling Nulls

We're not finding nulls in our dataset, however, as part of our meta cleaning process, we will drop any null observations.

We must account for this in our process as we're assuming these values are missing completely at random, or MCAR.

In [25]:
db.isnull().sum()

complaint_number         0
status                   0
date_entered             0
zip_code                 0
special_district         0
complaint_category       0
unit                     0
inspection_date          0
days_until_inspection    0
dtype: int64

In [26]:
db = db.dropna()

In [27]:
db.shape

(74647, 9)

In [28]:
db.head()

Unnamed: 0,complaint_number,status,date_entered,zip_code,special_district,complaint_category,unit,inspection_date,days_until_inspection
234435,1245555,CLOSED,2009-01-02,10025,NOT SPECIAL,58,BOILR,2009-06-02,151
234487,1245608,CLOSED,2009-01-02,10013,NOT SPECIAL,23,SCFLD,2009-10-09,280
234500,1245621,CLOSED,2009-01-02,10001,NOT SPECIAL,63,ELEVR,2009-01-22,20
234513,1245634,CLOSED,2009-01-03,10013,NOT SPECIAL,56,BOILR,2009-01-07,4
234526,1245648,CLOSED,2009-01-03,10016,NOT SPECIAL,59,ELCTR,2009-01-08,5


# Save cleaned data

In [29]:
# saving db
db.to_csv("./data/cleaned.csv")