# Cleaning Department of Buildings Complaints dataset

#### By: Mahdi Shadkam-Farrokhi & Jeremy Ondov

### Resources
- [Data Source](https://data.cityofnewyork.us/Housing-Development/DOB-Complaints-Received/eabe-havv)
- [Complaint Codes](https://www1.nyc.gov/assets/buildings/pdf/complaint_category.pdf)
- [Disposition Codes](https://www1.nyc.gov/assets/buildings/pdf/bis_complaint_disposition_codes.pdf)
- [Data Explains](https://docs.google.com/spreadsheets/d/10p0HLqinKbUrSjKaZC2E0ZTHDXgULT0K/edit#gid=1015257717)

## Loading libraries and data

In [1]:
import pandas as pd
import googlemaps
import time
import math

Given the massive size of the dataset, we'll only pull a sample from the relevant observations.

After some outside research, we discovered the DOB website went live in 2009, which drastically altered the shape of the data after that point. Therefore, we will only select observations from 2009 onward, as these are much more applicable to current events.

In [67]:
target_size = 100_000 # desired sample size

In [68]:
data_file = "./datasets/DOB_Complaints_Received.csv"
chunk_size = 100_000 # number of lines used for each iterated read through file
skip = math.ceil(1_300_000 / target_size) # the sample rate. Every "skip"th observation is selected

dtypes = {
    'Complaint Number':"int64",
    'ZIP Code':"object",
    'Special District':"object",
    'Complaint Category':"object",
    'Unit':"object",
    'Date Entered':"object",
    'Status':"object",
    'House Street':"object",
    'House Number':"object"
}

keepers = [
    'Complaint Number',
    'ZIP Code',
    'Special District',
    'Complaint Category',
    'Unit',
    'Date Entered',
    'Inspection Date',
    'Status',
    'House Street',
    'House Number'
]

iteration_obj = pd.read_csv(
                    data_file, 
                    usecols = keepers, 
                    parse_dates=['Date Entered'], 
                    iterator = True,
                    chunksize = chunk_size,
                    dtype = dtypes
                );

db = None
current_n = 0
while db is None or (db.shape[0] < target_size and iteration_obj._currow <= 2_300_000):
    raw_dataframe = iteration_obj.get_chunk()
    # removing anything before 2009
    filtered_dataframe = raw_dataframe["2009" < raw_dataframe["Date Entered"]]
    if db is None:
        db = filtered_dataframe
    # adding to sample
    db = pd.concat([db, filtered_dataframe.iloc[::skip,:]], axis = 0)
    current_n += chunk_size
    print("Working up to row # {} | Current sample length = {}".format(current_n,db.shape[0]))

iteration_obj.close() # not sure if needed, but good practice to close connections

Working up to row # 100000 | Current sample length = 0
Working up to row # 200000 | Current sample length = 0
Working up to row # 300000 | Current sample length = 5044
Working up to row # 400000 | Current sample length = 12737
Working up to row # 500000 | Current sample length = 20430
Working up to row # 600000 | Current sample length = 21628
Working up to row # 700000 | Current sample length = 26477
Working up to row # 800000 | Current sample length = 34170
Working up to row # 900000 | Current sample length = 34709
Working up to row # 1000000 | Current sample length = 34709
Working up to row # 1100000 | Current sample length = 36542
Working up to row # 1200000 | Current sample length = 44235
Working up to row # 1300000 | Current sample length = 51928
Working up to row # 1400000 | Current sample length = 59621
Working up to row # 1500000 | Current sample length = 67314
Working up to row # 1600000 | Current sample length = 68238
Working up to row # 1700000 | Current sample length = 6823

In [69]:
db.shape

(102989, 10)

In [70]:
db.head()

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
234435,1245555,CLOSED,2009-01-02,930,10025,WEST END AVENUE,,58,BOILR,06/02/2009
234448,1245568,CLOSED,2009-01-02,639,10036,WEST 46 STREET,,4,ERT,01/02/2009
234461,1245582,CLOSED,2009-01-02,34,10001,WEST 32 STREET,,23,SCFLD,01/02/2009
234474,1245595,CLOSED,2009-01-02,515,10031,WEST 139 STREET,,54,MAN.,01/02/2009
234487,1245608,CLOSED,2009-01-02,428,10013,BROADWAY,,23,SCFLD,10/09/2009


We'll be working with roughly 100,000 observations.

We are also bringing in a dataset of median household income for each zip code in New York. This data was sourced from the American Community Survey, using their 5-year estimates from 2017, and adjusted to 2017 inflation levels.

In [71]:
income_db = pd.read_csv("./datasets/ACS_17_5YR_S1901_with_ann.csv", header=1)
income_db.head()

Unnamed: 0,Id,Id2,Geography,Households; Estimate; Median income (dollars),Households; Margin of Error; Median income (dollars),Families; Estimate; Median income (dollars),Families; Margin of Error; Median income (dollars),Married-couple families; Estimate; Median income (dollars),Married-couple families; Margin of Error; Median income (dollars),Nonfamily households; Estimate; Median income (dollars),Nonfamily households; Margin of Error; Median income (dollars)
0,8600000US06390,6390,ZCTA5 06390,150703,86256,151172,67633,131875,104286,-,**
1,8600000US07421,7421,ZCTA5 07421,90412,4718,99948,7733,112639,19572,69906,16080
2,8600000US10001,10001,ZCTA5 10001,85221,9970,103304,29429,149007,18758,75794,8870
3,8600000US10002,10002,ZCTA5 10002,35449,2696,39145,3234,42485,6118,28319,3989
4,8600000US10003,10003,ZCTA5 10003,104441,6666,183657,17463,198650,9364,86768,7078


## Data Cleaning

### Filter only closed complaints

As this project is centering on estimates of department responses, we only want to work with entries that have actually been responded to. Therefore, we will filter out all of the still open cases.

In [72]:
db = db[db["Status"] == "CLOSED"]

### Converting Inspection date to datetime

In [73]:
db["Inspection Date"].head()

234435    06/02/2009
234448    01/02/2009
234461    01/02/2009
234474    01/02/2009
234487    10/09/2009
Name: Inspection Date, dtype: object

Some dates are erroneous and out of bounds for conversion to date time. Although a few observations can me inferentially imputed manually, it is impractical to include such a tactic in the main workflow for many hundreds of thousands of observations. 

According to the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits), the earliest valid time stamp for proper formating is `pd.Timestamp.min`. Also, we'll remove any dates prior to 2009, which serves as our hard cutoff for consideration.

In [74]:
years = db["Inspection Date"].apply(lambda a:a.split("/")[2])

In [75]:
invalid_year_indeces = years[years.astype(int) < 2009].index

In [76]:
invalid_year_indeces.shape

(37,)

In [77]:
db.loc[invalid_year_indeces,:].head()

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
234617,1245742,CLOSED,2009-01-05,48,10002,CANAL STREET,,23,SCFLD,12/20/2008
234630,1245755,CLOSED,2009-01-05,252,10009,EAST 4 STREET,,23,SCFLD,12/20/2008
234656,1245781,CLOSED,2009-01-05,15,10036,WEST 47 STREET,,23,SCFLD,12/17/2008
234799,1245929,CLOSED,2009-01-07,76,10024,WEST 82 STREET,,23,SCFLD,12/24/2008
234812,1245945,CLOSED,2009-01-07,122,10023,WEST 71 STREET,,23,SCFLD,12/19/2008


In [78]:
db.drop(index = invalid_year_indeces, inplace = True)

In [79]:
db["Inspection Date"] = db["Inspection Date"].astype('datetime64[ns]')

In [80]:
db.dtypes

Complaint Number               int64
Status                        object
Date Entered          datetime64[ns]
House Number                  object
ZIP Code                      object
House Street                  object
Special District              object
Complaint Category            object
Unit                          object
Inspection Date       datetime64[ns]
dtype: object

Now that we have removed the entries entered before the relevant time frame, we want to look back at the descriptive statistics for the dataset.

In [81]:
db.describe(include = "all")

Unnamed: 0,Complaint Number,Status,Date Entered,House Number,ZIP Code,House Street,Special District,Complaint Category,Unit,Inspection Date
count,91810.0,91810,91810,91810.0,91810.0,91810,91810.0,91810.0,91810,91810
unique,,1,3912,15087.0,209.0,6977,2.0,116.0,32,3877
top,,CLOSED,2016-10-18 00:00:00,1.0,11419.0,BROADWAY,,45.0,QNS.,2018-10-24 00:00:00
freq,,91810,65,228.0,1544.0,993,91170.0,12757.0,17236,67
first,,,2009-01-02 00:00:00,,,,,,,2009-01-02 00:00:00
last,,,2019-09-21 00:00:00,,,,,,,2019-09-21 00:00:00
mean,3269364.0,,,,,,,,,
std,1255153.0,,,,,,,,,
min,1245555.0,,,,,,,,,
25%,2169322.0,,,,,,,,,


In [82]:
db.shape

(91810, 10)

We now have over 90,000 cleaned observations.

For the income dataset, we are only interested in two of the columns, the 'Id2' column which describes the zip code, and the first income column, labeled 'Households; Estimate; Median income (dollars)'. These will be formatted so they can easily be added to the main dataframe.

In [83]:
income_db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 11 columns):
Id                                                                   1795 non-null object
Id2                                                                  1795 non-null int64
Geography                                                            1795 non-null object
Households; Estimate; Median income (dollars)                        1795 non-null object
Households; Margin of Error; Median income (dollars)                 1795 non-null object
Families; Estimate; Median income (dollars)                          1795 non-null object
Families; Margin of Error; Median income (dollars)                   1795 non-null object
Married-couple families; Estimate; Median income (dollars)           1795 non-null object
Married-couple families; Margin of Error; Median income (dollars)    1795 non-null object
Nonfamily households; Estimate; Median income (dollars)              1795 non-null o

Looking at the datatypes, the zip code is currently an integer, though we would prefer it be a string, so that will be converted. For the income column, it is currently being interpreted as an object, so we need to determine what possible non-numeric characters are present and deal with them.

In [84]:
inc_cols = {"Households; Estimate; Median income (dollars)": "med_inc_zip",
           "Id2": "zip_code"}
income_db = income_db.rename(mapper=inc_cols,
                             axis=1)
income_db.head()

Unnamed: 0,Id,zip_code,Geography,med_inc_zip,Households; Margin of Error; Median income (dollars),Families; Estimate; Median income (dollars),Families; Margin of Error; Median income (dollars),Married-couple families; Estimate; Median income (dollars),Married-couple families; Margin of Error; Median income (dollars),Nonfamily households; Estimate; Median income (dollars),Nonfamily households; Margin of Error; Median income (dollars)
0,8600000US06390,6390,ZCTA5 06390,150703,86256,151172,67633,131875,104286,-,**
1,8600000US07421,7421,ZCTA5 07421,90412,4718,99948,7733,112639,19572,69906,16080
2,8600000US10001,10001,ZCTA5 10001,85221,9970,103304,29429,149007,18758,75794,8870
3,8600000US10002,10002,ZCTA5 10002,35449,2696,39145,3234,42485,6118,28319,3989
4,8600000US10003,10003,ZCTA5 10003,104441,6666,183657,17463,198650,9364,86768,7078


In [85]:
income_db["zip_code"] = income_db["zip_code"].astype("str")

In [86]:
income_db["med_inc_zip"] = income_db["med_inc_zip"].str.replace("\D+", "")

In [87]:
income_db.head()

Unnamed: 0,Id,zip_code,Geography,med_inc_zip,Households; Margin of Error; Median income (dollars),Families; Estimate; Median income (dollars),Families; Margin of Error; Median income (dollars),Married-couple families; Estimate; Median income (dollars),Married-couple families; Margin of Error; Median income (dollars),Nonfamily households; Estimate; Median income (dollars),Nonfamily households; Margin of Error; Median income (dollars)
0,8600000US06390,6390,ZCTA5 06390,150703,86256,151172,67633,131875,104286,-,**
1,8600000US07421,7421,ZCTA5 07421,90412,4718,99948,7733,112639,19572,69906,16080
2,8600000US10001,10001,ZCTA5 10001,85221,9970,103304,29429,149007,18758,75794,8870
3,8600000US10002,10002,ZCTA5 10002,35449,2696,39145,3234,42485,6118,28319,3989
4,8600000US10003,10003,ZCTA5 10003,104441,6666,183657,17463,198650,9364,86768,7078


# Feature Engineering

### Creating target variable
Our target is the number of days until a complaint's inspection date: 

$$\text{Inspection Date} - \text{Data Entered} = \text{Days until Inspection}$$

In [88]:
db["days_until_inspection"] = db["Inspection Date"] - db["Date Entered"]

In [89]:
db.describe()

Unnamed: 0,Complaint Number,days_until_inspection
count,91810.0,91810
mean,3269364.0,69 days 04:28:52.830846
std,1255153.0,193 days 00:47:32.565052
min,1245555.0,-1095 days +00:00:00
25%,2169322.0,1 days 00:00:00
50%,3498805.0,12 days 00:00:00
75%,4493673.0,63 days 00:00:00
max,5124998.0,3767 days 00:00:00


Some complaints took a negative number of days, which have been explained by the maintainers as instances of the issue being inspected or resolved without a resident opening a ticket, and so a ticket was later input by a DOB employee. Since these instances will not correctly correlate to giving a resident a time lapse estimation for inspection, these observations will be removed.

In [90]:
# extracting raw number of days
db["days_until_inspection"] = db["days_until_inspection"].map(lambda x:x.days)

In [91]:
db[db["days_until_inspection"] <= 0].shape

(17082, 11)

We'll be dropping about 17,000 observations.

In [92]:
db = db[db["days_until_inspection"] > 0]

### Fixing Special District

The "Special District" column has an emptry string category, which we'll change to "NOT SPECIAL".

In [93]:
db["Special District"].unique()

array(['   ', 'IBZ'], dtype=object)

In [94]:
db["Special District"] = db["Special District"].map(lambda x: x if x != '   ' else "NOT SPECIAL")

In [95]:
db["Special District"].unique()

array(['NOT SPECIAL', 'IBZ'], dtype=object)

## Renaming columns

In [96]:
# removing spaces & forcing all to lowercase
db.columns = [col.lower().replace(" ", "_") for col in db.columns]

In [97]:
db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection
234435,1245555,CLOSED,2009-01-02,930,10025,WEST END AVENUE,NOT SPECIAL,58,BOILR,2009-06-02,151
234487,1245608,CLOSED,2009-01-02,428,10013,BROADWAY,NOT SPECIAL,23,SCFLD,2009-10-09,280
234500,1245621,CLOSED,2009-01-02,146,10001,WEST 28 STREET,NOT SPECIAL,63,ELEVR,2009-01-22,20
234513,1245634,CLOSED,2009-01-03,388,10013,BROADWAY,NOT SPECIAL,56,BOILR,2009-01-07,4
234526,1245648,CLOSED,2009-01-03,375,10016,3 AVENUE,NOT SPECIAL,59,ELCTR,2009-01-08,5


## Formatting Zip Codes and Addresses

Since there are some entries with missing or corrupted zip codes, we are going to use the contextual address information to impute the correct zip codes. First, we will concatenate the house number and street names. Then, we will add in the correct borough of the address, utilizing the isolated first number from the "bin" column. Then, the partial address can be sent to Google's Geocode API, which will return the full address including the zip code.

In [98]:
# removing whitespace around address info
db["zip_code"] = db["zip_code"].str.rstrip()
db["house_street"] = db["house_street"].str.rstrip()
db["house_number"] = db["house_number"].str.rstrip()

In [99]:
# isolating entries with missing zips
zip_db = db[db["zip_code"].str.len() < 5].copy()

In [100]:
# creating borough mapper
borough_codes = {
    "1": "Manhattan",
    "2": "Bronx",
    "3": "Brooklyn",
    "4": "Queens",
    "5": "Staten Island"
}

In [101]:
# adding boroughs (first digit of complaint) to addresses with missing zips
zip_db["address"] = (zip_db["house_number"] + " " +
                     zip_db["house_street"] + ", " + 
                     zip_db["complaint_number"].apply(lambda x: str(x)[0]).map(borough_codes) + 
                     ", NY")
zip_db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection,address
335867,1351812,CLOSED,2013-07-02,60,,COLLISTER STREET,NOT SPECIAL,59,ELCTR,2013-07-08,6,"60 COLLISTER STREET, Manhattan, NY"
658794,2146737,CLOSED,2010-06-21,450,,HUTCHINSON RIVER PARKWAY,NOT SPECIAL,67,C & D,2010-07-06,15,"450 HUTCHINSON RIVER PARKWAY, Bronx, NY"
1150492,3369005,CLOSED,2011-02-18,639,,VANDALIA AVENUE,NOT SPECIAL,4B,SEP,2011-02-22,4,"639 VANDALIA AVENUE, Brooklyn, NY"
1201547,3424340,CLOSED,2012-09-15,30,,WASHINGTON AVENUE,NOT SPECIAL,04,ERT,2012-09-18,3,"30 WASHINGTON AVENUE, Brooklyn, NY"
1977766,4489830,CLOSED,2011-08-02,57-15,,72 PLACE,NOT SPECIAL,05,QNS.,2011-10-05,64,"57-15 72 PLACE, Queens, NY"


## Scraping Zip Codes - Google Geocoding

For the purposes of automation and future scaling, a function will be built that can take in a dataframe containing addresses and request the zip codes from Google. We will begin by securely importing our API key.

In [102]:
# making var for api key
ENV = pd.read_json("../env.json", typ="series")
API_KEY = ENV["API KEY"]

# setting client with api key
gmap_client = googlemaps.client.Client(key=API_KEY)

The function we are going to build will simply take in a dataframe with the correct address column. The Google Maps service will take the partial address from the dataframe, and the response will be a JSON object that is converted to a dictionary with each part of the full address as a key:pair entry. This zip code will then be mapped back onto the dataframe's zip code column.

In [103]:
# building func to fetch zips

def zip_finder(df):
    
    # easy part - loop thru addresses
    # run geocode request for each
    for address in df["address"]:
        print("fetching address: ", address)
        
        # gets full address
        full_addr = googlemaps.geocoding.geocode(client=gmap_client,
                             address=address)

        # isolates just the zip from the full address
        for addr_dict in full_addr[0]["address_components"]:
            if addr_dict["types"] == ["postal_code"]:
                zip_code = addr_dict["short_name"]
        print("found zip: ", zip_code)

        # connecting found zip back to entry with this address
        df.loc[df.index[df["address"] == address], "zip_code"] = zip_code
        
        # spacing requests to not exceed rate limit
        time.sleep(0.5)
    
    return None

In [104]:
zip_finder(zip_db)

fetching address:  60 COLLISTER STREET, Manhattan, NY
found zip:  10013
fetching address:  450 HUTCHINSON RIVER PARKWAY, Bronx, NY
found zip:  10465
fetching address:  639 VANDALIA AVENUE, Brooklyn, NY
found zip:  11239
fetching address:  30 WASHINGTON AVENUE, Brooklyn, NY
found zip:  11205
fetching address:  57-15   72 PLACE, Queens, NY
found zip:  11385
fetching address:  153 HAWTREE BASIN, Queens, NY
found zip:  11434
fetching address:  3010 VETERANS ROAD WEST, Staten Island, NY
found zip:  10309


In [105]:
zip_db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection,address
335867,1351812,CLOSED,2013-07-02,60,10013,COLLISTER STREET,NOT SPECIAL,59,ELCTR,2013-07-08,6,"60 COLLISTER STREET, Manhattan, NY"
658794,2146737,CLOSED,2010-06-21,450,10465,HUTCHINSON RIVER PARKWAY,NOT SPECIAL,67,C & D,2010-07-06,15,"450 HUTCHINSON RIVER PARKWAY, Bronx, NY"
1150492,3369005,CLOSED,2011-02-18,639,11239,VANDALIA AVENUE,NOT SPECIAL,4B,SEP,2011-02-22,4,"639 VANDALIA AVENUE, Brooklyn, NY"
1201547,3424340,CLOSED,2012-09-15,30,11205,WASHINGTON AVENUE,NOT SPECIAL,04,ERT,2012-09-18,3,"30 WASHINGTON AVENUE, Brooklyn, NY"
1977766,4489830,CLOSED,2011-08-02,57-15,11385,72 PLACE,NOT SPECIAL,05,QNS.,2011-10-05,64,"57-15 72 PLACE, Queens, NY"


Now that our isolated zip code dataframe has been filled, we can assign those zip codes back onto the original entries in the main dataframe.

In [106]:
# assigning the located zips back to the original entries
db.loc[zip_db.index, "zip_code"] = zip_db["zip_code"]

## Joining Income Information

Now that all of our entries have zip codes, we can merge the income database using the zip codes as the common column.

In [107]:
db = pd.merge(db,
         income_db[["zip_code", "med_inc_zip"]],
         how="left",
         on="zip_code")
db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection,med_inc_zip
0,1245555,CLOSED,2009-01-02,930,10025,WEST END AVENUE,NOT SPECIAL,58,BOILR,2009-06-02,151,82352
1,1245608,CLOSED,2009-01-02,428,10013,BROADWAY,NOT SPECIAL,23,SCFLD,2009-10-09,280,106056
2,1245621,CLOSED,2009-01-02,146,10001,WEST 28 STREET,NOT SPECIAL,63,ELEVR,2009-01-22,20,85221
3,1245634,CLOSED,2009-01-03,388,10013,BROADWAY,NOT SPECIAL,56,BOILR,2009-01-07,4,106056
4,1245648,CLOSED,2009-01-03,375,10016,3 AVENUE,NOT SPECIAL,59,ELCTR,2009-01-08,5,109250


In [108]:
db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74728 entries, 0 to 74727
Data columns (total 12 columns):
complaint_number         74728 non-null int64
status                   74728 non-null object
date_entered             74728 non-null datetime64[ns]
house_number             74728 non-null object
zip_code                 74728 non-null object
house_street             74728 non-null object
special_district         74728 non-null object
complaint_category       74728 non-null object
unit                     74728 non-null object
inspection_date          74728 non-null datetime64[ns]
days_until_inspection    74728 non-null int64
med_inc_zip              74365 non-null object
dtypes: datetime64[ns](2), int64(2), object(8)
memory usage: 7.4+ MB


In [111]:
db[db["med_inc_zip"].str.len() < 1].shape

(25, 12)

In [112]:
db[db["med_inc_zip"].str.len() < 5]["zip_code"].nunique()

12

### Blank Income Entries

There are some zip codes in the income dataset that did not have entries, creating blank strings in the median income column. For now, we are going to drop the complaints that use these zip codes in order to effectively test out the modeling steps.

In [113]:
db = db[db["med_inc_zip"].str.len() >= 1]

## Handling Nulls

We're not finding nulls in our dataset, however, as part of our meta cleaning process, we will drop any null observations.

We must account for this in our process as we're assuming these values are missing completely at random, or MCAR.

In [114]:
db.isnull().sum()

complaint_number         0
status                   0
date_entered             0
house_number             0
zip_code                 0
house_street             0
special_district         0
complaint_category       0
unit                     0
inspection_date          0
days_until_inspection    0
med_inc_zip              0
dtype: int64

In [115]:
db = db.dropna()

In [116]:
db.shape

(74340, 12)

In [117]:
db.head()

Unnamed: 0,complaint_number,status,date_entered,house_number,zip_code,house_street,special_district,complaint_category,unit,inspection_date,days_until_inspection,med_inc_zip
0,1245555,CLOSED,2009-01-02,930,10025,WEST END AVENUE,NOT SPECIAL,58,BOILR,2009-06-02,151,82352
1,1245608,CLOSED,2009-01-02,428,10013,BROADWAY,NOT SPECIAL,23,SCFLD,2009-10-09,280,106056
2,1245621,CLOSED,2009-01-02,146,10001,WEST 28 STREET,NOT SPECIAL,63,ELEVR,2009-01-22,20,85221
3,1245634,CLOSED,2009-01-03,388,10013,BROADWAY,NOT SPECIAL,56,BOILR,2009-01-07,4,106056
4,1245648,CLOSED,2009-01-03,375,10016,3 AVENUE,NOT SPECIAL,59,ELCTR,2009-01-08,5,109250


# Save cleaned data

In [120]:
# saving db
db.to_csv("./datasets/cleaned.csv", index=False)