# Dataset Information  
Name: DOHMH Dog Bite Data  
Author: New York City Department of Health and Mental Hygiene (NYC DOHMH)  
Source: https://data.cityofnewyork.us/Health/DOHMH-Dog-Bite-Data/rsgh-akpg/about_data  
Accessed: 2024 November 2  
Method of Data Collection:  
* Reports received online, mail, fax or by phone to 311
* NYC DOHMH Animal Bite Unit

# Feature Information
<table style='margin-left: auto; margin-right: auto'>
    <tr>
        <th colspan='3'> DOHMH Dog Bite Data </th>
    <tr>
    <tr>
        <th> Column Name </th>
        <th> Description </th>
        <th> Data Type </th>
    </tr>
    <tr>
        <td> UniqueID </th>
        <td> Unique dog bite case identifier </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> DateOfBite </th>
        <td> Date bitten </th>
        <td> Floating Timestamp </th>
    </tr>
    <tr>
        <td> Species </th>
        <td> Animal Type (Dog) </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> Breed </th>
        <td> Breed type </th>
        <td> Text </th>
    </tr>
        <tr>
        <td> Age </th>
        <td> Dog's age at time of bite. Numbers with 'M' indicate months. </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> Gender </th>
        <td> Sex of Dog. M=Male, F=Female, U=Unknown </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> SpayNeuter </th>
        <td> Surgical removal of dog's reproductive organs. True (reported to DOHMH as Spayed or Neutered), False (Unknown or Not Spayed or Neutered) </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> Borough </th>
        <td> Dog bite Borough. 'Other' indicates that the bite took place outside New York City </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> ZipCode </th>
        <td> Dog bite Zipcode. Blank ZipCode indicates that information was not available </th>
        <td> Text </th>
    </tr>
</table>

# Import and Initializing Cleaning

In [51]:
# libraries
import pandas as pd
from utils import breed_mapping, useless_breed_words, nyc_zip_codes

In [52]:
# import data
raw = pd.read_csv('../data/raw/DOHMH_Dog_Bite_Data_20241102.csv')

# display
raw.head()

Unnamed: 0,UniqueID,DateOfBite,Species,Breed,Age,Gender,SpayNeuter,Borough,ZipCode
0,1,January 01 2018,DOG,UNKNOWN,,U,False,Brooklyn,11220.0
1,2,January 04 2018,DOG,UNKNOWN,,U,False,Brooklyn,
2,3,January 06 2018,DOG,Pit Bull,,U,False,Brooklyn,11224.0
3,4,January 08 2018,DOG,Mixed/Other,4.0,M,False,Brooklyn,11231.0
4,5,January 09 2018,DOG,Pit Bull,,U,False,Brooklyn,11224.0


In [53]:
# copy raw to wrang_init
wrang_init = raw.copy()

# snake case column names
snake_case = {
    'UniqueID': 'unique_id',
    'DateOfBite': 'date_of_bite',
    'SpayNeuter': 'spay_neuter',
    'ZipCode': 'zip_code',
}

wrang_init.rename(columns=snake_case, inplace=True)
wrang_init.rename(columns=str.lower, inplace=True)

# display
wrang_init.head()

Unnamed: 0,unique_id,date_of_bite,species,breed,age,gender,spay_neuter,borough,zip_code
0,1,January 01 2018,DOG,UNKNOWN,,U,False,Brooklyn,11220.0
1,2,January 04 2018,DOG,UNKNOWN,,U,False,Brooklyn,
2,3,January 06 2018,DOG,Pit Bull,,U,False,Brooklyn,11224.0
3,4,January 08 2018,DOG,Mixed/Other,4.0,M,False,Brooklyn,11231.0
4,5,January 09 2018,DOG,Pit Bull,,U,False,Brooklyn,11224.0


In [54]:
# drop columns
# UniqueID: not useful
# Species: only has one value (dog)
# Age: too many missing values
# Gender: too many missing values

wrang_init = wrang_init.drop(columns=['unique_id', 'species', 'age', 'gender'])

# display
wrang_init.head()

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code
0,January 01 2018,UNKNOWN,False,Brooklyn,11220.0
1,January 04 2018,UNKNOWN,False,Brooklyn,
2,January 06 2018,Pit Bull,False,Brooklyn,11224.0
3,January 08 2018,Mixed/Other,False,Brooklyn,11231.0
4,January 09 2018,Pit Bull,False,Brooklyn,11224.0


In [55]:
# convert date_of_bite to datetime
wrang_init['date_of_bite'] = pd.to_datetime(wrang_init['date_of_bite'])

# convert spay_neuter to boolean
wrang_init['spay_neuter'] = wrang_init['spay_neuter'].astype('bool')

# lower case all string columns
string_columns = wrang_init.select_dtypes(include='object').columns
wrang_init[string_columns] = wrang_init[string_columns].apply(lambda x: x.str.lower())

# convert all nan into None
wrang_init = wrang_init.where(pd.notnull(wrang_init), None)

# display
wrang_init.head()

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code
0,2018-01-01,unknown,False,brooklyn,11220.0
1,2018-01-04,unknown,False,brooklyn,
2,2018-01-06,pit bull,False,brooklyn,11224.0
3,2018-01-08,mixed/other,False,brooklyn,11231.0
4,2018-01-09,pit bull,False,brooklyn,11224.0


In [56]:
# check for missing values
wrang_init.isna().sum()

date_of_bite       0
breed           2263
spay_neuter        0
borough            0
zip_code        7167
dtype: int64

# Wrangling Borough
First because some data might be invalid.  
Remove 'other' because it refers to report outside of NYC.  

In [57]:
# copy inir_cleaned to wrang_borough
wrang_borough = wrang_init.copy()

# drop rows with 'other' Borough
wrang_borough = wrang_borough[wrang_borough['borough'] != 'other']

# display borough count
wrang_borough['borough'].value_counts()

borough
queens           6693
manhattan        6081
brooklyn         5698
bronx            4375
staten island    2140
Name: count, dtype: int64

# Wrangling Date of Bite
Extract date values

In [58]:
# copy wrang_borough to wrang_date
wrang_date = wrang_borough.copy()

In [59]:
# extract date values
wrang_date['year'] = pd.to_datetime(wrang_date['date_of_bite']).dt.year
wrang_date['month'] = pd.to_datetime(wrang_date['date_of_bite']).dt.month
wrang_date['day'] = pd.to_datetime(wrang_date['date_of_bite']).dt.day
wrang_date['day_of_week'] = pd.to_datetime(wrang_date['date_of_bite']).dt.dayofweek

# display
wrang_date.head()

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code,year,month,day,day_of_week
0,2018-01-01,unknown,False,brooklyn,11220.0,2018,1,1,0
1,2018-01-04,unknown,False,brooklyn,,2018,1,4,3
2,2018-01-06,pit bull,False,brooklyn,11224.0,2018,1,6,5
3,2018-01-08,mixed/other,False,brooklyn,11231.0,2018,1,8,0
4,2018-01-09,pit bull,False,brooklyn,11224.0,2018,1,9,1


# Wrangling Zip Code
Dropping rows with missing zip code  
Mapping zip code to longitude and latitude

In [60]:
# copy wrang_date to wrang_zip
wrang_zip = wrang_date.copy()

In [61]:
# check for missing zip_code values percentage
(wrang_zip['zip_code'].isna().sum() / wrang_zip.shape[0]) * 100

26.169608196262057

In [62]:
# drop rows with missing zip_code values
wrang_zip = wrang_zip.dropna(subset=['zip_code'])

# check for missing zip_code values
(wrang_zip['zip_code'].isna().sum() / wrang_zip.shape[0]) * 100

0.0

In [63]:
# display non-numeric zip_code values
wrang_zip[~wrang_zip['zip_code'].str.isnumeric()]

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code,year,month,day,day_of_week
21303,2017-07-11,unknown,False,queens,?,2017,7,11,1
25122,2022-09-03,pit bull,False,bronx,1o458,2022,9,3,5


In [64]:
# manually fix non-numeric zip_code values
wrang_zip.loc[wrang_zip['zip_code'] == '1o458', 'zip_code'] = '10458'
wrang_zip.drop(wrang_zip[wrang_zip['zip_code'] == '?'].index, inplace=True)

In [65]:
# display non-numeric zip_code values
wrang_zip[~wrang_zip['zip_code'].str.isnumeric()]

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code,year,month,day,day_of_week


In [66]:
# convert zip_code to int
wrang_zip['zip_code'] = wrang_zip['zip_code'].astype('Int64')

# remove zip_code values not in NYC
wrang_zip = wrang_zip[wrang_zip['zip_code'].isin(nyc_zip_codes)]

In [67]:
# read zip code data
zip_data = pd.read_csv('../data/raw/2024_Gaz_zcta_national.txt', sep='\t')

# display
zip_data.head()

Unnamed: 0,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,601,166836392,798613,64.416,0.308,18.180555,-66.749961
1,602,78546711,4428428,30.327,1.71,18.361945,-67.175597
2,603,88980555,6253316,34.356,2.414,18.457399,-67.124867
3,606,114825641,12228,44.334,0.005,18.158327,-66.932928
4,610,96150194,4289688,37.124,1.656,18.295304,-67.12518


In [68]:
# lower_case column names
zip_col_names = []

for col in zip_data.columns:
    zip_col_names.append(col.strip().lower())

zip_data.columns = zip_col_names

# keep necessary columns
zip_data = zip_data[['geoid', 'intptlat', 'intptlong']]

# rename columns
zip_data.columns = ['zip_code', 'latitude', 'longitude']

# display
zip_data.head()

Unnamed: 0,zip_code,latitude,longitude
0,601,18.180555,-66.749961
1,602,18.361945,-67.175597
2,603,18.457399,-67.124867
3,606,18.158327,-66.932928
4,610,18.295304,-67.12518


In [69]:
# map zip_code to latitude and longitude
wrang_zip = wrang_zip.merge(zip_data, on='zip_code')

# display
wrang_zip.head()

Unnamed: 0,date_of_bite,breed,spay_neuter,borough,zip_code,year,month,day,day_of_week,latitude,longitude
0,2018-01-01,unknown,False,brooklyn,11220,2018,1,1,0,40.641026,-74.016688
1,2018-01-06,pit bull,False,brooklyn,11224,2018,1,6,5,40.577372,-73.988706
2,2018-01-08,mixed/other,False,brooklyn,11231,2018,1,8,0,40.677916,-74.005154
3,2018-01-09,pit bull,False,brooklyn,11224,2018,1,9,1,40.577372,-73.988706
4,2018-01-03,basenji,False,brooklyn,11231,2018,1,3,2,40.677916,-74.005154


# Wrangling Breed
Take top 10 breeds for one-hot encoding.  
Other or mixed breeds will be in others/mixed

In [70]:
# copy wrang_zip to wrang_breed
wrang_breed = wrang_zip.copy()

# number of unique breed values
n_top_breeds = 10

In [71]:
# dataframe for wrangling breed
breed_values = wrang_breed['breed'].copy()

# separate breed by '/', ',' and ' '
breed_values = breed_values.str.split('/')
breed_values = breed_values.apply(lambda x: [y.strip() for y in x] if x is not None else x)

# remove useless words
for word in useless_breed_words:
    breed_values = breed_values.apply(lambda x: [y.replace(word, '').strip() for y in x] if x is not None else x)

# remove white spaces
breed_values = breed_values.apply(lambda x: [y.strip() for y in x] if x is not None else x)

# map breed names to standard names
breed_values = breed_values.apply(lambda x: [breed_mapping.get(y, y) for y in x] if x is not None else x)

# map repeating values to one value, ex: ['pit bull', 'pit bull'] to ['pit bull']
breed_values = breed_values.apply(lambda x: [x[0]] if x is not None and len(x) == 2 and x[0] == x[1] else x)

# fill missing values with ['mixed/other']
breed_values = breed_values.apply(lambda x: ['mixed/other'] if x is None else x)

# map breeds with more than one value to ['mixed/other']
breed_values = breed_values.apply(lambda x: ['mixed/other'] if len(x) > 1 else x)

# display breed counts
breed_counts = breed_values.explode().value_counts()
breed_counts.head(n_top_breeds+1)

breed
mixed/other           4891
pit bull              4381
shih tzu               641
german shepherd        628
chihuahua              616
yorkshire terrier      460
bull dog               438
labrador retriever     400
maltese                331
husky                  315
standard poodle        284
Name: count, dtype: int64

In [72]:
# top breeds based on count, with mixed/others
top_breeds = breed_counts.head(n_top_breeds+1).index.tolist()
top_breeds

['mixed/other',
 'pit bull',
 'shih tzu',
 'german shepherd',
 'chihuahua',
 'yorkshire terrier',
 'bull dog',
 'labrador retriever',
 'maltese',
 'husky',
 'standard poodle']

In [73]:
# keep breeds in top_breeds, replace others with 'mixed/other'
breed_values = breed_values.apply(lambda x: [y if y in top_breeds else 'mixed/other' for y in x] if x is not None else x)

# display
breed_values.head()

0    [mixed/other]
1       [pit bull]
2    [mixed/other]
3       [pit bull]
4    [mixed/other]
Name: breed, dtype: object

In [74]:
# one-hot encode breed
for breed in top_breeds:
    wrang_breed[breed] = breed_values.apply(lambda x: breed in x if x is not None else False)

# drop breed column
wrang_breed = wrang_breed.drop(columns='breed')

# display
wrang_breed.head()

Unnamed: 0,date_of_bite,spay_neuter,borough,zip_code,year,month,day,day_of_week,latitude,longitude,...,pit bull,shih tzu,german shepherd,chihuahua,yorkshire terrier,bull dog,labrador retriever,maltese,husky,standard poodle
0,2018-01-01,False,brooklyn,11220,2018,1,1,0,40.641026,-74.016688,...,False,False,False,False,False,False,False,False,False,False
1,2018-01-06,False,brooklyn,11224,2018,1,6,5,40.577372,-73.988706,...,True,False,False,False,False,False,False,False,False,False
2,2018-01-08,False,brooklyn,11231,2018,1,8,0,40.677916,-74.005154,...,False,False,False,False,False,False,False,False,False,False
3,2018-01-09,False,brooklyn,11224,2018,1,9,1,40.577372,-73.988706,...,True,False,False,False,False,False,False,False,False,False
4,2018-01-03,False,brooklyn,11231,2018,1,3,2,40.677916,-74.005154,...,False,False,False,False,False,False,False,False,False,False


# Export Data

In [77]:
# copy wrang_breed to wrang_final
wrang_final = wrang_breed.copy()

# convert column names to snake_case
wrang_final.columns = wrang_final.columns.str.replace(' ', '_')

# rearrange columns
columns_order = [
    'date_of_bite', 'year', 'month', 'day', 'day_of_week', 'borough', 'zip_code', 
    'latitude', 'longitude', 'spay_neuter', 'mixed/other', 'pit_bull', 'german_shepherd', 
    'shih_tzu', 'chihuahua', 'yorkshire_terrier', 'bull_dog', 'labrador_retriever', 
    'maltese', 'husky', 'standard_poodle'
]
wrang_final = wrang_final[columns_order]

# display
wrang_final.head()

Unnamed: 0,date_of_bite,year,month,day,day_of_week,borough,zip_code,latitude,longitude,spay_neuter,...,pit_bull,german_shepherd,shih_tzu,chihuahua,yorkshire_terrier,bull_dog,labrador_retriever,maltese,husky,standard_poodle
0,2018-01-01,2018,1,1,0,brooklyn,11220,40.641026,-74.016688,False,...,False,False,False,False,False,False,False,False,False,False
1,2018-01-06,2018,1,6,5,brooklyn,11224,40.577372,-73.988706,False,...,True,False,False,False,False,False,False,False,False,False
2,2018-01-08,2018,1,8,0,brooklyn,11231,40.677916,-74.005154,False,...,False,False,False,False,False,False,False,False,False,False
3,2018-01-09,2018,1,9,1,brooklyn,11224,40.577372,-73.988706,False,...,True,False,False,False,False,False,False,False,False,False
4,2018-01-03,2018,1,3,2,brooklyn,11231,40.677916,-74.005154,False,...,False,False,False,False,False,False,False,False,False,False


In [79]:
# save cleaned data
wrang_final.to_csv('../data/processed/dog_bite_wrangled.csv', index=False)

# Feature Information
<table style='margin-left: auto; margin-right: auto'>
    <tr>
        <th colspan='3'> Wrangled DOHMH Dog Bite Data </th>
    <tr>
    <tr>
        <th> Column Name </th>
        <th> Description </th>
        <th> Data Type </th>
    </tr>    
    <tr>
        <td> date_of_bite </th>
        <td> Date bitten </th>
        <td> DateTime </th>
    </tr>
    <tr>
        <td> year </th>
        <td> Reported on a specific year </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> month </th>
        <td> Reported on a specific month </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> day </th>
        <td> Reported on a specific month </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> day_of_week </th>
        <td> Reported on a specific day of week </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> borough </th>
        <td> Dog bite Borough. </th>
        <td> Text </th>
    </tr>
    <tr>
        <td> zip_code </th>
        <td> Dog bite ZipCode. </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> latitude </th>
        <td> Latitude of Zip Code </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> longitude </th>
        <td> Longitude of Zip Code </th>
        <td> Integer </th>
    </tr>
    <tr>
        <td> spay_neuter </th>
        <td> Surgical removal of dog's reproductive organs. True (reported to DOHMH as Spayed or Neutered), False (Unknown or Not Spayed or Neutered) </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> mixed/other </th>
        <td> Indicates that the dog was a mixed or other breed.  </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> pit_bull </th>
        <td> Indicates that the dog was a pit bull.  </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> shih_tzu </th>
        <td> Indicates that the dog was a shih tzu.  </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> chihuahua </th>
        <td> Indicates that the dog was a chihuahua. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> german_shepherd </th>
        <td> Indicates that the dog was a german shepherd. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> bull_dog </th>
        <td> Indicates that the dog was a bull dog. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> labrador_retriever </th>
        <td> Indicates that the dog was a labrador retriever. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> maltese </th>
        <td> Indicates that the dog was a maltese. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> husky </th>
        <td> Indicates that the dog was a husky. </th>
        <td> Boolean </th>
    </tr>
    <tr>
        <td> standard_poodle </th>
        <td> Indicates that the dog was a standard poodle. </th>
        <td> Boolean </th>
    </tr>
</table>