Skip to content

Ossifragus/DataJamboree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Activate the project and instantiate using the following code

using Pkg; Pkg.activate(".")
# uncomment the following to install all the required packages
# Pkg.instantiate()

load required packages

using Dates, DataFrames, CSV, JDF, Arrow, GLM, StatsBase, HypothesisTests
using PythonCall, MLJ, XGBoost, FLoops, Geocoder, CairoMakie # GLMakie
import StatsPlots as sp
import Plots as plt
  • Data cleaning.
  • For ease of comparison across languages, make the column names consistent in style with lowercase using underscore to separate words within a name.

load data and rename columns

julia> @time dat = CSV.read("nyc311_011523-012123_by022023.csv", DataFrame,
                      dateformat="m/d/y H:M:S p")
  0.175945 seconds (1.81 M allocations: 125.091 MiB)
54469×41 DataFrame
   Row │ Unique Key  Created Date         Closed Date          Agency   Agency 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156532399  2023-01-15T00:00:00  2023-01-17T00:00:01  DOHMH    Depart 
     256533302  2023-01-15T00:00:00  2023-01-23T00:00:01  DOHMH    Depart
     356533451  2023-01-15T00:00:00  2023-01-17T09:34:08  DOHMH    Depart
     456536021  2023-01-15T00:00:00  2023-01-17T09:48:19  DOHMH    Depart
     556538004  2023-01-15T00:00:00  2023-01-23T15:08:02  DOHMH    Depart 
     656524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     New Yo
     756526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     New Yo
     856525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     New Yo
                                                                 
 5446356591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     New Yo 
 5446456587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     New Yo
 5446556590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     New Yo
 5446656591502  2023-01-22T00:00:00  2023-01-23T10:03:08  DOHMH    Depart
 5446756592422  2023-01-22T00:00:00  2023-01-23T10:05:39  DOHMH    Depart 
 5446856594948  2023-01-22T00:00:00  2023-02-01T00:00:01  DOHMH    Depart
 5446956597360  2023-01-22T00:00:00  2023-01-23T10:02:46  DOHMH    Depart
                                               37 columns and 54454 rows omitted

julia> @time rename!(x -> lowercase(replace(x, " " => "_")), dat)
  0.037280 seconds (25.38 k allocations: 1.682 MiB, 99.43% compilation time)
54469×41 DataFrame
   Row │ unique_key  created_date         closed_date          agency   agency 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156532399  2023-01-15T00:00:00  2023-01-17T00:00:01  DOHMH    Depart 
     256533302  2023-01-15T00:00:00  2023-01-23T00:00:01  DOHMH    Depart
     356533451  2023-01-15T00:00:00  2023-01-17T09:34:08  DOHMH    Depart
     456536021  2023-01-15T00:00:00  2023-01-17T09:48:19  DOHMH    Depart
     556538004  2023-01-15T00:00:00  2023-01-23T15:08:02  DOHMH    Depart 
     656524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     New Yo
     756526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     New Yo
     856525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     New Yo
                                                                 
 5446356591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     New Yo 
 5446456587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     New Yo
 5446556590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     New Yo
 5446656591502  2023-01-22T00:00:00  2023-01-23T10:03:08  DOHMH    Depart
 5446756592422  2023-01-22T00:00:00  2023-01-23T10:05:39  DOHMH    Depart 
 5446856594948  2023-01-22T00:00:00  2023-02-01T00:00:01  DOHMH    Depart
 5446956597360  2023-01-22T00:00:00  2023-01-23T10:02:46  DOHMH    Depart
                                               37 columns and 54454 rows omitted

Save and load the data with other formats

# @time Arrow.write("nyc311.arrow", dat)
@time dat = Arrow.Table("nyc311.arrow") |> DataFrame
@time dat = copy(dat)

# @time JDF.save("nyc311.jdf", dat)
@time dat = JDF.load("nyc311.jdf") |> DataFrame
  • Check for obvious errors or inefficiencies. For example, are there records whose Closed Date is earlier than or exactly the same as the Created Date? Are their invalid values for any columns? Are any columns redundant?
julia> first(dat, 5)
5×41 DataFrame
 Row │ unique_key  created_date         closed_date          agency   agency_n 
     │ Int64       DateTime             DateTime?            String7  String   
─────┼──────────────────────────────────────────────────────────────────────────
   156532399  2023-01-15T00:00:00  2023-01-17T00:00:01  DOHMH    Departme 
   256533302  2023-01-15T00:00:00  2023-01-23T00:00:01  DOHMH    Departme
   356533451  2023-01-15T00:00:00  2023-01-17T09:34:08  DOHMH    Departme
   456536021  2023-01-15T00:00:00  2023-01-17T09:48:19  DOHMH    Departme
   556538004  2023-01-15T00:00:00  2023-01-23T15:08:02  DOHMH    Departme 
                                                              37 columns omitted

julia> last(dat, 5)
5×41 DataFrame
 Row │ unique_key  created_date         closed_date          agency   agency_n 
     │ Int64       DateTime             DateTime?            String7  String   
─────┼──────────────────────────────────────────────────────────────────────────
   156590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     New York 
   256591502  2023-01-22T00:00:00  2023-01-23T10:03:08  DOHMH    Departme
   356592422  2023-01-22T00:00:00  2023-01-23T10:05:39  DOHMH    Departme
   456594948  2023-01-22T00:00:00  2023-02-01T00:00:01  DOHMH    Departme
   556597360  2023-01-22T00:00:00  2023-01-23T10:02:46  DOHMH    Departme 
                                                              37 columns omitted

julia> n, p = size(dat)
(54469, 41)

julia> ds = describe(dat)
41×7 DataFrame
 Row │ variable                  mean       min                                
     │ Symbol                    Union     Any                                
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ unique_key                5.65614e7  56524167                           
   2 │ created_date                         2023-01-15T00:00:00
   3 │ closed_date                          2022-01-12T10:06:00
   4 │ agency                               DCA
   5 │ agency_name                          Department of Buildings            
   6 │ complaint_type                       AHV Inspection Unit
   7 │ descriptor                           1 or 2
   8 │ location_type                        1-2 Family Dwelling
                                                            
  35 │ bridge_highway_name                  1                                  
  36 │ bridge_highway_direction             1 2 3 Downtown & Brooklyn
  37 │ road_ramp                            10 St & 4 Av NW corner
  38 │ bridge_highway_segment               14th Ave Bay 8 St (Exit 4) - Bay
  39 │ latitude                  40.7367    40.4989                            
  40 │ longitude                 -73.9248   -74.2525
  41 │ location                             (40.498948846168354, -74.2443650
                                                   4 columns and 26 rows omitted

julia> ds[!, [:variable, :nmissing]] # ds[!, [1, 6]]
41×2 DataFrame
 Row │ variable                  nmissing
     │ Symbol                    Int64
─────┼────────────────────────────────────
   1 │ unique_key                       0
   2 │ created_date                     0
   3 │ closed_date                   5176
   4 │ agency                           0
   5 │ agency_name                      0
   6 │ complaint_type                   0
   7 │ descriptor                     394
   8 │ location_type                 7386
                   
  35 │ bridge_highway_name          54126
  36 │ bridge_highway_direction     54290
  37 │ road_ramp                    54334
  38 │ bridge_highway_segment       54126
  39 │ latitude                       978
  40 │ longitude                      978
  41 │ location                       978
                           26 rows omitted

julia> ds.mis_rate .= ds.nmissing ./ n
41-element Vector{Float64}:
 0.0
 0.0
 0.0950265288512732
 0.0
 0.0
 0.0
 0.007233472250270797
 0.13560006609263986
 0.013750940902164534
 0.04743982815913639
 
 0.9994492280012485
 0.989553691090345
 0.9937028401476069
 0.9967137270741155
 0.9975215260056178
 0.9937028401476069
 0.01795516715930162
 0.01795516715930162
 0.01795516715930162

julia> mis_pattern = ds[!, [:variable, :mis_rate]]
41×2 DataFrame
 Row │ variable                  mis_rate
     │ Symbol                    Float64
─────┼──────────────────────────────────────
   1 │ unique_key                0.0
   2 │ created_date              0.0
   3 │ closed_date               0.0950265
   4 │ agency                    0.0
   5 │ agency_name               0.0
   6 │ complaint_type            0.0
   7 │ descriptor                0.00723347
   8 │ location_type             0.1356
                    
  35 │ bridge_highway_name       0.993703
  36 │ bridge_highway_direction  0.996714
  37 │ road_ramp                 0.997522
  38 │ bridge_highway_segment    0.993703
  39 │ latitude                  0.0179552
  40 │ longitude                 0.0179552
  41 │ location                  0.0179552
                             26 rows omitted

agency v.s. agency_name, location is redundant with latitude and longitude ...

julia> dat1 = dat[!, [1:4; 6:10; 25:26; 39:40]]
54469×13 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156532399  2023-01-15T00:00:00  2023-01-17T00:00:01  DOHMH    Food P 
     256533302  2023-01-15T00:00:00  2023-01-23T00:00:01  DOHMH    Food P
     356533451  2023-01-15T00:00:00  2023-01-17T09:34:08  DOHMH    Food P
     456536021  2023-01-15T00:00:00  2023-01-17T09:48:19  DOHMH    Food P
     556538004  2023-01-15T00:00:00  2023-01-23T15:08:02  DOHMH    Food P 
     656524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise
     756526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     856525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
                                                                 
 5446356591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 5446456587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 5446556590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
 5446656591502  2023-01-22T00:00:00  2023-01-23T10:03:08  DOHMH    Food P
 5446756592422  2023-01-22T00:00:00  2023-01-23T10:05:39  DOHMH    Food P 
 5446856594948  2023-01-22T00:00:00  2023-02-01T00:00:01  DOHMH    Food P
 5446956597360  2023-01-22T00:00:00  2023-01-23T10:02:46  DOHMH    Food P
                                                9 columns and 54454 rows omitted

julia> ds1 = describe(dat1)
13×7 DataFrame
 Row │ variable          mean       min                      median     max    
     │ Symbol            Union     Any                      Union     Any    
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ unique_key        5.65614e7  56524167                 5.65613e7  566356 
   2 │ created_date                 2023-01-15T00:00:00                 2023-0
   3 │ closed_date                  2022-01-12T10:06:00                 2023-0
   4 │ agency                       DCA                                 TLC
   5 │ complaint_type               AHV Inspection Unit                 ZTESTI 
   6 │ descriptor                   1 or 2                              unknow
   7 │ location_type                1-2 Family Dwelling                 Yard
   8 │ incident_zip      10807.1    10000                    11201.0    12345
   9 │ incident_address             -999 FURMANVILLE AVENUE             YELLOW 
  10 │ bbl               2.6957e9   0                        3.01604e9  520042
  11 │ borough                      BRONX                               Unspec
  12 │ latitude          40.7367    40.4989                  40.7309    40.912
  13 │ longitude         -73.9248   -74.2525                 -73.9281   -73.70 
                                                               3 columns omitted

julia> dat2 = dat1[isless.(dat1.created_date, dat1.closed_date), :]
       # dat2 = filter(r -> isless(r.created_date, r.closed_date) ,dat1)
52576×13 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156532399  2023-01-15T00:00:00  2023-01-17T00:00:01  DOHMH    Food P 
     256533302  2023-01-15T00:00:00  2023-01-23T00:00:01  DOHMH    Food P
     356533451  2023-01-15T00:00:00  2023-01-17T09:34:08  DOHMH    Food P
     456536021  2023-01-15T00:00:00  2023-01-17T09:48:19  DOHMH    Food P
     556538004  2023-01-15T00:00:00  2023-01-23T15:08:02  DOHMH    Food P 
     656524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise
     756526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     856525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
                                                                 
 5257056591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 5257156587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 5257256590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
 5257356591502  2023-01-22T00:00:00  2023-01-23T10:03:08  DOHMH    Food P
 5257456592422  2023-01-22T00:00:00  2023-01-23T10:05:39  DOHMH    Food P 
 5257556594948  2023-01-22T00:00:00  2023-02-01T00:00:01  DOHMH    Food P
 5257656597360  2023-01-22T00:00:00  2023-01-23T10:02:46  DOHMH    Food P
                                                9 columns and 52561 rows omitted
  • Fill in missing values if possible. For example, if incident zip code is missing but the location is not, the zip code could be recovered by geocoding.
julia> idx = findall(ismissing.(dat2.latitude) .& .!ismissing.(dat2.incident_address))
518-element Vector{Int64}:
   539
  1254
  1286
  1337
  1427
  1476
  1526
  1633
  2046
  2065
     
 51561
 51721
 51758
 51811
 51816
 51824
 51942
 52051
 52082

Geocoder.jl can be used as a geocoding client for Google Maps API and Open Street Maps API. It requires an APT key to use, so the missing coordiantes are stored in the jdf format.

julia> include("get_coordinates.jl")

julia> coor = geocode("28-07 JACKSON AVENUE, New York", osm_key, "osm")
Dict{String, Any} with 2 entries:
  "lat" => 40.7484
  "lng" => -73.9388

julia> if abspath(PROGRAM_FILE) == @__FILE__ # only run as the main script
           coordinates = DataFrame([c => Float64[] for c in ["lat", "lng"]])
           ads = dat2.incident_address[idx]
           for a in ads
               g = geocode(a * " New York", osm_key, "osm")
               push!(coordinates, g)
           end
           # CSV.write("coordinates.csv", coordinates)
           # Arrow.write("coordinates.arrow", coordinates)
           JDF.save("coordinates.jdf", coordinates)
       else
           coordinates = JDF.load("coordinates.jdf") |> DataFrame
       end
518×2 DataFrame
 Row │ lat      lng
     │ Float64  Float64
─────┼───────────────────
   140.8699  -73.8928
   240.8012  -73.9598
   340.8269  -73.872
   440.6853  -73.8468
   540.8674  -73.8359
   640.758   -73.7894
   740.7846  -73.8283
   840.6347  -74.1158
          
 51240.636   -74.0013
 51340.7672  -73.9862
 51440.6273  -74.1102
 51540.7672  -73.9862
 51640.6675  -73.7537
 51740.7277  -73.8491
 51840.6019  -73.7458
         503 rows omitted

julia> rename!(coordinates, [:latitude, :longitude])
518×2 DataFrame
 Row │ latitude  longitude
     │ Float64   Float64
─────┼─────────────────────
   140.8699   -73.8928
   240.8012   -73.9598
   340.8269   -73.872
   440.6853   -73.8468
   540.8674   -73.8359
   640.758    -73.7894
   740.7846   -73.8283
   840.6347   -74.1158
            
 51240.636    -74.0013
 51340.7672   -73.9862
 51440.6273   -74.1102
 51540.7672   -73.9862
 51640.6675   -73.7537
 51740.7277   -73.8491
 51840.6019   -73.7458
           503 rows omitted

julia> dat2[idx, [:latitude, :longitude]] .= coordinates
518×2 SubDataFrame
 Row │ latitude  longitude
     │ Float64?  Float64?
─────┼─────────────────────
   140.8699   -73.8928
   240.8012   -73.9598
   340.8269   -73.872
   440.6853   -73.8468
   540.8674   -73.8359
   640.758    -73.7894
   740.7846   -73.8283
   840.6347   -74.1158
            
 51240.636    -74.0013
 51340.7672   -73.9862
 51440.6273   -74.1102
 51540.7672   -73.9862
 51640.6675   -73.7537
 51740.7277   -73.8491
 51840.6019   -73.7458
           503 rows omitted
  • Summarize your suggestions to the data curator in several bullet points.

use a variable dictionary instead of additional columns

  • Data manipulation. Focus only on requests made to NYPD.
  • Create a a new variable duration, which represents the time period from the Created Date to Closed Date. Note that duration may be censored for some requests.
julia> dat3 = dat2[dat2.agency .== "NYPD", :]
21535×13 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise  
     256526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     356525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
     456526234  2023-01-15T00:01:02  2023-01-15T05:47:41  NYPD     Noise
     556530817  2023-01-15T00:01:07  2023-01-15T01:01:26  NYPD     Noise  
     656530491  2023-01-15T00:01:12  2023-01-15T01:47:57  NYPD     Noise
     756527886  2023-01-15T00:01:22  2023-01-15T00:15:44  NYPD     Blocke
     856530623  2023-01-15T00:01:23  2023-01-15T01:07:30  NYPD     Illega
                                                                 
 2152956588474  2023-01-21T23:58:36  2023-01-22T01:02:17  NYPD     Blocke 
 2153056590134  2023-01-21T23:58:43  2023-01-22T01:13:54  NYPD     Noise
 2153156585700  2023-01-21T23:58:48  2023-01-22T01:30:36  NYPD     Noise
 2153256588661  2023-01-21T23:59:10  2023-01-22T02:50:54  NYPD     Noise
 2153356591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 2153456587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 2153556590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
                                                9 columns and 21520 rows omitted

julia> describe(dat3)[!,[1, end-1]]
13×2 DataFrame
 Row │ variable          nmissing
     │ Symbol            Int64
─────┼────────────────────────────
   1 │ unique_key               0
   2 │ created_date             0
   3 │ closed_date              1
   4 │ agency                   0
   5 │ complaint_type           0
   6 │ descriptor               0
   7 │ location_type            0
   8 │ incident_zip           240
   9 │ incident_address       240
  10 │ bbl                   2010
  11 │ borough                  0
  12 │ latitude                46
  13 │ longitude               46

julia> filter!(:closed_date => !ismissing, dat3)
       
       # using Dates
21534×13 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise  
     256526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     356525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
     456526234  2023-01-15T00:01:02  2023-01-15T05:47:41  NYPD     Noise
     556530817  2023-01-15T00:01:07  2023-01-15T01:01:26  NYPD     Noise  
     656530491  2023-01-15T00:01:12  2023-01-15T01:47:57  NYPD     Noise
     756527886  2023-01-15T00:01:22  2023-01-15T00:15:44  NYPD     Blocke
     856530623  2023-01-15T00:01:23  2023-01-15T01:07:30  NYPD     Illega
                                                                 
 2152856588474  2023-01-21T23:58:36  2023-01-22T01:02:17  NYPD     Blocke 
 2152956590134  2023-01-21T23:58:43  2023-01-22T01:13:54  NYPD     Noise
 2153056585700  2023-01-21T23:58:48  2023-01-22T01:30:36  NYPD     Noise
 2153156588661  2023-01-21T23:59:10  2023-01-22T02:50:54  NYPD     Noise
 2153256591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 2153356587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 2153456590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
                                                9 columns and 21519 rows omitted

julia> dat3.duration .= Dates.value.(dat3.closed_date .- dat3.created_date) ./ 60000
       # passmissing(Dates.value)
       # disallowmissing!(dat3, :duration)
21534-element Vector{Float64}:
 157.2
 114.05
  69.1
 346.65
  60.31666666666667
 106.75
  14.366666666666667
  66.11666666666666
 107.73333333333333
  68.03333333333333
   
  16.583333333333332
  96.45
  63.68333333333333
  75.18333333333334
  91.8
 171.73333333333332
 402.5833333333333
 385.26666666666665
 269.28333333333336
  • Visualize the distribution of uncensored duration by weekdays/weekend and by borough, and test whether the distributions are the same across weekdays/weekends of their creation and across boroughs.
julia> dat3.weekday .= dayofweek.(dat3.created_date) .<= 5;

julia> duration_weekday = dat3.duration[dat3.weekday];

julia> duration_weekend = dat3.duration[.!dat3.weekday];

pick a backend for ploting

plt.plotlyjs()    # use plotlyjs backend
plt.plotly()      # use plotly backend
plt.pythonplot()  # use pythonplot backend
plt.gr()          # use gr backend; it's the default
julia> sp.@df dat3 sp.density(:duration, group=:weekday, lw=3, label=["weekend" "weekday"])

julia> sp.violin(["duration"], duration_weekday, side=:left, label="weekday")

julia> sp.violin!(["duration"], duration_weekend, side=:right, label="weekend")

julia> sp.@df dat3 sp.boxplot(ifelse.(:weekday, "weekday", "weekend"), :duration,
                              notch=true, label=false)

julia> sp.@df dat3 sp.violin!(ifelse.(:weekday, "weekday", "weekend"), :duration,
                              label=false)

julia> sp.@df dat3 sp.density(:duration, group=:borough, lw=3)

julia> sp.@df dat3 sp.violin(:borough, :duration, label=false)

julia> sp.@df dat3 sp.boxplot(:borough, :duration, label=false)

Plot with Makie.jl. Again, you can activate different backends.

Histograms

julia> g = groupby(dat3, :weekday);

julia> b = combine(g, :duration => Ref);

julia> f = Figure()

julia> for i in 1:2
           ax = Axis(f[1, i], title = "Duration on $(i==1 ? "weekend" : "weekday")")
           hist!(ax, b[i,2], normalization=:probability,
                 bar_labels=:values, label_formatter=x->round(x, digits=3),
                 label_size=15, strokewidth=0.5, strokecolor=(:black, 0.5),
                 color=:values)
           ylims!(ax, 0, 0.8)
       end

julia> f

Density plots

julia> f = Figure()

julia> ax = Axis(f[1, 1], title = "Duration on weekdays and weekends")
Axis with 0 plots:


julia> d1 = density!(duration_weekday, offset=0, strokewidth=2, strokecolor=:black,
                     color=(:red, 0.6))
Combined{Makie.density, Tuple{Vector{Float64}}}

julia> d2 = density!(duration_weekend, strokewidth=2, strokecolor=:black,
                     color=(:green, 0.6))
Combined{Makie.density, Tuple{Vector{Float64}}}

julia> axislegend(ax, [d1, d2], ["weekday", "weekend"], position=:rc)
Legend()

julia> f

by borough

julia> g = groupby(dat3, :borough);

julia> b = combine(g, :duration => Ref);

julia> f = Figure()

julia> ax = Axis(f[1, 1])
Axis with 0 plots:


julia> for i in 1:size(b,1)
           density!(ax, b[i,2], strokewidth=2)
       end

julia> f

Boxplots

julia> f = Figure()

julia> ax = Axis(f[1, 1], xticks = (0:1, ["weekend", "weekday"]), 
                 title = "Duration on weekdays and weekends")
Axis with 0 plots:


julia> d = boxplot!(dat3.weekday, dat3.duration, show_notch=true,
                    color=ifelse.(dat3.weekday, :blue, :green), outliercolor=:red)
Combined{Makie.boxplot, Tuple{BitVector, Vector{Float64}}}

julia> f

HypothesisTests

julia> using HypothesisTests

julia> EqualVarianceTTest(duration_weekday, duration_weekend)
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -2.61428
    95% confidence interval: (-6.661, 1.433)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.2055

Details:
    number of observations:   [15106,6428]
    t-statistic:              -1.2661004878706232
    degrees of freedom:       21532
    empirical standard error: 2.064827445712


julia> UnequalVarianceTTest(duration_weekday, duration_weekend)
Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -2.61428
    95% confidence interval: (-6.68, 1.451)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.2075

Details:
    number of observations:   [15106,6428]
    t-statistic:              -1.2605065048545891
    degrees of freedom:       12006.29661889752
    empirical standard error: 2.0739909126341214


julia> MannWhitneyUTest(duration_weekday, duration_weekend)
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -1.125

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0207

Details:
    number of observations in each group: [15106, 6428]
    Mann-Whitney-U statistic:             4.75853e7
    rank sums:                            [1.64354e8, 6.75136e7]
    adjustment for ties:                  239874.0
    normal approximation (μ, σ):          (-965419.0, 4.1744e5)


julia> g = groupby(dat3, :borough);

julia> b = combine(g, :duration => Ref)
6×2 DataFrame
 Row │ borough        duration_Ref
     │ String15       SubArray
─────┼──────────────────────────────────────────────────
   1 │ QUEENS         [114.05, 66.1167, 76.4833, 89.0,
   2 │ STATEN ISLAND  [120.567, 75.3833, 68.95, 11.783
   3 │ MANHATTAN      [60.3167, 40.25, 105.35, 223.75,
   4 │ BRONX          [157.2, 346.65, 131.75, 164.817,
   5 │ BROOKLYN       [69.1, 106.75, 14.3667, 107.733,
   6 │ Unspecified    [34.3833, 115.2, 131.1, 182.917,

julia> OneWayANOVATest(b.duration_Ref...)
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     <1e-99

Details:
    number of observations: [5844, 573, 3826, 3934, 7232, 125]
    F statistic:            390.612
    degrees of freedom:     (5, 21528)


julia> KruskalWallisTest(b.duration_Ref...)
Kruskal-Wallis rank sum test (chi-square approximation)
-------------------------------------------------------
Population details:
    parameter of interest:   Location parameters
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           <1e-99

Details:
    number of observation in each group: [5844, 573, 3826, 3934, 7232, 125]
    χ²-statistic:                        1671.16
    rank sums:                           [7.01507e7, 5.65392e6, 3.12847e7, 5.20025e7, 7.13712e7, 1.40443e6]
    degrees of freedom:                  5
    adjustment for ties:                 1.0
  • Basic information at the zipcode level such as population density, median home value, and median household income is available from the US Census. Convenient accesses are, for example, R package zipcodeR and Python package uszipcode; there seems to no Julia equivalent yet but Julia can call R or Python easily. Merge the zipcode level information with the NYPD requests data.
julia> dat4 = filter(:latitude => !ismissing, dat3)
21488×15 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise  
     256526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     356525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
     456526234  2023-01-15T00:01:02  2023-01-15T05:47:41  NYPD     Noise
     556530817  2023-01-15T00:01:07  2023-01-15T01:01:26  NYPD     Noise  
     656530491  2023-01-15T00:01:12  2023-01-15T01:47:57  NYPD     Noise
     756527886  2023-01-15T00:01:22  2023-01-15T00:15:44  NYPD     Blocke
     856530623  2023-01-15T00:01:23  2023-01-15T01:07:30  NYPD     Illega
                                                                 
 2148256588474  2023-01-21T23:58:36  2023-01-22T01:02:17  NYPD     Blocke 
 2148356590134  2023-01-21T23:58:43  2023-01-22T01:13:54  NYPD     Noise
 2148456585700  2023-01-21T23:58:48  2023-01-22T01:30:36  NYPD     Noise
 2148556588661  2023-01-21T23:59:10  2023-01-22T02:50:54  NYPD     Noise
 2148656591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 2148756587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 2148856590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
                                               11 columns and 21473 rows omitted

PythonCall.jl can be use to interact with Python. You can need to use the following code to install Python packages.

using CondaPkg
CondaPkg.add_pip("uszipcode")
CondaPkg.add("python-Levenshtein")
julia> uszipcode = pyimport("uszipcode")
Python: <module 'uszipcode' from '/home/ossifragus/Dropbox/julia/courses/DataJamboree/.CondaPkg/env/lib/python3.12/site-packages/uszipcode/__init__.py'>

julia> sr = uszipcode.SearchEngine()
Python: <uszipcode.search.SearchEngine object at 0x7f7cbb72d850>

julia> sr.by_zipcode("10001")
Python: SimpleZipcode(zipcode='10001', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.75, lng=-74.0, timezone='America/New_York', radius_in_miles=0.9090909090909091, area_code_list='718,917,347,646', population=21102, population_density=33959.0, land_area_in_sqmi=0.62, water_area_in_sqmi=0.0, housing_units=12476, occupied_housing_units=11031, median_home_value=650200, median_household_income=81671, bounds_west=-74.008621, bounds_east=-73.984076, bounds_north=40.759731, bounds_south=40.743451)

julia> sr.by_coordinates(40.8636, -73.8704, returns=1)[0]
Python: SimpleZipcode(zipcode='10467', zipcode_type='STANDARD', major_city='Bronx', post_office_city='Bronx, NY', common_city_list=['Bronx'], county='Bronx County', state='NY', lat=40.87, lng=-73.87, timezone='America/New_York', radius_in_miles=2.0, area_code_list='347,718', population=97060, population_density=41649.0, land_area_in_sqmi=2.33, water_area_in_sqmi=0.0, housing_units=37432, occupied_housing_units=35524, median_home_value=369500, median_household_income=36048, bounds_west=-73.89571, bounds_east=-73.855691, bounds_north=40.908016, bounds_south=40.857266)

Using a for loop to find one data point a time in Python from julia is slow. I am sure it can be done much faster. A csv file is created to save time.

julia> if abspath(PROGRAM_FILE) == @__FILE__ 
           usnames = ("units", "area", "value", "income", "occupiedunits",
                      "population", "population_density", "water_area", "zip")
           dus = length(usnames)
           us = DataFrame([name => Int[] for name in usnames])
           @time for r in eachrow(dat4)#[1:10]
               z = sr.by_coordinates(r.latitude, r.longitude, returns=1)
               if isempty(z)
                   tmp = fill(missing, dus)
               else
                   z = sr.by_coordinates(r.latitude, r.longitude, returns=1)[0]
                   tmp = (z.housing_units, z.land_area_in_sqmi, z.median_home_value,
                          z.median_household_income, z.occupied_housing_units,
                          z.population, z.population_density, z.water_area_in_sqmi,
                          z.zipcode)
                   tmp = pyconvert(Tuple, tmp)
               end
               push!(us, replace(tmp, nothing => missing), promote=true)
           end
           CSV.write("uszipdata.csv", us)
       else
           us = CSV.read("uszipdata.csv", DataFrame)
       end
21488×9 DataFrame
   Row │ units    area        value    income   occupiedunits  population  pop 
       │ Int64?   Float64?    Int64?   Int64?   Int64?         Int64?      Flo 
───────┼────────────────────────────────────────────────────────────────────────
     137432        2.33   369500    36048          35524       97060      
     211854        1.63   427000    60691          11031       36256
     3missing  missing     missing  missing        missing     missing
     431331        1.48   271300    45864          29855       75784
     516493        0.18   805300   102941          14622       26121      
     631045        1.79   615300    37580          29018       99598
     737619        2.25   531000    41639          33886       79132
     830978        2.63   480900    45964          28861      109931
                                                                
 2148229074        1.35   551900    34492          25693       67053      
 2148316724        0.98   464500    40372          15512       49896
 2148416724        0.98   464500    40372          15512       49896
 2148518756        1.52   726500    63739          17033       36934
 21486missing  missing     missing  missing        missing     missing      
 2148712184        1.07   464500    53617          11194       38615
 2148821127        2.4    454700    52347          19772       50502
                                                3 columns and 21473 rows omitted

julia> dat5 = [dat4[!, Not(:bbl, :incident_address, :incident_zip)] us]
21488×21 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime?            String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise  
     256526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     356525034  2023-01-15T00:00:50  2023-01-15T01:09:56  NYPD     Illega
     456526234  2023-01-15T00:01:02  2023-01-15T05:47:41  NYPD     Noise
     556530817  2023-01-15T00:01:07  2023-01-15T01:01:26  NYPD     Noise  
     656530491  2023-01-15T00:01:12  2023-01-15T01:47:57  NYPD     Noise
     756527886  2023-01-15T00:01:22  2023-01-15T00:15:44  NYPD     Blocke
     856530623  2023-01-15T00:01:23  2023-01-15T01:07:30  NYPD     Illega
                                                                 
 2148256588474  2023-01-21T23:58:36  2023-01-22T01:02:17  NYPD     Blocke 
 2148356590134  2023-01-21T23:58:43  2023-01-22T01:13:54  NYPD     Noise
 2148456585700  2023-01-21T23:58:48  2023-01-22T01:30:36  NYPD     Noise
 2148556588661  2023-01-21T23:59:10  2023-01-22T02:50:54  NYPD     Noise
 2148656591186  2023-01-21T23:59:12  2023-01-22T06:41:47  NYPD     Illega 
 2148756587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 2148856590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
                                               17 columns and 21473 rows omitted

julia> dropmissing!(dat5)
20877×21 DataFrame
   Row │ unique_key  created_date         closed_date          agency   compla 
       │ Int64       DateTime             DateTime             String7  String 
───────┼────────────────────────────────────────────────────────────────────────
     156524984  2023-01-15T00:00:18  2023-01-15T02:37:30  NYPD     Noise  
     256526790  2023-01-15T00:00:21  2023-01-15T01:54:24  NYPD     Noise
     356526234  2023-01-15T00:01:02  2023-01-15T05:47:41  NYPD     Noise
     456530817  2023-01-15T00:01:07  2023-01-15T01:01:26  NYPD     Noise
     556530491  2023-01-15T00:01:12  2023-01-15T01:47:57  NYPD     Noise  
     656527886  2023-01-15T00:01:22  2023-01-15T00:15:44  NYPD     Blocke
     756530623  2023-01-15T00:01:23  2023-01-15T01:07:30  NYPD     Illega
     856528076  2023-01-15T00:01:34  2023-01-15T01:49:18  NYPD     Illega
                                                                 
 2087156584502  2023-01-21T23:58:29  2023-01-22T01:34:56  NYPD     Noise  
 2087256588474  2023-01-21T23:58:36  2023-01-22T01:02:17  NYPD     Blocke
 2087356590134  2023-01-21T23:58:43  2023-01-22T01:13:54  NYPD     Noise
 2087456585700  2023-01-21T23:58:48  2023-01-22T01:30:36  NYPD     Noise
 2087556588661  2023-01-21T23:59:10  2023-01-22T02:50:54  NYPD     Noise  
 2087656587053  2023-01-21T23:59:24  2023-01-22T06:24:40  NYPD     Noise
 2087756590382  2023-01-22T00:00:00  2023-01-22T04:29:17  NYPD     Noise
                                               17 columns and 20862 rows omitted
  • Data analysis.
  • Define a binary variable over3h which is 1 if duration is greater than 3 hours. Note that it can be obtained even for censored duration.
julia> dat5.over3h .= dat5.duration .> 180
20877-element BitVector:
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 
 0
 0
 0
 0
 0
 0
 0
 1
 1
- Build a logistic model to predict over3h using the 311 request data as

well as those zip code level covariates. If your model has tuning parameters, justify their choices. Use appropriate metrics to assess the performance of the model.

julia> using GLM, StatsBase

julia> fm = @formula(over3h ~ latitude + longitude + weekday + value + income
                     + population + water_area)
FormulaTerm
Response:
  over3h(unknown)
Predictors:
  latitude(unknown)
  longitude(unknown)
  weekday(unknown)
  value(unknown)
  income(unknown)
  population(unknown)
  water_area(unknown)

julia> lgst = glm(fm, dat5, Bernoulli())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Bernoulli{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

over3h ~ 1 + latitude + longitude + weekday + value + income + population + water_area

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                   Coef.    Std. Error       z  Pr(>|z|)     Lower 95%     Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0         NaN           NaN       NaN     NaN           NaN
latitude      4.3594        0.224315     19.43    <1e-83    3.91975       4.79905
longitude     2.4171        0.124224     19.46    <1e-83    2.17362       2.66057
weekday       0.0374286     0.0417289     0.90    0.3697   -0.0443586     0.119216
value        -1.38242e-6    1.61184e-7   -8.58    <1e-17   -1.69833e-6   -1.0665e-6
income       -9.57878e-7    1.30116e-6   -0.74    0.4616   -3.50811e-6    1.59235e-6
population    1.66498e-6    8.59956e-7    1.94    0.0529   -2.0506e-8     3.35046e-6
water_area    0.792161      0.271544      2.92    0.0035    0.259944      1.32438
────────────────────────────────────────────────────────────────────────────────────

julia>= GLM.predict(lgst)
20877-element Vector{Float64}:
 0.3111196600355672
 0.1650622268062843
 0.307377645491299
 0.1035138896558327
 0.07661270152179724
 0.08760770837939519
 0.1941436175776555
 0.06187367866992892
 0.1948301989709319
 0.08697028138264871
 
 0.10352303880619676
 0.08874324033414512
 0.1107394172020614
 0.12481075944915469
 0.14281041685941806
 0.1395688392682537
 0.10419732204978574
 0.18031446474422425
 0.27307904394614646

julia> mean((p̂ .> 0.5) .== dat5.over3h) # accuracy
0.8344589739905158

julia> function cal_auc(p, y)
           p1, p0 = p[y .== 1], p[y .== 0]
           n1, n0 = length(p1), length(p0)
           s = 0.0
           for i in p0, j in p1
               s += (i < j) + 0.5(i == j)
           end
           s / (n1*n0)
       end
cal_auc (generic function with 1 method)

julia> @time cal_auc(p̂, dat5.over3h)
  0.137854 seconds (214.88 k allocations: 14.718 MiB, 68.03% compilation time)
0.6663342965439752

julia> using FLoops

julia> function cal_aucF(p, y)
           p1, p0 = p[y .== 1], p[y .== 0]
           n1, n0 = length(p1), length(p0)
           @floop for i in p0, j in p1
               @reduce(s += (i < j) + 0.5(i == j))
           end
           s / (n1*n0)
       end
cal_aucF (generic function with 1 method)

julia> @time cal_aucF(p̂, dat5.over3h)
  0.280798 seconds (625.57 k allocations: 42.133 MiB, 132.88% compilation time)
0.6663342965439752

run a second time for better timeing

julia> @time cal_auc(p̂, dat5.over3h)
  0.043704 seconds (11 allocations: 177.047 KiB)
0.6663342965439752

julia> @time cal_aucF(p̂, dat5.over3h)
  0.004009 seconds (1.57 k allocations: 262.844 KiB)
0.6663342965439752

julia> using MLJ

julia> accuracy(p̂ .> 0.5, dat5.over3h)
0.8344589739905159

julia> y_cat = categorical(dat5.over3h);

julia> c = categorical(unique(y_cat))
2-element CategoricalArrays.CategoricalArray{Bool,1,UInt32}:
 false
 true

julia>= [UnivariateFinite(c, [1.0 - p, p]) for p in p̂];

julia> @time auc(ŷ, y_cat)
  0.009821 seconds (11 allocations: 652.750 KiB)
0.6663342965439752

julia> rc = roc_curve(ŷ, y_cat);

julia> plt.plot(rc[1:2]..., color=:red, lw=3, label="glm", legend=:right,
                xlab="False Positive", ylab="True Positive")

  • Repeat the analysis with another model (e.g., random forest; neural network; etc.).
julia> using XGBoost

julia> X = dat5[!,[:latitude, :longitude, :weekday, :value, :income, :population, :water_area]]
20877×7 DataFrame
   Row │ latitude  longitude  weekday  value   income  population  water_area
       │ Float64   Float64    Bool     Int64   Int64   Int64       Float64
───────┼──────────────────────────────────────────────────────────────────────
     140.8636   -73.8704    false  369500   36048       97060        0.0
     240.7028   -73.8376    false  427000   60691       36256        0.0
     340.8324   -73.8589    false  271300   45864       75784        0.0
     440.7737   -73.9478    false  805300  102941       26121        0.0
     540.6353   -74.0202    false  615300   37580       99598        0.0
     640.5811   -73.9569    false  531000   41639       79132        0.19
     740.7381   -73.8583    false  480900   45964      109931        0.04
     840.6243   -74.008     false  681700   61893       41788        0.0
                                                        
 2087140.6383   -73.9453    false  426100   48400       76174        0.0
 2087240.6887   -73.9076    false  551900   34492       67053        0.0
 2087340.7054   -73.9157    false  464500   40372       49896        0.02
 2087440.7068   -73.9292    false  464500   40372       49896        0.02
 2087540.7225   -73.9468    false  726500   63739       36934        0.08
 2087640.7576   -73.8753    false  464500   53617       38615        0.0
 2087740.8464   -73.8283    false  454700   52347       50502        0.0
                                                            20862 rows omitted

julia> Y = dat5.over3h
20877-element BitVector:
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 
 0
 0
 0
 0
 0
 0
 0
 1
 1

julia> bst = xgboost((X,Y), num_round=100, max_depth=10, objective="binary:logistic")
Booster()

julia> p̂_xgb = XGBoost.predict(bst, X)
20877-element Vector{Float32}:
 0.02964254
 0.14859171
 0.7987125
 0.026607411
 0.017226843
 0.025675919
 0.1849252
 0.0009443293
 0.05393643
 0.0038877402
 
 0.026607411
 0.00012181072
 0.00266249
 0.15025197
 0.15114085
 0.009047651
 0.17287524
 0.6756687
 0.5761649

julia> accuracy(p̂_xgb .> 0.5, y_cat)
0.9007041241557695

julia> cal_auc(p̂_xgb, Y)
0.9478532936116905

julia> ŷ_xgb = [UnivariateFinite(c, [1.0 - p, p]) for p in p̂_xgb];

julia> rc_xgb = roc_curve(ŷ_xgb, y_cat)
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0    0.9991963721944779, 0.9992537741805867, 0.9994833821250215, 0.9995981860972389, 0.9996555880833476, 0.9997129900694564, 0.9998277940416739, 0.9998851960277826, 0.9999425980138913, 1.0], [0.0, 0.00028935185185185184, 0.002025462962962963, 0.0023148148148148147, 0.00318287037037037, 0.004050925925925926, 0.004629629629629629, 0.004918981481481482, 0.005208333333333333, 0.005497685185185185    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [0.9785438179969788, 0.9648628234863281, 0.9570191502571106, 0.948989748954773, 0.9487460851669312, 0.9475162625312805, 0.9425473213195801, 0.9362542629241943, 0.9339721202850342, 0.929896354675293    0.00013061388744972646, 0.00012431186041794717, 0.00012181072088424116, 0.00011837841157102957, 0.00011661482130875811, 0.00010541961819399148, 9.845692693488672e-5, 9.515414421912283e-5, 8.833550964482129e-5, 8.078639802988619e-5])

julia> plt.plot!(rc_xgb[1:2]..., color=:blue, lw=3, label="xgboost")

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages