<a href="https://colab.research.google.com/github/CoWoGeo/PUS2022_CWolk/blob/main/ProjectLanthamWolk/PhillyLIViolations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to L&I Violations Notebook

This notebook uses the Philadelphia License and Inspections Code Violations dataset: https://www.opendataphilly.org/dataset/licenses-and-inspections-violations

For our analysis, we retrieved all L&I code violations for locations with junkyards from 2017 through the present. The junkyard locations were retrieved in the Business License notebook.

Code violations range from paperwork-y things such as using a site without the license for it (selling food from your house or not having a rental license, for example) or not having materials safety data sheets (MSDS) available to imminent physical hazards such as fire code violations or imminently dangerous buildings. Some of this is given by the prefix in the violation code: FC is Fire Code, for example.

After finding all locations of junkyard licenses in another notebook, we are pulling all L&I violations in the past ~5 years for the location. This is not a perfect method. For example, a site can change owners or use, even turning into luxury condos, and some junkyard sites have non-junkyard businesses at the same location. But that's a problem to solve in a future version of this project if necessary.


# Installing Libraries

In [1]:
!apt install python3-rtree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libspatialindex-c4v5 libspatialindex-dev libspatialindex4v5
  python3-pkg-resources
Suggested packages:
  python3-setuptools
The following NEW packages will be installed:
  libspatialindex-c4v5 libspatialindex-dev libspatialindex4v5
  python3-pkg-resources python3-rtree
0 upgraded, 5 newly installed, 0 to remove and 11 not upgraded.
Need to get 671 kB of archives.
After this operation, 3,948 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex4v5 amd64 1.8.5-5 [219 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex-c4v5 amd64 1.8.5-5 [51.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-pkg

In [2]:
!pip install geopandas
import geopandas as gpd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopandas
  Downloading geopandas-0.12.1-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 4.6 MB/s 
[?25hCollecting fiona>=1.8
  Downloading Fiona-1.8.22-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.6 MB)
[K     |████████████████████████████████| 16.6 MB 51.7 MB/s 
[?25hCollecting pyproj>=2.6.1.post1
  Downloading pyproj-3.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 28.7 MB/s 
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting click-plugins>=1.0
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: munch, cligj, click-plugins, pyproj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.

In [3]:
!pip install cartoframes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cartoframes
  Downloading cartoframes-1.2.4-py2.py3-none-any.whl (245 kB)
[K     |████████████████████████████████| 245 kB 4.6 MB/s 
Collecting carto<2.0,>=1.11.3
  Downloading carto-1.11.3.tar.gz (27 kB)
Collecting semantic-version<3,>=2.8.0
  Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Collecting unidecode<2.0,>=1.1.0
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 48.2 MB/s 
Collecting pyrestcli==0.6.11
  Downloading pyrestcli-0.6.11.tar.gz (9.1 kB)
Building wheels for collected packages: carto, pyrestcli
  Building wheel for carto (setup.py) ... [?25l[?25hdone
  Created wheel for carto: filename=carto-1.11.3-py3-none-any.whl size=35088 sha256=5727c4d17e464042624aa4bd67984bd585edee5b387ca250969dc950b7671048
  Stored in directory: /root/.cache/pip/wheels/6b/a3/41/90fa4334cd280f91d17226f36db7a34b

In [4]:
import cartoframes as cf

In [5]:
! pip install shapely
! conda install -c conda-forge fiona

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
/bin/bash: conda: command not found


In [6]:
import pandas as pd

In [7]:
import numpy as np

In [8]:
import matplotlib.pyplot as plt

# Importing and Cleaning L&I Violations Data

I did manage to modify the link to get the years I want all in one go! I would like to be able to cut out some of the useless columns before importing, but oh well.

In [9]:
violations = gpd.GeoDataFrame.from_file("https://phl.carto.com/api/v2/sql?filename=violations&format=shp&skipfields=cartodb_id&q=SELECT%20*%20FROM%20violations%20WHERE%20violationdate%20%3E=%20%272017-01-01%27")

In [10]:
violations.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 577681 entries, 0 to 577680
Data columns (total 31 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   objectid    577681 non-null  int64   
 1   addressobj  576131 non-null  object  
 2   parcel_id_  275784 non-null  object  
 3   casenumber  577681 non-null  object  
 4   casecreate  574936 non-null  object  
 5   casecomple  432600 non-null  object  
 6   casetype    277673 non-null  object  
 7   casestatus  577681 non-null  object  
 8   caserespon  574387 non-null  object  
 9   casepriori  574927 non-null  object  
 10  violationn  577681 non-null  object  
 11  violationd  577681 non-null  object  
 12  violationc  577351 non-null  object  
 13  violatio_1  577298 non-null  object  
 14  violations  573528 non-null  object  
 15  violationr  177977 non-null  object  
 16  violatio_2  177982 non-null  object  
 17  mostrecent  560398 non-null  object  
 18  opa_accoun  5669

This is a big dataset. Unfortunately, there isn't an obvious set of rows to trim. I might go back in and remove some of the city-owned properties such as the School District of Philadelphia or Philadelphia Housing Authority, but knowing Philly, there could be licensed junkyards there, too. However, I can at least cut some columns!

In [11]:
# removing useless or less useful columns
violations = violations.drop(columns=["addressobj", "parcel_id_", "opa_accoun", 
                                      "unit_type", "council_di", "posse_jobi"], 
                             axis=1)

In [12]:
violations.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 577681 entries, 0 to 577680
Data columns (total 25 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   objectid    577681 non-null  int64   
 1   casenumber  577681 non-null  object  
 2   casecreate  574936 non-null  object  
 3   casecomple  432600 non-null  object  
 4   casetype    277673 non-null  object  
 5   casestatus  577681 non-null  object  
 6   caserespon  574387 non-null  object  
 7   casepriori  574927 non-null  object  
 8   violationn  577681 non-null  object  
 9   violationd  577681 non-null  object  
 10  violationc  577351 non-null  object  
 11  violatio_1  577298 non-null  object  
 12  violations  573528 non-null  object  
 13  violationr  177977 non-null  object  
 14  violatio_2  177982 non-null  object  
 15  mostrecent  560398 non-null  object  
 16  address     575564 non-null  object  
 17  unit_num    8759 non-null    object  
 18  zip         5755

In [13]:
# removing rows without geometry. Looking at the API, a lot of those were
# redundant kind of notes entries anyway, not new records
violations = violations[violations["geometry"].notna()]

In [14]:
violations.head()

Unnamed: 0,objectid,casenumber,casecreate,casecomple,casetype,casestatus,caserespon,casepriori,violationn,violationd,...,mostrecent,address,unit_num,zip,censustrac,opa_owner,systemofre,geocode_x,geocode_y,geometry
0,13,568777,2017-01-05,,NOTICE OF VIOLATION,IN VIOLATION,CLIP,STANDARD,211935379,2017-01-05,...,,1509 DEAL ST,,19124-4403,293,"HALLIGAN DOLORES, HORLACHER ELWOOD",ECLIPSE,2712718.0,257431.046185,POINT (-75.09284 40.00927)
1,14,568777,2017-01-05,,NOTICE OF VIOLATION,IN VIOLATION,CLIP,STANDARD,211935380,2017-01-05,...,,1509 DEAL ST,,19124-4403,293,"HALLIGAN DOLORES, HORLACHER ELWOOD",ECLIPSE,2712718.0,257431.046185,POINT (-75.09284 40.00927)
2,15,568792,2017-01-05,,NOTICE OF VIOLATION,IN VIOLATION,CSU INVESTIGATOR,UNSAFE,211935404,2017-01-05,...,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)
3,16,568792,2017-01-05,,NOTICE OF VIOLATION,IN VIOLATION,CSU INVESTIGATOR,UNSAFE,211935405,2017-01-05,...,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)
4,17,568792,2017-01-05,,NOTICE OF VIOLATION,IN VIOLATION,CSU INVESTIGATOR,UNSAFE,211935406,2017-01-05,...,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)


In [15]:
violations.iloc[:,8:].head()

Unnamed: 0,violationn,violationd,violationc,violatio_1,violations,violationr,violatio_2,mostrecent,address,unit_num,zip,censustrac,opa_owner,systemofre,geocode_x,geocode_y,geometry
0,211935379,2017-01-05,PM-302.2/4,EXT A-VACANT LOT CLEAN/MAINTAI,OPEN,,,,1509 DEAL ST,,19124-4403,293,"HALLIGAN DOLORES, HORLACHER ELWOOD",ECLIPSE,2712718.0,257431.046185,POINT (-75.09284 40.00927)
1,211935380,2017-01-05,CP-01,CLIP VIOLATION NOTICE,OPEN,,,,1509 DEAL ST,,19124-4403,293,"HALLIGAN DOLORES, HORLACHER ELWOOD",ECLIPSE,2712718.0,257431.046185,POINT (-75.09284 40.00927)
2,211935404,2017-01-05,PM15-108.1,UNSAFE STRUCTURE,OPEN,,,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)
3,211935405,2017-01-05,PM15-305.1,INTERIOR UNSAFE,OPEN,,,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)
4,211935406,2017-01-05,PM15-304.1G,EXTERIOR STRUCT UNSAFE COND 7,OPEN,,,2022-11-14,3462 JOYCE ST,,19134-2622,188,LUTEK BETH LYNN,ECLIPSE,2709630.0,252891.378189,POINT (-75.10434 39.99707)


In [None]:
#violations_save = violations.copy()

In [16]:
violations = violations[['caserespon', 'casepriori',        
            'violationc', 'violatio_1',  'geometry']]

In [17]:
len(violations.violationc.unique())

2299

Earlier, I looked at the contents of more columns and addressobj and parcels are in a different format than in business licenses, woohoo! It's still unclear what addressobj actually means. The parcels in this dataset were closer to the real parcel format but still not uniform.

## Importing Junkyard Locations

The whole set of licenses attached to each junkyard is in a different file. There is data cleaning in this notebook because I realized that a. I hadn't flattened this to one row per target object b. I wanted to preserve the number of different business names that have had junkyard licenses there as an object.

In [18]:
junklocations = gpd.GeoDataFrame.from_file("https://github.com/CoWoGeo/PUS2022_CWolk/raw/main/ProjectLanthamWolk/PhillyJunkyardLocations.geojson")

In [19]:
junklocations.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 28 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   address          356 non-null    object  
 1   map_section      384 non-null    object  
 2   map_parcel       383 non-null    object  
 3   parcel_full      384 non-null    object  
 4   PWD_review       65 non-null     object  
 5   DOR_review       65 non-null     object  
 6   address_licdata  384 non-null    object  
 7   unit_type        56 non-null     object  
 8   unit_num         56 non-null     object  
 9   zip              384 non-null    object  
 10  opa_accoun       322 non-null    object  
 11  opa_owner        322 non-null    object  
 12  licensenum       384 non-null    object  
 13  licensetyp       384 non-null    object  
 14  initialiss       384 non-null    object  
 15  mostrecent       384 non-null    object  
 16  expiration       384 non-null    obj

In [20]:
junklocations = junklocations[["address", "map_section", "map_parcel", 
                                       "parcel_full", "business_n", "geocode_x", 
                                       "geocode_y", "geometry"]]

In [21]:
junklocations.head(20)

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_n,geocode_x,geocode_y,geometry
0,,047S11,10,047S110010,CHRIS AUTO PARTS II,2678437.0,218972.547295,"POLYGON ((2679160.849 219790.732, 2679137.703 ..."
1,,047S17,35,047S170035,CHRIS AUTO PARTS II,2678437.0,218972.547295,"POLYGON ((2678435.960 218926.501, 2678379.218 ..."
2,,047S17,26,047S170026,CHRIS AUTO PARTS II,2678437.0,218972.547295,"POLYGON ((2679159.954 219790.563, 2679136.835 ..."
3,6101 W PASSYUNK AVE,047S21,23,047S210023,GIANNA SALVAGE CORP,2680928.0,223375.011346,"POLYGON ((2680829.298 223301.310, 2680808.700 ..."
4,6101 W PASSYUNK AVE,047S21,44,047S210044,GIANNA SALVAGE CORP,2680928.0,223375.011346,"POLYGON ((2680825.903 223299.337, 2680819.922 ..."
5,2731 FRANKFORD AVE,021N04,195,021N040195,MELENDEZ WILBERTO,2705116.0,249176.525141,"POLYGON ((2704960.012 249347.583, 2705052.744 ..."
6,631 W FISHER AVE,133N08,174,133N080174,ARPOL INC,2700823.0,265612.568861,"POLYGON ((2700956.026 265722.004, 2700936.794 ..."
7,631 W FISHER AVE,133N08,174,133N080174,POLAM AUTO WORLD INC,2700823.0,265612.568861,"POLYGON ((2700956.026 265722.004, 2700936.794 ..."
8,2100 S 61ST ST,029S07,134,029S070134,T & J AUTO BODY REPAIR & PAINT INC,2674739.0,227311.740363,"POLYGON ((2674950.680 227343.633, 2674774.687 ..."
9,2100 S 61ST ST,030S16,230,030S160230,T & J AUTO BODY REPAIR & PAINT INC,2674739.0,227311.740363,"POLYGON ((2674774.687 227284.231, 2674739.423 ..."


## Making a Column of Unique Business Names Per Location and Deleting Duplicate Locations

In [22]:
junklocations.groupby("parcel_full").head()

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_n,geocode_x,geocode_y,geometry
0,,047S11,0010,047S110010,CHRIS AUTO PARTS II,2.678437e+06,218972.547295,"POLYGON ((2679160.849 219790.732, 2679137.703 ..."
1,,047S17,0035,047S170035,CHRIS AUTO PARTS II,2.678437e+06,218972.547295,"POLYGON ((2678435.960 218926.501, 2678379.218 ..."
2,,047S17,0026,047S170026,CHRIS AUTO PARTS II,2.678437e+06,218972.547295,"POLYGON ((2679159.954 219790.563, 2679136.835 ..."
3,6101 W PASSYUNK AVE,047S21,0023,047S210023,GIANNA SALVAGE CORP,2.680928e+06,223375.011346,"POLYGON ((2680829.298 223301.310, 2680808.700 ..."
4,6101 W PASSYUNK AVE,047S21,0044,047S210044,GIANNA SALVAGE CORP,2.680928e+06,223375.011346,"POLYGON ((2680825.903 223299.337, 2680819.922 ..."
...,...,...,...,...,...,...,...,...
379,6213 W PASSYUNK AVE,047S20,0080,047S200080,VENICE AUTO PARTS,2.679718e+06,223471.726706,"POLYGON ((2680390.397 223711.413, 2680288.056 ..."
380,6247 W PASSYUNK AVE,047S20,0071,047S200071,MATTHEWS ALL FOREIGN AUTO PARTS INC,2.679286e+06,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ..."
381,6247 W PASSYUNK AVE,047S20,0071,047S200071,JIM'S AUTO RECYCLING,2.679286e+06,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ..."
382,6247 W PASSYUNK AVE,047S20,0071,047S200071,LUONG ANH NGOC,2.679286e+06,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ..."


Note that there are parcels without any address!

In [23]:
junklocations.nunique()

address        129
map_section     84
map_parcel     111
parcel_full    168
business_n     241
geocode_x      124
geocode_y      124
geometry       167
dtype: int64

In [24]:
junklocations["business_count"] = junklocations.groupby("parcel_full")["business_n"].transform('nunique')

This has duplicates: each business_n for a parcel will have the total count of unique businesses per parcel. But that's okay.

In [25]:
junklocations.tail()

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_n,geocode_x,geocode_y,geometry,business_count
379,6213 W PASSYUNK AVE,047S20,80,047S200080,VENICE AUTO PARTS,2679718.0,223471.726706,"POLYGON ((2680390.397 223711.413, 2680288.056 ...",5
380,6247 W PASSYUNK AVE,047S20,71,047S200071,MATTHEWS ALL FOREIGN AUTO PARTS INC,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",3
381,6247 W PASSYUNK AVE,047S20,71,047S200071,JIM'S AUTO RECYCLING,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",3
382,6247 W PASSYUNK AVE,047S20,71,047S200071,LUONG ANH NGOC,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",3
383,2201-09 SPRING GARDEN ST,004N24,112,004N240112,LINNETT'S GULF INC,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",1


In [26]:
numbiz = junklocations.groupby("business_n")
print("The number of unique businesses overall is", numbiz.ngroups)

The number of unique businesses overall is 241


One thing I'm curious about in the data but not for this model is how total of unique businesses assessed by location differs or doesn't from unique businesses overall. (If a business name changed location or has two, it would be counted repeatedly in the former but not the latter.)

In [27]:
#dropping duplicate rows
junklocations1 = junklocations.drop_duplicates(subset=["parcel_full"], keep='first')

In [28]:
junklocations1.nunique()

address           129
map_section        84
map_parcel        111
parcel_full       168
business_n        117
geocode_x         121
geocode_y         121
geometry          167
business_count     10
dtype: int64

In [29]:
junklocations1.tail()

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_n,geocode_x,geocode_y,geometry,business_count
373,3111 GRAYS FERRY AVE,009S06,32,009S060032,UC TECH INC,2684602.0,231312.0,"POLYGON ((2684634.992 231397.724, 2684640.581 ...",1
374,5247-57 UNRUH AVE,111N22,10,111N220010,ORTHODOX AUTO CO INC,2727557.0,261081.029566,"POLYGON ((2727858.623 261131.420, 2727876.458 ...",1
375,6213 W PASSYUNK AVE,047S20,80,047S200080,TANGS AUTO PARTS,2679718.0,223471.726706,"POLYGON ((2680390.397 223711.413, 2680288.056 ...",5
380,6247 W PASSYUNK AVE,047S20,71,047S200071,MATTHEWS ALL FOREIGN AUTO PARTS INC,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",3
383,2201-09 SPRING GARDEN ST,004N24,112,004N240112,LINNETT'S GULF INC,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",1


In [30]:
junklocations1 = junklocations1[["address", "map_section", "map_parcel", 
                                 "parcel_full", "business_count", 
                                 "geocode_x", "geocode_y", "geometry"]]

In [None]:
junklocations1.tail()

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_count,geocode_x,geocode_y,geometry
373,3111 GRAYS FERRY AVE,009S06,32,009S060032,1,2684602.0,231312.0,"POLYGON ((2684634.992 231397.724, 2684640.581 ..."
374,5247-57 UNRUH AVE,111N22,10,111N220010,1,2727557.0,261081.029566,"POLYGON ((2727858.623 261131.420, 2727876.458 ..."
375,6213 W PASSYUNK AVE,047S20,80,047S200080,5,2679718.0,223471.726706,"POLYGON ((2680390.397 223711.413, 2680288.056 ..."
380,6247 W PASSYUNK AVE,047S20,71,047S200071,3,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ..."
383,2201-09 SPRING GARDEN ST,004N24,112,004N240112,1,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ..."


In [33]:
#resetting index to be 0-(n-1)
junklocations1.reset_index(inplace=True, drop=True)

In [34]:
junklocations1.tail()

Unnamed: 0,address,map_section,map_parcel,parcel_full,business_count,geocode_x,geocode_y,geometry
163,3111 GRAYS FERRY AVE,009S06,32,009S060032,1,2684602.0,231312.0,"POLYGON ((2684634.992 231397.724, 2684640.581 ..."
164,5247-57 UNRUH AVE,111N22,10,111N220010,1,2727557.0,261081.029566,"POLYGON ((2727858.623 261131.420, 2727876.458 ..."
165,6213 W PASSYUNK AVE,047S20,80,047S200080,5,2679718.0,223471.726706,"POLYGON ((2680390.397 223711.413, 2680288.056 ..."
166,6247 W PASSYUNK AVE,047S20,71,047S200071,3,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ..."
167,2201-09 SPRING GARDEN ST,004N24,112,004N240112,1,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ..."


# Merge Violations with Junkyard  Business License Locations and Data

In [35]:
print("The CRS of Junk Locations is", junklocations1.crs)

The CRS of Junk Locations is epsg:2272


In [36]:
#checking violations CRS
print("The CRS of the Junkyard Violations is" , violations.crs)

The CRS of the Junkyard Violations is epsg:4326


In [37]:
#changing the CRS of violations to match
violations_proj = violations.to_crs(2272)

In [48]:
#spatial join, keeping extra columns in case I want to compare them
junkviolations = gpd.sjoin(junklocations1,violations_proj, how="left", lsuffix="lic", rsuffix="viol")

In [49]:
junkviolations.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2152 entries, 0 to 167
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   address         1899 non-null   object  
 1   map_section     2152 non-null   object  
 2   map_parcel      2151 non-null   object  
 3   parcel_full     2152 non-null   object  
 4   business_count  2152 non-null   int64   
 5   geocode_x       2152 non-null   float64 
 6   geocode_y       2152 non-null   float64 
 7   geometry        2152 non-null   geometry
 8   index_viol      2121 non-null   float64 
 9   caserespon      2121 non-null   object  
 10  casepriori      2121 non-null   object  
 11  violationc      2121 non-null   object  
 12  violatio_1      2121 non-null   object  
dtypes: float64(3), geometry(1), int64(1), object(8)
memory usage: 235.4+ KB


In [50]:
#we decided to drop this because it's the exact violation code. Maybe for future analysis!
junkviolations.drop("violationc", axis=1, inplace=True)

In [51]:
junkviolations.drop(["index_viol", "geocode_x", "geocode_y", "address", "address", "map_parcel"], axis=1, inplace=True)

In [52]:
junkviolations.shape

(2152, 7)

In [53]:
junkviolations.reset_index(drop=True)

Unnamed: 0,map_section,parcel_full,business_count,geometry,caserespon,casepriori,violatio_1
0,047S11,047S110010,1,"POLYGON ((2679160.849 219790.732, 2679137.703 ...",CSU INVESTIGATOR,UNSAFE,UNSAFE STRUCTURE
1,047S11,047S110010,1,"POLYGON ((2679160.849 219790.732, 2679137.703 ...",CSU INVESTIGATOR,UNSAFE,ARCHITECT/ENGINEER SERVICES
2,047S11,047S110010,1,"POLYGON ((2679160.849 219790.732, 2679137.703 ...",AUDITS AND INVESTIGATIONS BUILDING CERTS,STANDARD,PIERS & WATERFRONT STRUCTURES INSPECTION REQUIRED
3,047S11,047S110010,1,"POLYGON ((2679160.849 219790.732, 2679137.703 ...",AUDITS AND INVESTIGATIONS BUILDING CERTS,STANDARD,PIERS & WATERFRONT STRUCTURES INSPECTION REQUIRED
4,047S11,047S110010,1,"POLYGON ((2679160.849 219790.732, 2679137.703 ...",AUDITS AND INVESTIGATIONS BUILDING CERTS,STANDARD,PIERS & WATERFRONT STRUCTURES INSPECTION REQUIRED
...,...,...,...,...,...,...,...
2147,047S20,047S200071,3,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",FIRE SAFETY INVESTIGATOR,STANDARD,STORAGE OF SAFETY DATA SHEETS\r\n
2148,047S20,047S200071,3,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",FIRE SAFETY INVESTIGATOR,STANDARD,"VEHICLE SALVAGE, TIRE REBUILDING, & TIRE STORAGE"
2149,004N24,004N240112,1,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",CI,STANDARD,OBTAIN LIC INDICATED
2150,004N24,004N240112,1,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",CI,STANDARD,RENEW LIC INDICATED


# One-Hot Encoding Violation Data

In [54]:
jv_1he = pd.get_dummies(junkviolations[["caserespon","casepriori","violatio_1"]]).reset_index().drop("index", axis=1)
jv_1he

Unnamed: 0,caserespon_AUDITS AND INVESTIGATIONS BUILDING CERTS,caserespon_BU,caserespon_BUILDING COURTS,caserespon_BUILDING INVESTIGATOR,caserespon_CI,caserespon_CLIP,caserespon_CLIP - VACANT LOT INVESTIGATOR,caserespon_CODE ENFORCEMENT COURTS,caserespon_CODE ENFORCEMENT INVESTIGATOR,caserespon_CSU,...,violatio_1_WASTE ACCUMULATION PROHIBITED\r\n,violatio_1_WASTE CANS,violatio_1_WASTE CANS\r\n,violatio_1_WASTE HANDLING\r\n,violatio_1_WASTE MATERIAL,violatio_1_WASTE MATERIAL\r\n,violatio_1_WELDING-PERMIT REQ?D,violatio_1_WET CHEMICAL SYSTEM TEST,violatio_1_WET CHEMICAL SYSTEM TEST\r\n,violatio_1_WORKING SPACE/CLEARANCE
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2149,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2150,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
tmp_ = junkviolations[["parcel_full"]].reset_index().drop("index", axis=1).join(jv_1he)
tmp_.shape

(2152, 338)

In [56]:
tmp_[tmp_.parcel_full == "155N090023"]

Unnamed: 0,parcel_full,caserespon_AUDITS AND INVESTIGATIONS BUILDING CERTS,caserespon_BU,caserespon_BUILDING COURTS,caserespon_BUILDING INVESTIGATOR,caserespon_CI,caserespon_CLIP,caserespon_CLIP - VACANT LOT INVESTIGATOR,caserespon_CODE ENFORCEMENT COURTS,caserespon_CODE ENFORCEMENT INVESTIGATOR,...,violatio_1_WASTE ACCUMULATION PROHIBITED\r\n,violatio_1_WASTE CANS,violatio_1_WASTE CANS\r\n,violatio_1_WASTE HANDLING\r\n,violatio_1_WASTE MATERIAL,violatio_1_WASTE MATERIAL\r\n,violatio_1_WELDING-PERMIT REQ?D,violatio_1_WET CHEMICAL SYSTEM TEST,violatio_1_WET CHEMICAL SYSTEM TEST\r\n,violatio_1_WORKING SPACE/CLEARANCE
1610,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1611,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1612,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1613,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1614,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
1615,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1616,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1617,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1618,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1619,155N090023,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [57]:
tmp_.groupby("parcel_full").sum()

Unnamed: 0_level_0,caserespon_AUDITS AND INVESTIGATIONS BUILDING CERTS,caserespon_BU,caserespon_BUILDING COURTS,caserespon_BUILDING INVESTIGATOR,caserespon_CI,caserespon_CLIP,caserespon_CLIP - VACANT LOT INVESTIGATOR,caserespon_CODE ENFORCEMENT COURTS,caserespon_CODE ENFORCEMENT INVESTIGATOR,caserespon_CSU,...,violatio_1_WASTE ACCUMULATION PROHIBITED\r\n,violatio_1_WASTE CANS,violatio_1_WASTE CANS\r\n,violatio_1_WASTE HANDLING\r\n,violatio_1_WASTE MATERIAL,violatio_1_WASTE MATERIAL\r\n,violatio_1_WELDING-PERMIT REQ?D,violatio_1_WET CHEMICAL SYSTEM TEST,violatio_1_WET CHEMICAL SYSTEM TEST\r\n,violatio_1_WORKING SPACE/CLEARANCE
parcel_full,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
004N240112,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060029,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
009S060032,0,0,0,0,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060037,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
009S060043,0,0,0,0,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150N240018,0,7,0,0,2,0,0,0,3,3,...,0,0,0,0,0,0,1,0,0,0
150N240031,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
150N240050,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
150N240056,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
junkviolations_oh = junkviolations[["map_section","parcel_full","business_count","geometry"]
               ].groupby("parcel_full").first().join(tmp_.groupby("parcel_full").sum())

In [61]:
junkviolations_oh

Unnamed: 0_level_0,map_section,business_count,geometry,caserespon_AUDITS AND INVESTIGATIONS BUILDING CERTS,caserespon_BU,caserespon_BUILDING COURTS,caserespon_BUILDING INVESTIGATOR,caserespon_CI,caserespon_CLIP,caserespon_CLIP - VACANT LOT INVESTIGATOR,...,violatio_1_WASTE ACCUMULATION PROHIBITED\r\n,violatio_1_WASTE CANS,violatio_1_WASTE CANS\r\n,violatio_1_WASTE HANDLING\r\n,violatio_1_WASTE MATERIAL,violatio_1_WASTE MATERIAL\r\n,violatio_1_WELDING-PERMIT REQ?D,violatio_1_WET CHEMICAL SYSTEM TEST,violatio_1_WET CHEMICAL SYSTEM TEST\r\n,violatio_1_WORKING SPACE/CLEARANCE
parcel_full,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
004N240112,004N24,1,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060029,009S06,1,"POLYGON ((2684793.810 231346.359, 2684722.606 ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060032,009S06,1,"POLYGON ((2684634.992 231397.724, 2684640.581 ...",0,0,0,0,5,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060037,009S06,1,"POLYGON ((2684793.810 231346.359, 2684722.606 ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
009S060043,009S06,1,"POLYGON ((2684634.992 231397.724, 2684708.683 ...",0,0,0,0,5,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150N240018,150N24,1,"POLYGON ((2672176.106 266044.413, 2672100.590 ...",0,7,0,0,2,0,0,...,0,0,0,0,0,0,1,0,0,0
150N240031,150N24,1,"POLYGON ((2671287.317 266579.049, 2671213.614 ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
150N240050,150N24,1,"POLYGON ((2671285.560 266581.577, 2671198.612 ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
150N240056,150N24,1,"POLYGON ((2672430.931 265735.444, 2672374.887 ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Exporting L&I Violations for Our Model

In [62]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [63]:
drive.mount("content")

Mounted at content


In [64]:
cd /content/content/MyDrive/Colab Notebooks

/content/content/MyDrive/Colab Notebooks


In [65]:
junkviolations_oh.to_file("/content/content/MyDrive/Colab Notebooks/JunkyardViolations_OHE.geojson", driver="GeoJSON")

## Just testing that it worked

In [66]:
cd /content/content/MyDrive/Colab Notebooks

/content/content/MyDrive/Colab Notebooks


In [70]:
ls

 09_26Class_Geopandas.ipynb
 09_28_Class_MoreGeoPandas.ipynb
 10_03_Class_LinearRegression.ipynb
 AutocorrelationCensus.ipynb
'Carl Jung.jpg'
 ClassDemoBusTimes.ipynb
'Copy of MidtermLanthamWolk DEPDataNotebook.ipynb'
'HW3 HW3.ipynb'
'HW4 HW4.ipynb'
'HW5 HW5_TimeSeriesClustering.ipynb'
'HW6\HW6_RandomForestsChicago (1).ipynb'
'HW6\HW6_RandomForestsChicago.ipynb'
'HW6\HW6_RandomForestsDC.ipynb'
'HW6\HW6_RandomForestsDC Take 2.ipynb'
'HW7 HW7_deepdream.ipynb'
'HW HW5_timeSeriesClustering.ipynb'
 JunkyardViolations_OHE.geojson
 kaggle.json
'MidtermLanthamWolk 311 Requests Notebook.ipynb'
'MidtermLanthamWolk DEPDataNotebook.ipynb'
'MidtermLanthamWolk LanthamWolkMidtermData.ipynb'
'MidtermLanthamWolk LIBusinessLicenses.ipynb'
'MidtermLanthamWolk LIViolations.ipynb'
 PhillyJunkLicensesLocations.csv
 PhillyJunkLicensesLocations.geojson
 PhillyJunkyardLocations.geojson
 PUS2022_1031_MoransINYCWomens.ipynb
 PUS22_Class2and3.ipynb
 [0m[01;34mTitanicKaggleFraggleRockData[0m/
 TitanicRandomFore

## You can ignore this old step: Messily Looking at Mismatches in Geocode and Geometry Columns

In [None]:
junkviolations.nunique()

address_lic        129
map_section         84
map_parcel         111
parcel_full        168
business_count      10
geocode_x_lic      121
geocode_y_lic      121
geometry           167
index_viol        1542
objectid          1542
casenumber         403
casecreate         272
casecomple         247
casetype             2
casestatus           3
caserespon          19
casepriori           6
violationn        1542
violationd         285
violationc         332
violatio_1         312
violations           8
violationr         106
violatio_2          10
mostrecent         301
address_viol       120
unit_num             5
zip                113
censustrac          37
opa_owner          122
systemofre           2
geocode_x_viol     162
geocode_y_viol     167
dtype: int64

In [None]:
#selecting just the columns I want to compare
junkgeo = junkviolations[["geocode_x_lic", "geocode_x_viol",
                          "geocode_y_lic", "geocode_y_viol", "geometry"]]

In [None]:
junkgeo.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2152 entries, 0 to 167
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   geocode_x_lic   2152 non-null   float64 
 1   geocode_x_viol  2121 non-null   float64 
 2   geocode_y_lic   2152 non-null   float64 
 3   geocode_y_viol  2121 non-null   float64 
 4   geometry        2152 non-null   geometry
dtypes: float64(4), geometry(1)
memory usage: 100.9 KB


In [None]:
junkgeo["geocode_x_lic"]

0      2.678437e+06
0      2.678437e+06
0      2.678437e+06
0      2.678437e+06
0      2.678437e+06
           ...     
166    2.679286e+06
166    2.679286e+06
167    2.690249e+06
167    2.690249e+06
167    2.690249e+06
Name: geocode_x_lic, Length: 2152, dtype: float64

In [None]:
weirdgeo = junkgeo.groupby(["geocode_x_lic"])

In [None]:
idx = [x[0] for x in weirdgeo.groups.values() if len(x) == 1]

In [None]:
idx

[98, 63, 60, 146, 145, 147, 149, 136, 119, 108, 130, 137, 141, 107, 123, 122]

In [None]:
mismatches = junkgeo.loc[[71, 39, 36, 100, 95, 1, 91, 76, 67, 85, 54, 20, 66, 
                          35, 25, 82, 80, 79, 27]]
mismatches

Unnamed: 0,geocode_x_lic,geocode_x_viol,geocode_y_lic,geocode_y_viol,geometry
71,2.700817e+06,2.700800e+06,248719.992220,248728.625163,"POLYGON ((2700857.520 248711.701, 2700783.745 ..."
71,2.700817e+06,2.700800e+06,248719.992220,248728.625163,"POLYGON ((2700857.520 248711.701, 2700783.745 ..."
71,2.700817e+06,2.700800e+06,248719.992220,248728.625163,"POLYGON ((2700857.520 248711.701, 2700783.745 ..."
71,2.700817e+06,2.700800e+06,248719.992220,248728.625163,"POLYGON ((2700857.520 248711.701, 2700783.745 ..."
71,2.700817e+06,2.700800e+06,248719.992220,248728.625163,"POLYGON ((2700857.520 248711.701, 2700783.745 ..."
...,...,...,...,...,...
27,2.732425e+06,2.732130e+06,265003.685308,263167.735625,"POLYGON ((2732380.592 265561.662, 2732387.279 ..."
27,2.732425e+06,2.732130e+06,265003.685308,263167.735625,"POLYGON ((2732380.592 265561.662, 2732387.279 ..."
27,2.732425e+06,2.732130e+06,265003.685308,263167.735625,"POLYGON ((2732380.592 265561.662, 2732387.279 ..."
27,2.732425e+06,2.732130e+06,265003.685308,263167.735625,"POLYGON ((2732380.592 265561.662, 2732387.279 ..."


So it seems to be due to NaNs, but the ones with data seem to match!

In [None]:
missing = junkgeo[junkgeo.isna().any(axis=1)]
print(missing.index)

Int64Index([  3,   4,  33,  34,  42,  43,  46,  47,  58,  59,  60,  85,  98,
            107, 108, 112, 113, 119, 122, 123, 125, 126, 136, 141, 145, 146,
            147, 149, 155, 156, 157],
           dtype='int64')


In [None]:
missing.count()

geocode_x_lic     31
geocode_x_viol     0
geocode_y_lic     31
geocode_y_viol     0
geometry          31
dtype: int64

I have no idea what I'm doing lol but it looks like almost all mismatches in uniques, NaNs, etc is due to missing geocode data in the violations set, though there are a few cases where the same geometry has very slightly different license geocode data. And the different numbers of index values I get through these methods must be from the data being expanded or not with groupby. So I'm not going to worry about those non-matching numbers anymore.

# Cleaning up the Junkyard Violations for the Model

In [59]:
junkviolations.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2152 entries, 0 to 167
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   map_section     2152 non-null   object  
 1   parcel_full     2152 non-null   object  
 2   business_count  2152 non-null   int64   
 3   geometry        2152 non-null   geometry
 4   caserespon      2121 non-null   object  
 5   casepriori      2121 non-null   object  
 6   violatio_1      2121 non-null   object  
dtypes: geometry(1), int64(1), object(5)
memory usage: 134.5+ KB


I'm just going to make my life easier now that I've done that and drop some more columns.

In [None]:
junkviolations = junkviolations.drop(columns=["geocode_x_viol", 
                                              "geocode_y_viol", "index_viol", 
                                              "objectid", "censustrac"], axis=1)

In [None]:
#dropping identifiers bc Fed said to
junkviolations.drop(columns=["address_lic", "map_parcel"])

In [None]:
#looks like this is useless? It's sort of a date plus code thing?
print(junkviolations["violationn"].tolist())

['211947484', '211947482', 'VI-2021-034274', 'VI-2021-043645', 'VI-2021-043644', 'VI-2021-034267', '211947483', '211947484', '211947482', 'VI-2021-034274', 'VI-2021-043645', 'VI-2021-043644', 'VI-2021-034267', '211947483', '211947484', '211947482', 'VI-2021-034274', 'VI-2021-043645', 'VI-2021-043644', 'VI-2021-034267', '211947483', nan, nan, '4541650', '4541649', '4541648', 'VI-2021-038382', '4850280', 'VI-2020-041731', 'VI-2020-041726', 'VI-2020-041721', 'VI-2020-041720', '4935901', 'VI-2021-086560', 'VI-2021-086554', '4896024', 'VI-2020-041791', 'VI-2020-041816', '4935899', 'VI-2020-041788', 'VI-2020-041815', 'VI-2020-041735', 'VI-2020-041703', 'VI-2020-041701', 'VI-2020-041696', 'VI-2020-041680', 'VI-2020-041664', 'VI-2020-041754', 'VI-2021-086557', '4935902', 'VI-2020-041820', 'VI-2021-086559', 'VI-2020-041743', 'VI-2021-086558', 'VI-2020-041669', '4935903', 'VI-2020-041657', 'VI-2021-086555', 'VI-2020-041648', 'VI-2020-041643', 'VI-2020-041639', 'VI-2021-086556', '4935898', 'VI-20

In [None]:
# ok, so this is violation code and could be chopped to type
print(junkviolations["violationc"].tolist())

['PM15-108.1', 'A-304.1/1', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-108.1', 'A-304.1/1', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-108.1', 'A-304.1/1', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', 'PM15-314', nan, nan, 'CP-312A', 'CP-305', 'CP-01', 'A-301.1/65', 'LO-1', 'FC-13-503', 'FC-13-3401', 'FC-13-315.2', 'FC-13-315.2.2', 'FC-13-2703.5', 'FC-13-906.2', 'FC-13-2701.5', 'LO-1', 'PM15-604.3', 'PM15-403.5', 'FC-13-2701.8', 'FC-13-901.7.2', 'PM15-603.1', 'FC-13-605.4', 'FC-13-308.1.9', 'FC-13-304.1.2', 'FC-13-3003.5.3', 'FC-13-2703.5.1', 'FC-13-2701.8', 'FC-13-906.2', 'FC-13-2703.5.1', 'FC-13-270393', 'FC-13-2704', 'FC-13-605.4', 'FC-13-605.6', 'FC-13-308.1.9', 'FC-13-2703.5', 'FC-13-503', 'FC-13-2510.5', 'LR-1', 'FC-13-2703.7.1', 'FC-13-1503.4.3', 'FC-13-1003.4', 'LR-1', 'FC-13-1006.3', 'FC-13-304.1', 'FC-13-906', 'FC-13-503.3', 'FC-13-270181', 'FC-13-906.3', 'FC-13-915.1', 'FC-13-915.1A', 'PM15-108.1', 'FC-13-300373', 'FC-13-30035

Prefixes look like FC, PM, F, A, LO, LR, and some that just have numbers.

In [None]:
junkviolations["violatio_1"].nunique()

312

In [None]:
junkviolations["violatio_1"].unique()

array(['UNSAFE STRUCTURE', 'ARCHITECT/ENGINEER SERVICES',
       'PIERS & WATERFRONT STRUCTURES INSPECTION REQUIRED',
       'PIERS AND WATERFRT STRUCT INSP', nan, 'HIGH WEEDS-CUT',
       'RUBBISH/GARBAGE EXTERIOR-OWNER', 'CLIP VIOLATION NOTICE',
       'NEW USE', 'OBTAIN LIC INDICATED', 'FIRE APPARATUS ROADS',
       'FLAMMABLE & COMBUSTIBLE LIQUIDS', 'STORAGE IN BUILDINGS',
       'STORAGE MEANS OF EGRESS', 'NFPA 704 SIGNS',
       'FIRE EXTINGUISHERS TAGGED', 'HAZMAT LICENSE REQUIRED',
       'ELECTRICAL- HAZARD', 'VENTILATION- DRYER EXHAUST',
       'SAFETY DATA SHEETS/CONTAINER', 'OUT-OF-SERVICE TAG',
       'MECHANICAL- MECHANICAL EQUIPMENT', 'MULTIPLUG ADAPTORS',
       'PORTABLE HEATING/COOKING EQUIP', 'VEGETATION',
       'SECURING COMPRESSED CYLINDERS', 'MARKING OF CONTAINERS',
       'PROTECTION FROM VEHICLES',
       'HAZARDOUS MATERIAL STORAGE REQUIREMENTS', 'OPEN JUNCTION BOXES',
       'TIRE STORAGE', 'FAIL OBTAIN LICENSE\n', 'SMOKING PROHIBITED',
       'WASTE CANS', '

This is a lot!! And what's with the \r\n?

In [None]:
junkviolations.columns

Index(['address_lic', 'map_section', 'map_parcel', 'parcel_full',
       'business_count', 'geocode_x_lic', 'geocode_y_lic', 'geometry',
       'casenumber', 'casecreate', 'casecomple', 'casetype', 'casestatus',
       'caserespon', 'casepriori', 'violationn', 'violationd', 'violationc',
       'violatio_1', 'violations', 'violationr', 'violatio_2', 'mostrecent',
       'address_viol', 'unit_num', 'zip', 'opa_owner'],
      dtype='object')

In [None]:
#comparing case create date and violation date. They are mostly the same.
junkviolations.loc[50:60, ["casecreate","violationd"]].tail()

Unnamed: 0,casecreate,violationd
57,2017-11-17,2017-11-17
57,2018-09-10,2018-09-07
58,,
59,,
60,,


In [None]:
#dropping useless violationn, redundant violationd
junkviolations.drop(columns=["violationd", "violationn"], inplace=True)

In [None]:
#looking at data grouped by parcel
junkviolations.groupby("parcel_full").first()

Unnamed: 0_level_0,address_lic,map_section,map_parcel,business_count,geocode_x_lic,geocode_y_lic,geometry,casenumber,casecreate,casecomple,...,violatio_1,violations,violationr,violatio_2,mostrecent,address_viol,unit_num,zip,opa_owner,systemofre
parcel_full,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
004N240112,2201-09 SPRING GARDEN ST,004N24,0112,1,2.690249e+06,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",694977,2019-07-10,2019-10-28,...,OBTAIN LIC INDICATED,COMPLIED,,,2019-10-28,2201-15 SPRING GARDEN ST,,19130-3511,E LINNETT LP,HANSEN
009S060029,1056 S 31ST ST,009S06,0029,1,2.684776e+06,231458.753416,"POLYGON ((2684793.810 231346.359, 2684722.606 ...",652749,2018-09-04,2019-03-22,...,VACANT LOT LICENSE,CLOSEDCASE,,,2019-03-06,1056 S 31ST ST,,19121-0000,GRAYS FERRY HOLDING INC,HANSEN
009S060032,3111 GRAYS FERRY AVE,009S06,0032,1,2.684602e+06,231312.000000,"POLYGON ((2684634.992 231397.724, 2684640.581 ...",CF-2022-002203,2022-01-10,2017-06-21,...,RESPONSIBILITY FOR CLEANUP,OPEN,,,2022-01-10,3111 GRAYS FERRY AVE,,19146-2706,SOMERMAN SHIRLEY B,ECLIPSE
009S060037,1056 S 31ST ST,009S06,0037,1,2.684776e+06,231458.753416,"POLYGON ((2684793.810 231346.359, 2684722.606 ...",652749,2018-09-04,2019-03-22,...,VACANT LOT LICENSE,CLOSEDCASE,,,2019-03-06,1056 S 31ST ST,,19121-0000,GRAYS FERRY HOLDING INC,HANSEN
009S060043,3111 GRAYS FERRY AVE,009S06,0043,1,2.684602e+06,231312.000000,"POLYGON ((2684634.992 231397.724, 2684708.683 ...",CF-2022-002203,2022-01-10,2017-06-21,...,RESPONSIBILITY FOR CLEANUP,OPEN,,,2022-01-10,3111 GRAYS FERRY AVE,,19146-2706,SOMERMAN SHIRLEY B,ECLIPSE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150N240018,,150N24,0018,1,2.671240e+06,266612.192963,"POLYGON ((2672176.106 266044.413, 2672100.590 ...",CF-2021-056096,2021-06-24,2021-08-09,...,FIRE EXTINGUISHERS TAGGED,COMPLIED,2021-08-09,COMPLIED - OWNER REPAIR,2021-08-09,5050 UMBRIA ST,E,19128-4347,ELITE SPORTS FACTORY HOLD,ECLIPSE
150N240031,5112 UMBRIA ST,150N24,0031,1,2.671240e+06,266612.192963,"POLYGON ((2671287.317 266579.049, 2671213.614 ...",,,,...,,,,,,,,,,
150N240050,5112 UMBRIA ST,150N24,0050,1,2.671240e+06,266612.192963,"POLYGON ((2671285.560 266581.577, 2671198.612 ...",,,,...,,,,,,,,,,
150N240056,5010 UMBRIA ST,150N24,0056,1,2.671947e+06,265972.998998,"POLYGON ((2672430.931 265735.444, 2672374.887 ...",653583,2018-09-10,2019-01-22,...,VEGETATION,COMPLIED,,,2019-01-22,5010 UMBRIA ST,,19128-4347,MOYER CHRISTOPHER E KATHRYN H/W,HANSEN


In [None]:
junkviolations.drop(["systemofre"], axis=1, inplace=True)

In [None]:
junkviolations.nunique()

address_lic        129
map_section         84
map_parcel         111
parcel_full        168
business_count      10
geocode_x_lic      121
geocode_y_lic      121
geometry           167
casenumber         403
casecreate         272
casecomple         247
casetype             2
casestatus           3
caserespon          19
casepriori           6
violationn        1542
violationd         285
violationc         332
violatio_1         312
violations           8
violationr         106
violatio_2          10
mostrecent         301
address_viol       120
unit_num             5
zip                113
opa_owner          122
dtype: int64

In [None]:
print("The number of violations with unique IDs is", junkviolations["violationn"].nunique())

The number of violations with unique IDs is 1542


In [None]:
junkviolations[junkviolations["violationn"].isna()]

Unnamed: 0,address_lic,map_section,map_parcel,parcel_full,business_count,geocode_x_lic,geocode_y_lic,geometry,casenumber,casecreate,...,violationc,violatio_1,violations,violationr,violatio_2,mostrecent,address_viol,unit_num,zip,opa_owner
3,6101 W PASSYUNK AVE,047S21,23.0,047S210023,1,2680928.0,223375.011346,"POLYGON ((2680829.298 223301.310, 2680808.700 ...",,,...,,,,,,,,,,
4,6101 W PASSYUNK AVE,047S21,44.0,047S210044,1,2680928.0,223375.011346,"POLYGON ((2680825.903 223299.337, 2680819.922 ...",,,...,,,,,,,,,,
33,4500 N FAIRHILL ST,121N24,19.0,121N240019,3,2700609.0,260924.365235,"POLYGON ((2700644.551 260831.211, 2700616.257 ...",,,...,,,,,,,,,,
34,520 W ANNSBURY ST,133N06,23.0,133N060023,3,2700609.0,260924.365235,"POLYGON ((2700644.551 260831.211, 2700616.257 ...",,,...,,,,,,,,,,
42,4201-19 ARAMINGO AVE,082N02,28.0,082N020028,1,2714910.0,253705.347795,"POLYGON ((2714993.639 253591.465, 2714972.585 ...",,,...,,,,,,,,,,
43,4201 ARAMINGO AVE,082N02,19.0,082N020019,1,2714910.0,253705.347795,"POLYGON ((2714993.639 253591.465, 2714972.585 ...",,,...,,,,,,,,,,
46,4927 ARENDELL AVE,114N19,205.0,114N190205,2,2740101.0,272814.424766,"POLYGON ((2740335.567 272928.392, 2739832.982 ...",,,...,,,,,,,,,,
47,,114N19,150.0,114N190150,2,2740101.0,272814.424766,"POLYGON ((2740335.567 272928.392, 2739832.982 ...",,,...,,,,,,,,,,
58,4031 ORCHARD ST,083N04,39.0,083N040039,2,2713776.0,256055.0,"POLYGON ((2713883.535 256127.871, 2713838.108 ...",,,...,,,,,,,,,,
59,4001-31 ORCHARD ST,083N04,74.0,083N040074,2,2713776.0,256055.0,"POLYGON ((2713688.910 255946.073, 2713731.056 ...",,,...,,,,,,,,,,


All the null violationns have a lot of other nas! So I will drop them.

In [None]:
junkviolations["violationn"].dropna(inplace=True)

Should count case numbers, count 

In [None]:
junkviolations.tail()

Unnamed: 0,address_lic,map_section,map_parcel,parcel_full,business_count,geocode_x_lic,geocode_y_lic,geometry,casenumber,casecreate,...,violationc,violatio_1,violations,violationr,violatio_2,mostrecent,address_viol,unit_num,zip,opa_owner
166,6247 W PASSYUNK AVE,047S20,71,047S200071,3,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",CF-2022-027991,2022-04-05,...,F-5003.4.1,STORAGE OF SAFETY DATA SHEETS\r\n,COMPLIED,2022-06-23,COMPLIED - OWNER REPAIR,2022-06-24,6247 W PASSYUNK AVE,,19153-3509,"NGO RICHARD, NGO STEVEN"
166,6247 W PASSYUNK AVE,047S20,71,047S200071,3,2679286.0,223056.180343,"POLYGON ((2679266.293 223243.072, 2679413.183 ...",CF-2021-099465,2021-10-08,...,FC-13-2501,"VEHICLE SALVAGE, TIRE REBUILDING, & TIRE STORAGE",COMPLIED,2021-11-29,COMPLIED - LICENSE/CERTIFICATE/REPORT OBTAINED,2021-11-29,6247 W PASSYUNK AVE,,19153-3509,"NGO RICHARD, NGO STEVEN"
167,2201-09 SPRING GARDEN ST,004N24,112,004N240112,1,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",694977,2019-07-10,...,LO-1,OBTAIN LIC INDICATED,COMPLIED,,,2019-10-28,2201-15 SPRING GARDEN ST,,19130-3511,E LINNETT LP
167,2201-09 SPRING GARDEN ST,004N24,112,004N240112,1,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",694977,2019-07-10,...,LR-1,RENEW LIC INDICATED,COMPLIED,,,2019-10-28,2201-15 SPRING GARDEN ST,,19130-3511,E LINNETT LP
167,2201-09 SPRING GARDEN ST,004N24,112,004N240112,1,2690249.0,240420.666163,"POLYGON ((2690335.698 240461.367, 2690321.702 ...",694981,2019-07-10,...,FC-13-105611,REPAIR GARAGES/MOTOR FUEL,COMPLIED,,,2020-02-04,2201-15 SPRING GARDEN ST,,19130-3511,E LINNETT LP


In [None]:
junkviolations.drop("")