## Limpieza General del Dataset

En el siguiete archivo haremos una cierta limpieza al dataset "Chicago_Crimes_2012_to_2017.csv", esto para que al moemnto de trabajarlo dentro de la pagina sea mas optimo

In [3]:
import pandas as pd

In [2]:
df = pd.read_csv("Chicago_Crimes_2012_to_2017.csv")

Visualización previa de datos para entendimiento de su estructura y cantidad de datos


In [6]:
df.shape

(1456714, 23)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,3,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,...,24.0,29.0,08B,1154907.0,1893681.0,2016,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"
1,89,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,...,20.0,42.0,08B,1183066.0,1864330.0,2016,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)"
2,197,10508697,HZ250503,05/03/2016 11:31:00 PM,053XX W CHICAGO AVE,470,PUBLIC PEACE VIOLATION,RECKLESS CONDUCT,STREET,False,...,37.0,25.0,24,1140789.0,1904819.0,2016,05/10/2016 03:56:50 PM,41.894908,-87.758372,"(41.894908283, -87.758371958)"
3,673,10508698,HZ250424,05/03/2016 10:10:00 PM,049XX W FULTON ST,460,BATTERY,SIMPLE,SIDEWALK,False,...,28.0,25.0,08B,1143223.0,1901475.0,2016,05/10/2016 03:56:50 PM,41.885687,-87.749516,"(41.885686845, -87.749515983)"
4,911,10508699,HZ250455,05/03/2016 10:00:00 PM,003XX N LOTUS AVE,820,THEFT,$500 AND UNDER,RESIDENCE,False,...,28.0,25.0,06,1139890.0,1901675.0,2016,05/10/2016 03:56:50 PM,41.886297,-87.761751,"(41.886297242, -87.761750709)"


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1456714 entries, 0 to 1456713
Data columns (total 23 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   Unnamed: 0            1456714 non-null  int64  
 1   ID                    1456714 non-null  int64  
 2   Case Number           1456713 non-null  object 
 3   Date                  1456714 non-null  object 
 4   Block                 1456714 non-null  object 
 5   IUCR                  1456714 non-null  object 
 6   Primary Type          1456714 non-null  object 
 7   Description           1456714 non-null  object 
 8   Location Description  1455056 non-null  object 
 9   Arrest                1456714 non-null  bool   
 10  Domestic              1456714 non-null  bool   
 11  Beat                  1456714 non-null  int64  
 12  District              1456713 non-null  float64
 13  Ward                  1456700 non-null  float64
 14  Community Area        1456674 non-

## Limpieza de datos para su posterior uso

Primero eliminamos filas donde latitud o longitud sean nulos, esto para las vizualisaciones espaciales


In [7]:
df.dropna(subset=['Latitude', 'Longitude'], inplace=True)

In [8]:
df.shape

(1419631, 23)

Convertimos la columana "Date" de object a datatime de pandas, esto mas que nada para que sea mas facil el trabajo de visualización

In [None]:
# Convertimos la columna 'Date' al formato datetime
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')
df['Hour'] = df['Date'].dt.hour
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.day_name()

In [11]:
df.info

<bound method DataFrame.info of          Unnamed: 0        ID Case Number                Date  \
0                 3  10508693    HZ250496 2016-05-03 23:40:00   
1                89  10508695    HZ250409 2016-05-03 21:40:00   
2               197  10508697    HZ250503 2016-05-03 23:31:00   
3               673  10508698    HZ250424 2016-05-03 22:10:00   
4               911  10508699    HZ250455 2016-05-03 22:00:00   
...             ...       ...         ...                 ...   
1456709     6250330  10508679    HZ250507 2016-05-03 23:33:00   
1456710     6251089  10508680    HZ250491 2016-05-03 23:30:00   
1456711     6251349  10508681    HZ250479 2016-05-03 00:15:00   
1456712     6253257  10508690    HZ250370 2016-05-03 21:07:00   
1456713     6253474  10508692    HZ250517 2016-05-03 23:38:00   

                        Block  IUCR            Primary Type  \
0          013XX S SAWYER AVE  0486                 BATTERY   
1          061XX S DREXEL AVE  0486                 BATTERY  

Procedemos a eliminar algunas columnas (variables) que no utilizaremos, esto para un trabajo mas eficiente al momento de visualizar

In [None]:
# Columnas que son IDs, códigos o redundantes con otras
columnas_a_eliminar = [
    'Unnamed: 0', 'IUCR', 'FBI Code', 
    'Updated On', 'X Coordinate', 'Y Coordinate'
]
df.drop(columns=columnas_a_eliminar, inplace=True)

In [15]:
df.shape

(1419631, 20)

Ahora procedemos a guardar este df (DataFrame) como parquet (permite ser un archivo mas ligero y rapido) ya listo para visualizar

In [None]:
df.to_parquet('crimes_cleaned.parquet') 