# Chicago Open Data API 

Elegimos el dataset de los crimenes en Chicago desde el año 2001.
Vemos que se puede acceder a los datos viendo los primeros registros de cada variable.
Elegimos los 200.000 ultimos registros del dataset.

In [1]:
from datetime import date
import numpy as np
import pandas as pd
from sodapy import Socrata

MyAppToken = "YvqGPeBbmzUccvBCj8hAHOF2Q"
'''Cargamos los Dataset complementarios'''
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
#client = Socrata("data.cityofchicago.org", None)

# Example authenticated client (needed for non-public datasets):
OnlineDataset = Socrata("data.cityofchicago.org",
                 MyAppToken,
                 username="ivanna.yanel@gmail.com",
                 password="369874120000.IY")
print("Cliente de API Autenticado...")

# First 200.000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
CrimesDataset = OnlineDataset.get("6zsd-86xi", limit=200000)
print("Cargado Data Set Principal Chicago Crimes...")
IUCRdataset= OnlineDataset.get("miqk-suf4", limit=200000)
print("Cargado Data Set de IUCR...")
BeatsDataset = OnlineDataset.get("n9it-hstw", limit=200000)
print("Cargado Data Set de Beats...")
DistrictDataset = OnlineDataset.get("miqk-suf4", limit=200000)

# Convert to pandas DataFrame results_df = pd.DataFrame.from_records(results)
'''convertir los crimenes en matrices'''
print("Transformando los Dataset Cargados a Pandas...")
CrimesDatasetP = pd.DataFrame.from_records(CrimesDataset)
print("Data Set principal convertido...")
print(CrimesDatasetP.head(n=5))

Cliente de API Autenticado...
Cargado Data Set Principal Chicago Crimes...
Cargado Data Set de IUCR...
Cargado Data Set de Beats...
Transformando los Dataset Cargados a Pandas...
Data Set principal convertido...
   arrest  beat                    block case_number community_area  \
0   False  1023        024XX W OGDEN AVE    JB150383             28   
1   False  0412          016XX E 86TH PL    JA366925             45   
2   False  0215          003XX E 47TH ST    JA522842             38   
3    True  1034  026XX S CALIFORNIA BLVD    JA529032             30   
4    True  1221  007XX N SACRAMENTO BLVD    JA545986             23   

                      date                            description district  \
0  2018-02-13T08:20:00.000                         $500 AND UNDER      010   
1  2001-01-01T11:00:00.000    FINANCIAL IDENTITY THEFT OVER $ 300      004   
2  2017-11-23T15:14:00.000                    AGGRAVATED: HANDGUN      002   
3  2017-11-28T21:43:00.000  VIOLENT OFFENDER: ANN

Número de observaciones por cada variable.

In [2]:
print(CrimesDatasetP.groupby('primary_type').size())
print(CrimesDatasetP.groupby('fbi_code').size())
print(CrimesDatasetP.groupby('beat').size())
print(CrimesDatasetP.groupby('community_area').size())
print(CrimesDatasetP.groupby('arrest').size())
print(CrimesDatasetP.groupby('year').size())
print(CrimesDatasetP.groupby('location_description').size())
print(CrimesDatasetP.groupby('domestic').size())

primary_type
ARSON                                  379
ASSAULT                              13117
BATTERY                              37203
BURGLARY                              9799
CONCEALED CARRY LICENSE VIOLATION       18
CRIM SEXUAL ASSAULT                    911
CRIMINAL DAMAGE                      22425
CRIMINAL TRESPASS                     5884
DECEPTIVE PRACTICE                    8721
DOMESTIC VIOLENCE                        1
GAMBLING                               216
HOMICIDE                               621
HUMAN TRAFFICKING                        2
INTERFERENCE WITH PUBLIC OFFICER       401
INTIMIDATION                           111
KIDNAPPING                             267
LIQUOR LAW VIOLATION                   518
MOTOR VEHICLE THEFT                  10148
NARCOTICS                            19794
NON-CRIMINAL                            10
NON-CRIMINAL (SUBJECT SPECIFIED)         1
OBSCENITY                               28
OFFENSE INVOLVING CHILDREN            123

# Conversión de datos en las variable 'arrest' y 'domestic'

Modificamos los valores False por 0 y True por 1 en las variables arrest y domestic.

In [3]:
CrimesDatasetP.loc[CrimesDatasetP.arrest == False, 'arrest'] = '0'
CrimesDatasetP.loc[CrimesDatasetP.arrest == True, 'arrest'] = '1'
CrimesDatasetP.loc[CrimesDatasetP.domestic == False, 'domestic'] = '0'
CrimesDatasetP.loc[CrimesDatasetP.domestic == True, 'domestic'] = '1'
print(CrimesDatasetP.groupby('arrest').size())
print(CrimesDatasetP.groupby('domestic').size())

arrest
0    142840
1     57160
dtype: int64
domestic
0    172469
1     27531
dtype: int64


# Limpieza de la varible fecha y creación de nuevas columnas

Formateamos la variable de fecha a un tipo de dato de fecha. La limpiamos quitando la letra 'T' que tiene en medio e incluimos también nuevas columnas con el mes, el año y el día de la semana.

In [4]:
def clean_split_dates(row):
    # Initial date contains the current value for the date column
    initial_date = str(row['date'])

    # Split initial_date into two elements if "T" is found
    split_date = initial_date.split('T') 

    # If a "T"  is found, split_date will contain a list with at least two items
    if len(split_date) > 1:
        final_date = split_date[0]
    # If no "T" is found, split_date will just contain 1 item, the initial_date
    else:
        final_date = initial_date
    
    return final_date

# Assign the results of "clean_split_dates" to the 'date' column. 
# We want Pandas to go row-wise so we set "axis=1".
CrimesDatasetP['date'] = CrimesDatasetP.apply(lambda row: clean_split_dates(row), axis=1)
CrimesDatasetP['date'].head(5)

0    2018-02-13
1    2001-01-01
2    2017-11-23
3    2017-11-28
4    2017-12-11
Name: date, dtype: object

In [5]:
CrimesDatasetP['date'] = pd.to_datetime(CrimesDatasetP['date'])
CrimesDatasetP['year'] = CrimesDatasetP['date'].dt.year
CrimesDatasetP['month'] = CrimesDatasetP['date'].dt.month
CrimesDatasetP['week'] = CrimesDatasetP['date'].dt.week
CrimesDatasetP['weekday'] = CrimesDatasetP['date'].dt.weekday

# Nuevo data frame

Seleccionamos las variables que utilizaremos para nuestro algoritmo de Machine Learning incluyendo exclusivamente registros de 2017 y 2018.

In [6]:
finalcrimedataset = CrimesDatasetP[['description','beat','domestic','primary_type','location_description','fbi_code','arrest',
                                   'year','month','week','weekday']]

In [7]:
finaldf = finalcrimedataset.ix[finalcrimedataset['year'] >= 2017]
print(finaldf.head(5))

                             description  beat domestic   primary_type  \
0                         $500 AND UNDER  1023        0          THEFT   
2                    AGGRAVATED: HANDGUN  0215        0        ASSAULT   
3  VIOLENT OFFENDER: ANNUAL REGISTRATION  1034        0  OTHER OFFENSE   
4                         ARMED: HANDGUN  1221        0        ROBBERY   
5                DOMESTIC BATTERY SIMPLE  1131        1        BATTERY   

      location_description fbi_code arrest  year  month  week  weekday  
0               RESTAURANT       06      0  2018      2     7        1  
2         DEPARTMENT STORE      04A      0  2017     11    47        3  
3  JAIL / LOCK-UP FACILITY       26      1  2017     11    48        1  
4                 SIDEWALK       03      1  2017     12    50        0  
5                APARTMENT      08B      0  2018      1     5        1  


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [8]:
print(len(finaldf))

45684


De los 200k registros con los que partíamos nos quedamos con 44k para nuestro modelo.