## Transforming Zonas de Patrullaje 2018 - clean_ZonasPatrullaje.csv 

This notebook is going to be used to extract and transform the clean dataset by adding new columns that we need for our investigation. Some columns we need are list of Radius, avg Radius & Standard Deviation

In [21]:
# We are importing the necessary libraries to start working
import pandas as pd
import numpy as np
from os import path
from ast import literal_eval
from vincenty import vincenty

In [22]:
# Setting filename and location
filename = path.join("..","Clean","clean_ZonasPatrullaje.csv")

# Read Purchasing File and store into Pandas data frame
zonas_patrullaje_df = pd.read_csv(filename, sep=';')

# Making sure that this is correct clean dataset
zonas_patrullaje_df.head()

Unnamed: 0,Geopoint,Geoshape,Alcaldía,Sector 18,Área km2,x,y
0,"19.4559485754, -99.1339187632","{""type"": ""Polygon"", ""coordinates"": [[[-99.1373...",CUAUHTEMOC,TLATELOLCO,0.599031,-99.133437,19.455969
1,"19.4489311584, -99.1492549723","{""type"": ""Polygon"", ""coordinates"": [[[-99.1529...",CUAUHTEMOC,BUENAVISTA,0.542691,-99.149153,19.448988
2,"19.4466167038, -99.1372059309","{""type"": ""Polygon"", ""coordinates"": [[[-99.1388...",CUAUHTEMOC,BUENAVISTA,0.139906,-99.136628,19.446343
3,"19.4345863188, -99.1559474685","{""type"": ""Polygon"", ""coordinates"": [[[-99.1587...",CUAUHTEMOC,REVOLUCION,0.263673,-99.156147,19.434677
4,"19.4287224255, -99.1566989128","{""type"": ""Polygon"", ""coordinates"": [[[-99.1544...",CUAUHTEMOC,REVOLUCION,0.294612,-99.156832,19.428315


# Creating new columns 'Avg_Radius' & 'list_Radius'
We are going to get each row's Geoshape dictionary and then using the coordinates, we are going to measure the distance between the center point and the geoshape point.

## Step 1 - Extract the Geoshape coordinates

In [23]:
#Looking for the Geoshape column from our DataFrame
geoshapes_resultsSet = zonas_patrullaje_df['Geoshape']
geopoint_resultSet = zonas_patrullaje_df['Geopoint']

#This list will contain all the rows' Geoshape value (in the form of a dictionary) in the same order as indexed
geoshapes = list()

for index in range(len(geoshapes_resultsSet)):
    coordinates = literal_eval(geoshapes_resultsSet[index])['coordinates']
    geoshapes.append({'geopoint': geopoint_resultSet[index], 'coordinates': coordinates})
    

In [24]:
#Example of a Geoshape Dictionary
print(geoshapes[0]['geopoint'])
print(geoshapes[0]['coordinates'][0][0])


19.4559485754, -99.1339187632
[-99.13739031215657, 19.459475890504873]


## Step 2 - Find the distance between two points (aka radius between center of polygon and its geopoints)

In [26]:
# Formula we are using is taken from Vincent's Formula for finding the distance between to coordinate points on earth
# (https://pypi.org/project/vincenty/)
# The Geopoints and coordinates are given using WGS84 Geolocation point standard


# For this part, we are taking Geopoint as X0 and Y0 where the first element is Y and the second is X
# Example: Geopoint -> 19.4559485754, -99.1339187632 -> X = -99.1339187632 & Y = 19.4559485754 

counter = 0
all_distances = list()

for row in geoshapes:
    counter +=1
    origin = row['geopoint'].split(",")
    x_origin = float(origin[0])
    y_origin = float(origin[1])
    
    #There are several sets of polygon geopoints per row
    coordinates = row['coordinates'][0]
    if(len(row['coordinates'][0]) == 1):
        coordinates = coordinates[0]
        #print(len(coordinates))
    
    zona_distances = list()
    count = 0
    
    # Using Vicenty's distance formula for an elipsis to calculate de distance between two points in earth (Units are KM)
    for coordinate in coordinates:
        count +=1
        x_2 = coordinate[1]
        y_2 = coordinate[0]
        zona_distances.append(vincenty([x_origin, y_origin],[x_2, y_2]))
                
    all_distances.append(zona_distances)

print(len(all_distances))

698


## Step 3 - Calculate the average distance & the standard deviation for each row

In [27]:
final_distances = list()
for row in all_distances:
    mean = 0.0
    sDeviation = 0.0
    n = 0
    
    # Calculating the mean distance
    for distance in row:
        n +=1
        mean += float(distance)
    mean = mean/n
    
    # Calculating the Standard Deviation of all distances
    for distance in row:
        sDeviation += (distance - mean)*(distance - mean)
    sDeviation = sDeviation / n
    
    final_distances.append({'mean':mean, 'standard_Deviation':sDeviation, 'all_distances': row})
    
print(len(final_distances))

698


In [28]:
print(final_distances[0])
print('---------')
print(final_distances[512])

{'mean': 0.5274741333333334, 'standard_Deviation': 0.014290675331715555, 'all_distances': [0.534167, 0.437955, 0.262072, 0.536854, 0.648912, 0.542366, 0.566509, 0.531167, 0.514737, 0.434136, 0.47821, 0.405127, 0.751992, 0.733741, 0.534167]}
---------
{'mean': 0.5668734565217389, 'standard_Deviation': 0.032621109029204634, 'all_distances': [0.792035, 0.751079, 0.694341, 0.675988, 0.664544, 0.620499, 0.594059, 0.57521, 0.557674, 0.537924, 0.513261, 0.495752, 0.479599, 0.460686, 0.443342, 0.432325, 0.40875, 0.391636, 0.377104, 0.360853, 0.343991, 0.331809, 0.319572, 0.313666, 0.309548, 0.309595, 0.316558, 0.335548, 0.344294, 0.354286, 0.361979, 0.368164, 0.374585, 0.377578, 0.379129, 0.377809, 0.377035, 0.376807, 0.378103, 0.380459, 0.385494, 0.392155, 0.404151, 0.434919, 0.460641, 0.475781, 0.489684, 0.499196, 0.518249, 0.543564, 0.564442, 0.584796, 0.614568, 0.631514, 0.651995, 0.687955, 0.707256, 0.714792, 0.638823, 0.578057, 0.573799, 0.653556, 0.666405, 0.751784, 0.645595, 0.566734, 

# Step4 - Adding the new data into the original DataFrame & Finally export to CSV

In [32]:
## df1['e'] = Series(np.random.randn(sLength), index=df1.index)

means = list()
sDeviations = list()
# We already have all the distances in all_distances

# We need to turn the dictionaries into lists back again <- Poor planing on my part
for row in final_distances:
    means.append(row['mean'])
    sDeviations.append(row['standard_Deviation'])
    
zonas_patrullaje_df['mean'] = pd.Series(means, index=zonas_patrullaje_df.index)
zonas_patrullaje_df['standard_Deviation'] = pd.Series(sDeviations, index=zonas_patrullaje_df.index)
zonas_patrullaje_df['distances'] = pd.Series(all_distances, index=zonas_patrullaje_df.index)

zonas_patrullaje_df.head()

Unnamed: 0,Geopoint,Geoshape,Alcaldía,Sector 18,Área km2,x,y,mean,standard_Deviation,distances
0,"19.4559485754, -99.1339187632","{""type"": ""Polygon"", ""coordinates"": [[[-99.1373...",CUAUHTEMOC,TLATELOLCO,0.599031,-99.133437,19.455969,0.527474,0.014291,"[0.534167, 0.437955, 0.262072, 0.536854, 0.648..."
1,"19.4489311584, -99.1492549723","{""type"": ""Polygon"", ""coordinates"": [[[-99.1529...",CUAUHTEMOC,BUENAVISTA,0.542691,-99.149153,19.448988,0.453007,0.006291,"[0.526167, 0.320717, 0.468643, 0.542789, 0.444..."
2,"19.4466167038, -99.1372059309","{""type"": ""Polygon"", ""coordinates"": [[[-99.1388...",CUAUHTEMOC,BUENAVISTA,0.139906,-99.136628,19.446343,0.309541,0.007696,"[0.403358, 0.375808, 0.340076, 0.304342, 0.299..."
3,"19.4345863188, -99.1559474685","{""type"": ""Polygon"", ""coordinates"": [[[-99.1587...",CUAUHTEMOC,REVOLUCION,0.263673,-99.156147,19.434677,0.362777,0.025262,"[0.443341, 0.16611, 0.507736, 0.496656, 0.2699..."
4,"19.4287224255, -99.1566989128","{""type"": ""Polygon"", ""coordinates"": [[[-99.1544...",CUAUHTEMOC,REVOLUCION,0.294612,-99.156832,19.428315,0.37827,0.007552,"[0.469645, 0.426766, 0.407346, 0.198526, 0.457..."


In [33]:
#Exporting clean dataset version 1 of Zonas de Patrullaje
fileExport = path.join("..","Clean","after_transform","transformed_ZonasPatrullaje.csv")
zonas_patrullaje_df.to_csv(fileExport, sep=';', index=False)

In [34]:
#Checking the exported file
exportFile_df = pd.read_csv(fileExport, sep=';')

exportFile_df.head()

Unnamed: 0,Geopoint,Geoshape,Alcaldía,Sector 18,Área km2,x,y,mean,standard_Deviation,distances
0,"19.4559485754, -99.1339187632","{""type"": ""Polygon"", ""coordinates"": [[[-99.1373...",CUAUHTEMOC,TLATELOLCO,0.599031,-99.133437,19.455969,0.527474,0.014291,"[0.534167, 0.437955, 0.262072, 0.536854, 0.648..."
1,"19.4489311584, -99.1492549723","{""type"": ""Polygon"", ""coordinates"": [[[-99.1529...",CUAUHTEMOC,BUENAVISTA,0.542691,-99.149153,19.448988,0.453007,0.006291,"[0.526167, 0.320717, 0.468643, 0.542789, 0.444..."
2,"19.4466167038, -99.1372059309","{""type"": ""Polygon"", ""coordinates"": [[[-99.1388...",CUAUHTEMOC,BUENAVISTA,0.139906,-99.136628,19.446343,0.309541,0.007696,"[0.403358, 0.375808, 0.340076, 0.304342, 0.299..."
3,"19.4345863188, -99.1559474685","{""type"": ""Polygon"", ""coordinates"": [[[-99.1587...",CUAUHTEMOC,REVOLUCION,0.263673,-99.156147,19.434677,0.362777,0.025262,"[0.443341, 0.16611, 0.507736, 0.496656, 0.2699..."
4,"19.4287224255, -99.1566989128","{""type"": ""Polygon"", ""coordinates"": [[[-99.1544...",CUAUHTEMOC,REVOLUCION,0.294612,-99.156832,19.428315,0.37827,0.007552,"[0.469645, 0.426766, 0.407346, 0.198526, 0.457..."
