# EDA and Data Cleaning

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#EDA-and-Data-Cleaning" data-toc-modified-id="EDA-and-Data-Cleaning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>EDA and Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Dictionary</a></span></li><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Read-in-data" data-toc-modified-id="Read-in-data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Read in data</a></span></li><li><span><a href="#Initial-Data-Cleaning" data-toc-modified-id="Initial-Data-Cleaning-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Initial Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Drop-columns:-'Unnamed:-0',-'icon',-and-'photos'" data-toc-modified-id="Drop-columns:-'Unnamed:-0',-'icon',-and-'photos'-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Drop columns: 'Unnamed: 0', 'icon', and 'photos'</a></span></li><li><span><a href="#Get-city,-state,-and-zip-code-from-'formatted_address',-and-create-new-columns" data-toc-modified-id="Get-city,-state,-and-zip-code-from-'formatted_address',-and-create-new-columns-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Get city, state, and zip code from 'formatted_address', and create new columns</a></span></li><li><span><a href="#Get-lat,-lng-from-'geometry',-and-create-new-columns" data-toc-modified-id="Get-lat,-lng-from-'geometry',-and-create-new-columns-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Get lat, lng from 'geometry', and create new columns</a></span></li><li><span><a href="#Get-'compound_code',-'global_code'-from-'plus_code',-and-create-new-columns" data-toc-modified-id="Get-'compound_code',-'global_code'-from-'plus_code',-and-create-new-columns-1.4.4"><span class="toc-item-num">1.4.4&nbsp;&nbsp;</span>Get 'compound_code', 'global_code' from 'plus_code', and create new columns</a></span></li><li><span><a href="#Use-CountVectorizer-to-process-'types'" data-toc-modified-id="Use-CountVectorizer-to-process-'types'-1.4.5"><span class="toc-item-num">1.4.5&nbsp;&nbsp;</span>Use CountVectorizer to process 'types'</a></span></li></ul></li></ul></li></ul></div>

## Data Dictionary

[Data Dictionary Link](https://developers.google.com/places/web-service/search#PlaceSearchResults)

## Import libraries

In [52]:
import numpy as np
import pandas as pd
import ast 
import re

from sklearn.feature_extraction.text import CountVectorizer

# Display Preference
pd.set_option('display.max_columns', None)

## Read in data

In [2]:
df = pd.read_csv('../data/raw_google_data_nyc.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,formatted_address,geometry,icon,id,name,opening_hours,photos,place_id,plus_code,price_level,rating,reference,types,user_ratings_total,searched_keyword,searched_zipcode
0,0,"138 W 34th St, New York, NY 10001, United States","{'location': {'lat': 40.750269, 'lng': -73.989...",https://maps.gstatic.com/mapfiles/place_api/ic...,cab18c9c7ea5cf2330fdf146a7cbfe9a3ab03d6d,Sprint Store,{'open_now': False},"[{'height': 2592, 'html_attributions': ['<a hr...",ChIJCV0vTalZwokR61PIDIe0gI0,"{'compound_code': 'Q226+44 New York', 'global_...",2.0,3.4,ChIJCV0vTalZwokR61PIDIe0gI0,"['point_of_interest', 'store', 'establishment']",273,stores,10001
1,1,"460 8th Ave, New York, NY 10001, United States","{'location': {'lat': 40.751744, 'lng': -73.993...",https://maps.gstatic.com/mapfiles/place_api/ic...,e386d17b32833d39a246d0f8ed3df43ed5f27252,Duane Reade,{'open_now': True},"[{'height': 4896, 'html_attributions': ['<a hr...",ChIJJwQ3561ZwokRuYknT0uxER8,"{'compound_code': 'Q224+MJ New York', 'global_...",2.0,3.9,ChIJJwQ3561ZwokRuYknT0uxER8,"['convenience_store', 'food', 'point_of_intere...",50,stores,10001
2,2,"151 W 34th St, New York, NY 10001, United States","{'location': {'lat': 40.7508025, 'lng': -73.98...",https://maps.gstatic.com/mapfiles/place_api/ic...,e04114820206890ff0155d2f7a6f7efc0903fb9b,Macy's,{'open_now': True},"[{'height': 2610, 'html_attributions': ['<a hr...",ChIJ3xjWra5ZwokRrwJ0KZ4yKNs,"{'compound_code': 'Q226+86 New York', 'global_...",2.0,4.4,ChIJ3xjWra5ZwokRrwJ0KZ4yKNs,"['department_store', 'shoe_store', 'jewelry_st...",50800,stores,10001
3,3,"5 Pennsylvania Plaza, New York, NY 10001, Unit...","{'location': {'lat': 40.7519205, 'lng': -73.99...",https://maps.gstatic.com/mapfiles/place_api/ic...,76f93a3e6a81e29c91d14ceb783366beedf1c63e,CVS,{'open_now': True},"[{'height': 1836, 'html_attributions': ['<a hr...",ChIJhbi9_npZwokR0rqh_uP6s1U,"{'compound_code': 'Q224+Q9 New York', 'global_...",,3.8,ChIJhbi9_npZwokR0rqh_uP6s1U,"['drugstore', 'convenience_store', 'food', 'he...",45,stores,10001
4,4,"420 9th Ave, New York, NY 10001, United States","{'location': {'lat': 40.7529454, 'lng': -73.99...",https://maps.gstatic.com/mapfiles/place_api/ic...,3b2cbe32c41a5633864a49f9730d1c1388cbc37a,B&H Photo Video - Electronics and Camera Store,{'open_now': True},"[{'height': 2988, 'html_attributions': ['<a hr...",ChIJI93dPbJZwokRIoOEoivEDQs,"{'compound_code': 'Q233+5F New York', 'global_...",,4.6,ChIJI93dPbJZwokRIoOEoivEDQs,"['electronics_store', 'home_goods_store', 'poi...",22644,stores,10001


In [4]:
# Check the shape of the data
df.shape

(9889, 17)

In [20]:
# Check data types
df.dtypes

city                   object
state                  object
zipcode                object
id                     object
name                   object
open_now               object
place_id               object
plus_code              object
price_level           float64
rating                float64
reference              object
types                  object
user_ratings_total      int64
searched_keyword       object
searched_zipcode        int64
location_lat          float64
location_lng          float64
dtype: object

In [21]:
# Check nulls
df.isnull().sum()

city                     0
state                    9
zipcode                  0
id                       0
name                     0
open_now               483
place_id                 0
plus_code                4
price_level           3571
rating                   0
reference                0
types                    0
user_ratings_total       0
searched_keyword         0
searched_zipcode         0
location_lat             0
location_lng             0
dtype: int64

## Initial Data Cleaning

### Drop columns: 'Unnamed: 0', 'icon', and 'photos'

In [5]:
df.drop(columns=['Unnamed: 0', 'icon', 'photos'], inplace=True)

In [6]:
# Change 'opening_hours' from str to bool
df['opening_hours'] = [ast.literal_eval(df['opening_hours'][i]).get('open_now') 
                       if pd.isnull(df['opening_hours'][i]) is False else
                       df['opening_hours'][i]
                       for i in df.index ]
df.rename(columns={'opening_hours':'open_now'}, inplace=True )

### Get city, state, and zip code from 'formatted_address', and create new columns

In [7]:
# Regular expression reference: https://regex101.com/
ADDRESS_RE = re.compile(r'^(.*, +)?(?P<city>.*),( +(?P<state>[A-Z]{2}))? +(?P<zipcode>[0-9\-]*), +United States$')

In [8]:
# Define a funciton to match the regular expression constant above
def parse_address(string):
    match = re.match(ADDRESS_RE, string)
    
    # If match fails, raise error showing the failed match string
    if match is None:
        raise Exception(string)
        
    #  Return a dictionary object
    address_dict = match.groupdict()
    
    # Return 'None' if the address is missing 'state'. 
    if 'state' not in address_dict:
        address_dict['state'] = None
        
    return address_dict

In [9]:
# Apply the parse_address function to column 'formatted_address'
df = pd.concat([pd.DataFrame(list(df['formatted_address'].apply(parse_address).values)), df], axis=1, copy=True)

In [10]:
# Drop the 'formatted_address' column
df.drop(columns='formatted_address', inplace=True)

### Get lat, lng from 'geometry', and create new columns

In [11]:
df['location_lat'] = [ast.literal_eval(df['geometry'][i]).get('location').get('lat') for i in df.index]
df['location_lng'] = [ast.literal_eval(df['geometry'][i]).get('location').get('lng') for i in df.index]

In [12]:
df.drop(columns='geometry', inplace=True)

### Get 'compound_code', 'global_code' from 'plus_code', and create new columns

'plus_code' is is an encoded location reference, derived from latitude and longitude coordinates, that represents an area: 1/8000th of a degree by 1/8000th of a degree (about 14m x 14m at the equator) or smaller. Plus codes can be used as a replacement for street addresses in places where they do not exist (where buildings are not numbered or streets are not named). [Reference](https://developers.google.com/places/web-service/search#PlaceSearchResults)

In [22]:
df['compound_code'] = [ast.literal_eval(df['plus_code'][i]).get('compound_code')
                       if pd.isnull(df['plus_code'][i]) is False else
                       df['plus_code'][i]
                       for i in df.index]
df['global_code'] = [ast.literal_eval(df['plus_code'][i]).get('global_code')
                     if pd.isnull(df['plus_code'][i]) is False else
                     df['plus_code'][i]
                     for i in df.index]

In [24]:
df.drop(columns='plus_code', inplace=True)

### Use CountVectorizer to process 'types'

In [47]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['types'])

In [48]:
vectorizerized_types = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [49]:
df = pd.concat([df, vectorizerized_types], axis=1, copy=True)

In [55]:
df.drop(columns='types', inplace=True)

https://stackoverflow.com/questions/27198283/google-places-api-are-place-id-or-id-unique-to-any-city-in-the-world