## CAPSTONE PROJECT "BATTLE OF THE NEIGHBORHOODS"

### Applied Data Science by IBM / Coursera

### Project: The Third Place

####    Assessing the Impact of Community Public Venues on Individual Life Satisfaction

## Table of Contents
#### I.    INTRODUCTION
#### II.   DATA
#### III.  METHODOLOGY
#### IV.   ANALYSIS 
#### V.    RESULTS
#### VI.   CONCLUSIONS

## I. INTRODUCTION

In [None]:
The concept of Third Places was originated in 1991 by Roy Oldenburg, a noted urban sociologist.

Third places are informal public gathering places. They host the regular, voluntary, informal, and happily anticipated gatherings
of individuals beyond the realms of home (The First Place) and work (The Second Place).    ... beer gardens, main streets, pubs, cafés, coffeehouses, post offices, 
and other third places are the heart of a communitys social vitality

This analysis proposes to use Machine Learning classification to determine if Foursquare data on Public and Social venues 
for New York City communities can be correlated to more conventional measures of Life Quality or Satisfaction.

Insights developed by this type of analysis could provide additional depth to assessments of “well-being”, and identify opportunities 
for businesses, communities, and policy makers in efforts to enhance community quality of life.


### Call Required Python Libraries

In [1]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
import requests
import json
import csv
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import urllib.request
import folium
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
from sklearn.metrics import classification_report, confusion_matrix
import itertools
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.svm import SVC

## II. DATA               

### 1. Construct Life Satisfaction Indicator Ranks for NYC Neighborhoods and Community Districts

Economic and other Life Satisfaction Indicators (LSI) will be compiled in this section from the U.S. Census Bureau Community
Survey Data. This data is aggregated by Census Bureau Public Use Microdata Areas (PUMAs).  Each PUMA corresponds directly with NYC Community Districts, allowing for efficient aggregation of PUMA and Foursquare data.

#### Construct API to web scrape NYC Community District and Component Neighborhoods 

In [105]:
# web scrape NYC Community Districts and component Neighborhoods
req = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City#Community_areas")
soup = BeautifulSoup(req.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df_nycd = pd.DataFrame(df[0])
df_nycd.head()

Unnamed: 0,Community Board(CB),Areakm2,Pop.Census2010,Pop./km2,Neighborhoods
0,Bronx CB 1,7.17,91497,12761,"Melrose, Mott Haven, Port Morris"
1,Bronx CB 2,5.54,52246,9792,"Hunts Point, Longwood"
2,Bronx CB 3,4.07,79762,19598,"Claremont, Concourse Village, Crotona Park, Mo..."
3,Bronx CB 4,5.28,146441,27735,"Concourse, Highbridge"
4,Bronx CB 5,3.55,128200,36145,"Fordham, Morris Heights, Mount Hope, Universit..."


#### Utilize U.S. Census Bureau API to extract relavent Indicator metrics from Public Use Microdata (PUMA) database

In [None]:
#NAME = S0101_C01_001E,S0902_C01_017E,S1501_C01_007E,S1501_C01_015E,S2506_C01_001E,S2701_C05_001E,S0802_C01_090E,S2801_C01_002E,S1602_C04_001E,S2001_C01_002E,S2301_C01_028E
area = '03701,03702,03703,03704,03705,03706,03707,03708,03709,03710,03801,03801,03803,03804,03805,03806,03807,03708,03709,03710,03801,03802,03803,03804,03805,03806,03807,03808,03809,03810,03901,03902,03903,04001,04002,04003,04004,04005,04006,04007,04008,04009,04010,04011,04012,04013,04014,04015,04016,04017,04018,04101,04102,04103,04104,04105,04106,04107,04108,04109,04110,04111,04112,04113,04114'

baseAPI = "https://api.census.gov/data/2018/acs/acs1/subject?get=NAME,S0101_C01_001E,S2301_C04_021E,S1501_C02_015E,S2503_C02_028E,S2503_C02_032E,S2503_C02_036E,S2501_C02_008E,S2701_C05_001E,S0802_C01_090E,S2801_C02_002E,S1602_C04_001E,S2001_C01_002E&for=public%20use%20microdata%20area:{}&in=state:36".format(area)
response = requests.get(baseAPI)
formattedResponse = json.loads(response.text)[1:]
NYSC = pd.DataFrame(columns=['PUMA', 'Population', 'Unemployment', 'Bachelor Degree or Above', 'Housing <$20M Yr, >30% Income,', 'Housing <$35M Yr, >30% Income', 'Housing <$50M Yr, >30% Income', 'Housing >1.5 Occupants per Room', '% Uninsured', 'Travel Time to Work', '% Hhlds >1 Computer', '%LtdEnglish', 'Median Earnings', 'State', 'GeoCode'], data=formattedResponse)
NYSC.to_csv(r'C:\MLP Temp\Python\Python_AI_ML\US_Census_PUMA_Data_NYC_2.csv', header=True)

#### Clean and process the PUMA data extract and add to a dataframe

In [107]:
csv_path=r"C:\Users\mlporter\atom\PY4E\US_Census_PUMA_Data_NYC_3.xlsx"
df2=pd.read_excel(csv_path, quotechar="'")
sum_column = df2["Housing<$20M Yr >30% Income"] + df2["Housing<$35M Yr >30% Income"] + df2["Housing<$50M Yr >30% Income"]
df2["Housing Cost > 30% Income"] = sum_column
df2.drop('Housing<$20M Yr >30% Income', axis=1, inplace=True)
df2.drop('Housing<$35M Yr >30% Income', axis=1, inplace=True)
df2.drop('Housing<$50M Yr >30% Income', axis=1, inplace=True)
column_names = ['PUMA', 'Population', 'Unemployment', 'Bachelor Degree or Above', 'Housing Cost > 30% Income', '% Uninsured', 'Travel Time to Work', '% Hhlds >1 Computer', '% Ltd English', 'Median Earnings', 'State', 'GeoCode']
df2=df2.reindex(columns=column_names)
df2.head()

Unnamed: 0,PUMA,Population,Unemployment,Bachelor Degree or Above,Housing Cost > 30% Income,% Uninsured,Travel Time to Work,% Hhlds >1 Computer,% Ltd English,Median Earnings,State,GeoCode
0,NYC-Bronx Community District 1 & 2--Hunts Poin...,164003,13.0,12.5,54.3,11.5,42.6,86.1,24.3,23316,36,3710
1,"NYC-Bronx Community District 10--Co-op City, P...",119071,7.1,27.6,27.5,5.5,46.0,86.4,8.8,43360,36,3703
2,NYC-Bronx Community District 11--Pelham Parkwa...,124931,6.3,27.0,35.8,7.7,45.9,88.7,13.2,36323,36,3704
3,"NYC-Bronx Community District 12--Wakefield, Wi...",135799,9.5,26.2,39.7,7.2,48.6,91.5,8.4,32674,36,3702
4,"NYC-Bronx Community District 3 & 6--Belmont, C...",175456,12.5,14.5,54.1,8.6,42.2,89.5,22.5,23933,36,3705


#### Sort the PUMA dataframe

In [108]:
df2["PUMA"] = df2["PUMA"].str.split("PUMA, New York", n=1, expand=True)
df2["PUMA"] = df2["PUMA"].str.split("PUMA; New York", n=1, expand=True)
df2[["PUMA2","PUMA"]] = df2["PUMA"].str.split("-", n=1, expand=True)
df2[["PUMA","Neighborhoods"]] = df2["PUMA"].str.split("--", n=1, expand=True)
df2.drop(["PUMA2"], axis=1, inplace=True)
df2.drop(["State"], axis=1, inplace=True)
col_name="Neighborhoods"
second_col = df2.pop(col_name)
df2.insert(1,col_name,second_col)

In [109]:
df2.sort_values(by=['PUMA'], inplace=True)
df2 = df2.reset_index(drop=True)
df2.head()

Unnamed: 0,PUMA,Neighborhoods,Population,Unemployment,Bachelor Degree or Above,Housing Cost > 30% Income,% Uninsured,Travel Time to Work,% Hhlds >1 Computer,% Ltd English,Median Earnings,GeoCode
0,Bronx Community District 1 & 2,"Hunts Point, Longwood & Melrose",164003,13.0,12.5,54.3,11.5,42.6,86.1,24.3,23316,3710
1,Bronx Community District 10,"Co-op City, Pelham Bay & Schuylerville",119071,7.1,27.6,27.5,5.5,46.0,86.4,8.8,43360,3703
2,Bronx Community District 11,"Pelham Parkway, Morris Park & Laconia",124931,6.3,27.0,35.8,7.7,45.9,88.7,13.2,36323,3704
3,Bronx Community District 12,"Wakefield, Williamsbridge & Woodlawn",135799,9.5,26.2,39.7,7.2,48.6,91.5,8.4,32674,3702
4,Bronx Community District 3 & 6,"Belmont, Crotona Park East & East Tremont",175456,12.5,14.5,54.1,8.6,42.2,89.5,22.5,23933,3705


#### Extract the additional data from the NYC Community District Planning Commission and insert to DataFrame

In [110]:
data = pd.read_excel(r'C:\\Users\mlporter\atom\PY4E\NewYorkCity PUMS Community District Indicators from NYC Planning.xlsx')
df_a = pd.DataFrame(data, columns = ['cd_full_title','area_sqmi','count_hosp_clinic','pct_served_parks','pct_clean_strts','crime_count'])
df_a.head()

Unnamed: 0,cd_full_title,area_sqmi,count_hosp_clinic,pct_served_parks,pct_clean_strts,crime_count
0,Bronx Community District 1,2.2,37,99,90.4,2373
1,Bronx Community District 10,6.4,11,51,97.7,1073
2,Bronx Community District 11,3.6,35,86,95.4,1228
3,Bronx Community District 12,5.6,15,67,93.8,2434
4,Bronx Community District 2,2.2,18,97,92.3,946


#### Combine the 6 Community Districts that are shared by the same PUMA

In [135]:
new_row = {'cd_full_title':'Bronx Community District 1,2', 'area_sqmi':4.4, 'count_hosp_clinic':55, 'pct_served_parks':98, 'pct_clean_strts':91.4, 'crime_count':3319}
new_row2 = {'cd_full_title':'Bronx Community District 3,6', 'area_sqmi':3.1, 'count_hosp_clinic':54, 'pct_served_parks':99, 'pct_clean_strts':93.2, 'crime_count':2890}
new_row3 = {'cd_full_title':'Manhattan Community District 1,2', 'area_sqmi':2.9, 'count_hosp_clinic':21, 'pct_served_parks':100, 'pct_clean_strts':95.9, 'crime_count':3452}
new_row4 = {'cd_full_title':'Manhattan Community District 4,5', 'area_sqmi':3.4, 'count_hosp_clinic':42, 'pct_served_parks':95, 'pct_clean_strts':95.0, 'crime_count':7062}
df_b = df_a.append(new_row, ignore_index=True)
df_c = df_b.append(new_row2,ignore_index=True)
df_d = df_c.append(new_row3,ignore_index=True)
df_e = df_d.append(new_row4,ignore_index=True)
df_e.sort_values(by=['cd_full_title'], inplace=True)
df_f = df_e.drop(index=[0, 4, 5, 8, 30, 34, 36, 37])
df_f['cd_full_title'].replace({'Bronx Community District 3,6':'Bronx Community District 3 & 6', 'Bronx Community District 1,2':'Bronx Community District 1 & 2', 'Manhattan Community District 1,2':'Manhattan Community District 1 & 2', 'Manhattan Community District 4,5':'Manhattan Community District 4 & 5'}, inplace=True)
df_f.sort_values(by=['cd_full_title'], inplace=True)
df_f = df_f.reset_index(drop=True)
df_f.rename(columns = {'cd_full_title':'PUMA'}, inplace=True)

#### Arrange Columns and Rename to match Census PUMA data

In [None]:
Combine the Census DataFrame and the NYC District Community DataFrame

In [119]:
df_SI = pd.merge(df2, df_f, how="outer", on=["PUMA"])

In [7]:
csv_path=r'C:\MLP Temp\Python\Python_AI_ML\US_Census_and_NYC_Comm_District_New_York_PUMA_Data.csv'
df_SI2=pd.read_csv(csv_path, index_col=0)

In [8]:
df_SI2['Pop Density sqmi'] = df_SI2.apply(lambda row: row['Population'] / row['area_sqmi'], axis=1)

col_name = 'Pop Density sqmi'
first_col = df_SI2.pop(col_name)
df_SI2.insert(3, col_name, first_col)

In [9]:
col_name2 = 'Median Earnings'
sec_col = df_SI2.pop(col_name2)
df_SI2.insert(4, col_name2, sec_col)

#### Inspect prepared Dataframe of Life Satisfaction Indicators by PUMA - Community District, and component Neighborhoods

In [10]:
df_SI2.head()

Unnamed: 0,PUMA,Neighborhoods,Population,Pop Density sqmi,Median Earnings,Unemployment,Bachelor Degree or Above,Housing Cost > 30% Income,% Uninsured,Travel Time to Work,% Hhlds >1 Computer,% Ltd English,GeoCode,area_sqmi,count_hosp_clinic,pct_served_parks,pct_clean_strts,crime_count
0,Bronx Community District 1 & 2,"Hunts Point, Longwood & Melrose",164003,37273.409091,23316,13.0,12.5,54.3,11.5,42.6,86.1,24.3,3710,4.4,55,98,91.4,3319
1,Bronx Community District 10,"Co-op City, Pelham Bay & Schuylerville",119071,18604.84375,43360,7.1,27.6,27.5,5.5,46.0,86.4,8.8,3703,6.4,11,51,97.7,1073
2,Bronx Community District 11,"Pelham Parkway, Morris Park & Laconia",124931,34703.055556,36323,6.3,27.0,35.8,7.7,45.9,88.7,13.2,3704,3.6,35,86,95.4,1228
3,Bronx Community District 12,"Wakefield, Williamsbridge & Woodlawn",135799,24249.821429,32674,9.5,26.2,39.7,7.2,48.6,91.5,8.4,3702,5.6,15,67,93.8,2434
4,Bronx Community District 3 & 6,"Belmont, Crotona Park East & East Tremont",175456,56598.709677,23933,12.5,14.5,54.1,8.6,42.2,89.5,22.5,3705,3.1,54,99,93.2,2890


### Use Cluster Analysis to Group the PUMA - Community Districts by average of LSI Indicator Metrics

#### Prepare Cluster Data Set for Normalization

In [11]:
temp_df_SI2 = df_SI2.drop(['PUMA','Neighborhoods','Population','GeoCode','area_sqmi',], axis=1)
temp_df_SI2.head()

Unnamed: 0,Pop Density sqmi,Median Earnings,Unemployment,Bachelor Degree or Above,Housing Cost > 30% Income,% Uninsured,Travel Time to Work,% Hhlds >1 Computer,% Ltd English,count_hosp_clinic,pct_served_parks,pct_clean_strts,crime_count
0,37273.409091,23316,13.0,12.5,54.3,11.5,42.6,86.1,24.3,55,98,91.4,3319
1,18604.84375,43360,7.1,27.6,27.5,5.5,46.0,86.4,8.8,11,51,97.7,1073
2,34703.055556,36323,6.3,27.0,35.8,7.7,45.9,88.7,13.2,35,86,95.4,1228
3,24249.821429,32674,9.5,26.2,39.7,7.2,48.6,91.5,8.4,15,67,93.8,2434
4,56598.709677,23933,12.5,14.5,54.1,8.6,42.2,89.5,22.5,54,99,93.2,2890


#### Data Processing - Normalization using SKLearn StandardScalar

In [12]:
from sklearn.preprocessing import StandardScaler

scaled_temp_df_SI2 = temp_df_SI2
X = scaled_temp_df_SI2.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)

#### Model Community LSI Ranked Group Clusters with KMeans

In [21]:
num_clusters = 5

k_means = KMeans(init='k-means++', n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

print(labels)

[3 2 2 2 3 3 3 3 2 2 4 2 0 4 0 2 2 4 2 2 1 4 4 4 1 0 4 4 1 4 4 4 4 1 1 1 1
 4 2 2 2 2 2 2 2 0 0 2 2 0 2 2 2 2 2]


#### Determine # of PUMAs - Community Districts in each Ranked Group

In [22]:
# Note that Counts will change as the cluster algorithm is re-run
counts = np.bincount(labels[labels>=0])
print(counts)

[ 6  7 24  5 13]


In [23]:
df_SI2['Rank'] = labels

#### Calculate the Average of each Life Satisfaction Indicators for each LSI Ranked Group

In [24]:
df_SI2.groupby('Rank').mean()
               

Unnamed: 0_level_0,Population,Pop Density sqmi,Median Earnings,Unemployment,Bachelor Degree or Above,Housing Cost > 30% Income,% Uninsured,Travel Time to Work,% Hhlds >1 Computer,% Ltd English,GeoCode,area_sqmi,count_hosp_clinic,pct_served_parks,pct_clean_strts,crime_count
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,167330.333333,44205.847825,31927.833333,4.966667,30.2,37.05,13.166667,44.983333,88.116667,37.1,4059.833333,4.633333,12.333333,78.833333,94.666667,1346.833333
1,159842.571429,70874.928621,83347.714286,3.971429,76.614286,17.4,3.5,31.128571,95.3,4.771429,3863.571429,2.5,29.0,96.285714,96.042857,2725.428571
2,151506.708333,26460.193764,41434.625,5.279167,34.8,28.920833,6.716667,46.516667,91.245833,12.275,3978.375,7.516667,13.416667,70.375,96.083333,1347.833333
3,149922.2,66482.264355,23751.4,11.62,14.02,53.64,9.08,43.94,88.38,25.34,3707.2,2.56,38.2,99.0,92.44,2382.2
4,145412.769231,65564.938494,39162.076923,4.992308,36.169231,36.546154,6.469231,39.476923,88.030769,11.553846,3928.538462,2.592308,23.692308,93.846154,89.953846,1873.769231


#### Summary LSI Groups 
##### Rank 0: Mod-Low LSI, 5 Districts
##### Rank 1: High LSI, 7 Districts
##### Rank 2: Mod-High LSI, 22 Districts
##### Rank 3: Low LSI, 5 Districts
##### Rank 4: Moderate LSI, 15 Districts

### This completes construction of the Life Satisfaction Indicator rankings for PUMAs - CDs

### 2. Compile Foursquare Trending Public Venue data for New York City Neighborhoods

In [None]:
In this section, Foursquart trending Venue data will be extracted and grouped by NYC neighborhood, and then
organized by PUMA and NYC Community District

#### Determine geo-locators for NYC and Boroughs

In [4]:
with open('C:/Users/mlporter/atom/PY4E/nyu_2451_34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)

In [5]:
newyork_data_mine = newyork_data['features']
newyork_data_mine[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [6]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [9]:
for data in newyork_data_mine:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)


In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]))

The dataframe has 5 boroughs and 918 neighborhoods.


In [11]:
# Find GEOID coordinates of NYC
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [12]:
# Create dataframe of Queens neighborhoods
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


### Extract Foursquare New York Neighborhood Trending Venue Data for All 5 Boroughs

#### Using Foursquare API import Venue Data for NYC Neighborhoods, beginning with Queens Borough

In [None]:
# Create code to import data for all Queens neighborhoods

def getNearbyVenues(names, latitudes, longitudes, radius=1500, limit=250):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# Run above function for all Queens neighborhoods

queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                                  )

In [None]:
# Determine size of results dataframe for Queens venues
print(queens_venues.shape)

In [None]:
# save Foursquare venue extract for all Queens neighborhoods to local file to limit API calls
queens_venues.to_csv(r'C:\Users\mlporter\atom\PY4E\Capstone_Index\Queens_nearby_venues_all_neighborhoods.txt')

#### Repeat the API above to extract Foursquare Venue data for remaining NYC Boroughs and save to local File

### Consolidate Foursquare Venue data for NYC

#### Import Foursquare Trending Venue Data for all NYC Boroughs by Neighborhood from saved Local Files

#### Brooklyn

In [33]:
# read Foursquare venue data for all Brooklyn neighborhoods from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Brooklyn_nearby_venues_all_neighborhoods.txt"
brooklyn_venues = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
brooklyn_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Bagel Boy,40.627896,-74.029335,Bagel Shop
1,Bay Ridge,40.625801,-74.030621,Pilo Arts Day Spa and Salon,40.624748,-74.030591,Spa
2,Bay Ridge,40.625801,-74.030621,Pegasus Cafe,40.623168,-74.031186,Breakfast Spot
3,Bay Ridge,40.625801,-74.030621,Ho' Brah Taco Joint,40.62296,-74.031371,Taco Place
4,Bay Ridge,40.625801,-74.030621,Karam,40.622931,-74.028316,Middle Eastern Restaurant


In [34]:
brooklyn_venues.shape

(6550, 7)

#### Queens

In [35]:
# read Foursquare venue data for all Queens neighborhoods from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Queens_nearby_venues_all_neighborhoods.txt"
queens_venues = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
queens_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
1,Astoria,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
2,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
3,Astoria,40.768509,-73.915654,Al-sham Sweets and Pastries,40.768077,-73.911561,Middle Eastern Restaurant
4,Astoria,40.768509,-73.915654,Sitan Muay Thai,40.766108,-73.913224,Martial Arts Dojo


In [36]:
queens_venues.shape

(6466, 7)

#### Bronx

In [37]:
# read Foursquare venue data for all Bronx neighborhoods from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Bronx_nearby_venues_all_neighborhoods.txt"
bronx_venues = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
bronx_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Ripe Kitchen & Bar,40.898152,-73.838875,Caribbean Restaurant
2,Wakefield,40.894705,-73.847201,Ali's Roti Shop,40.894036,-73.856935,Caribbean Restaurant
3,Wakefield,40.894705,-73.847201,Jimbo's,40.89174,-73.858226,Burger Joint
4,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop


In [38]:
bronx_venues.shape

(4519, 7)

#### Manhattan

In [39]:
# read Foursquare venue data for all Manhattan neighborhoods from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Manhattan_nearby_venues_all_neighborhoods.txt"
manhattan_venues = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
1,Marble Hill,40.876551,-73.91066,Sam's Pizza,40.879435,-73.905859,Pizza Place
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
4,Marble Hill,40.876551,-73.91066,El Malecon,40.879338,-73.904457,Caribbean Restaurant


In [40]:
manhattan_venues.shape

(4000, 7)

#### Staten Island

In [41]:
# read Foursquare venue data for all Staten Island neighborhoods from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Staten_island_nearby_venues_all_neighborhoods.txt"
staten_island_venues = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
staten_island_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,St. George,40.644982,-74.079353,Beso,40.643306,-74.076508,Tapas Restaurant
1,St. George,40.644982,-74.079353,Staten Island September 11 Memorial,40.646767,-74.07651,Monument / Landmark
2,St. George,40.644982,-74.079353,A&S Pizzeria,40.64394,-74.077626,Pizza Place
3,St. George,40.644982,-74.079353,Shake Shack,40.64366,-74.075891,Burger Joint
4,St. George,40.644982,-74.079353,Enoteca Maria,40.641941,-74.07732,Italian Restaurant


In [42]:
staten_island_venues.shape

(3618, 7)

### Join / Merge Queens, Brooklyn, Bronx, Manhattan, and Staten Island Borough Venue Dataframes

##### Venue count: Queens 6.466, Brooklyn 6.550, Bronx 4.519, Manhattan 4.000, Staten Island 3.618, Total All Neighborhoods 25.153

In [43]:
df1 = brooklyn_venues.merge(queens_venues, 'outer')


In [44]:
df1.shape

(13016, 7)

In [45]:
df2 = df1.merge(bronx_venues, 'outer')


In [46]:
df2.shape

(17535, 7)

In [47]:
df3 = df2.merge(manhattan_venues, 'outer')


In [48]:
df3.shape

(21535, 7)

In [49]:
df4 = df3.merge(staten_island_venues, 'outer')
df4. tail()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
25148,Fox Hills,40.617311,-74.08174,Penny Beach,40.615868,-74.064456,Beach
25149,Fox Hills,40.617311,-74.08174,Hylan Blvd & Bay Street,40.613853,-74.064934,Intersection
25150,Fox Hills,40.617311,-74.08174,Now or Never Body Works,40.60418,-74.084027,Tattoo Parlor
25151,Fox Hills,40.617311,-74.08174,Steven Nails,40.630058,-74.077025,Cosmetics Shop
25152,Fox Hills,40.617311,-74.08174,MTA SIR - Grasmere,40.603937,-74.083478,Train Station


In [50]:
df4.shape

(25153, 7)

#### This completes construction of a dataframe with 25,153 public venues from Foursquare for all 330 NYC neighborhoods

#### Process the Venue dataframe for Analysis

In [51]:
# process dataframe for analysis

# one hot encoding
df4_onehot = pd.get_dummies(df4[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df4_onehot['Neighborhood'] = df4['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [df4_onehot.columns[-1]] + list(df4_onehot.columns[:-1])
df4_onehot = df4_onehot[fixed_columns]


#### Aggregate Venue Counts by Neighborhood to allow grouping by PUMA Community Districts

In [52]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df4_onehot.csv"
df4_onehot = pd.read_csv(csv_path, index_col=0)


In [53]:
# group rows by Neighborhood and take summed total of occurance
df4_grouped_all = df4_onehot.groupby('Neighborhood').mean().reset_index()
df4_grouped_all.tail()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,Airport Tram,...,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
297,Woodhaven,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
298,Woodlawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299,Woodrow,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
300,Woodside,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
301,Yorkville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.01,0.0


In [54]:
df4_grouped_all.shape

(302, 453)

#### Import Borough Neighborhoods with Associated PUMA Community District Names

In [55]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\NYCD_PUMA_CD_Hoods_Names4.csv"
NYCD_puma = pd.read_csv(csv_path, index_col=0)
NYCD_puma.head()

Unnamed: 0,Neighborhood,Community Board(CB),Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
1,Melrose,Bronx Community District 1 & 2,,,,,
2,Mott Haven,Bronx Community District 1 & 2,,,,,
3,Port Morris,Bronx Community District 1 & 2,,,,,
4,Hunts Point,Bronx Community District 1 & 2,,,,,
5,Longwood,Bronx Community District 1 & 2,,,,,


In [56]:
NYCD_puma.shape

(370, 7)

#### Join PUMA Community District Names to Neighborhood Names

In [57]:
df_puma_all = pd.merge(df4_grouped_all,NYCD_puma[['Neighborhood', 'Community Board(CB)']], on='Neighborhood', how='left')
df_puma_all = df_puma_all.reset_index(drop=True)

In [58]:
df_puma_all.shape

(315, 454)

In [59]:
# move PUMA Community District Name column to axis
col_name = 'Community Board(CB)'
sec_col = df_puma_all.pop(col_name)
df_puma_all.insert(0, col_name, sec_col)
df_puma_all.head()

Unnamed: 0,Community Board(CB),Neighborhood,Zoo Exhibit,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,...,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Bronx Community District 11,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Staten Island Community District 3,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Staten Island Community District 3,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Staten Island Community District 1,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0
4,Staten Island Community District 2,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [60]:
df_puma_all.shape

(315, 454)

In [62]:
# inspect data types for machine learning models
df_puma_all.dtypes

Community Board(CB)     object
Neighborhood            object
Zoo Exhibit            float64
Accessories Store      float64
Adult Boutique         float64
                        ...   
Winery                 float64
Wings Joint            float64
Women's Store          float64
Yoga Studio            float64
Zoo                    float64
Length: 454, dtype: object

### This completes construction of a Dataframe containing 25,153 venues organized by 59 PUMA-CDs
#### The Dataframe containes 454 unique Public Venues

#### Determine Top 10 Trending Venues by Neighborhood

In [81]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df_puma_all_final.csv"
df_puma_all = pd.read_csv(csv_path, index_col=0)
df_puma_all.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,Airport Tram,...,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0
4,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
# sort venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [82]:
# Top 10 venues dataframe

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
# create a new dataframe
df_puma_sort_all = pd.DataFrame(columns=columns)
df_puma_sort_all['Neighborhood'] = df_puma_all['Neighborhood']

for ind in np.arange(df_puma_all.shape[0]):
    df_puma_sort_all.iloc[ind, 1:] = return_most_common_venues(df_puma_all.iloc[ind, :], num_top_venues)
    
df_puma_sort_all.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Donut Shop,Sandwich Place,Caribbean Restaurant,Pharmacy,Fast Food Restaurant,Bank,Chinese Restaurant,Garden,Fried Chicken Joint
1,Annadale,Italian Restaurant,Train Station,Sandwich Place,Restaurant,Donut Shop,Fast Food Restaurant,Diner,Sushi Restaurant,Pizza Place,Sports Bar
2,Arden Heights,Pizza Place,Bus Stop,Restaurant,Liquor Store,Diner,Bank,Bakery,Soccer Field,Optical Shop,Park
3,Arlington,Discount Store,Spanish Restaurant,Fast Food Restaurant,Hardware Store,Department Store,Sandwich Place,Pharmacy,Donut Shop,Convenience Store,Latin American Restaurant
4,Arrochar,Italian Restaurant,Baseball Field,Grocery Store,Beach,Pharmacy,Ice Cream Shop,Chinese Restaurant,Bank,Mediterranean Restaurant,Middle Eastern Restaurant


In [83]:
df_puma_sort_all.shape

(302, 11)

In [84]:
df_puma_sort_all.to_csv(r'C:\Users\mlporter\atom\PY4E\Capstone_Index\df_puma_sort_all.csv')

### Join Life Satisfaction Indicator (LSI) Ranks Computed Earlier to Venue DataFrame

In [109]:
# read LSI Rank data by PUMA CD from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df_cd_lsi.csv"
df_lsi = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)
df_lsi.head()

Unnamed: 0,LSI,Community Board(CB)
0,0,Brooklyn Community District 9
1,0,Brooklyn Community District 3
2,0,Manhattan Community District 12
3,0,Brooklyn Community District 5
4,0,Manhattan Community District 10


In [114]:
# Add LSI to Top 10 Venues by PUMA CD
df_puma_all_lsi = df_puma_all.merge(df_lsi, 'inner')

# move LSI column to axis
col_name = 'LSI'
sec_col = df_puma_all_lsi.pop(col_name)
df_puma_all_lsi.insert(0, col_name, sec_col)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [113]:
df_puma_all_lsi.head(40)

Unnamed: 0,LSI,Zoo Exhibit,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,Airport Tram,...,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,0,0.002553,0.000426,0.0,0.0,0.003364,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000851,0.006809,0.01109,0.0,0.000741,0.000938,0.010488,0.002553
1,1,0.0,0.000233,0.000698,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00093,0.0,0.00093,0.009839,0.017907,0.00093,0.000698,0.00186,0.009714,0.000465
2,2,0.0,0.000476,0.0,0.0,0.0,0.002857,0.002857,0.0,0.0,...,0.0,0.0,0.0,0.001429,0.003998,0.0,0.001155,0.0,0.000574,0.000952
3,3,0.006522,0.0,0.0,0.0,0.004746,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001304,0.001304,0.0,0.002411,0.000537,0.00087,0.008261
4,4,0.0,0.000819,0.0,0.000299,0.000199,5.9e-05,0.000118,5.9e-05,6.1e-05,...,5.6e-05,0.000381,5.6e-05,0.001428,0.002762,0.0,0.002188,0.000868,0.002203,0.000227


#### The Public Venue Dataframe now includes the LSI Rank by PUMA-Community District and Neighborhood

In [101]:
df_puma_all_lsi.shape

(5, 453)

### Determine Top 10 Trending Venues by Life Satisfaction Indicator (LSI) Rank groups

#### Determine Mean Frequency of Occurance of Venues by LSI Group

In [102]:
# group rows by LSI Rank and take mean frequency of occurance
df_puma_all_lsi = df_puma_all_lsi.groupby('LSI').mean().reset_index()


In [103]:
# sort venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [115]:
# Top 10 venues dataframe

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['LSI']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
# create a new dataframe
df_puma_top_sort = pd.DataFrame(columns=columns)
df_puma_top_sort['LSI'] = df_puma_all_lsi['LSI']

for ind in np.arange(df_puma_all_lsi.shape[0]):
    df_puma_top_sort.iloc[ind, 1:] = return_most_common_venues(df_puma_all_lsi.iloc[ind, :], num_top_venues)
    
df_puma_top_sort.head()

Unnamed: 0,LSI,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Pizza Place,Coffee Shop,Bar,Park,Bakery,Café,Caribbean Restaurant,Donut Shop,Sandwich Place,Deli / Bodega
1,1,Coffee Shop,Park,Italian Restaurant,Pizza Place,Bakery,American Restaurant,Gym,Hotel,Bar,Gym / Fitness Center
2,2,Pizza Place,Bakery,Italian Restaurant,Chinese Restaurant,Donut Shop,Park,Pharmacy,Bank,Sushi Restaurant,Coffee Shop
3,3,Pizza Place,Donut Shop,Park,Fast Food Restaurant,Grocery Store,Sandwich Place,Mexican Restaurant,Discount Store,Pharmacy,Latin American Restaurant
4,4,Pizza Place,Italian Restaurant,Donut Shop,Pharmacy,Sandwich Place,Deli / Bodega,Bank,Ice Cream Shop,Grocery Store,Park


### Determine Top 10 Trending Venues by Community Board Districts (PUMA)

In [116]:
# read Community District PUMA venue data from saved local file
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df_puma.csv"
df_puma = pd.read_csv(csv_path, index_col=0).reset_index(drop=True)

In [117]:
# sort venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [118]:
# Top 10 venues dataframe

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Community Board(CB)']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
# create a new dataframe
df_puma_CB_sort = pd.DataFrame(columns=columns)
df_puma_CB_sort['Community Board(CB)'] = df_puma['Community Board(CB)']

for ind in np.arange(df_puma.shape[0]):
    df_puma_CB_sort.iloc[ind, 1:] = return_most_common_venues(df_puma.iloc[ind, :], num_top_venues)
    
df_puma_CB_sort.head()

Unnamed: 0,Community Board(CB),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx Community District 1 & 2,Pizza Place,Mexican Restaurant,Discount Store,Park,Donut Shop,Gym,Pharmacy,Sandwich Place,Fast Food Restaurant,Grocery Store
1,Bronx Community District 10,Pizza Place,Italian Restaurant,Donut Shop,Sandwich Place,Bar,Pharmacy,Diner,Bakery,American Restaurant,Mexican Restaurant
2,Bronx Community District 11,Pizza Place,Donut Shop,Coffee Shop,Italian Restaurant,Pharmacy,Sandwich Place,Bank,Chinese Restaurant,Deli / Bodega,Supermarket
3,Bronx Community District 12,Caribbean Restaurant,Pizza Place,Pharmacy,Donut Shop,Fast Food Restaurant,Supermarket,Bakery,Sandwich Place,Mobile Phone Shop,Bank
4,Bronx Community District 3 & 6,Pizza Place,Donut Shop,Italian Restaurant,Discount Store,Zoo,Fast Food Restaurant,Grocery Store,Mexican Restaurant,Mobile Phone Shop,Park


## METHODOLOGY

In [None]:
The objective of this section is to use a sample of Machine Learning Classification algorithms to determine if Community LSI Rank
can be predicted on the basis of respective Public Venue data sets - the data sets constructed in the previous sections

### ANALYSIS:  MACHINE LEARNING MODELS -

#### Call Ranked Data Sets from saved Local File

In [120]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df_puma_all_lsi.csv"
df_puma_all_lsi = pd.read_csv(csv_path, index_col=0, sep=",")
df_puma_all_lsi.tail()

Unnamed: 0,LSI,Community Board(CB),Neighborhood,Zoo Exhibit,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,...,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
308,1,Manhattan Community District 7,Lincoln Square,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
309,1,Manhattan Community District 7,Manhattan Valley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.01,0.0
310,1,Manhattan Community District 7,Upper West Side,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0
311,2,Brooklyn Community District 7,Sunset Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
312,2,Brooklyn Community District 7,Windsor Terrace,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0


#### Designate DataFrame for Training Sets

In [121]:
df_train = df_puma_all_lsi

In [122]:
df_train.shape

(313, 455)

In [123]:
df_train.columns = df_train.columns.str.replace('\s+', '')
df_train.columns = df_train.columns.str.replace(' ' , '')
from IPython.display import display
pd.set_option('display.max_rows', 2500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 2500)

display(df_train.columns)

Index(['LSI', 'CommunityBoard(CB)', 'Neighborhood', 'ZooExhibit',
       'AccessoriesStore', 'AdultBoutique', 'AfghanRestaurant',
       'AfricanRestaurant', 'AirportLounge', 'AirportService',
       ...
       'Waterfront', 'WeightLossCenter', 'WhiskyBar', 'WineBar', 'WineShop',
       'Winery', 'WingsJoint', 'Women'sStore', 'YogaStudio', 'Zoo'],
      dtype='object', length=455)

In [124]:
df_train.to_csv(r'C:\Users\mlporter\atom\PY4E\Capstone_Index\df_train.csv')

In [125]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\df_train.csv"
df_train = pd.read_csv(csv_path, index_col=0, sep=",")
df_train.tail()

Unnamed: 0,LSI,CommunityBoard(CB),Neighborhood,ZooExhibit,AccessoriesStore,AdultBoutique,AfghanRestaurant,AfricanRestaurant,AirportLounge,AirportService,AirportTerminal,AirportTram,AmericanRestaurant,Amphitheater,AnimalShelter,AntiqueShop,Aquarium,Arcade,ArepaRestaurant,ArgentinianRestaurant,ArtGallery,ArtMuseum,Arts&CraftsStore,Arts&Entertainment,AsianRestaurant,Athletics&Sports,Auditorium,AustralianRestaurant,AustrianRestaurant,AutoDealership,AutoGarage,AutoWorkshop,AutomotiveShop,BBQJoint,BagelShop,Bakery,Bank,Bar,BaseballField,BaseballStadium,BasketballCourt,BasketballStadium,BathHouse,Beach,BeachBar,Bed&Breakfast,BeerBar,BeerGarden,BeerStore,BigBoxStore,BikeRental/BikeShare,BikeShop,BikeTrail,Bistro,BoardShop,BoatorFerry,Bookstore,BorderCrossing,BotanicalGarden,Boutique,BowlingAlley,BoxingGym,BrazilianRestaurant,BreakfastSpot,Brewery,BridalShop,Bridge,BubbleTeaShop,Buffet,Building,BurgerJoint,BurmeseRestaurant,BurritoPlace,BusLine,BusStation,BusStop,BusinessService,Butcher,Cafeteria,Café,Cajun/CreoleRestaurant,CambodianRestaurant,CameraStore,Campground,CandyStore,CantoneseRestaurant,CarWash,CaribbeanRestaurant,Casino,CaucasianRestaurant,CheckCashingService,CheeseShop,ChineseRestaurant,ChocolateShop,Church,Circus,ClimbingGym,ClothingStore,ClubHouse,CocktailBar,CoffeeShop,CollegeAcademicBuilding,CollegeArtsBuilding,CollegeBasketballCourt,CollegeGym,CollegeTheater,ColombianRestaurant,ComedyClub,ComfortFoodRestaurant,ComicShop,CommunityCenter,ConcertHall,Construction&Landscaping,ConvenienceStore,ConventionCenter,CookingSchool,CosmeticsShop,CoworkingSpace,Creperie,CubanRestaurant,CulturalCenter,CupcakeShop,CycleStudio,CzechRestaurant,DanceStudio,Deli/Bodega,DepartmentStore,DesignStudio,DessertShop,DimSumRestaurant,Diner,DiscountStore,Distillery,DiveBar,Doctor'sOffice,DogRun,DonutShop,DosaPlace,DryCleaner,DumplingRestaurant,EasternEuropeanRestaurant,EgyptianRestaurant,ElectronicsStore,EmergencyRoom,EmpanadaRestaurant,EnglishRestaurant,EthiopianRestaurant,EventService,EventSpace,Exhibit,Factory,FalafelRestaurant,Farm,FarmersMarket,FastFoodRestaurant,Field,FilipinoRestaurant,FilmStudio,FinancialorLegalService,Fish&ChipsShop,FishMarket,FishingStore,FleaMarket,FlowerShop,Food,Food&DrinkShop,FoodCourt,FoodService,FoodTruck,Fountain,FrenchRestaurant,FriedChickenJoint,FrozenYogurtShop,Fruit&VegetableStore,Furniture/HomeStore,GamingCafe,Garden,GardenCenter,GasStation,Gastropub,GayBar,GeneralEntertainment,GermanRestaurant,GiftShop,Gluten-freeRestaurant,GoKartTrack,GolfCourse,GolfDrivingRange,GourmetShop,GovernmentBuilding,GreekRestaurant,GroceryStore,GunRange,Gym,Gym/FitnessCenter,GymPool,GymnasticsGym,HalalRestaurant,Harbor/Marina,HardwareStore,HawaiianRestaurant,Health&BeautyService,HealthFoodStore,Heliport,Herbs&SpicesStore,HighSchool,HimalayanRestaurant,HistoricSite,HistoryMuseum,HobbyShop,HomeService,HookahBar,Hostel,HotDogJoint,Hotel,HotelBar,HotpotRestaurant,ITServices,IceCreamShop,IndianRestaurant,IndieMovieTheater,IndieTheater,IndonesianRestaurant,Intersection,IrishPub,Island,IsraeliRestaurant,ItalianRestaurant,JapaneseCurryRestaurant,JapaneseRestaurant,JazzClub,JewelryStore,JewishRestaurant,JuiceBar,KaraokeBar,KebabRestaurant,KidsStore,KitchenSupplyStore,KoftePlace,KoreanRestaurant,KosherRestaurant,Lake,LatinAmericanRestaurant,Laundromat,LaundryService,Lawyer,LebaneseRestaurant,Library,Lighthouse,LightingStore,LingerieStore,LiquorStore,Locksmith,Lounge,LuggageStore,Mac&CheeseJoint,MalayRestaurant,Market,MartialArtsDojo,MassageStudio,MattressStore,MedicalCenter,MediterraneanRestaurant,MemorialSite,Men'sStore,MetroStation,MexicanRestaurant,MiddleEasternRestaurant,MiniGolf,MiscellaneousShop,MobilePhoneShop,ModernGreekRestaurant,MolecularGastronomyRestaurant,Monument/Landmark,MoroccanRestaurant,Motel,MotorcycleShop,MovieTheater,MovingTarget,Multiplex,Museum,MusicSchool,MusicStore,MusicVenue,NailSalon,NationalPark,NewAmericanRestaurant,Newsstand,Nightclub,NightlifeSpot,Non-Profit,NoodleHouse,Office,OperaHouse,OpticalShop,OrganicGrocery,OtherGreatOutdoors,OtherNightlife,OtherRepairShop,OutdoorSculpture,Outdoors&Recreation,OutletMall,PaellaRestaurant,PaintballField,PakistaniRestaurant,Paper/OfficeSuppliesStore,Park,Parking,PedestrianPlaza,PerformingArtsVenue,PersianRestaurant,PeruvianRestaurant,PetCafé,PetService,PetStore,Pharmacy,PhotographyStudio,PieShop,Pier,PiercingParlor,PilatesStudio,PizzaPlace,Planetarium,Playground,Plaza,PokePlace,PolishRestaurant,Pool,PoolHall,PortugueseRestaurant,Pub,PublicArt,Racetrack,RamenRestaurant,RecordShop,RecordingStudio,RecreationCenter,RentalCarLocation,RentalService,Reservoir,ResidentialBuilding(Apartment/Condo),Resort,RestArea,Restaurant,River,RockClub,RoofDeck,RussianRestaurant,SakeBar,SaladPlace,Salon/Barbershop,SalvadoranRestaurant,SandwichPlace,ScandinavianRestaurant,ScenicLookout,School,ScienceMuseum,SculptureGarden,SeafoodRestaurant,ShanghaiRestaurant,ShippingStore,ShoeRepair,ShoeStore,Shop&Service,ShoppingMall,ShoppingPlaza,SkatePark,SkatingRink,SmokeShop,SmoothieShop,SnackPlace,SoccerField,SocialClub,SoupPlace,SouthAmericanRestaurant,Southern/SoulFoodRestaurant,SouvenirShop,SouvlakiShop,Spa,SpanishRestaurant,Speakeasy,SportingGoodsShop,SportsBar,SportsClub,SriLankanRestaurant,Stables,Stadium,State/ProvincialPark,StationeryStore,Steakhouse,StorageFacility,StreetArt,StreetFoodGathering,Supermarket,SupplementShop,SurfSpot,SushiRestaurant,SwissRestaurant,SzechuanRestaurant,TVStation,TacoPlace,TailorShop,TaiwaneseRestaurant,TanningSalon,TapasRestaurant,TattooParlor,TeaRoom,TechStartup,TennisCourt,TennisStadium,Tex-MexRestaurant,ThaiRestaurant,Theater,ThemePark,ThemeParkRide/Attraction,ThemeRestaurant,Thrift/VintageStore,TibetanRestaurant,TikiBar,TollPlaza,TourProvider,TouristInformationCenter,Toy/GameStore,Track,TrackStadium,Trail,TrainStation,TramStation,Tree,Tunnel,TurkishRestaurant,UdonRestaurant,UkrainianRestaurant,UsedBookstore,VapeStore,Varenykyrestaurant,Vegetarian/VeganRestaurant,VenezuelanRestaurant,Veterinarian,VideoGameStore,VideoStore,VietnameseRestaurant,VolleyballCourt,WarehouseStore,WasteFacility,Waterfront,WeightLossCenter,WhiskyBar,WineBar,WineShop,Winery,WingsJoint,Women'sStore,YogaStudio,Zoo
308,1,Manhattan Community District 7,Lincoln Square,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.04,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
309,1,Manhattan Community District 7,Manhattan Valley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.02,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.01,0.0
310,1,Manhattan Community District 7,Upper West Side,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0
311,2,Brooklyn Community District 7,Sunset Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.1,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.03,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
312,2,Brooklyn Community District 7,Windsor Terrace,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.06,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0


In [126]:
df_train.shape

(313, 455)

#### Define Features 

In [None]:
The objective of this section is to use a sample of Machine Learning Classification algorithms to determine if Community LSI Rank
can be predicted on the basis of respective Public Venue data sets

In [127]:
Features = df_train

In [128]:
Features.shape

(313, 455)

In [138]:
Features = Xa_train.columns.drop('LSI')

In [132]:
X_train.shape

NameError: name 'X_train' is not defined

In [None]:
X_train.to_csv(r'C:\Users\mlporter\atom\PY4E\Capstone_Index\X_train.csv')

In [134]:
csv_path=r"C:\Users\mlporter\atom\PY4E\Capstone_Index\Xa_train.csv"
Xa_train = pd.read_csv(csv_path, index_col=0, sep=",")
Xa_train.tail()

Unnamed: 0_level_0,LSI,Beach,CaribbeanRestaurant,Park,KoreanRestaurant,SeafoodRestaurant,Trail,ChineseRestaurant,TennisStadium,Theater,ItalianRestaurant,Bar,Harbor/Marina,BusStop,PizzaPlace,CoffeeShop,Deli/Bodega,BusStation,DonutShop,DiscountStore,ThemeParkRide/Attraction,BoatorFerry,TrainStation,Campground,GroceryStore,Pharmacy,SurfSpot,Bakery,Hotel,SandwichPlace,ThaiRestaurant,GolfCourse,IndianRestaurant,Bank,Café,Exhibit,MexicanRestaurant,ThemePark,ZooExhibit,ArtGallery,FastFoodRestaurant,FriedChickenJoint,GreekRestaurant,SpanishRestaurant,Supermarket,SushiRestaurant,BaseballField,ClothingStore,RentalCarLocation,IceCreamShop,AmericanRestaurant,BubbleTeaShop,CocktailBar,Diner,Garden,Gym,Spa,Zoo,BagelShop,HardwareStore,Athletics&Sports,BorderCrossing,BusinessService,Food,Intersection,Lounge,MetroStation,Nightclub,PeruvianRestaurant,Pub,Resort,Restaurant,Shop&Service,SportingGoodsShop,TollPlaza,Construction&Landscaping,DiveBar,Gym/FitnessCenter,OtherNightlife,Playground,Pool,SouthAmericanRestaurant,SportsClub,MobilePhoneShop,LatinAmericanRestaurant,AirportLounge,AirportService,ConvenienceStore,FoodTruck,WineShop,YogaStudio,CosmeticsShop,HookahBar,BasketballCourt,DogRun,Monument/Landmark,MovingTarget,Thrift/VintageStore,Southern/SoulFoodRestaurant,Brewery,CantoneseRestaurant,HotpotRestaurant,JapaneseRestaurant,NewAmericanRestaurant,RussianRestaurant,ShoppingMall,WineBar,ArtMuseum,HistoryMuseum,Arcade,GiftShop,BurgerJoint,GasStation,ShoeStore,JuiceBar,LiquorStore,DepartmentStore,Furniture/HomeStore,Museum,BaseballStadium,ConcertHall,DumplingRestaurant,FrenchRestaurant,GourmetShop,HotDogJoint,JazzClub,MediterraneanRestaurant,Paper/OfficeSuppliesStore,Plaza,ScenicLookout,PetStore,Pier,BigBoxStore,BreakfastSpot,Farm,MovieTheater,OtherGreatOutdoors,PolishRestaurant,School,SkatePark,SriLankanRestaurant,StorageFacility,Women'sStore,AutomotiveShop,BeachBar,FlowerShop,GoKartTrack,Market,SkatingRink,TurkishRestaurant,VideoGameStore,Toy/GameStore,BikeTrail,ComfortFoodRestaurant,ComicShop,MiscellaneousShop,OtherRepairShop,SportsBar,Stables,Steakhouse,SupplementShop,TouristInformationCenter,VideoStore,WingsJoint,Arts&CraftsStore,BowlingAlley,ComedyClub,DanceStudio,EventSpace,MartialArtsDojo,OpticalShop,SmokeShop,SoccerField,BBQJoint,AsianRestaurant,FarmersMarket,TennisCourt,BoardShop,DessertShop,AccessoriesStore,JewelryStore,FleaMarket,IrishPub,NationalPark,PublicArt,Salon/Barbershop,SnackPlace,VolleyballCourt,Aquarium,ArgentinianRestaurant,BeerBar,Bookstore,BoxingGym,BrazilianRestaurant,Bridge,CycleStudio,EasternEuropeanRestaurant,FalafelRestaurant,FilipinoRestaurant,Food&DrinkShop,Fountain,IndieTheater,IndonesianRestaurant,KidsStore,MemorialSite,MiddleEasternRestaurant,PerformingArtsVenue,RecordShop,SaladPlace,TacoPlace,TapasRestaurant,Vegetarian/VeganRestaurant,VietnameseRestaurant,HotelBar,LingerieStore,AfricanRestaurant,Butcher,Lawyer,MusicStore,MusicVenue,Bed&Breakfast,Building,DimSumRestaurant,HistoricSite,KitchenSupplyStore,Outdoors&Recreation,AutoGarage,RentalService,ShippingStore,Cajun/CreoleRestaurant,ElectronicsStore,Gastropub,MotorcycleShop,Racetrack,AutoDealership,HomeService,ITServices,Lake,CandyStore,ClimbingGym,Fruit&VegetableStore,GymnasticsGym,PaintballField,RecreationCenter,TeaRoom,WasteFacility,Health&BeautyService,CheckCashingService,HobbyShop,Laundromat,NightlifeSpot,River,SculptureGarden,BusLine,Fish&ChipsShop,FishingStore,SocialClub,RecordingStudio,ArepaRestaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1
Lincoln Square,1,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.05,0.04,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.05,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.01,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.04,0.0,0.0,0.04,0.03,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Manhattan Valley,1,0.0,0.01,0.09,0.01,0.01,0.0,0.04,0.0,0.01,0.02,0.02,0.0,0.0,0.03,0.06,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.02,0.01,0.0,0.01,0.02,0.02,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Upper West Side,1,0.0,0.0,0.11,0.0,0.01,0.01,0.01,0.0,0.03,0.03,0.0,0.0,0.0,0.01,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.01,0.0,0.01,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.01,0.0,0.03,0.02,0.0,0.01,0.01,0.02,0.03,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sunset Park,2,0.0,0.0,0.02,0.0,0.02,0.0,0.04,0.0,0.0,0.01,0.01,0.0,0.0,0.06,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.1,0.01,0.0,0.0,0.0,0.0,0.03,0.04,0.0,0.07,0.0,0.0,0.02,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Windsor Terrace,2,0.0,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.0,0.06,0.06,0.0,0.0,0.02,0.04,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.05,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.03,0.0,0.0,0.03,0.01,0.02,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.04,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [135]:
Xa_train.shape

(313, 259)

In [136]:
Xa_train.columns

Index(['LSI', 'Beach', 'CaribbeanRestaurant', 'Park', 'KoreanRestaurant',
       'SeafoodRestaurant', 'Trail', 'ChineseRestaurant', 'TennisStadium',
       'Theater',
       ...
       'Laundromat', 'NightlifeSpot', 'River', 'SculptureGarden', 'BusLine',
       'Fish&ChipsShop', 'FishingStore', 'SocialClub', 'RecordingStudio',
       'ArepaRestaurant'],
      dtype='object', length=259)

#### Define Labels

In [282]:
# set target variable labels as Y and inspect sample
Y_train = Xa_train['LSI'].values
Y_train[0:310], Y_train.shape

(array([0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 3, 3, 3, 0, 0, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 4, 4, 4, 4,
        4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4,
        4, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 2, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4,
        4, 4, 4, 2, 2, 2, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 0, 0,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 

#### Normalize Data Set

In [179]:
# normalize feature set data
Xa_train = preprocessing.StandardScaler().fit(Xa_train).transform(Xa_train).astype(float)
Xa_train[0:1]

array([[-1.7609965 , -0.22980307,  0.98647122, -0.18100742, -0.16845265,
        -0.46669969,  1.17729835,  0.65460838, -0.09641126, -0.237815  ,
        -0.03233738, -0.41168891, -0.22094164, -0.41681751,  2.01155802,
        -0.24455896, -0.53929157, -0.30938627,  2.23173606,  0.47107708,
        -0.10918731, -0.13836527, -0.30222984, -0.10285648, -0.56639804,
         0.91425885, -0.13830038, -0.12597547, -0.44189196,  1.932135  ,
        -0.46770633, -0.23940689, -0.42336985,  0.6747972 , -0.74717303,
        -0.12638842,  0.27508073, -0.11710061, -0.12351321, -0.31162068,
         0.82175629,  1.51039335, -0.35325811, -0.5316123 , -0.14809216,
        -0.01082857, -0.4084418 ,  0.37130918, -0.35801202,  0.68274227,
        -0.94209255, -0.30599627, -0.48579337,  0.58214089,  3.37262121,
        -0.96798785,  0.45960433, -0.16035701, -0.87441111, -0.26949577,
        -0.3528599 , -0.12747483, -0.0739122 ,  1.5473174 , -0.3391987 ,
        -0.45312686, -0.19904067, -0.31724033, -0.3

## MACHINE LEARNING CLASSIFICATION MODELS

### SUPPORT VECTOR MACHINES 

In [None]:
The objective of this section is to construct Machine Learning classification models and test the ability to predict LSI
rank based on Venue data.

#### Define Training and Test data sets for SVM

In [139]:
from sklearn.model_selection import train_test_split
xxs_train, xxs_test, yys_train, yys_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxs_train.shape, yys_train.shape)
print ('Test set: ', xxs_test.shape, yys_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


#### Establish SVM instance and train model

In [156]:
# train SVM model - provide gamma variable for multiclass classification
from sklearn import svm
clf = svm.SVC(kernel='rbf')

clf.fit(xxs_train, yys_train) 



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [157]:
# svm prediction on training data
yys_hat = clf.predict(xxs_test)

In [158]:
# determine f1 score for SVM
from sklearn.metrics import f1_score
f1_score(yys_test, yys_hat, average='micro') 

0.6808510638297872

#### Identify Optimum Parameters for SVM using GridSearchCV

#### SVM instance using Optimized Parameters

In [169]:
# train SVM model - provide gamma variable for multiclass classification
from sklearn import svm
clf = svm.SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

clf.fit(xxs_train, yys_train) 

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [170]:
#Determine best hyperparameters
from sklearn.model_selection import GridSearchCV

#create new Training model for Grid Search
svm_model = clf

param_grid = { 
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.001, 0.0001],
    'kernel': ['rbf','poly','sigmoid']
}

#test for svm model optimum paramters values with Grid Search
svmgrid_model = GridSearchCV(svm_model, param_grid, cv=5)

svmgrid_model.fit(xxs_train, yys_train)

print('Grid Search Best Estimator: ', svmgrid_model.best_estimator_)
print('Grid Search Best Hyperparameter Results: ', svmgrid_model.best_params_)
print('Grid Search Best Score: ', svmgrid_model.best_score_)

Grid Search Best Estimator:  SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
Grid Search Best Hyperparameter Results:  {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
Grid Search Best Score:  1.0




#### Prediction with SVM model

In [171]:
# svm prediction on training data
yys_hat = clf.predict(xxs_test)

In [172]:
#inspect prediction results and data set shape
print(yys_hat[:], yys_hat.shape) 
print(yys_test[:], yys_test.shape)

[4 4 4 1 4 4 4 4 4 4 4 0 4 4 4 4 4 3 4 0 4 4 4 4 1 4 1 4 0 4 4 1 4 4 2 3 4
 4 1 4 4 0 3 1 4 4 4 1 4 0 4 1 2 3 1 3 0 1 1 4 4 0 3 4 0 4 4 4 4 1 1 4 2 4
 2 3 4 3 4 0 4 4 4 0 2 4 2 4 2 4 4 0 3 1] (94,)
[4 4 4 1 4 4 4 4 4 4 4 0 4 4 4 4 4 3 4 0 4 4 4 4 1 4 1 4 0 4 4 1 4 4 2 3 4
 4 1 4 4 0 3 1 4 4 4 1 4 0 4 1 2 3 1 3 0 1 1 4 4 0 3 4 0 4 4 4 4 1 1 4 2 4
 2 3 4 3 4 0 4 4 4 0 2 4 2 4 2 4 4 0 3 1] (94,)


#### Evaluate Results of SVM test data model - Jaccard, Classification Report, and Confusion Matrix

In [173]:
# Pass a selected averaging method appropriate for multiclass/multilable targets (non-binary classification)
from sklearn.metrics import jaccard_score

jaccard_score(yys_test, yys_hat, average='micro')

1.0

In [174]:
# determine f1 score for SVM
from sklearn.metrics import f1_score
f1_score(yys_test, yys_hat, average='micro') 

1.0

#### Cross Validation of Accuracy Scores

In [175]:
## Inspect average Scores with Cross Valiation
from sklearn.model_selection import cross_val_score

#Determine average Accuracy Score using Cross Validation
svmm_model = clf

#Train the model with cv=5
cv_scores = cross_val_score(svmm_model, xxs_train, yys_train, cv=5)
print(cv_scores)
print('cv_scores mean: {}'.format(np.mean(cv_scores)))
print('cv_scores standard deviation: {}'.format(np.std(cv_scores)))

[1. 1. 1. 1. 1.]
cv_scores mean: 1.0
cv_scores standard deviation: 0.0


In [176]:
# call confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import itertools

conf_mx = confusion_matrix(yys_test, yys_hat)
conf_mx

array([[11,  0,  0,  0,  0],
       [ 0, 14,  0,  0,  0],
       [ 0,  0,  7,  0,  0],
       [ 0,  0,  0,  9,  0],
       [ 0,  0,  0,  0, 53]], dtype=int64)

In [177]:
# SVM model performance
count_misclassified = (yys_test != yys_hat).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yys_test, yys_hat)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 1.00


#### SVM Results and Conclusions

Classification results with the SVM at default parameters produced F1 score of 68.
Optimizing SVM model parameters using Grid Search CV increased F1 to 1.0, and Jaccard to 1.0
The confusion matrix indicates no misclassifications
Cross validation and visual inspection confirms the results

### DECISION TREE

#### Prepare Test and Train Data sets for Decision Tree model

In [178]:
from sklearn.model_selection import train_test_split
xxd_train, xxd_test, yyd_train, yyd_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxd_train.shape, yyd_train.shape)
print ('Test set: ', xxd_test.shape, yyd_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


 #### Decision Tree Instance for Training Set using optimum parameters from GridSearchCV result

In [182]:
#Create decision tree instance and show default parameters
venueTree = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4)
                       
                                   
venueTree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

#### Determine best hyperparamter settings for the Decision Tree Classifier model

In [183]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

#create new Training model for Grid Search
dt_venue_cv = venueTree

param_grid = { 
    'criterion': ['gini','entropy'],
    'max_depth': [4,5,6,7,8,9,10,40,100]
}

#test for Decision Tree model optimum paramters values with Grid Search
dt_venue_model = GridSearchCV(venueTree, param_grid, cv=5)

dt_venue_model.fit(xxd_train, yyd_train)

print('Grid Search Best Hyperparameter Results: ', dt_venue_model.best_estimator_)
print('Grid Search Best Score: ', dt_venue_model.best_score_)

Grid Search Best Hyperparameter Results:  DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Grid Search Best Score:  1.0


#### Decision Tree Instance for Training Set using optimum parameters from GridSearchCV results

In [184]:
#Create decision tree instance and show default parameters
venueTree = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
venueTree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [185]:
#Train the model for features x and result vector y on test_train split
venueTree.fit(xxd_train, yyd_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

#### Prediction on Test set with Decision Tree algorythm

In [189]:
#Prediction on test_train_split data
predTree = venueTree.predict(xxd_test)

print(predTree[0:350], predTree.shape)
print(yyd_test[0:350], yyd_test.shape)

[4 4 4 1 4 4 4 4 4 4 4 0 4 4 4 4 4 3 4 0 4 4 4 4 1 4 1 4 0 4 4 1 4 4 2 3 4
 4 1 4 4 0 3 1 4 4 4 1 4 0 4 1 2 3 1 3 0 1 1 4 4 0 3 4 0 4 4 4 4 1 1 4 2 4
 2 3 4 3 4 0 4 4 4 0 2 4 2 4 2 4 4 0 3 1] (94,)
[4 4 4 1 4 4 4 4 4 4 4 0 4 4 4 4 4 3 4 0 4 4 4 4 1 4 1 4 0 4 4 1 4 4 2 3 4
 4 1 4 4 0 3 1 4 4 4 1 4 0 4 1 2 3 1 3 0 1 1 4 4 0 3 4 0 4 4 4 4 1 1 4 2 4
 2 3 4 3 4 0 4 4 4 0 2 4 2 4 2 4 4 0 3 1] (94,)


In [191]:
Xa_train.shape

(313, 259)

In [192]:
print(Xa_train[1:1])

Empty DataFrame
Columns: [LSI, Beach, CaribbeanRestaurant, Park, KoreanRestaurant, SeafoodRestaurant, Trail, ChineseRestaurant, TennisStadium, Theater, ItalianRestaurant, Bar, Harbor/Marina, BusStop, PizzaPlace, CoffeeShop, Deli/Bodega, BusStation, DonutShop, DiscountStore, ThemeParkRide/Attraction, BoatorFerry, TrainStation, Campground, GroceryStore, Pharmacy, SurfSpot, Bakery, Hotel, SandwichPlace, ThaiRestaurant, GolfCourse, IndianRestaurant, Bank, Café, Exhibit, MexicanRestaurant, ThemePark, ZooExhibit, ArtGallery, FastFoodRestaurant, FriedChickenJoint, GreekRestaurant, SpanishRestaurant, Supermarket, SushiRestaurant, BaseballField, ClothingStore, RentalCarLocation, IceCreamShop, AmericanRestaurant, BubbleTeaShop, CocktailBar, Diner, Garden, Gym, Spa, Zoo, BagelShop, HardwareStore, Athletics&Sports, BorderCrossing, BusinessService, Food, Intersection, Lounge, MetroStation, Nightclub, PeruvianRestaurant, Pub, Resort, Restaurant, Shop&Service, SportingGoodsShop, TollPlaza, Constructi

#### Evaluate results of Decision Tree model

In [204]:
# determine f1 score for Decision Tree training set prediction
from sklearn.metrics import f1_score
f1_score(yyd_test, predTree, average='micro')

1.0

In [205]:
# verify Jaccard score
from sklearn.metrics import jaccard_score
print('Jaccard score: ', jaccard_score(yyd_test, predTree, average='micro'))

Jaccard score:  1.0


In [206]:
# Confustion Matrix
print(classification_report(yyd_test, predTree))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00         7
           3       1.00      1.00      1.00         9
           4       1.00      1.00      1.00        53

    accuracy                           1.00        94
   macro avg       1.00      1.00      1.00        94
weighted avg       1.00      1.00      1.00        94



In [207]:
# Decision Tree model performance
count_misclassified = (yyd_test != predTree).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yyd_test, predTree)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 1.00


In [208]:
## Inspect average Scores with Cross Valiation
from sklearn.model_selection import cross_val_score

#Determine average Accuracy Score using Cross Validation
cvdt_model = venueTree

#Train the model with cv=2
cvdt_scores = cross_val_score(cvdt_model, xxd_train, yyd_train, cv=5)
print(cvdt_scores)
print('cvdt_scores mean: {}'.format(np.mean(cvdt_scores)))
print('cvdt_scores standard deviation: {}'.format(np.std(cvdt_scores)))

[1. 1. 1. 1. 1.]
cvdt_scores mean: 1.0
cvdt_scores standard deviation: 0.0


#### Text representation of Decision Tree model logic

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

r = export_text(predTree, feature_names=Features[0:0])
print(r)

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(Features, labels)
for name, importance in zip(Features.columns, classifier.Feature_importances_):
    print(name, importance)

#### Graphical representation of Decision Tree Model

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline

In [None]:
#Visualize Decision Tree Model for the train_test_split data set
dot_data = StringIO()
filename = "predTree.png"
featureNames = Features.columns[0:258]
targetNames = Xa_train['LSI'].unique().tolist()
out=tree.export_graphviz(predTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(yyd_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(200, 500))
plt.imshow(img,interpolation='nearest')

#### Results and Conclusions for Decision Tree model

In [None]:
Classification results with the Decicion Tree model at optimum paramaters produced
using Grid Search CV resulted in F1 score of 1.0, and Jaccard score of 1.0
The confusion matrix indicates no misclassifications
Cross validation and visual inspection confirmed the results

#### Conclusion:

In [None]:
100% accuracy in predicting test set LSI with no missclassification. Next step would be validation on a larger sample set.

### LOG REGRESSION

#### Prepare Test and Train data sets for Log Regression classification model

In [283]:
from sklearn.model_selection import train_test_split
xxs_train, xxs_test, yys_train, yys_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxs_train.shape, yys_train.shape)
print ('Test set: ', xxs_test.shape, yys_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


In [284]:
# provide gamma variable for multiclass classification
from sklearn import svm
clf = svm.SVC(kernel='rbf',gamma='auto')
clf.fit(xxs_train, yys_train) 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [285]:
# svm prediction on training data
yys_hat = clf.predict(xxs_test)

In [286]:
# Pass a selected averaging method appropriate for multiclass/multilable targets (non-binary classification)
from sklearn.metrics import jaccard_score

jaccard_score(yys_test, yys_hat, average='micro')

0.5161290322580645

In [287]:
# call confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import itertools

conf_mx = confusion_matrix(yys_test, yys_hat)
conf_mx

array([[11,  0,  0,  0,  0],
       [14,  0,  0,  0,  0],
       [ 7,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  9],
       [ 0,  0,  0,  0, 53]], dtype=int64)

In [288]:
# Log Regression model performance
count_misclassified = (yys_test != yys_hat).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yys_test, yys_hat)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 30
Accuracy: 0.68


In [407]:
from sklearn.metrics import log_loss
log_loss(yyl_test, yyl_hat)
LR=round(log_loss(yyl_test, yyl_hat),2)
print(LR)

0.86


#### Conclusion

In [None]:
The results from the Log Regression model are much weaker than for other models
All scores and missclassifitions are unfavorable to other methods
Primary cause could be the difficulty of scaling Log Regression to high dimension classification problems

### KNN K - Nearest Neighbor

In [337]:
from sklearn.model_selection import train_test_split
xxk_train, xxk_test, yyk_train, yyk_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxk_train.shape, yyk_train.shape)
print ('Test set: ', xxk_test.shape, yyk_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


#### Prepare the Test and Train data sets for KNN classification model

#### KNN model instance

In [338]:
#Train the split Training Model Training sets using optimum k value - determined below
k = 5
kNN_model = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform').fit(xxk_train,yyk_train)
kNN_model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

#### KNN model prediction on Test data set

In [339]:
# prediction of Y for training set
yyk_hat = kNN_model.predict(xxk_test)
yyk_hat[0:350],yyk_test[0:350]

(array([4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4,
        4, 4, 1, 4, 1, 4, 0, 4, 4, 1, 4, 4, 2, 3, 4, 4, 1, 4, 4, 0, 3, 1,
        4, 4, 4, 1, 4, 0, 4, 1, 2, 3, 1, 3, 0, 1, 1, 4, 4, 0, 3, 4, 0, 4,
        4, 4, 4, 1, 1, 4, 2, 4, 2, 3, 4, 3, 4, 0, 4, 4, 4, 0, 2, 4, 2, 4,
        2, 4, 4, 0, 3, 1], dtype=int64),
 array([4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4,
        4, 4, 1, 4, 1, 4, 0, 4, 4, 1, 4, 4, 2, 3, 4, 4, 1, 4, 4, 0, 3, 1,
        4, 4, 4, 1, 4, 0, 4, 1, 2, 3, 1, 3, 0, 1, 1, 4, 4, 0, 3, 4, 0, 4,
        4, 4, 4, 1, 1, 4, 2, 4, 2, 3, 4, 3, 4, 0, 4, 4, 4, 0, 2, 4, 2, 4,
        2, 4, 4, 0, 3, 1], dtype=int64))

In [340]:
# evaluate accuracy score / Jaccard score on train and test split sets

print('Jaccard KNN Train set: ', metrics.accuracy_score(yyk_train, kNN_model.predict(xxk_train)))
print('Jaccard KNN Test set: ', metrics.accuracy_score(yyk_test, yyk_hat))

Jaccard KNN Train set:  1.0
Jaccard KNN Test set:  1.0


In [341]:
metrics.jaccard_score(yyk_test, yyk_hat,average='micro')

1.0

In [342]:
# determine f1 score for KNN training split set
from sklearn.metrics import f1_score
f1_score(yyk_test, yyk_hat, average='micro')

1.0

In [343]:
# Determining optimal k value for train_test_split
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfusionMx = [];
for n in range(1,Ks):
    kNN_model = KNeighborsClassifier(n_neighbors = n).fit(xxk_train, yyk_train)
    yyk_hat=kNN_model.predict(xxk_test)
    mean_acc[n-1] = metrics.accuracy_score(yyk_test, yyk_hat)

    std_acc[n-1]=np.std(yyk_hat==yyk_test)/np.sqrt(yyk_hat.shape[0])

mean_acc

array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

#### Evaluation of KNN Classification model results

In [344]:
# Classification report for KNN train split set
print(classification_report(yyk_train, kNN_model.predict(xxk_train)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       1.00      1.00      1.00        29
           2       1.00      1.00      1.00        14
           3       1.00      1.00      1.00        14
           4       1.00      1.00      1.00       126

    accuracy                           1.00       219
   macro avg       1.00      1.00      1.00       219
weighted avg       1.00      1.00      1.00       219



In [345]:
# inspect classification report for KNN test split set
print(classification_report(yyk_test, yyk_hat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00         7
           3       1.00      1.00      1.00         9
           4       1.00      1.00      1.00        53

    accuracy                           1.00        94
   macro avg       1.00      1.00      1.00        94
weighted avg       1.00      1.00      1.00        94



In [346]:
## Inspect average Scores with Cross Valiation
from sklearn.model_selection import cross_val_score

#Determine average Accuracy Score (Jaccard_score) using Cross Validation
kNN_model = KNeighborsClassifier(n_neighbors=5)

#Train the model with cv=5
cv_scores = cross_val_score(kNN_model, xxk_train, yyk_train, cv=5)
print(cv_scores)
print('cv_scores mean: {}'.format(np.mean(cv_scores)))
print('cv_scores standard deviation: {}'.format(np.std(cv_scores)))

[1. 1. 1. 1. 1.]
cv_scores mean: 1.0
cv_scores standard deviation: 0.0


In [347]:
# KNN model performance
count_misclassified = (yyk_test != yyk_hat).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yyk_test, yyk_hat)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 1.00


#### Conclusion:

In [None]:
100% accuracy in predicting test set LSI with no missclassification. Next step would be validation on a larger sample set.

### GAUSSIAN NAIVE BAYES 

#### Prepare Test and Training data sets for Naive Bayes model

In [348]:
from sklearn.model_selection import train_test_split
xxn_train, xxn_test, yyn_train, yyn_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxn_train.shape, yyn_train.shape)
print ('Test set: ', xxn_test.shape, yyn_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


#### Gaussian Naive Bayes instance and model Prediction

In [349]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(xxn_train, yyn_train) 
gnb_pred = gnb.predict(xxn_test)

In [355]:
yyn_test[0:],gnb_pred[0:]

(array([4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4,
        4, 4, 1, 4, 1, 4, 0, 4, 4, 1, 4, 4, 2, 3, 4, 4, 1, 4, 4, 0, 3, 1,
        4, 4, 4, 1, 4, 0, 4, 1, 2, 3, 1, 3, 0, 1, 1, 4, 4, 0, 3, 4, 0, 4,
        4, 4, 4, 1, 1, 4, 2, 4, 2, 3, 4, 3, 4, 0, 4, 4, 4, 0, 2, 4, 2, 4,
        2, 4, 4, 0, 3, 1], dtype=int64),
 array([4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4,
        4, 4, 1, 4, 1, 4, 0, 4, 4, 1, 4, 4, 2, 3, 4, 4, 1, 4, 4, 0, 3, 1,
        4, 4, 4, 1, 4, 0, 4, 1, 2, 3, 1, 3, 0, 1, 1, 4, 4, 0, 3, 4, 0, 4,
        4, 4, 4, 1, 1, 4, 2, 4, 2, 3, 4, 3, 4, 0, 4, 4, 4, 0, 2, 4, 2, 4,
        2, 4, 4, 0, 3, 1], dtype=int64))

#### Evaluate performance of Gaussian Naive Bayes model

In [356]:
# determine f1 score 
from sklearn.metrics import f1_score
f1_score(yyn_test, gnb_pred, average='micro')

1.0

In [357]:
# jaccard similarity score
metrics.jaccard_score(yyn_test, gnb_pred,average='micro')

1.0

In [358]:
# confustion matrix
cm_gnb = confusion_matrix(yyn_test,  gnb_pred)
cm_gnb

array([[11,  0,  0,  0,  0],
       [ 0, 14,  0,  0,  0],
       [ 0,  0,  7,  0,  0],
       [ 0,  0,  0,  9,  0],
       [ 0,  0,  0,  0, 53]], dtype=int64)

In [353]:
#  model performance
count_misclassified = (yyn_test != gnb_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yyk_test, yyk_hat)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 1.00


#### Conclusion:

In [None]:
100% accuracy in predicting test set LSI with no missclassification. Next step would be validation on a larger sample set.

### RANDOM FOREST

#### Prepare Train and Test data sets for Random Forest classification model

In [309]:
from sklearn.model_selection import train_test_split
xxf_train, xxf_test, yyf_train, yyf_test = train_test_split(Xa_train,Y_train,test_size=0.3,random_state=4)
print ('Train set: ', xxf_train.shape, yyf_train.shape)
print ('Test set: ', xxf_test.shape, yyf_test.shape)

Train set:  (219, 259) (219,)
Test set:  (94, 259) (94,)


#### Create Random Forest model instance

In [310]:
from sklearn.ensemble import RandomForestClassifier
clfRF = RandomForestClassifier(n_estimators=10)

#### Train the Random Forest model

In [311]:
clfRF.fit(xxf_train,yyf_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [312]:
#Determine best hyperparameters
from sklearn.model_selection import GridSearchCV

#create new Training model for Grid Search
rf_cv_model = clfRF

param_grid = { 
    'n_estimators': [15, 50, 100],
    'max_features': ['sqrt', .025, 0.5, 1.0],
    'min_samples_split': [2, 4, 6]
}

#test for svm model optimum paramters values with Grid Search
rf_cv_model = GridSearchCV(rf_cv_model, param_grid, cv=5)

rf_cv_model.fit(xxf_train, yyf_train)

print('Grid Search Best Estimator: ', rf_cv_model.best_estimator_)
print('Grid Search Best Hyperparameter Results: ', rf_cv_model.best_params_)
print('Grid Search Best Score: ', rf_cv_model.best_score_)

Grid Search Best Estimator:  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=1.0, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=15,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Grid Search Best Hyperparameter Results:  {'max_features': 1.0, 'min_samples_split': 2, 'n_estimators': 15}
Grid Search Best Score:  1.0




#### Predictions with the Random Forest model

In [313]:
yyf_pred=clfRF.predict(xxf_test)

#### Evaluate Random Forest model predictions

In [314]:
# jaccard similarity score
metrics.jaccard_score(yyf_test, yyf_pred,average='micro')

0.7904761904761904

In [315]:
# determine accuracy - f1 score
print("Accuracy:",metrics.accuracy_score(yyf_test, yyf_pred))

Accuracy: 0.8829787234042553


#### Predictions with the Grid Model with tuned parameters

In [316]:
clf_csv_RF = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=1.0, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=15,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [317]:
clf_csv_RF.fit(xxf_train,yyf_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=1.0, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=15,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [318]:
yyf2_pred=clf_csv_RF.predict(xxf_test)

In [319]:
# jaccard similarity score
metrics.jaccard_score(yyf_test, yyf2_pred,average='micro')

1.0

In [320]:
# determine accuracy - f1 score
print("Accuracy:",metrics.accuracy_score(yyf_test, yyf2_pred))

Accuracy: 1.0


In [321]:
# confustioh matrix
yyf2_test = yyf_test
cm2_RF = confusion_matrix(yyf2_test, yyf2_pred)
cm2_RF

array([[11,  0,  0,  0,  0],
       [ 0, 14,  0,  0,  0],
       [ 0,  0,  7,  0,  0],
       [ 0,  0,  0,  9,  0],
       [ 0,  0,  0,  0, 53]], dtype=int64)

In [322]:
#  model performance
RF_csv_count_misclassified = (yyf_test != yyf2_pred).sum()
print('CSV Misclassified samples: {}'.format(RF_csv_count_misclassified))
CSV_accuracy = metrics.accuracy_score(yyf_test, yyf2_pred)
print('Accuracy: {:.2f}'.format(accuracy))

CSV Misclassified samples: 0
Accuracy: 1.00


In [323]:
# confustion matrix
cm_RF = confusion_matrix(yyf_test, yyf_pred)
cm_RF

array([[ 7,  4,  0,  0,  0],
       [ 1, 13,  0,  0,  0],
       [ 0,  0,  5,  0,  2],
       [ 2,  0,  1,  6,  0],
       [ 0,  1,  0,  0, 52]], dtype=int64)

In [324]:
#  model performance
RFcount_misclassified = (yyf_test != yyf_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(yyf_test, yyf_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 0.88


#### Conclusion:  

In [None]:
100% accuracy in predicting test set LSI with no missclassification. Next step would be validation on a larger sample set.

# =======================================

### Report of Results for Classification Models

In [17]:
report_data = [['KNN',1.0,1.0,'N/A',0],
                  ['Decision Tree',1.0,1.0,'N/A',0],
                  ['SVM',1.0,1.0,'N/A',0],
                  ['Log Regression',.5168,.680,.86,30],
                  ['Naive Bayes',1.0,1.0,'N/A',0],
                  ['Random Forest',0.79,0.88,'N/A',11]]         

In [18]:
report_df2 = pd.DataFrame(report_data, columns = ['Algorythm','Jaccard','F1-score','LogLoss','# Misclassified'])
report_df2 = round(report_df2,2)
report_df2 = report_df2.set_index('Algorythm')

In [19]:
print('Test Set: 94 Neighborhoods, 259 Venues')
print("Target Classes: LSI Ranks, 1-5")
print("")
report_df2

Test Set: 94 Neighborhoods, 259 Venues
Target Classes: LSI Ranks, 1-5



Unnamed: 0_level_0,Jaccard,F1-score,LogLoss,# Misclassified
Algorythm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNN,1.0,1.0,,0
Decision Tree,1.0,1.0,,0
SVM,1.0,1.0,,0
Log Regression,0.52,0.68,0.86,30
Naive Bayes,1.0,1.0,,0
Random Forest,0.79,0.88,,11


### CONCLUSIONS

In [None]:
Model results indicate that Economic profiles of Communities (Life Satisfaction Indicators) can be predicted accurately
based on their respective set of Public Venues.

    * The KNN, Decision Tree, SVM, and Naive Bayes models acheived 100% accuracy for prediction rates on the test data sets,  
    after the model parameters were optimized using GridSearchCV. 

    * he Log Regression and Random Forest models indicated lower accuracy and precision scores, resulting in misclassifications.

    * All model results need to be validated on a larger data set.
