# Capstone Project - The Battle of the Neighborhoods 

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>

The objective of this project, is give the people a tool that they will can explore their neighborhood with, it will can help people than want to know about the best site to live in Carabobo Venezuela.

Carabobo is an industrial state, whence a lot of people need to migrate to this state, for this reason it is important to have a tool that can help people to make better and smarter decisions.

In this project I'm going to create an analysis of features for people who are searching about the best places, doing a comparative analysis between neighborhoods.

It will help people to get best decisions about the sites that they visit, improving their satisfaction.




## Data <a name="data"></a>

**Foursquare API:**

In the project I will use Forsquare API, It has a big database which provides locations and details about business in this locations

**Clustering Approach:**

To compare the neighborhood, I decided to explore neighborhoods, segment them and group into clusters to find simillarities.

To cluster the data I need a form of unsupervised machine learning (K-means clustering algorithm)

**Libraries**

Pandas: For dataframes.

Folium: Python visualization library would be used to visualize the neighborhoods cluster distribution of using interactive leaflet map.

Scikit Learn: For importing k-means clustering.

Geocoder: To retrieve Location Data.

Beautiful Soup and Requests: To scrap the web pages.

Matplotlib: to make plots

Numpy: To manage the math data

## Methodology <a name="methodology"></a>

In this project I detected areas in Carabobo, Valencia, next to it I found the most important venues.

In first step we have collected the required data: location of zone in Carabobo. We have also identified the principals venues (according to Foursquare categorization).

Second step in my analysis wasthecalculation and exploration of ‘venuesdensity' across different areas of Carabobo.

In third and final step I create clusters of locations. I present map of all such locations but also create clusters (using k-means clustering) of those locations to identify general zones / neighborhoods / addresses which should be a starting point to people who wants to lives in Carabobo.

In [None]:
!pip install geocoder
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
import geocoder
import folium 
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim



In [None]:
url=('https://es.wikipedia.org/wiki/Anexo:Municipios_de_Carabobo')
table = pd.read_html(url)
df=pd.DataFrame(np.concatenate(table))
df.columns=['PRI','SEG','Borough','TER','CUAR','QUIN','SEX','Neighborhood']
df2=df.drop(['PRI', 'SEG', 'TER','CUAR','QUIN','SEX'], axis=1)
df2.dropna(subset = ["Neighborhood"], inplace=True)
df2["Neighborhood"] = df2["Neighborhood"].str[2:]
df2.groupby(['Borough'])['Neighborhood'].apply(','.join).reset_index()


Unnamed: 0,Borough,Neighborhood
0,Bejuma,"Bejuma,Canoabo,Simón Bolívar"
1,Carlos Arvelo,"Belén,Güigüe,Tacarigua"
2,Diego Ibarra,"Aguas Calientes,Mariara"
3,Guacara,"Ciudad Alianza, Guacara, Yagua"
4,Juan José Mora,"Morón, Urama"
5,Libertador,"Independencia, Tocuyito"
6,Los Guayos,Los Guayos
7,Miranda,Miranda
8,Montalbán,Montalbán
9,Naguanagua,Naguanagua


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
coordinate=pd.read_csv('/content/drive/MyDrive/Coordenadas4.csv')
c2=coordinate.drop(['Unnamed: 0'], axis=1)
c2

Unnamed: 0,Latitude,Longitude
0,10.174,-68.259
1,10.083,-67.783
2,10.294,-67.711
3,10.226,-67.877
4,10.484,-68.204
5,10.114,-68.066
6,10.183,-67.933
7,10.147,-68.396
8,10.203,-68.3
9,10.254,-68.01


In [None]:
df_final=pd.merge(df2, c2, left_index=True, right_index=True)
df_final

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
1,Bejuma,Bejuma,10.083,-67.783
2,Bejuma,Canoabo,10.294,-67.711
3,Bejuma,Simón Bolívar,10.226,-67.877
5,Carlos Arvelo,Belén,10.114,-68.066
6,Carlos Arvelo,Güigüe,10.183,-67.933
7,Carlos Arvelo,Tacarigua,10.147,-68.396
9,Diego Ibarra,Aguas Calientes,10.254,-68.01
10,Diego Ibarra,Mariara,10.46,-68.01
12,Guacara,Ciudad Alianza,10.2611,-67.7928
13,Guacara,Guacara,10.1741,-67.9998


In [None]:
CLIENT_ID = 'SCV5LDOEIOWYGXRLVDCVSKY1VOBZ3UJ1H2DUBZCVS2UUUNBA' # your Foursquare ID
CLIENT_SECRET = 'NYYWCMTSEZY5MBCVRL4WU0PBUFXJSDOL4SQKLJ2NR5D055PA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return (nearby_venues)

In [None]:
carabobo_venues=getNearbyVenues(names=df_final['Neighborhood'], latitudes=df_final['Latitude'],longitudes=df_final['Longitude'])
carabobo_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Guacara,3,3,3,3,3,3
Aguas Calientes,7,7,7,7,7,7
Bejuma,2,2,2,2,2,2
Ciudad Alianza,2,2,2,2,2,2
Mariara,4,4,4,4,4,4
Simón Bolívar,2,2,2,2,2,2


In [None]:
# one hot encoding
carabobo_onehot = pd.get_dummies(carabobo_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
carabobo_onehot['Neighborhood'] = carabobo_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [carabobo_onehot.columns[-1]] + list(carabobo_onehot.columns[:-1])
carabobo_onehot = carabobo_onehot[fixed_columns]

carabobo_grouped = carabobo_onehot.groupby('Neighborhood').mean().reset_index()
carabobo_grouped

Unnamed: 0,Neighborhood,BBQ Joint,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Government Building,Metro Station,Park,Pharmacy,Pie Shop,Pizza Place,Plaza,Restaurant,Salad Place,Shopping Mall,Soccer Field
0,Guacara,0.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Aguas Calientes,0.142857,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.142857
2,Bejuma,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Ciudad Alianza,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
4,Mariara,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.25,0.0,0.0
5,Simón Bolívar,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0


In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = carabobo_grouped['Neighborhood']

for ind in np.arange(carabobo_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(carabobo_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Guacara,Metro Station,Bar,Department Store,Shopping Mall,Bakery,Coffee Shop,Fast Food Restaurant,Furniture / Home Store,Government Building,Soccer Field
1,Aguas Calientes,Soccer Field,Pizza Place,Bakery,Coffee Shop,Park,BBQ Joint,Plaza,Pie Shop,Pharmacy,Shopping Mall
2,Bejuma,Government Building,Pharmacy,Soccer Field,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Metro Station
3,Ciudad Alianza,Shopping Mall,Bakery,Soccer Field,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Government Building,Metro Station
4,Mariara,Salad Place,Restaurant,Pie Shop,Fast Food Restaurant,Soccer Field,Furniture / Home Store,Bakery,Bar,Coffee Shop,Department Store


In [None]:
# set number of clusters
kclusters = 5

carabobo_grouped_clustering = carabobo_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(carabobo_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 3, 0, 1, 2], dtype=int32)

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

carabobo_merged = df_final

# merge toronto_grouped with df_cluster to add latitude/longitude for each neighborhoo
carabobo_merged = carabobo_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

carabobo_merged.head() 

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bejuma,Bejuma,10.083,-67.783,3.0,Government Building,Pharmacy,Soccer Field,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Metro Station
2,Bejuma,Canoabo,10.294,-67.711,,,,,,,,,,,
3,Bejuma,Simón Bolívar,10.226,-67.877,2.0,Plaza,Furniture / Home Store,Soccer Field,Government Building,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Metro Station
5,Carlos Arvelo,Belén,10.114,-68.066,,,,,,,,,,,
6,Carlos Arvelo,Güigüe,10.183,-67.933,,,,,,,,,,,


In [None]:
address = 'Carabobo, VE'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(carabobo_merged['Latitude'], carabobo_merged['Longitude'], carabobo_merged['Neighborhood'], carabobo_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis

In [None]:
carabobo_merged.loc[carabobo_merged['Cluster Labels'] == 0, carabobo_merged.columns[[1] + list(range(5, carabobo_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Ciudad Alianza,Shopping Mall,Bakery,Soccer Field,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Government Building,Metro Station


In [None]:
carabobo_merged.loc[carabobo_merged['Cluster Labels'] == 1, carabobo_merged.columns[[1] + list(range(5, carabobo_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Aguas Calientes,Soccer Field,Pizza Place,Bakery,Coffee Shop,Park,BBQ Joint,Plaza,Pie Shop,Pharmacy,Shopping Mall
10,Mariara,Salad Place,Restaurant,Pie Shop,Fast Food Restaurant,Soccer Field,Furniture / Home Store,Bakery,Bar,Coffee Shop,Department Store


In [None]:
carabobo_merged.loc[carabobo_merged['Cluster Labels'] == 2, carabobo_merged.columns[[1] + list(range(5, carabobo_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Simón Bolívar,Plaza,Furniture / Home Store,Soccer Field,Government Building,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Metro Station


In [None]:
carabobo_merged.loc[carabobo_merged['Cluster Labels'] == 3, carabobo_merged.columns[[1] + list(range(5, carabobo_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bejuma,Government Building,Pharmacy,Soccer Field,Bakery,Bar,Coffee Shop,Department Store,Fast Food Restaurant,Furniture / Home Store,Metro Station


In [None]:
carabobo_merged.loc[carabobo_merged['Cluster Labels'] == 4, carabobo_merged.columns[[1] + list(range(5, carabobo_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Guacara,Metro Station,Bar,Department Store,Shopping Mall,Bakery,Coffee Shop,Fast Food Restaurant,Furniture / Home Store,Government Building,Soccer Field


## Results and Discussion <a name="results"></a>

The analysis shows that although there is a great number of venues in Carabobo, Highest concentration of venues was detected in ciudad alianza, Aguas Calientes, Mariara, guacara, aguas calientes y bejuma, so I focused my attention in this areas.
Those location candidates were then clustered to create zones of interest which contain greatest number of venues. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.
This, of course, does not imply that those zones are actually optimal locations for a new people to live, Purpose of this analysis was to only provide info on areas information about the zone with more venues varieties, it is entirely possible that there is a very good starting point to look the zones what fix best with the people. 
Recommended zones should therefore be considered only as a starting point for more detailed analysis which can integrate other information like prices of houses, schools etc. 




## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify the zones with the more venues quantity, to give the people who wants to live in carabobo a best perspective aboute the zone. By calculating venues density distribution from Foursquare data we have first identified the zones with venues. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of venues).

