# Clustering
In this file, instructions how to approach the challenge can be found.

We can use different types of clustering algorithms:

- KMeans
- Hierarchical
- DBScan

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

In [48]:
nyc_health = pd.read_csv("NYC_care.csv")
df = pd.read_csv('cities_df.csv')

In [49]:
df

Unnamed: 0,neighborhood,longitude,latitude,borough
0,Wakefield,-73.847201,40.894705,Bronx
1,Co-op City,-73.829939,40.874294,Bronx
2,Eastchester,-73.827806,40.887556,Bronx
3,Fieldston,-73.905643,40.895437,Bronx
4,Riverdale,-73.912585,40.890834,Bronx
...,...,...,...,...
301,Hudson Yards,-74.000111,40.756658,Manhattan
302,Hammels,-73.805530,40.587338,Queens
303,Bayswater,-73.765968,40.611322,Queens
304,Queensbridge,-73.945631,40.756091,Queens


In [50]:
df = df.sort_values('neighborhood')

In [51]:
df

Unnamed: 0,neighborhood,longitude,latitude,borough
298,Allerton,-73.859319,40.865788,Bronx
215,Annadale,-74.178549,40.538114,Staten Island
241,Arden Heights,-74.185887,40.549286,Staten Island
227,Arlington,-74.165104,40.635325,Staten Island
228,Arrochar,-74.067124,40.596313,Staten Island
...,...,...,...,...
146,Woodhaven,-73.858110,40.689887,Queens
7,Woodlawn,-73.867315,40.898273,Bronx
216,Woodrow,-74.205246,40.541968,Staten Island
130,Woodside,-73.901842,40.746349,Queens


In [52]:
nyc_health = nyc_health.drop(columns='Unnamed: 0')


In [53]:
nyc_health['Neighborhood'].astype('string')
nyc_health.groupby('Neighborhood').count()



Unnamed: 0_level_0,Place_name,Latitude,Longitude,Address,Borough,Category_name
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelphi,27,27,27,27,27,27
Allerton,32,32,32,32,32,32
Annadale,47,47,47,47,47,47
Arverne,2,2,2,2,2,2
Aspen Knolls,28,28,28,28,28,28
...,...,...,...,...,...,...
Williamsburg,25,25,25,25,25,25
Willowbrook,10,10,10,10,10,10
Wingate,27,27,27,27,27,27
Woodhaven,61,61,61,61,61,61


In [54]:
# one hot encoding
manhattan_onehot = pd.get_dummies(nyc_health[['Category_name']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe 
manhattan_onehot['Neighborhood'] = nyc_health['Neighborhood'] 

# Moving neighborhood to first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]





In [55]:
manhattan_onehot.shape

(13502, 21)

In [56]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Chiropractor,Dentist,Doctor's Office,Family Medicine Doctor,General Surgeon,Health and Medicine,Home Health Care Service,Hospital,Internal Medicine Doctor,...,Medical Lab,Nursing Home,Nutritionist,Ophthalmologist,Optometrist,Pediatrician,Physical Therapy Clinic,Psychiatrist,Urgent Care Center,Veterinarian
0,Adelphi,0.000000,0.222222,0.111111,0.037037,0.000000,0.037037,0.000000,0.111111,0.148148,...,0.037037,0.074074,0.000000,0.000000,0.148148,0.037037,0.000000,0.037037,0.000000,0.000000
1,Allerton,0.062500,0.125000,0.218750,0.031250,0.000000,0.187500,0.031250,0.062500,0.093750,...,0.000000,0.156250,0.031250,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,Annadale,0.042553,0.063830,0.170213,0.000000,0.063830,0.148936,0.000000,0.106383,0.106383,...,0.000000,0.106383,0.042553,0.000000,0.063830,0.000000,0.021277,0.000000,0.021277,0.042553
3,Arverne,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,Aspen Knolls,0.035714,0.250000,0.214286,0.035714,0.000000,0.035714,0.000000,0.107143,0.071429,...,0.000000,0.000000,0.000000,0.000000,0.071429,0.000000,0.000000,0.035714,0.071429,0.071429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271,Williamsburg,0.040000,0.040000,0.120000,0.040000,0.000000,0.080000,0.000000,0.280000,0.080000,...,0.040000,0.080000,0.000000,0.040000,0.000000,0.000000,0.040000,0.040000,0.080000,0.000000
272,Willowbrook,0.000000,0.000000,0.100000,0.100000,0.000000,0.300000,0.000000,0.100000,0.000000,...,0.000000,0.100000,0.000000,0.000000,0.000000,0.100000,0.000000,0.000000,0.100000,0.100000
273,Wingate,0.037037,0.111111,0.148148,0.000000,0.000000,0.296296,0.000000,0.222222,0.000000,...,0.000000,0.037037,0.037037,0.037037,0.037037,0.000000,0.000000,0.000000,0.000000,0.000000
274,Woodhaven,0.000000,0.081967,0.131148,0.016393,0.032787,0.213115,0.016393,0.098361,0.065574,...,0.016393,0.114754,0.016393,0.016393,0.081967,0.000000,0.000000,0.016393,0.032787,0.016393


In [57]:
num_top_care = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_care))
    print('\n')

----Adelphi----
                      venue  freq
0                   Dentist  0.22
1               Optometrist  0.15
2  Internal Medicine Doctor  0.15
3           Doctor's Office  0.11
4                  Hospital  0.11


----Allerton----
                      venue  freq
0           Doctor's Office  0.22
1       Health and Medicine  0.19
2              Nursing Home  0.16
3                   Dentist  0.12
4  Internal Medicine Doctor  0.09


----Annadale----
                      venue  freq
0           Doctor's Office  0.17
1       Health and Medicine  0.15
2                  Hospital  0.11
3  Internal Medicine Doctor  0.11
4              Nursing Home  0.11


----Arverne----
                venue  freq
0        Nursing Home   1.0
1        Chiropractor   0.0
2             Dentist   0.0
3  Urgent Care Center   0.0
4        Psychiatrist   0.0


----Aspen Knolls----
                      venue  freq
0                   Dentist  0.25
1           Doctor's Office  0.21
2                  Hosp

In [58]:

def return_most_common_venues(row, num_top_care):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_care]

In [59]:
num_top_care = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_care):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_care_sorted = pd.DataFrame(columns=columns)
neighborhoods_care_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_care_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_care)

neighborhoods_care_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelphi,Dentist,Optometrist,Internal Medicine Doctor,Doctor's Office,Hospital,Nursing Home,Medical Lab,Family Medicine Doctor,Health and Medicine,Psychiatrist
1,Allerton,Doctor's Office,Health and Medicine,Nursing Home,Dentist,Internal Medicine Doctor,Chiropractor,Hospital,Family Medicine Doctor,Home Health Care Service,Nutritionist
2,Annadale,Doctor's Office,Health and Medicine,Hospital,Internal Medicine Doctor,Nursing Home,Dentist,Optometrist,General Surgeon,Chiropractor,Nutritionist
3,Arverne,Nursing Home,Chiropractor,Dentist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist
4,Aspen Knolls,Dentist,Doctor's Office,Hospital,Veterinarian,Internal Medicine Doctor,Urgent Care Center,Optometrist,Psychiatrist,Chiropractor,Health and Medicine


In [60]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ [0:10]

  manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)


array([1, 1, 1, 2, 1, 1, 2, 2, 1, 1])

In [61]:
# add clustering labels
neighborhoods_care_sorted.insert(0, 'Cluster tag', kmeans.labels_)


 

In [62]:
neighborhoods_care_sorted

Unnamed: 0,Cluster tag,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Adelphi,Dentist,Optometrist,Internal Medicine Doctor,Doctor's Office,Hospital,Nursing Home,Medical Lab,Family Medicine Doctor,Health and Medicine,Psychiatrist
1,1,Allerton,Doctor's Office,Health and Medicine,Nursing Home,Dentist,Internal Medicine Doctor,Chiropractor,Hospital,Family Medicine Doctor,Home Health Care Service,Nutritionist
2,1,Annadale,Doctor's Office,Health and Medicine,Hospital,Internal Medicine Doctor,Nursing Home,Dentist,Optometrist,General Surgeon,Chiropractor,Nutritionist
3,2,Arverne,Nursing Home,Chiropractor,Dentist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist
4,1,Aspen Knolls,Dentist,Doctor's Office,Hospital,Veterinarian,Internal Medicine Doctor,Urgent Care Center,Optometrist,Psychiatrist,Chiropractor,Health and Medicine
...,...,...,...,...,...,...,...,...,...,...,...,...
271,1,Williamsburg,Hospital,Doctor's Office,Nursing Home,Urgent Care Center,Health and Medicine,Internal Medicine Doctor,Chiropractor,Psychiatrist,Physical Therapy Clinic,Ophthalmologist
272,2,Willowbrook,Health and Medicine,Veterinarian,Doctor's Office,Family Medicine Doctor,Urgent Care Center,Hospital,Pediatrician,Nursing Home,Ophthalmologist,Psychiatrist
273,1,Wingate,Health and Medicine,Hospital,Doctor's Office,Dentist,Chiropractor,Optometrist,Medical Center,Nursing Home,Nutritionist,Ophthalmologist
274,1,Woodhaven,Health and Medicine,Doctor's Office,Nursing Home,Hospital,Dentist,Optometrist,Internal Medicine Doctor,Urgent Care Center,General Surgeon,Medical Center


In [63]:
manhattan_merged = df.copy()

# merge to add latitude/longitude for each neighborhood
manhattan_merged1 = manhattan_merged.merge(neighborhoods_care_sorted, left_on='neighborhood', right_on='Neighborhood')


## Segmentation of NYC neighborhoods

The goal of this project is to segment the neighborhoods of New York City into separate clusters and examine the information about them. For clustering, We can use any available information **except** demographic and economic indicators. We don't want to segment them based on those and we want to keep them for the **profiling of clusters** to see if there are any important economic differences between the created clusters.

In [64]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged1['latitude'], manhattan_merged1['longitude'], manhattan_merged1['neighborhood'], manhattan_merged1['Cluster tag']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [65]:
manhattan_merged1.loc[manhattan_merged1['Cluster tag'] == 0, manhattan_merged1.columns[[1] + list(range(5, manhattan_merged1.shape[1]))]]


Unnamed: 0,longitude,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
90,-73.820878,Kew Gardens Hills,Dentist,Doctor's Office,Nutritionist,General Surgeon,Optometrist,Medical Center,Chiropractor,Urgent Care Center,Psychiatrist,Physical Therapy Clinic
111,-73.916556,Mount Eden,Dentist,Chiropractor,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist,Nursing Home
130,-74.012759,Red Hook,Dentist,Chiropractor,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist,Nursing Home


In [66]:
manhattan_merged1.loc[manhattan_merged1['Cluster tag'] == 1, manhattan_merged1.columns[[1] + list(range(5, manhattan_merged1.shape[1]))]]


Unnamed: 0,longitude,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,-73.859319,Allerton,Doctor's Office,Health and Medicine,Nursing Home,Dentist,Internal Medicine Doctor,Chiropractor,Hospital,Family Medicine Doctor,Home Health Care Service,Nutritionist
1,-74.178549,Annadale,Doctor's Office,Health and Medicine,Hospital,Internal Medicine Doctor,Nursing Home,Dentist,Optometrist,General Surgeon,Chiropractor,Nutritionist
3,-73.915654,Astoria,Health and Medicine,Hospital,Doctor's Office,Internal Medicine Doctor,Dentist,Optometrist,Nursing Home,Home Health Care Service,Chiropractor,Ophthalmologist
5,-73.791762,Auburndale,Dentist,Doctor's Office,Health and Medicine,Hospital,Nursing Home,Internal Medicine Doctor,Chiropractor,Family Medicine Doctor,Pediatrician,Medical Lab
6,-73.998752,Bath Beach,Health and Medicine,Doctor's Office,Nursing Home,Optometrist,Home Health Care Service,Hospital,Internal Medicine Doctor,Dentist,Nutritionist,Physical Therapy Clinic
...,...,...,...,...,...,...,...,...,...,...,...,...
162,-73.814202,Whitestone,Health and Medicine,Dentist,Hospital,Optometrist,Nursing Home,Chiropractor,Nutritionist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic
163,-73.857446,Williamsbridge,Health and Medicine,Doctor's Office,Dentist,Optometrist,Hospital,Chiropractor,Psychiatrist,Internal Medicine Doctor,Nursing Home,Family Medicine Doctor
164,-73.958115,Williamsburg,Hospital,Doctor's Office,Nursing Home,Urgent Care Center,Health and Medicine,Internal Medicine Doctor,Chiropractor,Psychiatrist,Physical Therapy Clinic,Ophthalmologist
166,-73.937187,Wingate,Health and Medicine,Hospital,Doctor's Office,Dentist,Chiropractor,Optometrist,Medical Center,Nursing Home,Nutritionist,Ophthalmologist


In [67]:
manhattan_merged1.loc[manhattan_merged1['Cluster tag'] == 2, manhattan_merged1.columns[[1] + list(range(5, manhattan_merged1.shape[1]))]]


Unnamed: 0,longitude,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,-73.791992,Arverne,Nursing Home,Chiropractor,Dentist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist
4,-73.89468,Astoria Heights,Nursing Home,Health and Medicine,Internal Medicine Doctor,Chiropractor,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist
11,-73.765968,Bayswater,Health and Medicine,Nursing Home,Dentist,Doctor's Office,General Surgeon,Hospital,Home Health Care Service,Internal Medicine Doctor,Optometrist,Chiropractor
12,-73.804365,Beechhurst,Health and Medicine,Dentist,Doctor's Office,Hospital,Nursing Home,Nutritionist,Chiropractor,Ophthalmologist,Urgent Care Center,Psychiatrist
30,-73.994654,Carroll Gardens,Health and Medicine,Dentist,Doctor's Office,Hospital,Internal Medicine Doctor,Chiropractor,Ophthalmologist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic
31,-73.848027,Castle Hill,Health and Medicine,Veterinarian,Nursing Home,Dentist,Hospital,Optometrist,Internal Medicine Doctor,Chiropractor,Doctor's Office,Home Health Care Service
44,-74.084024,Concord,Health and Medicine,Dentist,Doctor's Office,Home Health Care Service,Nursing Home,Hospital,Medical Lab,Nutritionist,Veterinarian,Family Medicine Doctor
55,-73.938858,East Williamsburg,Health and Medicine,Optometrist,Hospital,General Surgeon,Internal Medicine Doctor,Chiropractor,Nutritionist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic
56,-73.848083,Edenwald,Health and Medicine,Optometrist,Veterinarian,Doctor's Office,Internal Medicine Doctor,Nutritionist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician
82,-73.767142,Holliswood,Nutritionist,Chiropractor,Dentist,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nursing Home


In [68]:
manhattan_merged1.loc[manhattan_merged1['Cluster tag'] == 3, manhattan_merged1.columns[[1] + list(range(5, manhattan_merged1.shape[1]))]]


Unnamed: 0,longitude,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,-73.786488,City Island,Health and Medicine,Chiropractor,Nursing Home,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist
61,-73.990947,Flatiron,Health and Medicine,Chiropractor,Nursing Home,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist
125,-74.174645,Port Ivory,Health and Medicine,Chiropractor,Nursing Home,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist


In [69]:
manhattan_merged1.loc[manhattan_merged1['Cluster tag'] == 4, manhattan_merged1.columns[[1] + list(range(5, manhattan_merged1.shape[1]))]]


Unnamed: 0,longitude,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
40,-73.854144,Clason Point,Doctor's Office,Chiropractor,Nursing Home,Urgent Care Center,Psychiatrist,Physical Therapy Clinic,Pediatrician,Optometrist,Ophthalmologist,Nutritionist


### Feature Engineering

Feature engineering plays a crucial role in this problem. We have limited amount of attributes so we need to create some features that will be important for segmentation.

- Google Places, Yelp and Foursquare APIs: number of venues, density of venues per square mile, number of restaurants, top restarurant category...
- Uber: number of rides per day in the neighborhood
- Meetups: number of events
- etc...

### Feature Selection / Dimensionality Reduction¶
We need to apply different selection techniques to find out which one will be the best for our problems.

Original Features vs. PCA conponents?

Don't forget to scale the features for KMeans.

### Modeling

Use different attributes and clustering techniques and compare the created clusters:

- clustering only on restaurant features
- clustering only on Uber features
- clustering only on location
- combination of all

**Questions:**
1. Which clustering is the best?
2. How are neighborhoods split when we select only 2 clusters?
3. Are there any differences in housing and rental costs in different clusters?

### Evaluation

1. Check the segmentation evaluation metrics:
    - inertia
    - silhoutte score
2. How did you come up with the correct number of clusters?
3. Is there any relationship between the clusters and economic indicators? If yes, what does it mean?

You are required to share the file containing all NYC neighborhoods together with cluster_id with LighthouseLabs.