# Massachusetts Colleges Capstone Project - Raul Gonzalez
# Implementation
# April 7th, 2020

Choosing which college or university to attend is one of the biggest decisions that you will make in terms of your personal and professional development, and it can be overwhelming if you don’t know where to start. College is important for many reasons, including long-term financial gain, job stability, career satisfaction and success outside of the workplace. In essence, college teaches us how to learn and grow in every aspect of our lives. As Ralph Waldo Emerson says, "The things taught in schools and colleges are not an education, but the means to an education." Of course, the quality of the education offered is a very important factor in deciding which college to attend, as well as the overall on-campus experience. However, another important factor is the off-campus experience offered, including the historical, cultural and social background of the college's city since at the end attending college is preparing us to become active members of our societies.

Most of the time prospective college students can't visit the city of the college they are planning to attend because of time or economic limitations and are sometimes insecure if they are going to like the environment offered by that city. Also, many students choose to attend a certain college just because they really like the city in which it is located. It seems that knowing in advance the different types of environments offered by the cities would facilitate prospective college students the important task of choosing which college or university to attend. Maybe a student would prefer to attend a college in a city with a lot of Italian restaurants in the surroundings or maybe a city with many museums and parks to visit during weekends.

With its incredibly significant, undeniably unique place in American history and culture, from the earliest historical period of colonial America onward, Massachusetts continues to play a primary contributing roll to American high-culture and fine-arts. Massachusetts is home to countless world-class museums and national historical sites and has produced some of Americans most famously creative academics, artists, writers, and musicians. Massachusetts’ role in American education is also without equal. Massachusetts is home to the United States’ oldest high school, the first public library, oldest boarding school, oldest college, and the first women’s college. Additionally, top level universities such as Harvard and MIT, which consistently rank among the world’s best universities year after year, are located in this state. Massachusetts has 12% of the top research universities and 15% of the top 40 liberal arts colleges. Several of the world’s best medical and technology facilities are located here as well as numerous multinational corporations. In summary, the state of Massachusetts seems like a great place to decide to accomplish college studies. This project aims to cluster of find the different environments offered by the cities of Massachusetts that houses the state´s colleges and universities, seeking to facilitate prospective college students, wanting to study there, the important task of choosing the right college.

### Extracting data from Wikipedia using pandas

In [38]:
# extract tables from wikipedia
from pandas.io.html import read_html

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import re

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')
page = 'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Massachusetts'

wikitables = read_html(page,  attrs={"class":"wikitable sortable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Libraries imported.
Extracted 3 wikitables


We will only need the first table for the implementation

In [39]:
df=wikitables[0]
print(df.shape)
df.head(10)

(104, 7)


Unnamed: 0,School,Location[note 1],Control[1],Type[1],Enrollment[16],Founded,Accreditation[16]
0,American International College,Springfield,Private not-for-profit,Master's university,"2,177[17]",1885[17],"AOTA, APTA, CCNE, NEASC[17]"
1,Amherst College,Amherst,Private not-for-profit,Baccalaureate college,"1,817[18]",1821[18],NEASC[18]
2,Anna Maria College,Paxton,Private not-for-profit,Master's university,"1,455[19]",1946[19],"NASM, NEASC, NLNAC[19]"
3,Assumption College,Worcester,Private not-for-profit,Master's university,"2,813[20]",1904[20],NEASC[20]
4,Babson College,Wellesley,Private not-for-profit,Special-focus institution,"3,250[21]",1919[21],NEASC[21]
5,Bard College at Simon's Rock,Great Barrington,Private not-for-profit,Baccalaureate/associate's college,354[22],1964[22],NEASC[22]
6,Bay Path University,Longmeadow,Private not-for-profit,Baccalaureate college,"2,370[23]",1897[23],"AOTA, NEASC[23]"
7,Bay State College,Boston,For-profit,Associate's college,"1,721[24]",1946[24],"ABHES, APTA, NEASC, NLNAC[24]"
8,Becker College,Worcester,Private not-for-profit,Baccalaureate college,"1,826[25]",1784[25],"APTA, NEASC, NLNAC[25]"
9,Benjamin Franklin Institute of Technology,Boston,Private not-for-profit,Special-focus institution,475[26],1908[26],NEASC[26]


### Removing unwanted data

In [40]:
data=df.drop(['Accreditation[16]'], axis=1)
data.rename(columns={"Location[note 1]": "Location", "Control[1]": "Control", "Type[1]": "Type", "Enrollment[16]": "Enrollment"}, inplace=True)
for i in range (data['Enrollment'].shape[0]):
    data['Enrollment'][i] = re.sub("[\(\[].*?[\)\]]", "", data['Enrollment'][i])
for i in range (data['Founded'].shape[0]):
    data['Founded'][i] = re.sub("[\(\[].*?[\)\]]", "", data['Founded'][i])

data['Enrollment'] = data['Enrollment'].str.replace(',','')
data['Founded'] = data['Founded'].str.replace(',','')
data['Founded'][92] = 1975
data = data.astype({"Enrollment":'int64', "Founded":'int64'})

print('Data shape:',data.shape)
print('The data consists of {} colleges in {} different cities'.format(data.shape[0],np.count_nonzero(data['Location'].unique())))
data.head(10)

Data shape: (104, 6)
The data consists of 104 colleges in 53 different cities


Unnamed: 0,School,Location,Control,Type,Enrollment,Founded
0,American International College,Springfield,Private not-for-profit,Master's university,2177,1885
1,Amherst College,Amherst,Private not-for-profit,Baccalaureate college,1817,1821
2,Anna Maria College,Paxton,Private not-for-profit,Master's university,1455,1946
3,Assumption College,Worcester,Private not-for-profit,Master's university,2813,1904
4,Babson College,Wellesley,Private not-for-profit,Special-focus institution,3250,1919
5,Bard College at Simon's Rock,Great Barrington,Private not-for-profit,Baccalaureate/associate's college,354,1964
6,Bay Path University,Longmeadow,Private not-for-profit,Baccalaureate college,2370,1897
7,Bay State College,Boston,For-profit,Associate's college,1721,1946
8,Becker College,Worcester,Private not-for-profit,Baccalaureate college,1826,1784
9,Benjamin Franklin Institute of Technology,Boston,Private not-for-profit,Special-focus institution,475,1908


### Lets explore the data

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 6 columns):
School        104 non-null object
Location      104 non-null object
Control       104 non-null object
Type          104 non-null object
Enrollment    104 non-null int64
Founded       104 non-null int64
dtypes: int64(2), object(4)
memory usage: 5.0+ KB


First lets focus on the numeric features

In [42]:
data[['Enrollment', 'Founded']].describe()

Unnamed: 0,Enrollment,Founded
count,104.0,104.0
mean,4780.480769,1913.009615
std,5934.864141,57.049713
min,18.0,1636.0
25%,1281.75,1878.0
50%,2496.0,1918.0
75%,6314.75,1963.25
max,32603.0,1997.0


In [46]:
print('The oldest college in Massachusetts is {} founded in {}'.format(data['School'][np.argmin(data['Founded'])],np.min(data['Founded'])))
print('\n')
print('The newest college in Massachusetts is {} founded in {}'.format(data['School'][np.argmax(data['Founded'])],np.max(data['Founded'])))
print('\n')
print('The college with the highest enrollment is {} with {} students'.format(data['School'][np.argmax(data['Enrollment'])],np.max(data['Enrollment'])))
print('\n')
print('The college with the lowest enrollment is {} with {} students'.format(data['School'][np.argmin(data['Enrollment'])],np.min(data['Enrollment'])))
print('\n')
print('The average enrollment for a college in Massachusetts is approximately {} students'.format(round(data['Enrollment'].mean())))
print('\n')
print('The total number of students enrolled in Massachusetts is {} students'.format(data['Enrollment'].sum()))

The oldest college in Massachusetts is Harvard University founded in 1636


The newest college in Massachusetts is Olin College founded in 1997


The college with the highest enrollment is Boston University with 32603 students


The college with the lowest enrollment is Conway School of Landscape Design with 18 students


The average enrollment for a college in Massachusetts is approximately 4780 students


The total number of students enrolled in Massachusetts is 497170 students


In [47]:
print('{} college was founded in the 17th century'.format(np.count_nonzero(data['Founded'].between(1601, 1700, inclusive = True))))
print('\n')
print('{} colleges were founded in the 18th century'.format(np.count_nonzero(data['Founded'].between(1701, 1800, inclusive = True))))
print('\n')
print('{} colleges were founded in the 19th century'.format(np.count_nonzero(data['Founded'].between(1801, 1900, inclusive = True))))
print('\n')
print('{} colleges were founded in the 20th century'.format(np.count_nonzero(data['Founded'].between(1901, 2000, inclusive = True))))
print('\n')

1 college was founded in the 17th century


2 colleges were founded in the 18th century


40 colleges were founded in the 19th century


61 colleges were founded in the 20th century




Now lets focus on the categorical features

In [48]:
data[['School', 'Location', 'Control', 'Type']].describe()

Unnamed: 0,School,Location,Control,Type
count,104,104,104,104
unique,104,53,3,6
top,Bay Path University,Boston,Private not-for-profit,Special-focus institution
freq,1,24,72,28


 - We can see that there are 104 distinct colleges in the dataset, in a total of 53 different locations
 - The location with most colleges is Boston with a total of 24 colleges.
 - Out of the 104 colleges, 72 are Private not-for-profit.
 - The most common type of college are Special-focus institutions.

In [49]:
data_location = data.groupby(['Location']).School.nunique()
data_location.sort_values(ascending=False).head()

Location
Boston         24
Worcester       8
Cambridge       6
Springfield     4
Wellesley       3
Name: School, dtype: int64

Most colleges are in Boston, Worcester and Cambridge

In [50]:
data_control = data.groupby(['Control']).School.nunique()
data_control.sort_values(ascending=False)

Control
Private not-for-profit    72
Public                    30
For-profit                 2
Name: School, dtype: int64

72 colleges are Private not-for-profit, 30 are Public and 2 are For-profit

In [51]:
data_type = data.groupby(['Type']).School.nunique()
data_type.sort_values(ascending=False)

Type
Special-focus institution            28
Baccalaureate college                21
Master's university                  20
Associate's college                  20
Research university                  14
Baccalaureate/associate's college     1
Name: School, dtype: int64

28 colleges are Special-focus institutions, 22 are Baccalaureate colleges, 20 are Master's universities, 20 are Associate's colleges and 14 are Research universities

### Lets prepare the data in a convenient way

In [52]:
data['Number of Colleges'] = data['Location'].map(data['Location'].value_counts())
data.head(10)

Unnamed: 0,School,Location,Control,Type,Enrollment,Founded,Number of Colleges
0,American International College,Springfield,Private not-for-profit,Master's university,2177,1885,4
1,Amherst College,Amherst,Private not-for-profit,Baccalaureate college,1817,1821,3
2,Anna Maria College,Paxton,Private not-for-profit,Master's university,1455,1946,1
3,Assumption College,Worcester,Private not-for-profit,Master's university,2813,1904,8
4,Babson College,Wellesley,Private not-for-profit,Special-focus institution,3250,1919,3
5,Bard College at Simon's Rock,Great Barrington,Private not-for-profit,Baccalaureate/associate's college,354,1964,1
6,Bay Path University,Longmeadow,Private not-for-profit,Baccalaureate college,2370,1897,1
7,Bay State College,Boston,For-profit,Associate's college,1721,1946,24
8,Becker College,Worcester,Private not-for-profit,Baccalaureate college,1826,1784,8
9,Benjamin Franklin Institute of Technology,Boston,Private not-for-profit,Special-focus institution,475,1908,24


In [53]:
result = data.groupby('Location', sort=True).agg( ','.join)
result = result.reset_index(drop=True)

result.head()

Unnamed: 0,School,Control,Type
0,"Amherst College,Hampshire College,University o...","Private not-for-profit,Private not-for-profit,...","Baccalaureate college,Baccalaureate college,Re..."
1,Massachusetts School of Law,Private not-for-profit,Special-focus institution
2,"Endicott College,Montserrat College of Art","Private not-for-profit,Private not-for-profit","Master's university,Special-focus institution"
3,"Bay State College,Benjamin Franklin Institute ...","For-profit,Private not-for-profit,Private not-...","Associate's college,Special-focus institution,..."
4,Massachusetts Maritime Academy,Public,Baccalaureate college


### Lets just keep the cities with the names and number of colleges in each one

In [54]:
data_clean = pd.DataFrame(data[['Location','Number of Colleges']])
data_clean = data_clean.drop_duplicates()
data_clean.sort_values('Location',inplace=True)
data_clean.rename(columns={"Location": "City"}, inplace=True)
data_clean = data_clean.reset_index(drop=True)
data_clean['Colleges'] = result['School']

data_clean.head(10)

Unnamed: 0,City,Number of Colleges,Colleges
0,Amherst,3,"Amherst College,Hampshire College,University o..."
1,Andover,1,Massachusetts School of Law
2,Beverly,2,"Endicott College,Montserrat College of Art"
3,Boston,24,"Bay State College,Benjamin Franklin Institute ..."
4,Bourne,1,Massachusetts Maritime Academy
5,Bridgewater,1,Bridgewater State University
6,Brighton,1,Saint John's Seminary
7,Brockton,1,Massasoit Community College
8,Brookline,2,"Boston Graduate School of Psychoanalysis,Helle..."
9,Cambridge,6,"Cambridge College,Harvard University,Hult Inte..."


### Now we find the coordinates of each city using geopy and append it to the dataframe

In [55]:
longitudes=[]
latitudes=[]

for i in data_clean['City']:
    address = i + ', Massachusetts'

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitudes.append(location.latitude)
    longitudes.append(location.longitude)

print('Coordinates ready')

Coordinates ready


In [56]:
data_clean['Latitude'] = latitudes
data_clean['Longitude'] = longitudes
print(data_clean.shape)
data_clean.head(10)

(53, 5)


Unnamed: 0,City,Number of Colleges,Colleges,Latitude,Longitude
0,Amherst,3,"Amherst College,Hampshire College,University o...",42.368566,-72.505714
1,Andover,1,Massachusetts School of Law,42.65717,-71.140878
2,Beverly,2,"Endicott College,Montserrat College of Art",42.558428,-70.880049
3,Boston,24,"Bay State College,Benjamin Franklin Institute ...",42.360253,-71.058291
4,Bourne,1,Massachusetts Maritime Academy,41.741217,-70.59892
5,Bridgewater,1,Bridgewater State University,41.990379,-70.975043
6,Brighton,1,Saint John's Seminary,42.350097,-71.156442
7,Brockton,1,Massasoit Community College,42.083433,-71.018379
8,Brookline,2,"Boston Graduate School of Psychoanalysis,Helle...",42.331764,-71.121163
9,Cambridge,6,"Cambridge College,Harvard University,Hult Inte...",42.3751,-71.105616


### Use geopy library to get the latitude and longitude values of the state of Massachusetts 

In [57]:
address = 'Massachusetts, United States'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Massachusetts, United States are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Massachusetts, United States are 42.3788774, -72.032366.


### Create a map of Massachusetts with the cities that have colleges superimposed on top

In [58]:
# create map of Massachusetts using latitude and longitude values
map_massachusetts = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, city in zip(data_clean['Latitude'], data_clean['Longitude'], data_clean['City']):
    label = '{}, {}'.format(city, 'Massachusetts')
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_massachusetts)  
    
map_massachusetts

### Define Foursquare Credentials and Version

In [59]:
# The code was removed by Watson Studio for sharing.

### Let's create a function to find the closest venues for all the cities

In [60]:
def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    
    LIMIT=1000
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [61]:
#get all nearby venues
massachusetts_venues = getNearbyVenues(names=data_clean['City'],
                                 latitudes=data_clean['Latitude'],
                                 longitudes=data_clean['Longitude']
                                 )
print('Venues ready')

Venues ready


### Lets visualize the data extracted from Foursquare

In [62]:
print(massachusetts_venues.shape)
massachusetts_venues.head(10)

(4270, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amherst,42.368566,-72.505714,Spirit Haus,42.373262,-72.502193,Liquor Store
1,Amherst,42.368566,-72.505714,Amherst Cinema,42.375219,-72.520377,Movie Theater
2,Amherst,42.368566,-72.505714,Amherst Coffee,42.375694,-72.520501,Coffee Shop
3,Amherst,42.368566,-72.505714,Antonio's Pizza,42.376193,-72.51981,Pizza Place
4,Amherst,42.368566,-72.505714,Amherst Common,42.373922,-72.519379,Park
5,Amherst,42.368566,-72.505714,Pita Pockets,42.377343,-72.519499,Halal Restaurant
6,Amherst,42.368566,-72.505714,Lone Wolf,42.375735,-72.518509,Breakfast Spot
7,Amherst,42.368566,-72.505714,Lord Jeffery Inn,42.374501,-72.518794,Hotel
8,Amherst,42.368566,-72.505714,Amherst Books,42.375742,-72.519568,Bookstore
9,Amherst,42.368566,-72.505714,Emily Dickinson Museum,42.376222,-72.514377,History Museum


Let's check how many venues were returned for each city

In [63]:
massachusetts_venues.groupby('City').count().head()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Amherst,100,100,100,100,100,100
Andover,95,95,95,95,95,95
Beverly,100,100,100,100,100,100
Boston,100,100,100,100,100,100
Bourne,70,70,70,70,70,70


#### Let's find out how many unique categories can be curated from all the returned venues

In [64]:
print('There are {} uniques categories.'.format(len(massachusetts_venues['Venue Category'].unique())))

There are 319 uniques categories.


### Now lets analyze each city

In [65]:
# one hot encoding
massachusetts_onehot = pd.get_dummies(massachusetts_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
massachusetts_onehot['City'] = massachusetts_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [massachusetts_onehot.columns[-1]] + list(massachusetts_onehot.columns[:-1])
massachusetts_onehot = massachusetts_onehot[fixed_columns]

massachusetts_onehot.head()

Unnamed: 0,City,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Bowling Green,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Business Service,Butcher,Café,Cajun / Creole Restaurant,Cambodian Restaurant,Campground,Candy Store,Caribbean Restaurant,Casino,Castle,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Cidery,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Academic Building,College Arts Building,College Basketball Court,College Bookstore,College Cafeteria,College Hockey Rink,College Quad,College Stadium,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Disc Golf,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Eastern European Restaurant,Electronics Store,Event Space,Fabric Shop,Fair,Farm,Farmers Market,Fast Food Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market,Fishing Store,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Football Stadium,Forest,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Go Kart Track,Golf Course,Golf Driving Range,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Health Food Store,High School,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hockey Field,Hockey Rink,Home Service,Hostel,Hot Dog Joint,Hotel,Hotel Bar,Hotpot Restaurant,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kitchen Supply Store,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Library,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Marijuana Dispensary,Market,Martial Arts Dojo,Mattress Store,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Meze Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Moroccan Restaurant,Motel,Motorsports Shop,Movie Theater,Moving Target,Multiplex,Museum,Music Store,Music Venue,National Park,Nature Preserve,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,Outdoors & Recreation,Paintball Field,Paper / Office Supplies Store,Park,Pedestrian Plaza,Peking Duck Restaurant,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Photography Studio,Pizza Place,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Pool Hall,Portuguese Restaurant,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Rental Service,Resort,Rest Area,Restaurant,River,Rock Club,Roller Rink,Romanian Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skating Rink,Ski Area,Ski Lodge,Ski Shop,Ski Trail,Smoke Shop,Smoothie Shop,Snack Place,Soba Restaurant,Soccer Field,Soup Place,South American Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,State / Provincial Park,Steakhouse,Storage Facility,Student Center,Summer Camp,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toll Booth,Toll Plaza,Tourist Information Center,Toy / Game Store,Trail,Train,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Amherst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Amherst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Amherst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Amherst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Amherst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [66]:
massachusetts_onehot.shape

(4270, 320)

#### Next, let's group rows by City and by taking the mean of the frequency of occurrence of each category

In [67]:
massachusetts_grouped = massachusetts_onehot.groupby('City').mean().reset_index()
massachusetts_grouped.head()

Unnamed: 0,City,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Bowling Green,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Business Service,Butcher,Café,Cajun / Creole Restaurant,Cambodian Restaurant,Campground,Candy Store,Caribbean Restaurant,Casino,Castle,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Cidery,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Academic Building,College Arts Building,College Basketball Court,College Bookstore,College Cafeteria,College Hockey Rink,College Quad,College Stadium,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Disc Golf,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Eastern European Restaurant,Electronics Store,Event Space,Fabric Shop,Fair,Farm,Farmers Market,Fast Food Restaurant,Financial or Legal Service,Fish & Chips Shop,Fish Market,Fishing Store,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Football Stadium,Forest,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Go Kart Track,Golf Course,Golf Driving Range,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Health Food Store,High School,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hockey Field,Hockey Rink,Home Service,Hostel,Hot Dog Joint,Hotel,Hotel Bar,Hotpot Restaurant,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kitchen Supply Store,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Library,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Marijuana Dispensary,Market,Martial Arts Dojo,Mattress Store,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Meze Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Moroccan Restaurant,Motel,Motorsports Shop,Movie Theater,Moving Target,Multiplex,Museum,Music Store,Music Venue,National Park,Nature Preserve,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,Outdoors & Recreation,Paintball Field,Paper / Office Supplies Store,Park,Pedestrian Plaza,Peking Duck Restaurant,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Photography Studio,Pizza Place,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Pool Hall,Portuguese Restaurant,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Rental Service,Resort,Rest Area,Restaurant,River,Rock Club,Roller Rink,Romanian Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skating Rink,Ski Area,Ski Lodge,Ski Shop,Ski Trail,Smoke Shop,Smoothie Shop,Snack Place,Soba Restaurant,Soccer Field,Soup Place,South American Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,State / Provincial Park,Steakhouse,Storage Facility,Student Center,Summer Camp,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tapas Restaurant,Tattoo Parlor,Tea Room,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toll Booth,Toll Plaza,Tourist Information Center,Toy / Game Store,Trail,Train,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Amherst,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.03,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.02,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.05,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0
1,Andover,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.021053,0.0,0.0,0.0,0.0,0.010526,0.021053,0.010526,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.021053,0.010526,0.0,0.0,0.0,0.0,0.021053,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.073684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.010526,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.031579,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.031579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.021053,0.010526,0.0,0.0,0.010526,0.010526,0.031579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.021053,0.0,0.0,0.010526,0.0,0.042105,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.010526,0.010526,0.0,0.010526,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.021053,0.0,0.0,0.0,0.0,0.0,0.0,0.021053,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031579,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.010526,0.010526,0.0,0.0,0.010526,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0,0.010526,0.010526,0.0,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.0
2,Beverly,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.04,0.02,0.02,0.0,0.0,0.0,0.06,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0
3,Boston,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.04,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.03,0.0,0.01,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.01,0.0
4,Bourne,0.0,0.0,0.0,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.014286,0.0,0.014286,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.014286,0.0,0.0,0.0,0.028571,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.014286,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042857,0.0,0.0,0.0,0.0,0.0,0.028571,0.014286,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.014286,0.028571,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.042857,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [68]:
massachusetts_grouped.shape

(53, 320)

Let's write a function to sort the venues in descending order.

In [69]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each city.

In [70]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
massachusetts_venues_sorted = pd.DataFrame(columns=columns)
massachusetts_venues_sorted['City'] = massachusetts_grouped['City']

for ind in np.arange(massachusetts_grouped.shape[0]):
    massachusetts_venues_sorted.iloc[ind, 1:] = return_most_common_venues(massachusetts_grouped.iloc[ind, :], num_top_venues)

massachusetts_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amherst,Coffee Shop,Grocery Store,Sandwich Place,Hotel,American Restaurant,Bakery,Pizza Place,Department Store,Liquor Store,Breakfast Spot
1,Andover,Coffee Shop,American Restaurant,Pizza Place,Italian Restaurant,Sandwich Place,Restaurant,Gym / Fitness Center,Fast Food Restaurant,Donut Shop,Burger Joint
2,Beverly,Coffee Shop,Italian Restaurant,Park,Pizza Place,Bakery,Ice Cream Shop,Pub,Sandwich Place,Indie Movie Theater,Brewery
3,Boston,Park,Bakery,Seafood Restaurant,Gym,Coffee Shop,Hotel,Historic Site,Pizza Place,Sandwich Place,New American Restaurant
4,Bourne,Seafood Restaurant,Donut Shop,Convenience Store,American Restaurant,Park,Sandwich Place,Restaurant,Beach,Gas Station,Breakfast Spot


### Now lets cluster the cities with colleges using k-means clustering

Run *k*-means to cluster the cities into 4 clusters.

In [71]:
# set number of clusters
kclusters = 4

massachusetts_grouped_clustering = massachusetts_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(massachusetts_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1, 0, 0, 1, 2, 0, 1, 0, 0], dtype=int32)

In [72]:
# add clustering labels
massachusetts_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

massachusetts_merged = data_clean

# merge massachusetts_grouped with massachusetts_data to add latitude/longitude for each city
massachusetts_merged = massachusetts_merged.join(massachusetts_venues_sorted.set_index('City'), on='City')

massachusetts_merged.head()

Unnamed: 0,City,Number of Colleges,Colleges,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amherst,3,"Amherst College,Hampshire College,University o...",42.368566,-72.505714,0,Coffee Shop,Grocery Store,Sandwich Place,Hotel,American Restaurant,Bakery,Pizza Place,Department Store,Liquor Store,Breakfast Spot
1,Andover,1,Massachusetts School of Law,42.65717,-71.140878,1,Coffee Shop,American Restaurant,Pizza Place,Italian Restaurant,Sandwich Place,Restaurant,Gym / Fitness Center,Fast Food Restaurant,Donut Shop,Burger Joint
2,Beverly,2,"Endicott College,Montserrat College of Art",42.558428,-70.880049,0,Coffee Shop,Italian Restaurant,Park,Pizza Place,Bakery,Ice Cream Shop,Pub,Sandwich Place,Indie Movie Theater,Brewery
3,Boston,24,"Bay State College,Benjamin Franklin Institute ...",42.360253,-71.058291,0,Park,Bakery,Seafood Restaurant,Gym,Coffee Shop,Hotel,Historic Site,Pizza Place,Sandwich Place,New American Restaurant
4,Bourne,1,Massachusetts Maritime Academy,41.741217,-70.59892,1,Seafood Restaurant,Donut Shop,Convenience Store,American Restaurant,Park,Sandwich Place,Restaurant,Beach,Gas Station,Breakfast Spot


### Finally, let's visualize the resulting clusters

In [73]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(massachusetts_merged['Latitude'], massachusetts_merged['Longitude'], massachusetts_merged['City'], massachusetts_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

#### Cluster 0

In [74]:
massachusetts_merged.loc[massachusetts_merged['Cluster Labels'] == 0, massachusetts_merged.columns[[0] + [1] + [2] + list(range(5, massachusetts_merged.shape[1]))]]

Unnamed: 0,City,Number of Colleges,Colleges,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amherst,3,"Amherst College,Hampshire College,University o...",0,Coffee Shop,Grocery Store,Sandwich Place,Hotel,American Restaurant,Bakery,Pizza Place,Department Store,Liquor Store,Breakfast Spot
2,Beverly,2,"Endicott College,Montserrat College of Art",0,Coffee Shop,Italian Restaurant,Park,Pizza Place,Bakery,Ice Cream Shop,Pub,Sandwich Place,Indie Movie Theater,Brewery
3,Boston,24,"Bay State College,Benjamin Franklin Institute ...",0,Park,Bakery,Seafood Restaurant,Gym,Coffee Shop,Hotel,Historic Site,Pizza Place,Sandwich Place,New American Restaurant
6,Brighton,1,Saint John's Seminary,0,Pizza Place,Bakery,Grocery Store,Ice Cream Shop,Trail,Park,Indie Movie Theater,Chinese Restaurant,Sandwich Place,Rock Club
8,Brookline,2,"Boston Graduate School of Psychoanalysis,Helle...",0,Park,Pizza Place,Bakery,Trail,Sandwich Place,American Restaurant,Seafood Restaurant,Grocery Store,Chinese Restaurant,Brewery
9,Cambridge,6,"Cambridge College,Harvard University,Hult Inte...",0,Bakery,Café,New American Restaurant,Pizza Place,Park,Brewery,Seafood Restaurant,Indie Movie Theater,Coffee Shop,Spa
10,Chestnut Hill,2,"Boston College,Pine Manor College",0,Ice Cream Shop,Pizza Place,American Restaurant,Park,Gym,Gym / Fitness Center,Pub,Sandwich Place,Grocery Store,Thai Restaurant
19,Framingham,1,Framingham State University,0,Grocery Store,Bakery,Furniture / Home Store,Brazilian Restaurant,Brewery,Indian Restaurant,Ice Cream Shop,Pizza Place,Department Store,Deli / Bodega
29,Medford,1,Tufts University,0,Bakery,Café,Italian Restaurant,Mexican Restaurant,Ice Cream Shop,Pizza Place,Breakfast Spot,Park,American Restaurant,Brewery
31,Needham,1,Olin College,0,Pizza Place,Park,Italian Restaurant,Bakery,Chinese Restaurant,Japanese Restaurant,Coffee Shop,Burger Joint,Thai Restaurant,Golf Course


#### Cluster 1

In [75]:
massachusetts_merged.loc[massachusetts_merged['Cluster Labels'] == 1, massachusetts_merged.columns[[0] + [1] + [2] + list(range(5, massachusetts_merged.shape[1]))]]

Unnamed: 0,City,Number of Colleges,Colleges,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Andover,1,Massachusetts School of Law,1,Coffee Shop,American Restaurant,Pizza Place,Italian Restaurant,Sandwich Place,Restaurant,Gym / Fitness Center,Fast Food Restaurant,Donut Shop,Burger Joint
4,Bourne,1,Massachusetts Maritime Academy,1,Seafood Restaurant,Donut Shop,Convenience Store,American Restaurant,Park,Sandwich Place,Restaurant,Beach,Gas Station,Breakfast Spot
7,Brockton,1,Massasoit Community College,1,Donut Shop,Convenience Store,Pizza Place,Coffee Shop,Pharmacy,Discount Store,American Restaurant,Breakfast Spot,Gym / Fitness Center,Pub
11,Chicopee,1,Elms College,1,Pizza Place,American Restaurant,Pharmacy,Discount Store,Gym / Fitness Center,Donut Shop,Grocery Store,Bakery,Ice Cream Shop,Bar
13,Danvers,1,North Shore Community College,1,American Restaurant,Sandwich Place,Italian Restaurant,Pizza Place,Department Store,Chinese Restaurant,Steakhouse,Liquor Store,Cosmetics Shop,Ice Cream Shop
14,Dartmouth,1,University of Massachusetts Dartmouth,1,Pizza Place,Clothing Store,Breakfast Spot,Donut Shop,American Restaurant,Café,Convenience Store,Sandwich Place,Pet Store,Lingerie Store
17,Fall River,1,Bristol Community College,1,Restaurant,American Restaurant,Pizza Place,Sandwich Place,Bakery,Breakfast Spot,Department Store,Chinese Restaurant,Grocery Store,Coffee Shop
18,Fitchburg,1,Fitchburg State University,1,Coffee Shop,Convenience Store,Sandwich Place,Restaurant,Donut Shop,Ice Cream Shop,Pub,Pizza Place,Park,American Restaurant
20,Franklin,1,Dean College,1,American Restaurant,Donut Shop,Sandwich Place,Spa,Italian Restaurant,Hotel,Gym,Breakfast Spot,Pizza Place,Fast Food Restaurant
22,Great Barrington,1,Bard College at Simon's Rock,1,American Restaurant,Ski Area,Café,Liquor Store,Golf Course,Mediterranean Restaurant,Pharmacy,Ski Trail,Mexican Restaurant,Health Food Store


#### Cluster 2

In [76]:
massachusetts_merged.loc[massachusetts_merged['Cluster Labels'] == 2, massachusetts_merged.columns[[0] + [1] + [2] + list(range(5, massachusetts_merged.shape[1]))]]

Unnamed: 0,City,Number of Colleges,Colleges,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Bridgewater,1,Bridgewater State University,2,Donut Shop,Pizza Place,Sandwich Place,Pharmacy,Convenience Store,Grocery Store,Coffee Shop,Thai Restaurant,Mobile Phone Shop,Breakfast Spot
15,Dudley,1,Nichols College,2,Donut Shop,Pizza Place,Convenience Store,Restaurant,Sandwich Place,Trail,Pharmacy,Seafood Restaurant,Chinese Restaurant,Nightclub
16,Easton,1,Stonehill College,2,Grocery Store,Pharmacy,Sandwich Place,Pizza Place,Donut Shop,Gym,Salon / Barbershop,American Restaurant,Golf Course,Liquor Store
21,Gardner,1,Mount Wachusett Community College,2,Donut Shop,Pharmacy,Gas Station,Sandwich Place,Discount Store,Restaurant,Pizza Place,American Restaurant,Fried Chicken Joint,Grocery Store
36,Norton,1,Wheaton College,2,Donut Shop,Pizza Place,Restaurant,Pharmacy,Video Store,American Restaurant,Gym / Fitness Center,Ice Cream Shop,Convenience Store,Sandwich Place
37,Paxton,1,Anna Maria College,2,Donut Shop,State / Provincial Park,Pizza Place,Moving Target,Gastropub,Campground,Golf Course,Market,Gym / Fitness Center,Breakfast Spot


#### Cluster 3

In [77]:
massachusetts_merged.loc[massachusetts_merged['Cluster Labels'] == 3, massachusetts_merged.columns[[0] + [1] + [2] + list(range(5, massachusetts_merged.shape[1]))]]

Unnamed: 0,City,Number of Colleges,Colleges,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Conway,1,Conway School of Landscape Design,3,River,Farm,Construction & Landscaping,Trail,Photography Studio,Bar,Food & Drink Shop,Food,Flower Shop,Flea Market
