In this project I try to address the following question: is there any correlation between how good a university is and what kind of neighborhood it is located in. Of course, one should not expect any sort of strong correlation here, since there are so many factors that have their impact on the ranking, so, we better reformulate this question in a slightly different way. Namely, in terms of location, what do different universities in the same rank cluster have in common? What makes the best universities so different from the others? 

To do this I first obtain the data containing the list of 1000 World's best universities according to 2018 QS World University ranking. Then, using Google geocoder library I obtain the geographical data: latitude&longitude for each university, and put it all together into a dataframe. 

The next step will be to explore the neighborhoods of universities using Foursquare and the geographical data I have. (And this is what will happen in the next assignment)

In [29]:
#importing all libraries we will use
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import ssl
import re
import json # library to handle JSON files
#!conda install -c conda-forge geotext --yes
from geotext import GeoText

import googlemaps
from datetime import datetime

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
geolocator = Nominatim(user_agent="uni_expl")

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=3)

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Now, I clean up the data, leaving only the following information: rank, name of the University, country where it is located, and its size. 

In [2]:
unidata_full=pd.read_excel('Copy of 2019-QS-World-University-Rankings-v1.0.xlsx',
sheet_name=0,
header=3,
index_col=False,
keep_default_na=True
)
unidata_full.head(20)
unidata_full.drop(0,axis=0, inplace=True)
top_uni=unidata_full.transpose().head().transpose()


In [3]:
top_uni.columns=['rank','2018', 'University','Country','Size']

In [4]:
top_uni.drop('2018',axis=1,inplace=True)
top_uni.set_index('rank',inplace=True)
top_uni.dropna(inplace=True)
top_uni.head()


Unnamed: 0_level_0,University,Country,Size
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,MASSACHUSETTS INSTITUTE OF TECHNOLOGY (MIT),United States,M
2,STANFORD UNIVERSITY,United States,L
3,HARVARD UNIVERSITY,United States,L
4,CALIFORNIA INSTITUTE OF TECHNOLOGY (CALTECH),United States,S
5,UNIVERSITY OF OXFORD,United Kingdom,L


Now, I introduce the function that will produce the geodata: latitude and longitude. To avoid some issues with cities with similar names, which exist in different countries, I also add the country into the input. 

In [31]:
API_KEY='MY_KEY'

In [33]:
gmaps = googlemaps.Client(key=API_KEY)

In [35]:
geocode_result = gmaps.geocode('1600 Amphitheatre Parkway, Mountain View, CA')

In [43]:
def uni_loc(university, country):
    location = gmaps.geocode(str(university)+','+str(country))
    latitude = location[0]['geometry']['location']['lat']
    longitude = location[0]['geometry']['location']['lng']
    return [latitude, longitude]


Now I create a dictionary where I will put the geodata I will obtain. Then, I will turn this dictionary into a pandas dataframe, which I will join with top universities dataframe to obtain the dataframe we will work with later.

In [44]:
geo_uni=dict()

In [None]:
bad=list()
for university,country in zip(top_uni['University'],top_uni['Country']):
    try:
        geo_uni[university]=uni_loc(university,country)
        print(university)
    except:
        abbrev=re.findall('\(.+\)\Z',university)
        try:
            geo_uni[university]=uni_loc(abbrev[0],country)
            print(university)
        except: bad.append(university)
        

In [48]:
geodata=pd.DataFrame.from_dict(geo_uni).transpose()
geodata.columns=['Latitude','Longitude']




In [49]:
geodata.reset_index(inplace=True)
geodata.columns=['University','Latitude','Longitude']
geodata.set_index('University',inplace=True)
geodata.to_csv("GEONEW.csv")


In [50]:
geodata.head(10)
        

Unnamed: 0_level_0,Latitude,Longitude
University,Unnamed: 1_level_1,Unnamed: 2_level_1
MASSACHUSETTS INSTITUTE OF TECHNOLOGY (MIT),42.360091,-71.09416
STANFORD UNIVERSITY,37.427475,-122.169719
HARVARD UNIVERSITY,42.377003,-71.11666
CALIFORNIA INSTITUTE OF TECHNOLOGY (CALTECH),34.137658,-118.125269
UNIVERSITY OF OXFORD,51.754816,-1.254367
UNIVERSITY OF CAMBRIDGE,52.204267,0.114908
ETH ZURICH (SWISS FEDERAL INSTITUTE OF TECHNOLOGY),47.376313,8.54767
IMPERIAL COLLEGE LONDON,51.4988,-0.174877
UNIVERSITY OF CHICAGO,41.788608,-87.598713
UCL (UNIVERSITY COLLEGE LONDON),51.524275,-0.13336


Now let us join the two tables that we have. The resulting table is the one we are going to work with in the next part of the project. 

In [58]:
top_uni.head()
top_uni.reset_index(inplace=True)
top_uni.set_index('University',inplace=True)
top_uni.drop('index',axis=1,inplace=True)
top_uni.head()

Unnamed: 0_level_0,rank,Country,Size
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MASSACHUSETTS INSTITUTE OF TECHNOLOGY (MIT),1,United States,M
STANFORD UNIVERSITY,2,United States,L
HARVARD UNIVERSITY,3,United States,L
CALIFORNIA INSTITUTE OF TECHNOLOGY (CALTECH),4,United States,S
UNIVERSITY OF OXFORD,5,United Kingdom,L


In [59]:
data_geo=top_uni.join(geodata, on='University')

In [67]:
data_geo.reset_index(inplace=True)

In [68]:
data_geo.head()

Unnamed: 0,University,rank,Country,Size,Latitude,Longitude
0,MASSACHUSETTS INSTITUTE OF TECHNOLOGY (MIT),1,United States,M,42.360091,-71.09416
1,STANFORD UNIVERSITY,2,United States,L,37.427475,-122.169719
2,HARVARD UNIVERSITY,3,United States,L,42.377003,-71.11666
3,CALIFORNIA INSTITUTE OF TECHNOLOGY (CALTECH),4,United States,S,34.137658,-118.125269
4,UNIVERSITY OF OXFORD,5,United Kingdom,L,51.754816,-1.254367


In [None]:
data_geo.to_csv(r'C:\Users\arhip\Desktop\IBM\projects\Coursera_Capstone\UNI_data_geo.csv')

In [89]:
df=data_geo.dropna()

Now let us visualize the universities we will work with. 

In [73]:

map_uni = folium.Map(location=[46.2050242,6.1090692], zoom_start=2)

# add markers to map
for lat, lng, university in zip(df['Latitude'], df['Longitude'], df['University']):
    
    label = '{}'.format(university)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uni)  
    
map_uni

Now let us outline the next steps of the project: 

1. We will divide the universities into rank clusters (e.g. top-100, 100-200, etc);
2. We will use Foursquare to explore the neighborhood of each university.
3. Independently of rank clustering, based on the neighborhood data, we will cluster universities. 
4. We will try to find out any correlation between the neighborhood clustering and rank. 