# Neighbourhoods
### A comparative analysis of Lagos to Kigali

### Install Dependencies

In [None]:
!pip3 install bs4
!pip3 install requests
!pip3 install html5lib

### Import Dependencies
 We import Beautifulsoup dependency for web scraping of wikipedia page, requests for making http calls, html5lib a type of beautifulsoup parser for html files and pandas for working with extracted data in the form of a dataframe
 

In [None]:
import html5lib
import pandas as pd

## Data Collection - Import Files

In [None]:
pd.set_option("max_rows", None)
cost_of_living_data = pd.read_csv("cost_of_living.csv")
neighbourhoods_data = pd.read_csv("neighbourhoods.csv")
cost_of_living_data.head()


In [None]:
neighbourhoods_data

## Data Preprocessing - Convert Files into DataFrame

we need to clean the cost_of_living_data to remove the extra currencies. i decided to use the rwandan franc when comparing, therefore we will be the alternative sum in Naira.

In [None]:
def remove_unneccesary_amount (value):
    value =value.split("R")[0]
    value = value.strip()
    value = value.replace(",",'')
    value = float(value)
    return value 


In [None]:
cost_of_living_data["Kigali"]= cost_of_living_data["Kigali"].apply(remove_unneccesary_amount)
cost_of_living_data["Lagos"] = cost_of_living_data["Lagos"].apply(remove_unneccesary_amount)
cost_of_living_data.head(10)

we basically need to compare the amounts for kigali and lagos and not really the type of goods. so we create a new dataframe of the information we need

In [None]:
living_cost = cost_of_living_data[["Kigali","Lagos"]]
living_cost.head()

Likewise we process the neighbourhoods data adding the longitude and latitude of each area to the csv file and saving it for future reuse

In [None]:
from geopy.geocoders import Nominatim

In [None]:
def get_location_data(value):
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(value)
    if location is not None: 
        latitude = location.latitude
        longitude = location.longitude
        return latitude, longitude
    return None, None
  


In [None]:

new_cols =["Latitude","Longitude"]  
for n,col in enumerate(new_cols):
       neighbourhoods_data[col] = neighbourhoods_data[['Neighborhoods','City']].agg(",".join, axis =1).apply(lambda x: get_location_data(x)[n])
neighbourhoods_data[neighbourhoods_data["Latitude"].isnull()]["Latitude"].value_counts()




In [None]:
not_found_coordinates = pd.read_csv("Missing_Coordinates.csv")
not_found_coordinates

In [None]:
neighbourhoods_data.set_index("Neighborhoods", inplace=True)
not_found_coordinates.set_index("Neighborhoods", inplace=True)

In [None]:
neighbourhoods_data

In [None]:

for val in not_found_coordinates.index:
    neighbourhoods_data.loc[val,["Latitude","Longitude"]] = not_found_coordinates.loc[val,["Latitude", "Longitude"]]
neighbourhoods_data.reset_index(inplace=True)
neighbourhoods_data[neighbourhoods_data["Latitude"].isnull()]


In [None]:
neighbourhoods_data = neighbourhoods_data[[neighbourhoods_data.columns[1]] +[neighbourhoods_data.columns[2]]  + [neighbourhoods_data.columns[0]]+ list(neighbourhoods_data.columns[3:])]
neighbourhoods_data

we see that all areas now have latitude and longitude.

### Get the Latitude and Longitude based on Postal Codes

## Exploring Cost Of Living
  
We want to analyze the cost of living in Lagos vs Kigali to understand trends in price distribution and understand which area is more costly to live in. We would be using a normal independent t-test to check if there is a significant difference between living in Lagos and living in Kigali. Also a correlation analysis to see if price are distributed in the same order for both Kigali and Lagos.

In [None]:
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline 
import plotly.express as px

### first we create a box plot to check if location causes significant difference in price trend

In [None]:
kigali_prices = living_cost[["Kigali"]]
kigali_prices["Location"] = "Kigali"
kigali_prices.rename(columns={'Kigali': "Prices"}, inplace=True)
lagos_prices = living_cost[["Lagos"]]
lagos_prices["Location"] = "Lagos"
lagos_prices.rename(columns={'Lagos': "Prices"}, inplace=True)
result = pd.concat([kigali_prices, lagos_prices])
# scaler =preprocessing.StandardScaler()
# result['Prices']= scaler.fit_transform(result[["Prices"]])
sns.boxplot(x= "Location", y="Prices", data=result)

The boxplot shows that the trend of prices for products are significantly similiar for both lagos and Kigali, therefore the meal for two persons would be more expensive than the meal for one person in Kigali as it is in Lagos. Therefore location has no significant effect on trends of product prices

In [None]:
living_cost[['Kigali','Lagos']].corr()

In [None]:
pearson_coef, p_value = stats.pearsonr(living_cost['Lagos'], living_cost['Kigali'])
pearson_coef

We see that the trend of prices in both cities are correlated but not strongly. A final independent T-test we help us understand if there is a significant difference though in the cost of living in Kigali versus Lagos

In [None]:
stats.ttest_ind(living_cost["Kigali"], living_cost["Lagos"], equal_var=False)

A large pValue of 0.6729 shows that we cannot reject the null hypothesis of identical means. Therefore there is no significant difference between the cost average cost of living in Lagos and that of Kigali.

## Exploring Similarities (Areas/Neighbourhoods)
 Having seen that there is no siginificant difference in the cost of living in both cities, we explore areas and neighbourhoods to find  similarities between them 

We install and import the neccessary packages for our exploration

In [None]:
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import json
import matplotlib.colors as colors
import matplotlib.cm as cm
import folium
import requests

In [None]:

def get_coordinates(place):
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(place)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of Republic of Congo are {}, {}.'.format(latitude, longitude))
    return (latitude,longitude)



### Map of Cities with its neighbourhoods superimposed on it.

In [None]:
# we use a place that is at the middle of both Nigeria and Rwanda so we can easily represent both places on the map
cities_map = folium.Map(location=get_coordinates("Republic of Congo"), zoom_start=5)
for lat, lng, label in zip(neighbourhoods_data['Latitude'], neighbourhoods_data['Longitude'], neighbourhoods_data['Neighborhoods']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(cities_map)
cities_map

### Using Forsquare API
using foursquare api, we collect data about places nearby to a specific longitude and latitude

In [None]:
CLIENT_ID = '************************' # your Foursquare ID
CLIENT_SECRET = '*********************' # your Foursquare Secret
ACCESS_TOKEN = "***************" # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100

Let explore the neighbourhoods by getting the top nearby venues for each neighbourhood in north york. 

In [None]:

def getNearbyVenues(city, names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for city_name, name, lat, lng in zip(city, names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]
        print(results)
        results = results['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            city_name,
            name, 
            lat, 
            lng, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']) for venue in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ["City",'Neighborhoods',
                  'Neighborhoods Latitude', 
                  'Neighborhoods Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
        

In [None]:
nearby_venues = getNearbyVenues(neighbourhoods_data["City"], neighbourhoods_data["Neighborhoods"], neighbourhoods_data["Latitude"], neighbourhoods_data["Longitude"])

Let split the nearby_venues into two sets for Kigali and Lagos since we are trying to compare both cities

In [None]:
nearby_venues_lagos = nearby_venues[nearby_venues['City'] == "Lagos"]
nearby_venues_kigali = nearby_venues[nearby_venues['City'] == "Kigali"]

In [None]:
nearby_venues_kigali

In [None]:
nearby_venues_lagos

In [None]:

nearby_venues_kigali.groupby("Neighborhoods").count().sort_values(["City"], ascending=False).head(10)



In [None]:
nearby_venues_lagos.groupby("Neighborhoods").count().sort_values(["City"], ascending=False).head(10)

In [None]:
# one of the neighborhoods in north york have no nearby places with a 500m range

## Analyzing Neighbourhoods
To be able to use this information for clustering we create dummy variables for each category

In [None]:
import numpy as np

In [None]:

# add neighborhood column back to dataframe
def analyse_neighbourhood(city_venues, num_top_venues):
    neighbourhood_dummies = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")
    neighbourhood_dummies['Neighborhoods'] = city_venues['Neighborhoods'] 
# move neighborhood column to the first column

    fixed_columns = [neighbourhood_dummies.columns[-1]] + list(neighbourhood_dummies.columns[:-1])
    neighbourhood_dummies = neighbourhood_dummies[fixed_columns]
    neighbourhood_grouped = neighbourhood_dummies.groupby("Neighborhoods").mean().reset_index()
    columns = ["Neighborhoods"]
    indicators = ['st', 'nd', 'rd']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhoods'] = neighbourhood_grouped['Neighborhoods']
    for ind in np.arange(neighbourhood_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neighbourhood_grouped.iloc[ind, :], num_top_venues)
      
    return neighbourhood_grouped, neighborhoods_venues_sorted


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    categories_list=[]
    count = 0
    for x in row_categories_sorted:
        if x > 0.0:
            categories_list.extend(row_categories_sorted.index.values[count:count+1])
        else:
            categories_list.extend([np.NaN])
        count= count+1
    return categories_list[0: num_top_venues]



Lets print 10 top venues for Lagos neighborhoods

In [None]:
_,top_10_venues_in_lagos = analyse_neighbourhood( nearby_venues_lagos, 10)
top_10_venues_in_lagos

Also we get top 10 venues in Kigali

In [None]:
_,top_10_venues_in_kigali = analyse_neighbourhood( nearby_venues_kigali, 10)
top_10_venues_in_kigali

### Get Most Common Places

In [None]:
# get the unique list of most_common_places in all neigh
def get_most_common_place(neighborhoods_venues_sorted,val):
    common_places_list = [venue for venues in neighborhoods_venues_sorted.iloc[:,val:].to_numpy() for venue in venues if str(venue) != 'nan' and str(venue)!=""]
    common_venues = pd.Series(np.array(common_places_list)).value_counts()
    most_common_venues = common_venues.to_frame()
    most_common_venues.reset_index(inplace =True)
    most_common_venues.columns = ["Venues","Count"]
    return most_common_venues

In [None]:
most_common_venues_in_lagos = get_most_common_place(top_10_venues_in_lagos ,1)
Ten_most_common_venues_in_lagos = most_common_venues_in_lagos.head(10)
Ten_most_common_venues_in_lagos

Likewise we get 10 most common places in Kigali

In [None]:
most_common_venues_in_kigali = get_most_common_place(top_10_venues_in_kigali ,1)
Ten_most_common_venues_in_kigali = most_common_venues_in_kigali.head(10)
Ten_most_common_venues_in_kigali

## Clustering Neighborhoods

We want to cluster similiar neighbourhoods in both lagos and kigali. We use K-means Clustering method, an unspervised machine learning method to know cluster these neighbourhoods.

First we determine the number of clusters that is the best fit for clustering the neighbourhoods

In [None]:
neighbourhood_grouped,neighborhoods_venues_sorted = analyse_neighbourhood(nearby_venues,10)
neighborhoods_venues_sorted


In [None]:
neighbourhood_clustering_data = neighbourhood_grouped.drop("Neighborhoods", 1)

In [None]:
inertia = []
K = range(1,12)
for k in K:
    kmeans = KMeans(n_clusters=k, init="k-means++").fit(neighbourhood_clustering_data)
    inertia.append(kmeans.inertia_) 

In [None]:
plt.plot(K, inertia, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Distortion')
plt.show()

so we use K = 3 as our number of clusters


In [None]:
kmeans = KMeans( n_clusters = 4, init="k-means++").fit(neighbourhood_clustering_data)
if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
    del neighborhoods_venues_sorted["Cluster Labels"]
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
ny_merged = neighbourhoods_data

ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhoods'), on='Neighborhoods')

# remove the neighborhood without any nearby venues
# ny_merged.dropna(inplace=True, )
ny_merged.drop(ny_merged[ny_merged["Cluster Labels"].isna()].index, inplace=True)
ny_merged
ny_merged["Cluster Labels"] = ny_merged["Cluster Labels"].astype(int)
ny_merged.reset_index(drop=True, inplace=True)
ny_merged.replace(np.NaN, '', inplace=True)
ny_merged

In [None]:
map_clusters = folium.Map(location=get_coordinates("Republic of Congo"), zoom_start=5)
kclusters =4
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhoods'], ny_merged
['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
first_cluster =ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[0]+[2] + list(range(5, ny_merged.shape[1]))]]
first_cluster

In [None]:
second_cluster =ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[0]+[2] + list(range(5, ny_merged.shape[1]))]]
second_cluster

In [None]:
third_cluster =ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[0]+[2] + list(range(5, ny_merged.shape[1]))]]
third_cluster

from the clusterization we see that the first clusters is a very busy neighbourhood, the second cluster is moderatively busy with fewer places, the third place is a more quiet neighbourhood with more venues like parks and movie theaters and finally the last venue with relatively few places to visit nearby.

In [None]:
fourth_cluster =ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[0]+[2] + list(range(5, ny_merged.shape[1]))]]
fourth_cluster

In [None]:
fifth_cluster =ny_merged.loc[ny_merged['Cluster Labels'] == 4, ny_merged.columns[[0]+[2] + list(range(5, ny_merged.shape[1]))]]
fifth_cluster

In [None]:
clusters = [first_cluster, second_cluster, third_cluster, fourth_cluster, fifth_cluster]
for cluster in clusters:
    print(get_most_common_place(cluster, 3).head(10),"\n")