## Capstone Project - The Battle of the Neighborhoods 
### Applied Data Science Capstone by IBM/Coursera

## Introduction: The Problem <a name="introduction"></a>

This project aims to select the safest streets in Barcelona based on the **total accidents**, explore the **districts** of that neighbourhood to find the **most accidental streets** in each neighbourhood and finally cluster the districts using **k-mean clustering**.

This report will be targeted to people such as myself who are looking to **the safest routes for driving**. In order to choose a neighborhood to drive, **safety** is considered as a concern when moving around my city. The **accidents statistics** will provide an insight into this issue.


## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* The total number of accidents happened in each street during the year.
* The most common neighborhoods in each district.

Following data sources will be needed to extract/generate the required information:

- Preprocessing a real world data set from Kaggle accidents in Barcelona from 2017.



### Part 1: Preprocessing a data set from Kaggle accidents in Barcelona<a name="part1"></a>


List of accidents handled by the local police in the city of Barcelona. Incorporates the number of injuries by severity, the number of vehicles and the point of impact.

Data sets from the Open Data BCN portal, the Ajuntament de Barcelona's open data service.

Open Data BCN, a project that was born in 2010, implementing the portal in 2011, has evolved and is now part of the Barcelona Ciutat Digital strategy, fostering a pluralistic digital economy and developing a new model of urban innovation based on the transformation and digital innovation of the public sector and the implication among companies, administrations, the academic world, organizations, communities and people, with a clear public and citizen leadership. 

https://www.kaggle.com/xvivancos/barcelona-data-sets?select=accidents_2017.csv




#### Import necessary libraries

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from bs4 import BeautifulSoup # library for web scrapping  

#!conda install -c conda-forge geocoder --yes
import geocoder

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')


#### Define Foursquare Credentials and Version
Make sure that you have created a Foursquare developer account and have your credentials handy

In [None]:
LIMIT = 100
CLIENT_ID = 'OEM1KGPBAE1DJPP2NRJK2RIYX5IUBWEVNV1WG441OXHJCNEA'
CLIENT_SECRET = 'PIJRVPMHDQPCVHW0GR0VU3BDX4BUGRNQQGSLKFKQBE1D2KOX' 
VERSION = '20200801' 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

#### Read in the dataset

In [None]:
df = pd.read_csv("accidents.csv")

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

#### Total number of accidents in each Neighborhood 

In [None]:
df['Neighborhood Name'].value_counts()

#### The total accidents per month of the year

In [None]:
df['Month'].value_counts()

#### The total accidents per district, day and hour.

In [None]:
df['District Name'].value_counts()

In [None]:
df['Weekday'].value_counts()

In [None]:
df['Hour'].value_counts()

#### Pivoting the table to view the no. of victims for each mild injuries in each Neighbourhood 

In [None]:
Barcelona_accidents = pd.pivot_table(df,values=['Victims'],
                               index=['Neighborhood Name'],
                               columns=['Weekday'],
                               aggfunc=np.sum,fill_value=0)
Barcelona_accidents.head()

In [None]:
# Reset the index
Barcelona_accidents.reset_index(inplace = True)

In [None]:
# Total accidentss per Neighbourhood
Barcelona_accidents['Total'] = Barcelona_accidents.sum(axis=1)
Barcelona_accidents.head(33)

#### Renaming the columns

In [None]:
Barcelona_accidents.columns = ['Neighborhood Name','Friday','Monday','Saturday','Sunday','Thursday','Tuesday','Wednesday','Total']
Barcelona_accidents.head()

In [None]:
# Shape of the data set 
Barcelona_accidents.shape

### Exploratory Data Analysis <a name="EDA"></a>

#### Descriptive statistics of the data

In [None]:
Barcelona_accidents.describe()

In [None]:
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') 


print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0


import matplotlib.cm as cm
import matplotlib.colors as colors

#### Check if the column names are strings 

In [None]:
Barcelona_accidents.columns = list(map(str, Barcelona_accidents.columns))

# let's check the column labels types now
all(isinstance(column, str) for column in Barcelona_accidents.columns)

#### Sort the total accidentss in descenting order to see 5  Neighbourhoods with the highest number of accidents

In [None]:
Barcelona_accidents.sort_values(['Total'], ascending = False, axis = 0, inplace = True )

df_top5 = Barcelona_accidents.head() 
df_top5

#### Visualize the five Neighbourhoods with the highest number of accidents

In [None]:
df_tt = df_top5[['Neighborhood Name','Total']]

df_tt.set_index('Neighborhood Name',inplace = True)

ax = df_tt.plot(kind='bar', figsize=(10, 10), rot=0)

ax.set_ylabel('Number of accidents') # add to x-label to the plot
ax.set_xlabel('Neighbourhood') # add y-label to the plot
ax.set_title('Barcelona Neighbourhoods with the Highest no. of accidents') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 20
               )

plt.show()

### We'll stay clear from these places :)

#### Sort the total accidents in ascending order to see 5  Neighbourhoods with the highest number of accidents 

In [None]:
Barcelona_accidents.sort_values(['Total'], ascending = True, axis = 0, inplace = True )

df_bot5 = Barcelona_accidents.head() 
df_bot5

#### Visualize the five Neighbourhoods with the least number of accidents 

In [None]:
df_bt = df_bot5[['Neighborhood Name','Total']]

df_bt.set_index('Neighborhood Name',inplace = True)

ax = df_bt.plot(kind='bar', figsize=(10, 6), rot=0)

ax.set_ylabel('Number of accidentss') # add to x-label to the plot
ax.set_xlabel('Neighborhood Name') # add y-label to the plot
ax.set_title('Barcelona Neighbourhoods with the least no. of accidents') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

plt.show()

La dreta de l'Eixample has the highest no. of accidents recorded for the year 2017, Looking into the details of the Neighbourhood.

In [None]:
df_col = df_top5[df_top5['Neighborhood Name'] == 'la Dreta de l\'Eixample']
df_col = df_col[['Neighborhood Name','Total','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]
df_col

In [None]:
df_eixample = df[df['Neighborhood Name'] == 'la Dreta de l\'Eixample']

Eixample_accidents = pd.pivot_table(df_eixample,values=['Victims'],
                               index=['Street'],
                               columns=['Part of the day'],
                               aggfunc=np.sum,fill_value=0)
Eixample_accidents



In [None]:
# Reset the index
Eixample_accidents.reset_index(inplace = True)

In [None]:
# Total accidentss per Neighbourhood
Eixample_accidents['Total'] = Eixample_accidents.sum(axis=1)


Eixample_accidents.sort_values(['Total'], ascending = False, axis = 0, inplace = True )
Eixample_accidents.head(336)

In [None]:
Eixample_accidents.sort_values(['Total'], ascending = False, axis = 0, inplace = True )

df_top5eix = Eixample_accidents.head() 
df_top5eix

URL: https://es.wikipedia.org/wiki/La_Dreta_de_l%27Eixample


### Visualizing different types of accidentss in the Neighbourhood 'la dreta de l'Eixample'

In [None]:
df_bt = df_top5eix[['Street','Total']]

df_bt.set_index('Street',inplace = True)

ax = df_bt.plot(kind='bar', figsize=(15, 6), rot=0)

ax.set_ylabel('Number of accidentss') # add to x-label to the plot
for tick in ax.xaxis.get_majorticklabels():
    tick.set_horizontalalignment("left")
ax.set_xlabel('Street') # add y-label to the plot
ax.set_title('Barcelona Neighbourhoods with the highhest no. of accidents') # add title to the plot

# Creating a function to display the percentage.

for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

plt.show()

We can conclude that la dreta de l'Eixample is the most dangerous Neighbourhood to drive when compared to the other Neighbourhoods in Barcelona. 

### Part 3: Creating a map of the accidents using their co-ordinates. <a name="part3"></a>



In [None]:
coord_df = df.drop(['Id', 'Street', 'Weekday', 'Month', 'Day', 'Hour', 'Part of the day', 'Mild injuries', 'Victims','Serious injuries','Vehicles involved'], axis=1)

coord_df

In [None]:
address = 'Barcelona, Barcelona, Spain'

geolocator = Nominatim(user_agent="ld_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Barcelona, Spain are {}, {}.'.format(latitude, longitude))

In [None]:

# create map of New York using latitude and longitude values
map_lon = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, Neighbourhood, neighborhood in zip(coord_df['Latitude'], coord_df['Longitude'], coord_df['District Name'], coord_df['Neighborhood Name']):
    label = '{}, {}'.format(neighborhood, Neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_lon)  
    
map_lon

### Modelling <a name="modelling"></a>

- Finding all the venues within each neighborhood.
- Perform one hot ecoding on the venues data.
- Grouping the venues by the neighborhood and calculating their mean.
- Performing a K-means clustering (Defining K = 5)

In [None]:
coord_venues = df.drop(['Id', 'Weekday', 'Month', 'Day', 'Hour', 'Part of the day', 'Mild injuries', 'Victims','Serious injuries','Vehicles involved'], axis=1)



In [None]:
coord_venues = coord_venues[coord_venues['District Name'] != 'Unknown']
coord_df = coord_df[coord_df['District Name'] != 'Unknown']



In [None]:
print(coord_venues.shape)
coord_venues.head()

In [None]:
coord_venues.groupby('District Name').count()

In [None]:
print('There are {} Neighbourhoods.'.format(len(coord_venues['Neighborhood Name'].unique())))

In [None]:
# one hot encoding
coord_onehot = pd.get_dummies(coord_venues[['Neighborhood Name']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
coord_onehot['District Name'] = coord_venues['District Name'] 

# move neighborhood column to the first column
fixed_columns = [coord_onehot.columns[-1]] + list(coord_onehot.columns[:-1])
coord_onehot = coord_onehot[fixed_columns]

coord_onehot.head()

#### Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
coord_grouped = coord_onehot.groupby('District Name').mean().reset_index()
coord_grouped

In [None]:
coord_grouped.shape

In [None]:
num_top_venues = 5

for hood in coord_grouped['District Name']:
    print("----"+hood+"----")
    temp = coord_grouped[coord_grouped['District Name'] == hood].T.reset_index()
    temp.columns = ['District','Accidents Frequency']
    temp = temp.iloc[1:]
    temp['Accidents Frequency'] = temp['Accidents Frequency'].astype(float)
    temp = temp.round({'Accidents Frequency': 2})
    print(temp.sort_values('Accidents Frequency', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Function to sort the districts in descending order.

In [None]:
def return_most_common_Neighbourhoods(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_Neighbourhoods]

Create the new dataframe and display the top 10 neighbourhoods for each district

In [None]:
num_top_Neighbourhoods = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top Neighbourhoods
columns = ['District Name']
for ind in np.arange(num_top_Neighbourhoods):
    try:
        columns.append('{}{} Most Common Neighbourhood'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Neighbourhood'.format(ind+1))

# create a new dataframe
neighborhoods_Neighbourhoods_sorted = pd.DataFrame(columns=columns)
neighborhoods_Neighbourhoods_sorted['District Name'] = coord_grouped['District Name']

for ind in np.arange(coord_grouped.shape[0]):
    neighborhoods_Neighbourhoods_sorted.iloc[ind, 1:] = return_most_common_Neighbourhoods(coord_grouped.iloc[ind, :], num_top_Neighbourhoods)

neighborhoods_Neighbourhoods_sorted.head()

### Clustering similar districts together using k - means clustering

In [None]:
from sklearn.cluster import KMeans

kclusters = 5

coord_grouped_clustering = coord_grouped.drop('District Name', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(coord_grouped_clustering)

kmeans.labels_[0:10] 

In [None]:
# add clustering labels
neighborhoods_Neighbourhoods_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

coord_merged = coord_df

coord_merged = coord_merged.join(neighborhoods_Neighbourhoods_sorted.set_index('District Name'), on='District Name')

coord_merged.head()

In [None]:
coord_merged.info()

In [None]:
# Dropping the row with the NaN value 
coord_merged.dropna(inplace = True)

In [None]:
coord_merged.shape

In [None]:
coord_merged['Cluster Labels'] = coord_merged['Cluster Labels'].astype(int)

In [None]:
coord_merged.info()

### Visualize the clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(coord_merged['Latitude'], coord_merged['Longitude'], coord_merged['District Name'], coord_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(map_clusters)
       
map_clusters

## Analysis <a name="analysis"></a>

Analyse each of the clusters to identify the characteristics of each cluster and the neighborhoods in them.

#### Examine the first cluster

In [None]:
coord_merged[coord_merged['Cluster Labels'] == 0]

This cluster contains the 2nd most accidental district in the city.

#### Examine the second cluster

In [None]:
coord_merged[coord_merged['Cluster Labels'] == 1]

#### Examine the third cluster

In [None]:
coord_merged[coord_merged['Cluster Labels'] == 2]

#### Examine the forth cluster

In [None]:
coord_merged[coord_merged['Cluster Labels'] == 3]

#### Examine the fifth cluster

In [None]:
coord_merged[coord_merged['Cluster Labels'] == 4]

The fifth cluster contains the most accidental neighbourhoods of the city.

## Results and Discussion <a name="results"></a>

The aim of this project is to help people who want to drive around the less accidental streets in Barcelona. The clusters suggest that crossing the city through its insides, like Diagonal Street, or all the way up, from or towards the port, is very risky compared to the other routes around. In case you drive from outside the city, it might seem relevant to choose from which side to approach instead of reaching the city from the middle and then reroute.

## Conclusion <a name="conclusion"></a>

This project helps a person get a better understanding of traffic inside the city of Barcelona and how important it is, not only taking the fastest route, but the potentially safest.