# Capstone Project - The Battle of Neighborhoods (Week 1)

## Part 1 [Week 1]



Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

## Part 2 [Week 1]

Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.


## Part 3 [Week 2]

A full report consisting of all of the following components (15 marks):
    
- Introduction where you discuss the business problem and who would be interested in this project.
- Data where you describe the data that will be used to solve the problem and the source of the data.
- Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
- Results section where you discuss the results.
- Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
- Conclusion section where you conclude the report.
- A link to your Notebook on your Github repository pushed showing your code. (15 marks).
- Your choice of a presentation or blogpost. (10 marks)

## Section 1: Introduction 

### Background

The London Housing Market is on a decline and facing several economic challenges. There  is a chance of higher taxes and the Bank fo England is warning home owners that property values can decrease up to 30% in the case of an exit from the European Union. Four concerning signs of hidden price falls, record low sales, exodus from homebuilder, and increaese in taxes in the London Housing market offers hints that should concern citizens in England and Wales. 


### Project Idea

To help the home buyers in the London House Markek, we can use machine learning to help them anlayze and make better decisions for buying homes. Thus, we can ask the questiion: How can we help these buyers purchase a proprerty in spite of all the financial and economic challenges? To answer this question, we are going to use the cluster method on London Neighborhoods to find average price of houses, and recommend them to pruchase a real estate that will net them back a positive value investment. We will based our "postive value" based on proprty amenitites, schools, groceery stores, venues, shopping mall, and entertainment factors. 

## Section 2: Data

Our source for London properties were taken from http://landregistry.data.gov.uk/. We will use and analyze the data from Price Paid Data the following fields: Postcode, PAON (Primary Addressable obeject Name), SAON (Secondary Addressable Object Name.


To analyze and recommmend locations across different venues of "postive" values, we will use FourSquare API to access data and present it as a dataframe in form of data visualization. We can combine the data on London proporties with the relative price paid data from the data souce link and data of "positive" values near properties from the FourSquare API interface. This will allow us to recommend to buyers of houses with postivie value investment. 

## Section 3: Methodology

This section will divide the main parts of the analysis into four main parts:

- Collection and Inspection of Data
- Exploring and Understanding Data
- Data Preparation/Preprocessing
- Modeling

We then can make our predictions based on our modeling

**i. Collection and Inspection of Data**

In [1]:
import os # Operating System
import numpy as np
import pandas as pd
import datetime as dt # Datetime
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium #import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [2]:
#Read the data for examination (Source: http://landregistry.data.gov.uk/)
df_ppd = pd.read_csv("http://prod2.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv")

**ii. Exploring and Understanding Data**

In [3]:
df_ppd.head(11)


Unnamed: 0,{666758D7-43A9-3363-E053-6B04A8C0D74E},405000,2018-01-25 00:00,WR15 8LH,D,N,F,RAMBLERS WAY,Unnamed: 8,Unnamed: 9,BORASTON,TENBURY WELLS,SHROPSHIRE,SHROPSHIRE.1,A,A.1
0,{666758D7-43AA-3363-E053-6B04A8C0D74E},315000,2018-01-23 00:00,SY7 8QA,D,N,F,MONT CENISE,,,CLUN,CRAVEN ARMS,SHROPSHIRE,SHROPSHIRE,A,A
1,{666758D7-43AD-3363-E053-6B04A8C0D74E},165000,2018-01-19 00:00,SY1 2BF,T,Y,F,42,,PENSON WAY,,SHREWSBURY,SHROPSHIRE,SHROPSHIRE,A,A
2,{666758D7-43B0-3363-E053-6B04A8C0D74E},370000,2018-01-22 00:00,SY8 4DF,D,N,F,WILLOW HEY,,,ASHFORD CARBONEL,LUDLOW,SHROPSHIRE,SHROPSHIRE,A,A
3,{666758D7-43B3-3363-E053-6B04A8C0D74E},320000,2018-01-19 00:00,TF10 7ET,D,N,F,3,,PRINCESS GARDENS,,NEWPORT,WREKIN,WREKIN,A,A
4,{666758D7-43B4-3363-E053-6B04A8C0D74E},180000,2018-01-31 00:00,SY3 0NQ,S,N,F,79,,LYTHWOOD ROAD,BAYSTON HILL,SHREWSBURY,SHROPSHIRE,SHROPSHIRE,A,A
5,{666758D7-43B6-3363-E053-6B04A8C0D74E},205000,2018-01-26 00:00,SY4 5ES,D,N,F,7,,WELLGATE,WEM,SHREWSBURY,SHROPSHIRE,SHROPSHIRE,A,A
6,{666758D7-43B7-3363-E053-6B04A8C0D74E},255000,2018-01-26 00:00,TF11 9LS,D,N,F,MONTROSE,,GROOMS LANE,KEMBERTON,SHIFNAL,SHROPSHIRE,SHROPSHIRE,A,A
7,{666758D7-43B8-3363-E053-6B04A8C0D74E},177500,2018-01-31 00:00,SY1 3DG,S,N,F,27,,POWIS DRIVE,,SHREWSBURY,SHROPSHIRE,SHROPSHIRE,A,A
8,{666758D7-43B9-3363-E053-6B04A8C0D74E},174995,2018-02-02 00:00,TF1 5GN,D,N,F,89,,HENDY AVENUE,KETLEY,TELFORD,WREKIN,WREKIN,A,A
9,{666758D7-43BB-3363-E053-6B04A8C0D74E},194950,2018-02-02 00:00,SY2 5SZ,S,Y,F,80,,REDWING FIELDS,,SHREWSBURY,SHROPSHIRE,SHROPSHIRE,A,A


In [4]:
df_ppd.shape

(1021214, 16)

The dataset has over 1,021,214 rows and 16 columns. 

**iii. Data Preparation/Preprocessing**

We will prepare the dataset for the modeling process using the following steps:

In [20]:
# 1. Rename and asssign column names to better understand the data
df_ppd.columns = ['TUID', 'Price', 'Date_Transfer', 'Postcode', 'Prop_Type', 'Old_New', 'Duration', 'PAON', \
                  'SAON', 'Street', 'Locality', 'Town_City', 'District', 'County', 'PPD_Cat_Type', 'Record_Status']

In [21]:
#2. Format the date column
df_ppd['Date_Transfer'] = df_ppd['Date_Transfer'].apply(pd.to_datetime)

#3. Delete all obsolete transactions which were done before 2016
df_ppd.drop(df_ppd[df_ppd.Date_Transfer.dt.year < 2016].index, inplace=True)

#4. Sort by Date of Sale
df_ppd.sort_values(by=['Date_Transfer'],ascending=[False],inplace=True)

In [22]:
df_ppd_london = df_ppd.query("Town_City == 'LONDON'")

#5.  Make a list of street names in LONDON
streets = df_ppd_london['Street'].unique().tolist()

In [23]:
df_grp_price = df_ppd_london.groupby(['Street'])['Price'].mean().reset_index()

#6. Give meaningful names to the columns
df_grp_price.columns = ['Street', 'Avg_Price']

In [9]:
#7. Input your Budget's Upper Limit and Lower Limit - Find the locations df_grp_price which fits your budget
df_affordable = df_grp_price.query("(Avg_Price >= 2200000) & (Avg_Price <= 2500000)")

In [10]:
#8. Display the dataframe
df_affordable

Unnamed: 0,Street,Avg_Price
196,ALBION SQUARE,2.450000e+06
391,ANHALT ROAD,2.435000e+06
406,ANSDELL TERRACE,2.250000e+06
421,APPLEGARTH ROAD,2.400000e+06
699,AYLESTONE AVENUE,2.286667e+06
853,BARONSMEAD ROAD,2.375000e+06
979,BEAUCLERC ROAD,2.480000e+06
1100,BELVEDERE DRIVE,2.340000e+06
1213,BICKENHALL STREET,2.208500e+06
1251,BIRCHLANDS AVENUE,2.217000e+06


In [11]:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [12]:
for index, item in df_affordable.iterrows():
    print(f"index: {index}")
    print(f"item: {item}")
    print(f"item.Street only: {item.Street}")

index: 196
item: Street       ALBION SQUARE
Avg_Price         2.45e+06
Name: 196, dtype: object
item.Street only: ALBION SQUARE
index: 391
item: Street       ANHALT ROAD
Avg_Price      2.435e+06
Name: 391, dtype: object
item.Street only: ANHALT ROAD
index: 406
item: Street       ANSDELL TERRACE
Avg_Price           2.25e+06
Name: 406, dtype: object
item.Street only: ANSDELL TERRACE
index: 421
item: Street       APPLEGARTH ROAD
Avg_Price            2.4e+06
Name: 421, dtype: object
item.Street only: APPLEGARTH ROAD
index: 699
item: Street       AYLESTONE AVENUE
Avg_Price         2.28667e+06
Name: 699, dtype: object
item.Street only: AYLESTONE AVENUE
index: 853
item: Street       BARONSMEAD ROAD
Avg_Price          2.375e+06
Name: 853, dtype: object
item.Street only: BARONSMEAD ROAD
index: 979
item: Street       BEAUCLERC ROAD
Avg_Price          2.48e+06
Name: 979, dtype: object
item.Street only: BEAUCLERC ROAD
index: 1100
item: Street       BELVEDERE DRIVE
Avg_Price           2.34e+06
Name

In [13]:
geolocator = Nominatim()


  if __name__ == '__main__':


In [14]:
df_affordable['city_coord'] = df_affordable['Street'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [15]:
df_affordable

Unnamed: 0,Street,Avg_Price,city_coord
196,ALBION SQUARE,2.450000e+06,"(-41.27375755, 173.289393239104)"
391,ANHALT ROAD,2.435000e+06,"(51.4803265, -0.1667607)"
406,ANSDELL TERRACE,2.250000e+06,"(51.4998899, -0.1891027)"
421,APPLEGARTH ROAD,2.400000e+06,"(53.749244, -0.32678)"
699,AYLESTONE AVENUE,2.286667e+06,"(51.5409157, -0.2178742)"
853,BARONSMEAD ROAD,2.375000e+06,"(51.4773147, -0.239457)"
979,BEAUCLERC ROAD,2.480000e+06,"(51.4995771, -0.2290331)"
1100,BELVEDERE DRIVE,2.340000e+06,"(51.4249173, -0.2120774)"
1213,BICKENHALL STREET,2.208500e+06,"(51.5211969, -0.1589341)"
1251,BIRCHLANDS AVENUE,2.217000e+06,"(51.4483941, -0.1604676)"


In [16]:
df_affordable[['Latitude', 'Longitude']] = df_affordable['city_coord'].apply(pd.Series)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [17]:
df_affordable

Unnamed: 0,Street,Avg_Price,city_coord,Latitude,Longitude
196,ALBION SQUARE,2.450000e+06,"(-41.27375755, 173.289393239104)",-41.273758,173.289393
391,ANHALT ROAD,2.435000e+06,"(51.4803265, -0.1667607)",51.480326,-0.166761
406,ANSDELL TERRACE,2.250000e+06,"(51.4998899, -0.1891027)",51.499890,-0.189103
421,APPLEGARTH ROAD,2.400000e+06,"(53.749244, -0.32678)",53.749244,-0.326780
699,AYLESTONE AVENUE,2.286667e+06,"(51.5409157, -0.2178742)",51.540916,-0.217874
853,BARONSMEAD ROAD,2.375000e+06,"(51.4773147, -0.239457)",51.477315,-0.239457
979,BEAUCLERC ROAD,2.480000e+06,"(51.4995771, -0.2290331)",51.499577,-0.229033
1100,BELVEDERE DRIVE,2.340000e+06,"(51.4249173, -0.2120774)",51.424917,-0.212077
1213,BICKENHALL STREET,2.208500e+06,"(51.5211969, -0.1589341)",51.521197,-0.158934
1251,BIRCHLANDS AVENUE,2.217000e+06,"(51.4483941, -0.1604676)",51.448394,-0.160468


In [18]:
df = df_affordable.drop(columns=['city_coord'])


In [19]:
df


Unnamed: 0,Street,Avg_Price,Latitude,Longitude
196,ALBION SQUARE,2.450000e+06,-41.273758,173.289393
391,ANHALT ROAD,2.435000e+06,51.480326,-0.166761
406,ANSDELL TERRACE,2.250000e+06,51.499890,-0.189103
421,APPLEGARTH ROAD,2.400000e+06,53.749244,-0.326780
699,AYLESTONE AVENUE,2.286667e+06,51.540916,-0.217874
853,BARONSMEAD ROAD,2.375000e+06,51.477315,-0.239457
979,BEAUCLERC ROAD,2.480000e+06,51.499577,-0.229033
1100,BELVEDERE DRIVE,2.340000e+06,51.424917,-0.212077
1213,BICKENHALL STREET,2.208500e+06,51.521197,-0.158934
1251,BIRCHLANDS AVENUE,2.217000e+06,51.448394,-0.160468


**iv. Modeling**

In [24]:
address = 'London, UK'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London City are {}, {}.'.format(latitude, longitude))

  app.launch_new_instance()


The geograpical coordinate of London City are 51.4893335, -0.144055084527687.


In [26]:
# create map of London using latitude and longitude values
map_london = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, price, street in zip(df['Latitude'], df['Longitude'], df['Avg_Price'], df['Street']):
    label = '{}, {}'.format(street, price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london