# <center>Capstone Project - The Battle of Neighbourhoods: Welsh Towns Review</center>
## <center>Part 1 - Data Collection</center>
### Applied Data Science Capstone by IBM
### Part of our IBM Data Science Professional Certificate

***

## Table of contents
* [The Problem](#section1)
* [Discussion of the background](#section2)
* [Data](#section3)
* [Data Collection](#section4)

## The Problem <a name="section1"></a>

A couple with young children is looking for a safe and quiet place to live. For their children they want a good state school and for the family a small, but vibrant town. They would like either to settle in that town or very close. They are flexible in regards to the location because they both work from home with only occasional business travels to a city. But where to start? Where are the good schools and which towns could be nice to live in?

## Discussion of the background <a name="section2"></a>

Looking at the couple’s needs, they could potentially settle in a large area. Therefore they need help in narrowing down their search by finding good schools, classifying towns to highlight the ones that could suit them. Of course they will have to visit the place, but is it possible to visit hundreds of places around the country? Perhaps the classification will shed some light on the local amnesties and help select similar places.
Moving to a new place involves a costly commitment in form of either a rent of purchase of a property. Can they afford a place to live? The house price data should reflect to some degree the rental cost, therefore an average house price should be helpful. A price for a house will vary greatly based on specific location, condition and many other factors, but an average price should indicate affordability and help with budgeting.

## Data <a name="section3"></a>

The first task is to obtain a list of towns or localities in Wales. Wikipedia holds a list of 446 localities with their population in a single table. This source is reliable, will be easy to scrap, and filter appropriate size localities. Because the couple is looking for a quiet place yet with some vibrant community, hence it should not be a small village, nor a large town. This report focuses on localities with population raging from 2,000 to 20,000 citizens. 
Four Square database contains very comprehensive information about various type of venues, which could be used to identify similar towns based on composition of similar businesses. 
Information about school performance could be downloaded from the internet as a structured dataset. All state schools in Wales are graded using a colour code, where excellent schools are ‘Green/Gwyrdd’ and good schools are ‘Yellow/Melyn’. Although parents should read detailed information about the school and not focus only on the colour code, this goes beyond the scope of this project.
This report uses the latest secondary school database from year 2019, because of assumed age of children. 
UK government website contains a structured data set of average house prices in xlsx format. This report uses the average price for a detached house per county in Wales on 01/04/2020 (the latest available).
List of resources:
1. Localities with population:<br>
https://en.wikipedia.org/wiki/List_of_localities_in_Wales_by_population 
2. Four Square (will be used with a free account):<br>
https://foursquare.com/ 
3. List of schools in Wales including their 2019 rating:<br>
https://gov.wales/sites/default/files/publications/2020-02/national-school-categorisation-system-support-categories-2019-v2.xlsx 
4. Average house prices on government website:<br>
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Average-prices-Property-Type-2020-04.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=average_price_property_price&utm_term=9.30_19_08_20


## Data Collection<a name="section4"></a>

#### Import required libraries

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup # this module helps in web scrapping
import requests  # this module helps us to download a web page
import geocoder # import the geocoder

In [2]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# install .ods reader in console:
# pip install pandas_ods_reader
from pandas_ods_reader import read_ods

#### 1. Scrape data from HTML table

In [3]:
# Use get to download the contents of the webpage in text format and store in a variable called data:
url = "https://en.wikipedia.org/wiki/List_of_localities_in_Wales_by_population" # defines the source of the localities/towns
data  = requests.get(url).text

In [4]:
# Create a BeautifulSoup object using the BeautifulSoup constructor
soup = BeautifulSoup(data,"html5lib")

In [5]:
# Check what we've got, the page title:
soup.title.string

'List of localities in Wales by population - Wikipedia'

In [6]:
# Find the right table
table = soup.find("table",class_="wikitable sortable")

Using the webrowsers' Inspector we can identify the following data required for our analysis

Table's heading are within 'th' markers:

     <th>Built-up Area</th>
     <th>Population<br>(2011 Census)</th>
     </th>

The information we want is in rows marked by 'tr' while columns are separated by 'td'

In [7]:
df1 = pd.DataFrame(columns=["Town", "Population"])

for row in table.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        town = col[1].text
        population = col[2].text
        df1 = df1.append({"Town":town, "Population":population}, ignore_index=True)

df1.head()

Unnamed: 0,Town,Population
0,Cardiff,"447,287[1]\n"
1,Newport,"306,844[2]\n"
2,Swansea,"300,352[3]\n"
3,Wrexham,"65,692[4]\n"
4,Tonypandy,"62,545[5]\n"


In [8]:
#Wikipedia has 446 towns listed, check the completeness:
print('There are ',df1.shape[0],' rowns in the table')

There are  446  rowns in the table


In [9]:
# A quick clean to remove the ref no's and line breaks:
df1.replace(to_replace=r'\[.*\]\n', value='', regex=True, inplace=True)
df1.Population.replace(to_replace=r',', value='', regex=True, inplace=True)
df1.head()

Unnamed: 0,Town,Population
0,Cardiff,447287
1,Newport,306844
2,Swansea,300352
3,Wrexham,65692
4,Tonypandy,62545


In [10]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 0 to 445
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Town        446 non-null    object
 1   Population  446 non-null    object
dtypes: object(2)
memory usage: 7.1+ KB


In [11]:
df1['Population'] = df1['Population'].astype('int64')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 0 to 445
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Town        446 non-null    object
 1   Population  446 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.1+ KB


In [None]:
#save the results:
#df1.to_csv('towns.csv', index=False)

#### 1. Geolocate the towns

In [12]:
df2 = pd.read_csv('towns.csv')
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 0 to 445
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Town        446 non-null    object
 1   Population  446 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.1+ KB


Limit the list of towns to the ones fitting the criteria: population between 2,000 and 20,000.

In [13]:
df2 = df2[df2['Population'].between(2000, 20000)]
print('There are ',df2.shape[0],' town with polulation between 2 and 20 thousands')

There are  132  town with polulation between 2 and 20 thousands


In [14]:
# Define two empty lists to store the grid location data, one for latitude, one for longtitude:
lati=[]
longi=[]

# Loop throught the postcodes to obtain goelocation. We use ArcGIS, becasuse google is not free anymore:
for town in df2['Town']:
    g = geocoder.arcgis('{}, Wales, UK'.format(town))
    #print(code, g.latlng)
    while (g.latlng is None):
        g = geocoder.arcgis('{}, Wales, UK'.format(town))
        #print(code, g.latlng)
    latlng = g.latlng
    lati.append(latlng[0])
    longi.append(latlng[1])

# Append the coordinates to the dataframe
df2['Latitude'] = lati
df2['Longitude'] = longi

# Check results:
print("Table size: ", df2.shape)
print("Type of data frame objects:")
df2.dtypes

Table size:  (132, 4)
Type of data frame objects:


Town           object
Population      int64
Latitude      float64
Longitude     float64
dtype: object

In [None]:
# save the results:
#df2.to_csv('towns_geo.csv', index=False)

#### 2. Obtain local information from Four Square

In [15]:
# Load the geolocated towns:
df2 = pd.read_csv('towns_geo.csv')
df2.info()
df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132 entries, 0 to 131
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Town        132 non-null    object 
 1   Population  132 non-null    int64  
 2   Latitude    132 non-null    float64
 3   Longitude   132 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 4.2+ KB


Unnamed: 0,Town,Population,Latitude,Longitude
0,Neath,19258,51.66423,-3.803404
1,Ystrad Mynach,19204,51.641258,-3.236328
2,Aberystwyth,18749,52.415008,-4.083685
3,Kinmel Bay/Abergele,18705,53.298754,-3.533572
4,Bangor,17988,53.227974,-4.128214


Define your FourSquare access credentials

In [16]:
CLIENT_ID = 'XXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

Request venue information from FourSquare

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=3000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # print(url)  
        
        # make the GET request: results = requests.get(url).json()["response"]['groups'][0]['items']
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
# code to run the above function on each neighborhood and create a new dataframe 
# called torronto_venues
venues = getNearbyVenues(names=df2['Town'],
                                   latitudes=df2['Latitude'],
                                   longitudes=df2['Longitude']
                                  )

Neath
Ystrad Mynach
Aberystwyth
Kinmel Bay/Abergele
Bangor
Connah's Quay
Chepstow
Carmarthen
Porthcawl
Llandudno
Risca
Tredegar
Abergavenny
Ely, Cardiff
Haverfordwest
Llantrisant
Llantwit Major
Pyle
Milford Haven
Splott
Treharris
Bargoed
Holyhead
Newtown
Mountain Ash
Cil-y-coed
Bryn Pydew
Llanrumney
Abertillery
Bedwas
Monmouth
Mold
Pembroke Dock
Caernarfon
Newbridge, Caerphilly
Tonyrefail
Pencoed
Pontarddulais
Llandaff
Caerleon
Rhymney
Denbigh
Burry Port
Brecon
Pembroke
Ferndale
Hirwaun
Brynna
Abertridwr
Radyr
Rhoose
Abercynon
Broughton
Welshpool
Undy
Coedpoeth
Gwaun-Cae-Gurwen/Brynamman
Blaenavon
Ogmore Vale/Nantymoel
Ruthin
Glyn-neath
Cwmavon
Llandrindod Wells
Cardigan
Gresford
New Tredegar
Menai Bridge
Llangefni
Bethesda
Hope
Tenby
Llay
Gilfach Goch
Glanaman
Pontycymer
Pwllheli
Glyncoch
Conwy
Cowbridge
Llanbradach
Rhuddlan
Neyland
Blaenau Ffestiniog
Llanfairfechan
Penyffordd
Aberfan
Ynysybwl
Murton
Llangollen
Fishguard
Saundersfoot
St Asaph
Llanrwst
Amlwch
Maerdy
Llanfair Pwllgwyngy

In [19]:
print(venues.shape)
venues.head()

(2000, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Neath,51.66423,-3.803404,Gwyn Hall,51.662846,-3.804022,Performing Arts Venue
1,Neath,51.66423,-3.803404,Neath Market,51.662951,-3.80538,Market
2,Neath,51.66423,-3.803404,Neath Rugby Football Club,51.663931,-3.798201,Athletics & Sports
3,Neath,51.66423,-3.803404,Cwrt Herbert Sports Centre,51.665143,-3.816853,Gym / Fitness Center
4,Neath,51.66423,-3.803404,Lidl,51.668438,-3.806313,Supermarket


In [None]:
#save the results for evaluation later:
#venues.to_csv('venues.csv', index=False)

#### 3. Fetch the List of schools in Wales with their rating and store it

The Welsh Government websites publishes the school information in an Excel file: <br>
https://gov.wales/sites/default/files/publications/2020-02/national-school-categorisation-system-support-categories-2019-v2.xlsx <br>
It is a large file with complex headings and multiple tabs. The most efficient way of loading the date is by pre-processing it in Ms Excel and saving the results as `schools.xlsx`. 
The *schools* table provides 2019 information of secondary state schools in Wales with their rating:

Colour Code  | Rating
-------------|----------------------------------
Green/Gwyrdd | Highly Effective
Yellow/Melyn | Effective
Amber/Oren   | In Need of Improvement
Red/Coch     | In Need of Greatest Improvement

Primary and special schools have been removed and the headings simplified.



In [20]:
schools = pd.read_excel('schools.xlsx')
schools.head()

Unnamed: 0,School_code,School_name,Local_authority,Consortium,Rating
0,6604025,Ysgol Syr Thomas Jones,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn
1,6604026,Ysgol Uwchradd Caergybi,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn
2,6604027,Ysgol Gyfun Llangefni,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Green/Gwyrdd
3,6604028,Ysgol David Hughes,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn
4,6604029,Ysgol Uwchradd Bodedern,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Red/Coch


In [21]:
schools.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   School_code      205 non-null    int64 
 1   School_name      205 non-null    object
 2   Local_authority  205 non-null    object
 3   Consortium       205 non-null    object
 4   Rating           205 non-null    object
dtypes: int64(1), object(4)
memory usage: 8.1+ KB


In [22]:
school_address = read_ods('address-list-schools-wales.ods', 10)

In [23]:
school_address.head()

Unnamed: 0,School Number,School Name,LA Code,Local Authority,Sector,Governance - see notes,WM Code,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
0,6602130.0,Ysgol Gynradd Amlwch,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Amlwch,Ynys Môn,,,LL68 9DU,01407 830414,279.0
1,6602131.0,Ysgol Gynradd Beaumaris,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Maeshyfryd,Beaumaris,Ynys Môn,,LL58 8HL,01248 810451,48.0
2,6602132.0,Ysgol Gynradd Bodedern,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodedern,Caergybi,Ynys Môn,,LL65 3TZ,01407 740201,95.0
3,6602133.0,Ysgol Gymuned Bodffordd,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodffordd,Llangefni,Ynys Môn,,LL77 7LZ,01248 723384,80.0
4,6602135.0,Ysgol Gymuned Bryngwran,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bryngwran,Caergybi,Ynys Môn,,LL65 3PP,01407 720400,56.0


In [24]:
school_address.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1477 entries, 0 to 1476
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   School Number                  1474 non-null   float64
 1   School Name                    1476 non-null   object 
 2   LA Code                        1474 non-null   float64
 3   Local Authority                1474 non-null   object 
 4   Sector                         1474 non-null   object 
 5   Governance - see notes         1474 non-null   object 
 6   WM Code                        1474 non-null   object 
 7   Welsh Medium Type - see notes  1474 non-null   object 
 8   School Type                    1474 non-null   object 
 9   Religious Character            1474 non-null   object 
 10  Address 1                      1473 non-null   object 
 11  Address 2                      1431 non-null   object 
 12  Address 3                      1191 non-null   o

In [25]:
# Find missing data
missing_data = school_address.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

School Number
False    1474
True        3
Name: School Number, dtype: int64

School Name
False    1476
True        1
Name: School Name, dtype: int64

LA Code
False    1474
True        3
Name: LA Code, dtype: int64

Local Authority
False    1474
True        3
Name: Local Authority, dtype: int64

Sector
False    1474
True        3
Name: Sector, dtype: int64

Governance - see notes
False    1474
True        3
Name: Governance - see notes, dtype: int64

WM Code
False    1474
True        3
Name: WM Code, dtype: int64

Welsh Medium Type - see notes
False    1474
True        3
Name: Welsh Medium Type - see notes, dtype: int64

School Type
False    1474
True        3
Name: School Type, dtype: int64

Religious Character
False    1474
True        3
Name: Religious Character, dtype: int64

Address 1
False    1473
True        4
Name: Address 1, dtype: int64

Address 2
False    1431
True       46
Name: Address 2, dtype: int64

Address 3
False    1191
True      286
Name: Address 3, dtype: int64

Add

There are three schools without the number and one without the name

In [26]:
missing_data = missing_data[(missing_data['School Number'] == True)]
missing_data.head()

Unnamed: 0,School Number,School Name,LA Code,Local Authority,Sector,Governance - see notes,WM Code,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
1474,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1475,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1476,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


Check the data and decide whether these rows could be dropped

In [27]:
print(school_address.loc[[1474,1475,1476]])

      School Number                                        School Name  \
1474            NaN                                               None   
1475            NaN  (a) Whilst most of the usual data validation p...   
1476            NaN  Typically, overall numbers of pupils and teach...   

      LA Code Local Authority Sector Governance - see notes WM Code  \
1474      NaN            None   None                   None    None   
1475      NaN            None   None                   None    None   
1476      NaN            None   None                   None    None   

     Welsh Medium Type - see notes School Type Religious Character Address 1  \
1474                          None        None                None      None   
1475                          None        None                None      None   
1476                          None        None                None      None   

     Address 2 Address 3 Address 4 Postcode Phone Number  Pupils - see notes  
1474      None    

In [28]:
school_address.tail()

Unnamed: 0,School Number,School Name,LA Code,Local Authority,Sector,Governance - see notes,WM Code,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
1472,6817019.0,The Hollies School,681.0,Cardiff,Special,Not Applicable,,Not applicable,Special (without post-16 provision),---,Bryn Heulog,Pentwyn,Cardiff,,CF23 7XG,02920 734411,102.0
1473,6817021.0,Meadowbank Special School,681.0,Cardiff,Special,Not Applicable,,Not applicable,Special (without post-16 provision),---,Colwill Rd,Gabalfa,Cardiff,,CF14 2QQ,02920 616018,33.0
1474,,,,,,,,,,,,,,,,,
1475,,(a) Whilst most of the usual data validation p...,,,,,,,,,,,,,,,
1476,,"Typically, overall numbers of pupils and teach...",,,,,,,,,,,,,,,


These rows do not actually have any school information and can be dropped

In [29]:
school_address = school_address.drop([1474,1475,1476])
school_address['School Number'] = school_address['School Number'].astype(np.int64)
school_address.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1474 entries, 0 to 1473
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   School Number                  1474 non-null   int64  
 1   School Name                    1474 non-null   object 
 2   LA Code                        1474 non-null   float64
 3   Local Authority                1474 non-null   object 
 4   Sector                         1474 non-null   object 
 5   Governance - see notes         1474 non-null   object 
 6   WM Code                        1474 non-null   object 
 7   Welsh Medium Type - see notes  1474 non-null   object 
 8   School Type                    1474 non-null   object 
 9   Religious Character            1474 non-null   object 
 10  Address 1                      1473 non-null   object 
 11  Address 2                      1431 non-null   object 
 12  Address 3                      1191 non-null   o

In [30]:
school_address.head()

Unnamed: 0,School Number,School Name,LA Code,Local Authority,Sector,Governance - see notes,WM Code,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
0,6602130,Ysgol Gynradd Amlwch,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Amlwch,Ynys Môn,,,LL68 9DU,01407 830414,279.0
1,6602131,Ysgol Gynradd Beaumaris,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Maeshyfryd,Beaumaris,Ynys Môn,,LL58 8HL,01248 810451,48.0
2,6602132,Ysgol Gynradd Bodedern,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodedern,Caergybi,Ynys Môn,,LL65 3TZ,01407 740201,95.0
3,6602133,Ysgol Gymuned Bodffordd,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodffordd,Llangefni,Ynys Môn,,LL77 7LZ,01248 723384,80.0
4,6602135,Ysgol Gymuned Bryngwran,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bryngwran,Caergybi,Ynys Môn,,LL65 3PP,01407 720400,56.0


Add `Postcode` to the `schools` table using `School Number`

In [31]:
# Rename School Number to School Code to have the same column for merging
school_address.rename(columns = {'School Number':'School_code'}, inplace = True)
school_address.head()

Unnamed: 0,School_code,School Name,LA Code,Local Authority,Sector,Governance - see notes,WM Code,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
0,6602130,Ysgol Gynradd Amlwch,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Amlwch,Ynys Môn,,,LL68 9DU,01407 830414,279.0
1,6602131,Ysgol Gynradd Beaumaris,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Maeshyfryd,Beaumaris,Ynys Môn,,LL58 8HL,01248 810451,48.0
2,6602132,Ysgol Gynradd Bodedern,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodedern,Caergybi,Ynys Môn,,LL65 3TZ,01407 740201,95.0
3,6602133,Ysgol Gymuned Bodffordd,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bodffordd,Llangefni,Ynys Môn,,LL77 7LZ,01248 723384,80.0
4,6602135,Ysgol Gymuned Bryngwran,660.0,Isle of Anglesey,Primary,Community,WM,Welsh medium,"Nursery, Infants & Juniors",---,Bryngwran,Caergybi,Ynys Môn,,LL65 3PP,01407 720400,56.0


In [32]:
# merge 'left' on `School_code`
schools_adr = pd.merge(schools, school_address, on = 'School_code', how = 'left')
schools_adr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   School_code                    205 non-null    int64  
 1   School_name                    205 non-null    object 
 2   Local_authority                205 non-null    object 
 3   Consortium                     205 non-null    object 
 4   Rating                         205 non-null    object 
 5   School Name                    204 non-null    object 
 6   LA Code                        204 non-null    float64
 7   Local Authority                204 non-null    object 
 8   Sector                         204 non-null    object 
 9   Governance - see notes         204 non-null    object 
 10  WM Code                        204 non-null    object 
 11  Welsh Medium Type - see notes  204 non-null    object 
 12  School Type                    204 non-null    obj

There is one Postcode entry missing, let's find it

In [33]:
missing_data = schools_adr.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 
    
missing_data = missing_data[(missing_data['Postcode'] == True)]
missing_data.head()

School_code
False    205
Name: School_code, dtype: int64

School_name
False    205
Name: School_name, dtype: int64

Local_authority
False    205
Name: Local_authority, dtype: int64

Consortium
False    205
Name: Consortium, dtype: int64

Rating
False    205
Name: Rating, dtype: int64

School Name
False    204
True       1
Name: School Name, dtype: int64

LA Code
False    204
True       1
Name: LA Code, dtype: int64

Local Authority
False    204
True       1
Name: Local Authority, dtype: int64

Sector
False    204
True       1
Name: Sector, dtype: int64

Governance - see notes
False    204
True       1
Name: Governance - see notes, dtype: int64

WM Code
False    204
True       1
Name: WM Code, dtype: int64

Welsh Medium Type - see notes
False    204
True       1
Name: Welsh Medium Type - see notes, dtype: int64

School Type
False    204
True       1
Name: School Type, dtype: int64

Religious Character
False    204
True       1
Name: Religious Character, dtype: int64

Address 1
False    

Unnamed: 0,School_code,School_name,Local_authority,Consortium,Rating,School Name,LA Code,Local Authority,Sector,Governance - see notes,...,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
55,False,False,False,False,False,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [34]:
print(schools_adr.loc[[55]])

    School_code             School_name Local_authority Consortium  \
55      6664001  Llanfyllin High School   Powys / Powys  ERW / ERW   

        Rating School Name  LA Code Local Authority Sector  \
55  Amber/Oren         NaN      NaN             NaN    NaN   

   Governance - see notes  ... Welsh Medium Type - see notes School Type  \
55                    NaN  ...                           NaN         NaN   

   Religious Character Address 1 Address 2 Address 3 Address 4 Postcode  \
55                 NaN       NaN       NaN       NaN       NaN      NaN   

   Phone Number Pupils - see notes  
55          NaN                NaN  

[1 rows x 21 columns]


Because this school is rated Amber, it will not be required in the analysis and the missing data can be ignored.

In [35]:
schools_adr.head()

Unnamed: 0,School_code,School_name,Local_authority,Consortium,Rating,School Name,LA Code,Local Authority,Sector,Governance - see notes,...,Welsh Medium Type - see notes,School Type,Religious Character,Address 1,Address 2,Address 3,Address 4,Postcode,Phone Number,Pupils - see notes
0,6604025,Ysgol Syr Thomas Jones,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn,Ysgol Syr Thomas Jones,660.0,Isle of Anglesey,Secondary,Community,...,Bilingual (Type B),Secondary (ages 11-19),---,Pentrefelin,Amlwch,Ynys Mon,,LL68 9TH,01407 830287,507.0
1,6604026,Ysgol Uwchradd Caergybi,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn,Ysgol Uwchradd Caergybi,660.0,Isle of Anglesey,Secondary,Community,...,English with significant Welsh,Secondary (ages 11-19),---,Caergybi,Ynys Môn,,,LL65 1NP,01407 762219,816.0
2,6604027,Ysgol Gyfun Llangefni,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Green/Gwyrdd,Ysgol Gyfun Llangefni,660.0,Isle of Anglesey,Secondary,Community,...,Bilingual (Type B),Secondary (ages 11-19),---,Llangefni,Ynys Môn,,,LL77 7NG,01248 723441,654.0
3,6604028,Ysgol David Hughes,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Yellow/Melyn,Ysgol David Hughes,660.0,Isle of Anglesey,Secondary,Community,...,Bilingual (Type B),Secondary (ages 11-19),---,Ffordd Pentraeth,Porthaethwy,Ynys Môn,,LL59 5SS,01248 712287,1068.0
4,6604029,Ysgol Uwchradd Bodedern,Isle of Anglesey / Sir Ynys Môn,GwE / GwE,Red/Coch,Ysgol Uwchradd Bodedern,660.0,Isle of Anglesey,Secondary,Community,...,Bilingual (Type B),Secondary (ages 11-19),---,Bro Alaw,Bodedern,Caergybi,Ynys Môn,LL65 3TL,01407 741000,703.0


In [None]:
# Save the dataframe for further evaluation:
#schools_adr.to_csv('schools_rated.csv', index = None)

#### 4. Fetch the average property price data

The UK Government's official website holds the data in a .csv file: <br>
https://www.gov.uk/government/statistics/uk-house-price-index-wales-january-2021/uk-house-price-index-wales-january-2021 <br>
It was downloaded and save in the project folder in a raw form for evaluation: <br> `Wales-annual-price-change-by-local-authority-2021-01.csv` The file is not in the default `utf-8` encoding, therefor it requires a parameter `encoding='ANSI'` or `engine='python`. 



In [36]:
df4 = pd.read_csv('Wales-annual-price-change-by-local-authority-2021-01.csv', encoding='ANSI')
df4.head(25)

Unnamed: 0,Local authorities,January 2021,January 2020,Difference
0,Blaenau Gwent,£92888,£90884,2.2%
1,Bridgend,£167325,£156840,6.7%
2,Caerphilly,£149358,£141484,5.6%
3,Cardiff,£224555,£213409,5.2%
4,Carmarthenshire,£162338,£153222,5.9%
5,Ceredigion,£200147,£184545,8.5%
6,Conwy,£188111,£171569,9.6%
7,Denbighshire,£174724,£161290,8.3%
8,Flintshire,£192791,£177425,8.7%
9,Gwynedd,£175287,£160893,8.9%


Remove the last two columns, which we are not going to use, the '£' symbol and convert the values to integer.

In [37]:
df4 = df4.drop(['January 2020', 'Difference'], axis=1)
df4.rename(columns = {'Local authorities':'County', 'January 2021':'Ave_price_2021'}, inplace = True)
df4.Ave_price_2021.replace(to_replace=r'£', value='', regex=True, inplace=True)
df4['Ave_price_2021'] = df4['Ave_price_2021'].astype('int64')
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   County          23 non-null     object
 1   Ave_price_2021  23 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 496.0+ bytes


In [None]:
# Save the dataframe for further evaluation:
#df4.to_csv('Prices_Wales.csv', index = None)

## This completes the data collection.