<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Leuven City (part 1 by Levent Bingol)</font></h1>

## Introduction
In this project, we will be required to explore, segment, and cluster the neighborhoods in the city of Leuven. For the Leuven neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Leuven. We will be required to scrape a web page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we will do the analysis on the dataset to explore and cluster the neighborhoods in the city of Leuven.

## Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

 <a href="#item1"> Data Gathering and PreProcessing (Webscrape the Data from web page into Dataframe)(Part 1)</a>
    
    
</font>
</div>

### Import necessary libraries for part 1
Before we get the data and start exploring it, let's download all the dependencies that we will need.
In this projecet especially we will use necessary Libraries such as BeautifulSoup and requests for Web Scraping.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

#import webscraping libraries
from bs4 import BeautifulSoup
import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
print('Libraries imported.')

Libraries imported.


## 1. Data Gathering and PreProcessing (Webscrape the Data from web page into Dataframe)
I have three different methods two scrape the data as shown below.We can use one of the following methods to get the necessary html file and data from the Vlaanderen.be page.
    
    a .Pandas method 
    
    b. BeautifulSoup method:


## a. Pandas method
In this method we will use url and requests library and the with pandas text file will be transfered into into dataframe


In [2]:
#to get the table method-1

url="https://www.vlaanderen.be/gemeenten-en-provincies/provincie-vlaams-brabant/leuven"
LeuvenSource = requests.get(url).text

tables = pd.read_html(LeuvenSource)
# result of first method 
df=pd.DataFrame(tables[0])
df
#LeuvenSource

Unnamed: 0,0,1
0,Postcode(s),"3000, 3001, 3010, 3012, 3018"
1,Deelgemeenten,"Leuven, Heverlee, Kessel-Lo, Wilsele, Wijgmaal"


In [5]:

df = pd.DataFrame(df, columns=[1])
df= df[1].apply(lambda x : pd.Series(x.split(','))).head()
df

Unnamed: 0,0,1,2,3,4
0,3000,3001,3010,3012,3018
1,Leuven,Heverlee,Kessel-Lo,Wilsele,Wijgmaal


In [6]:
#pre-processing the data frame
neighborhoods=df.transpose()
neighborhoods[2]=neighborhoods[1]
#we know that borough is Leuven city so we set it into the table
neighborhoods[1]=['Leuven','Leuven','Leuven','Leuven','Leuven']
neighborhoods.columns = ['PostalCode','Borough', 'Neighborhood'] 

In [7]:
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood
0,3000,Leuven,Leuven
1,3001,Leuven,Heverlee
2,3010,Leuven,Kessel-Lo
3,3012,Leuven,Wilsele
4,3018,Leuven,Wijgmaal


## b. BeautifulSoup method:

We can downlad the web page as html file and then convert to file into lxml file for parsing. We can do this step by using !wget method or requests library (get method) then use BeautifulSoup capabilities.

In [8]:
#We can get the html info in two alternatives also.Then scraped data into lxml file with Beautifulsoup
#to get the HTML file first method-1

!wget -q -O 'Leuvenpost.html' https://www.vlaanderen.be/gemeenten-en-provincies/provincie-vlaams-brabant/leuven
print('Data downloaded!')
with open('Leuvenpost.html') as can_html_file:
    soup=BeautifulSoup(can_html_file,'lxml')
    
#to get the HTML file second method-2
LeuvenSource2= requests.get("https://www.vlaanderen.be/gemeenten-en-provincies/provincie-vlaams-brabant/leuven").text
soup=BeautifulSoup(LeuvenSource2,'lxml')
#soup

Data downloaded!


Now, with the following code we scrape the table data in the html file into tableLeuven element by using BeautifulSoup. In the following steps we will turn this information into pandas data frame. You can prettify the information and examine to see the hierarchy in the data.

In [9]:
tableLeuven=soup.find('table')
print(tableLeuven.prettify())

<table class="data-table data-table--no-header">
 <tbody>
  <tr>
   <td class="data-table__body-title">
    Postcode(s)
   </td>
   <td>
    3000, 3001, 3010, 3012, 3018
   </td>
  </tr>
  <tr>
   <td class="data-table__body-title">
    Deelgemeenten
   </td>
   <td>
    Leuven, Heverlee, Kessel-Lo, Wilsele, Wijgmaal
   </td>
  </tr>
 </tbody>
</table>


When we examine the tableLeuven data we will see that the necessary information stays between <td> parts. And each is in order with groups of postalcode, borough and neighborhood following each other. To get the data easier out from html element we will use fin_all method of Beautifulsoup library.

In [10]:
#Table content in a list. the order in three groups is such as postalcode,borough and neighborhood
tableContent=tableLeuven.find_all ('td')
tableContent[0:9]

[<td class="data-table__body-title">Postcode(s)</td>,
 <td>3000, 3001, 3010, 3012, 3018</td>,
 <td class="data-table__body-title">Deelgemeenten</td>,
 <td>Leuven, Heverlee, Kessel-Lo, Wilsele, Wijgmaal</td>]

Now we have a list and with the following codes we can exploer the data inside the list. Then we learn how to get the postcode and neighborhood information. Borough is Leuven City.

Then we get ready to scrape the data into a dataframe. So lets define the columns of our neighborhoods data frame that place the Leuven postal code and neighborhood information.

In [12]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods2= pd.DataFrame(columns=column_names)
neighborhoods2

Unnamed: 0,PostalCode,Borough,Neighborhood


In [36]:
a=tableContent[1].text.split(',')
b=tableContent[3].text.split(',')
neighborhoods2['PostalCode']=a
neighborhoods2['Neighborhood']=b
#we know that borough is Leuven city so we set it into the table
neighborhoods2['Borough']=['Leuven','Leuven','Leuven','Leuven','Leuven']

In [37]:
neighborhoods2

Unnamed: 0,PostalCode,Borough,Neighborhood
0,3000,Leuven,Leuven
1,3001,Leuven,Heverlee
2,3010,Leuven,Kessel-Lo
3,3012,Leuven,Wilsele
4,3018,Leuven,Wijgmaal


Now we can fiil the table with parsing the information from tableContent list into our dataframe with the following loop.

Now we can take a look at the current shape and the content of our dataframe. Then we will do some preprocessing and cleaning in the following section.

In [38]:
#Pandas method
neighborhoods.head()
#neighborhoods.shape

Unnamed: 0,PostalCode,Borough,Neighborhood
0,3000,Leuven,Leuven
1,3001,Leuven,Heverlee
2,3010,Leuven,Kessel-Lo
3,3012,Leuven,Wilsele
4,3018,Leuven,Wijgmaal


In [39]:
# Beautiful Soup method 
#We compare two solutions we have the same dataframe in the end of both methodology
neighborhoods2.head()
#neighborhoods.shape

Unnamed: 0,PostalCode,Borough,Neighborhood
0,3000,Leuven,Leuven
1,3001,Leuven,Heverlee
2,3010,Leuven,Kessel-Lo
3,3012,Leuven,Wilsele
4,3018,Leuven,Wijgmaal


Our dataframe now consists of three columns: PostalCode, Borough, and Neighborhood.We now only process the cells that have an assigned borough. We will save the dataframe into csv file

In [48]:
neighborhoods.shape
len(neighborhoods['PostalCode'].unique())
LeuvenDF=neighborhoods
LeuvenDF

Unnamed: 0,PostalCode,Borough,Neighborhood
0,3000,Leuven,Leuven
1,3001,Leuven,Heverlee
2,3010,Leuven,Kessel-Lo
3,3012,Leuven,Wilsele
4,3018,Leuven,Wijgmaal


In [47]:
LeuvenDF.to_csv('LeuvenDFpart1.csv', encoding='utf-8', index=False)

## This is the end of part 1. Answer for the first question