### JSON vs XML

Java Script Object Notation (`JSON`) and Extensible Markup Language (`XML`).<br>
JSON is exceptionally good in representing and accessing complicated data hierarchies, in a nutshell it's like python dictionaries for the web.

#### JSON Structure:
JSON is built on 2 key structures

1. JSON objects: which are a collection of key-value pairs. JSON objects are surrounded by curly braces like a python dictionary. In python JSON objects are interpreted as dictionaries and we can access them like we would a standard python dict.

2. The second key JSON structure is a JSON array, which is an ordered list of values. In python, JSON arrays are interpreted as lists and accessed just like so.

* While JSON objects keys must be strings, the values for both JSON objects and arrays can be any valid JSON data type.

### A little about LXML

`lxml` module of Python is an `XML` toolkit that is basically a Pythonic binding of the following two `C` libraries: `libxlst` and `libxml2`. `lxml` module is a very unique and special module of Python as it offers a combination of `XML` features and speed. In the lxml module, we are provided with multiple functions and using these functions in a Python program; we can easily perform the web scrapping and curate all the useful information from any web page. `lxml` module of Python also allows us to easily handle all the `HTML` and `XML` files along with its application in the web scraping process.

In this Notebook, we will use web scraping to scrape a table from the internet and use that table for data analysis. The table contains the name, district and postal code of local Government Areas in Lagos State, Nigeria

The link to the table we will scrape is on here: [lagos-link](https://en.wikipedia.org/wiki/Lagos_State#Cities_and_towns)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import glob
from bs4 import BeautifulSoup
print('imported')

imported


In [2]:
lagos_link = 'https://en.wikipedia.org/wiki/Lagos_State#Cities_and_towns'

Get the source code html data from the website

In [3]:
source = requests.get(lagos_link).text
type(source)

str

Lets Use BeautifulSoup to parse it

In [4]:
soup = BeautifulSoup(source, 'lxml')

In [5]:
type(soup)

bs4.BeautifulSoup

Next let's get the table that contains the data we want to scrape

In [6]:
My_table = soup.find('table',{'class':'wikitable sortable'})

Let's view the table

In [7]:
type(My_table)

bs4.element.Tag

In [8]:
My_table

<table class="wikitable sortable">
<tbody><tr>
<th>LGA</th>
<th>District</th>
<th>Postal code
</th></tr>
<tr>
<td>Ajeromi Ifelodun</td>
<td>Ajeromi</td>
<td>102103
</td></tr>
<tr>
<td>Amuwo Odofin</td>
<td>Amuwo odofin</td>
<td>102102
</td></tr>
<tr>
<td>Amuwo Odofin</td>
<td>Trade fair complex</td>
<td>102101
</td></tr>
<tr>
<td>Badagry</td>
<td>Badagry</td>
<td>103101
</td></tr>
<tr>
<td>Epe</td>
<td>Agbowa</td>
<td>106104
</td></tr>
<tr>
<td>Epe</td>
<td>Ejinrin</td>
<td>106102
</td></tr>
<tr>
<td>Epe</td>
<td>Epe</td>
<td>106101
</td></tr>
<tr>
<td>Epe</td>
<td>Erodo</td>
<td>106103
</td></tr>
<tr>
<td>Ibeju-Lekki</td>
<td>Ibeju</td>
<td>105101
</td></tr>
<tr>
<td>Ibeju-Lekki</td>
<td>Lekki</td>
<td>105102
</td></tr>
<tr>
<td>Ikorodu</td>
<td>Ikorodu rural</td>
<td>104101
</td></tr>
<tr>
<td>Ikorodu</td>
<td>Irepodun</td>
<td>104102
</td></tr>
<tr>
<td>Ojo</td>
<td>Ajangbadi Afromedia</td>
<td>102104
</td></tr>
<tr>
<td>Ojo</td>
<td>Ajangbadi Ikemba house</td>
<td>102107
</td></tr>

*Let's define a dictionary that we will use to hold the cleaned values from the table we extracted from*

In [9]:
details_dict = {'lga':[], 'district':[], 'postal_code':[]}

We can see that all the data we want are between the $<td>$ tags, let's get the data between the td brackets

In [10]:
links = My_table.find_all('td')

In [11]:
links[:10]  # let's see the first 10 links

[<td>Ajeromi Ifelodun</td>,
 <td>Ajeromi</td>,
 <td>102103
 </td>,
 <td>Amuwo Odofin</td>,
 <td>Amuwo odofin</td>,
 <td>102102
 </td>,
 <td>Amuwo Odofin</td>,
 <td>Trade fair complex</td>,
 <td>102101
 </td>,
 <td>Badagry</td>]

Next let's loop through links and extract only the text elements to a new list called text_links

In [12]:
type(links[0])

bs4.element.Tag

In [13]:
text_links = []

for link in links:
    text_links.append(link.text)
    
# uncommnet below to view text_links    
text_links

['Ajeromi Ifelodun',
 'Ajeromi',
 '102103\n',
 'Amuwo Odofin',
 'Amuwo odofin',
 '102102\n',
 'Amuwo Odofin',
 'Trade fair complex',
 '102101\n',
 'Badagry',
 'Badagry',
 '103101\n',
 'Epe',
 'Agbowa',
 '106104\n',
 'Epe',
 'Ejinrin',
 '106102\n',
 'Epe',
 'Epe',
 '106101\n',
 'Epe',
 'Erodo',
 '106103\n',
 'Ibeju-Lekki',
 'Ibeju',
 '105101\n',
 'Ibeju-Lekki',
 'Lekki',
 '105102\n',
 'Ikorodu',
 'Ikorodu rural',
 '104101\n',
 'Ikorodu',
 'Irepodun',
 '104102\n',
 'Ojo',
 'Ajangbadi Afromedia',
 '102104\n',
 'Ojo',
 'Ajangbadi Ikemba house',
 '102107\n',
 'Ojo',
 'Alaba',
 '102115\n',
 'Ojo',
 'Iba town new site',
 '102112\n',
 'Ojo',
 'Igbede',
 '102109\n',
 'Ojo',
 'Igbo Elerin',
 '102106\n',
 'Ojo',
 'Ilemba Awori',
 '102108\n',
 'Ojo',
 'Ilogbo',
 '102110\n',
 'Ojo',
 'Ira',
 '102114\n',
 'Ojo',
 'Ojo',
 '102101\n',
 'Ojo',
 'Okokomaiko',
 '102105\n',
 'Ojo',
 'Olojo',
 '102113\n',
 'Ojo',
 'Shibiri Ekune',
 '102111\n']

*Now, let's define two functions*


1. To extract all the LGA, District and Postal Code values from the `text_links` to the respective lists in `details_dict`
2. To clean all the escape characters from the postal code values

In [14]:
def list_toDict(arr, dicta):
    """This method takes a list and a dictionary
        and appends the elements from the list
        to the respective lists in the dictionary
        
    @param arr: A List
    @param dicta: A dictionary
    @return: A dictionary with populated list values
    """
    
    # We use while loop that runs as long as the list exists
    while arr:
        # Next itrate through the keys in the dictionary
        for key in list(dicta.keys()):
            dicta[key].append(arr[0])
            # After appending list element reduce list
            arr = arr[1:]
    
    # finally return the updated dictionary
    return dicta

In [15]:
def clean_text(arr):
    """Takes a list and cleans all text within
        by removing escape characters
    @param arr: A list
    @return: A list
    """
    arr = [f.replace('\n','') for f in arr]
    return arr

In [16]:
# Let's update the details_dict 

details_dict = list_toDict(text_links, details_dict)

In [17]:
# Let's see the details_dict first key

details_dict['lga']

['Ajeromi Ifelodun',
 'Amuwo Odofin',
 'Amuwo Odofin',
 'Badagry',
 'Epe',
 'Epe',
 'Epe',
 'Epe',
 'Ibeju-Lekki',
 'Ibeju-Lekki',
 'Ikorodu',
 'Ikorodu',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo',
 'Ojo']

Finally let's clean the postal code column by removing the escape characters. We will pass it the postalcode list from `details_dict`

In [18]:
details_dict['postal_code'] = clean_text(details_dict['postal_code'])

*Let's display col_dict as a DataFrame, we will also make the keys of `col_dict` the columns of the DataFrame*

In [19]:
lagos_df = pd.DataFrame(details_dict, columns=details_dict.keys())

In [20]:
lagos_df

Unnamed: 0,lga,district,postal_code
0,Ajeromi Ifelodun,Ajeromi,102103
1,Amuwo Odofin,Amuwo odofin,102102
2,Amuwo Odofin,Trade fair complex,102101
3,Badagry,Badagry,103101
4,Epe,Agbowa,106104
5,Epe,Ejinrin,106102
6,Epe,Epe,106101
7,Epe,Erodo,106103
8,Ibeju-Lekki,Ibeju,105101
9,Ibeju-Lekki,Lekki,105102


*Now that we have our table fully scraped from the website to a pandas dataframe we can go ahead and do any other procssing we want. Like converting the string values in postal_code column to integers and so on... That's how to do webscraping!*