# **Week 4 Applied Session: Data Parsing**

In [3]:
# ignore FutureWarning, UserWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [4]:
# Only run this cell when you are using Colab.
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

## **1. Parsing Excel Files**

We will provide some generic instructions on how to scrape data from excel files this week. In this circumstance, data is generated purely for human consumption.
The person who generated the data often tries to make it easily readable for human,
disregarding the importance of releasing it in a machine readable format.
You will find that the scraping process becomes much more difficult and time-consuming comparing with dealing with machine readable files, e.g. JSON, CSV, XML.
But the ultimate goal stays the same, i.e., <font color = "red"> extracting data and converting it into a machine readable format</font>.
* * *

### 1.1 Introduction to Excel

Excel is a popular spreadsheet application originally
developed for Windows.
You can also find free alternatives that run on Mac OS and Linux,
for example, LibreOffice Calc and OpenOffice Calc can both work with Excel files.
An Excel document is also called a workbook.
It is usually saved in a file with either .xlsx extension or .xls extension,
depending on the Excel version you use.
A workbook can contain multiple worksheets, each of which is a grid of cells
where you keep and manipulate the data.
Those cells are arranged in numbered rows and letter-named columns.
Excel can display not only tabular data but also data like line graphs, histograms and charts.
It also provides a set of data analysis functions for statistical, engineering and financial needs.
Presumably, most of you know what a Excel file looks like.
If not, please find some Excel files online and have a look or open the Excel file used in this tutorial.

There are many ways of manipulating data stored in Excel spreadsheets.
For instance,
"[Working with Excel Files in Python](http://www.python-excel.org/)" contains pointers to
the best information available about working with Excel files in Python.
The website lists the following Python packages that deal with Excel:

* `openpyxl`: Reads/writes Excel 2010 xlsx/xlsm/xltx/xltm files.
* `xlsxwriter`: write text, numbers, formulas and hyperlinks to multiple worksheets in an Excel 2007+ XLSX file.
* `xlrd`: Extracts data from Excel files (.xls and .xlsx, versions 2.0 onwards).
* `xlwt`: Writes and formats Excel files compatible with Microsoft Excel versions 95 to 2003.
* `xlutils`: Contains a set of advanced tools for manipulating Excel files (requires `xlrd` and `xlwt`).

You would need to install each separately if you want to use them;
however, in this tutorial we will use Pandas `ExcelFile` class that requires `xlrd` to demonstrate how to
parse Excel files.

Some tutorials on working with Excel files that might be of your interest:
* [Working with Excel Spreadsheets](https://automatetheboringstuff.com/chapter12/): It utilizes openyxl to read
data from spreadsheets. Read the following sections:
    * Reading Excel Documents 📖
    * Project: Reading Data from a Spreadsheet 📖
* [How to read Excel files with Python (xlrd tutorial)](https://www.youtube.com/watch?v=p0DNcTnreuY):
a Youtube video on extracting data from a simple Excel file. (Optional)


This tutorial will use a running example to show
you how to extract data from Excel spreadsheets step-by-step using Pandas.
The example we use in this tutorial is "[Table 2: Nutrition](https://data.unicef.org/resources/state-worlds-children-2016-statistical-tables/)" from Unicef's report on
[The State of the Worlds Children](https://www.unicef.org/reports/state-worlds-children-2014) for 2014.


* <font color = "red"> Note: </font>
    * <font color = "red">Our task is to extract the statistic data table on the child's issues of
underweight, stunting, wasting and overweight prevalence in different countries.</font>

    * <font color = "red">Link for Table 2 data is related to the 2016 dataset. However, We will be using 2014 dataset which is provided in the zip folder.</font>
* * *

### 1.2 Parsing Excel with Pandas
In this section we will walk through the process of parsing our example Excel file with Pandas.
A short tutorial on how to use Pandas `read_excel` function and the ExcelFile class  can
be found at Pandas [webpage on IO](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). 📖  (Just read the section "Reading Excel Files".)

Before we start parsing our Excel file,
we need to first make sure the Python package `xlrd` is installed,
as Pandas `ExcelFile` class makes use of `xlrd`.
The `xlrd` package can be run on Linux and Mac as well as Windows.
Here we assume you use either Linux or Mac.
If you use Anaconda, you do not need to worry about this,
as Anaconda includes the most popular Python packages for data analysis, including `xlrd`.
Otherwise, you might need to install `xlrd` in order to run `read_excel`.
To install `xlrd`, you can use [pip](https://pypi.python.org/pypi/pip),
a Python package management system.
In your command line, simply type
```shell
    pip install xlrd
```

Now to start our script,
we need to import Pandas
and open our Excel file by creating a Pandas `ExcelFile` object.
    

In [None]:
# install xlrd package

# !pip install xlrd

In [5]:
import pandas as pd

# excel_data = pd.ExcelFile('/content/drive/Shareddrives/FIT5196_S2_2025/week4/SOWC 2014 Stat Tables_Table 2.xlsx')
excel_data = pd.ExcelFile('C:/Personal Stuff/Monash/Sem3/FIT5196DataWrangling/Labs/data/week4/SOWC 2014 Stat Tables_Table 2.xlsx')
# excel_data = pd.ExcelFile('SOWC 2014 Stat Tables_Table 2.xlsx')
excel_data

<pandas.io.excel._base.ExcelFile at 0x19ad374d010>

By running the code above, we have loaded the Excel file as a Pandas' ExcelFile object into Python.

Are we ready to parse our Excel File? Before starting to parse the file,
we probably need to ask ourselves a couple of questions. For instance,
<font color = "red">
* How many sheets does our Excel file have?
* Which data sheet does contain our data? What is the name of the sheet? Or what is the index of the sheet?
 </font>

Unlike CSV files, an Excel file can have multiple worksheets.
For example, our Excel file contains two worksheets, one contains data notes,
and the other contains the data we want.
In order to get our data, we will just pull the sheet with the data we want.

If your Excel file has a couple of worksheets and you can guess the index of
the worksheet that contains the data you want, or you have been told from which
worksheet you are going to extract data, you can directly use Panda's
[`read_excel`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html#pandas.read_excel)
fuction
```python
    pandas.read_excel()
```
This function reads an Excel table in a given worksheet into a Pandas DataFrame,
where you can start further manipulating the data.

However, in some cases, particularly while an Excel file has a lot of worksheets,
it might be good to view all the sheets by their names.
So, let's check out what the names of the sheets we have in our Excel file are:

In [6]:
# show the names of all the sheets

excel_data.sheet_names

['Data Notes', 'Table 2 ']

There are two worksheets in our Excel file.
The one that we are looking for is "Table 2 ".
So, let's read the second worksheet into a Pandas DataFrame.
Note that <font color = "red"> there is an extra space in the worksheet name </font>.
Without this space, running the following parsing code
will result in the following error
```
    XLRDError: No sheet named <'Table 2'>
```

In [7]:
try:
    # parse the first sheet and show the first 5 rows
    df = excel_data.parse('Table 2')
except Exception as e:
    print(e)

Worksheet named 'Table 2' not found


In [8]:
# excel_data is an excel object
# after use the parse function, the result is a dataframe
df = excel_data.parse('Table 2 ')
df.head(100)

Unnamed: 0.1,Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,,,TABLEAU 2. NUTRITION,,,,,,,,...,,,,,,,,,,
1,,,,TABLA 2. NUTRICIÓN,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,Countries and areas,,,Low birthweight (%),,Early initiation of breastfeeding (%),,Exclusive breastfeeding\n<6 months (%),,...,Stunting (%),,Wasting (%),,Overweight (%),,"Vitamin A supplementation, full coverageΔ (%)",,Iodized salt consumption (%),
4,,,,,,,,,,,...,moderate \nand severeθ,,moderate \nand severeθ,,moderate and severeθ,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,,Indonesia,Indonésie,Indonesia,9,x,29.3,,41.5,,...,35.6,,13.3,,12.3,,73,,62.3,"x,y"
96,,Iran (Islamic Republic of),Iran (République islamique d’),Irán (República Islámica de),7,x,55.6,x,23,x,...,–,,–,,–,,–,,98.7,"x,y"
97,,Iraq,Iraq,Iraq,13.4,,42.8,,19.6,,...,22.6,,7.4,,11.8,,–,,29,
98,,Ireland,Irlande,Irlanda,–,,–,,–,,...,–,,–,,–,,–,,–,


In [9]:
print(f"The shape of the dataframe is: {df.shape}")

The shape of the dataframe is: (322, 28)


We have loaded the target worksheet into Python.
There are 322 rows and 28 columns (You can use `df.shape` to
see the dimensionality of the DataFrame).

If you scroll through the output, you will notice <font color = "red">  that the loaded data table is quite messy </font>.
The messiness includes
* Rows only contain missing values that are indicated by NaN in Pandas DataFrame.
* Column heads are in three languages, i.e., English, French and Spanish.
* Column heads in one language spread over multiple rows.
* Country names also appear in three languages.
* Notes shown in the original Excel file appear in rows towards the end of the data frame.

<font color = "red"> Remember that our goal is to extract the data table in English. </font>
It is clear that we need to further process the data frame.
For demonstration purpose,
we will try to keep the example as simple as possible,
so we will not extract column heads here.
Instead, if you are interested in programmatically extracting column heads,
you can try it by yourself.


### **1.3 Tasks**


#### **Task 1: drop useless columns and rows**

You can start with removing country names in French and Spanish,
which corresponds to remove two columns, labeled "Unnamed: 2" and "Unnamed: 3" in our data frame.
To do this, you can use DataFrame's [`drop()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) function,
which returns a new object with labels in requested axis removed.
You will frequently use this function later in this section.

In [10]:
df = df.drop(['Unnamed: 2', 'Unnamed: 3'], axis=1)

In [11]:
df

Unnamed: 0.1,Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,Countries and areas,Low birthweight (%),,Early initiation of breastfeeding (%),,Exclusive breastfeeding\n<6 months (%),,"Introduction of solid, semi-solid or soft food...",,...,Stunting (%),,Wasting (%),,Overweight (%),,"Vitamin A supplementation, full coverageΔ (%)",,Iodized salt consumption (%),
4,,,,,,,,,,,...,moderate \nand severeθ,,moderate \nand severeθ,,moderate and severeθ,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,,,,,,,,,,,...,,,,,,,,,,
318,,,,,,,,,,,...,,,,,,,,,,
319,,,,,,,,,,,...,,,,,,,,,,
320,,,,,,,,,,,...,,,,,,,,,,


Now you should have 26 columns.

Next, you can <font color = "red"> remove all the rows and columns that are empty </font>, i.e., only contains NaNs.

In [12]:
df = df.dropna(axis=0, how = 'all') # rows
df = df.dropna(axis=1, how = 'all') # columns
df.head(10)
# when deleting rows and columns like this, indices are not reset, so you get missing indices

Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
3,Countries and areas,Low birthweight (%),,Early initiation of breastfeeding (%),,Exclusive breastfeeding\n<6 months (%),,"Introduction of solid, semi-solid or soft food...",,Breastfeeding at age 2 (%),...,Stunting (%),,Wasting (%),,Overweight (%),,"Vitamin A supplementation, full coverageΔ (%)",,Iodized salt consumption (%),
4,,,,,,,,,,,...,moderate \nand severeθ,,moderate \nand severeθ,,moderate and severeθ,,,,,
6,,2008–2012*,,2008–2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
7,FRENCH HEADINGS,Insuffisance\npondérale à la\nnaissance\n(%)\n,,Initiation\nprécoce de\nl’allaitement\n(%),,Allaitement\nexclusivement\nau sein <6 mois\n(%),,"Introduction\nd’aliments\nsolides, semisolides...",,Nourris au sein\nà l’âge de\n2 ans (%),...,Retard de\ncroissance (%)\n,,Émaciation (%)\n,,Surpoids (%)\n,,Couverture\ntotale par la\nsupplémentation en\...,,Consommation\nde sel iodé (%),
8,,,,,,,,,,,...,modérée et graveθ,,modérée et graveθ,,modérée et graveθ,,,,,
10,,2008–2012*,,2008−2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
11,SPANISH HEADINGS,Bajo peso al nacer (%),,Iniciación temprana a la lactancia materna (%),,Lactancia materna exclusiva\n<6 meses (%),,"Incorporación de alimentos sóli- dos, semisóli...",,Lactancia materna a los 2 anos (%),...,Cortedad de\ntalla (%),,Emaciación (%),,Sobrepeso (%),,Suplementos de vitamina A cobertura completa Δ...,,Consumo de sal yodada (%),
12,,,,,,,,,,,...,moderada y graveθ,,moderada y graveθ,,moderada y graveθ,,,,,
14,,2008–2012*,,2008−2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
16,Afghanistan,–,,–,,–,,29,x,54,...,59,x,9,x,4.6,x,–,,20.4,


In [13]:
df.shape

(245, 25)

Here you can use the [`dropna`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) function of DataFrame. The first argument is axis (0 or 'index' means row, and 1 or 'columns' means column),
and the second argument indicates deleting rows/columns with all NaNs.
We further removed 77 rows and 1 column.

Now, you can <font color = "red"> reset the row indices with a list of integers. </font>

After resetting all the row indices, and if you print out the
first 15 rows using the slicing method

In [14]:
df.index = range(len(df.index))
df[:15]

Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,Countries and areas,Low birthweight (%),,Early initiation of breastfeeding (%),,Exclusive breastfeeding\n<6 months (%),,"Introduction of solid, semi-solid or soft food...",,Breastfeeding at age 2 (%),...,Stunting (%),,Wasting (%),,Overweight (%),,"Vitamin A supplementation, full coverageΔ (%)",,Iodized salt consumption (%),
1,,,,,,,,,,,...,moderate \nand severeθ,,moderate \nand severeθ,,moderate and severeθ,,,,,
2,,2008–2012*,,2008–2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
3,FRENCH HEADINGS,Insuffisance\npondérale à la\nnaissance\n(%)\n,,Initiation\nprécoce de\nl’allaitement\n(%),,Allaitement\nexclusivement\nau sein <6 mois\n(%),,"Introduction\nd’aliments\nsolides, semisolides...",,Nourris au sein\nà l’âge de\n2 ans (%),...,Retard de\ncroissance (%)\n,,Émaciation (%)\n,,Surpoids (%)\n,,Couverture\ntotale par la\nsupplémentation en\...,,Consommation\nde sel iodé (%),
4,,,,,,,,,,,...,modérée et graveθ,,modérée et graveθ,,modérée et graveθ,,,,,
5,,2008–2012*,,2008−2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
6,SPANISH HEADINGS,Bajo peso al nacer (%),,Iniciación temprana a la lactancia materna (%),,Lactancia materna exclusiva\n<6 meses (%),,"Incorporación de alimentos sóli- dos, semisóli...",,Lactancia materna a los 2 anos (%),...,Cortedad de\ntalla (%),,Emaciación (%),,Sobrepeso (%),,Suplementos de vitamina A cobertura completa Δ...,,Consumo de sal yodada (%),
7,,,,,,,,,,,...,moderada y graveθ,,moderada y graveθ,,moderada y graveθ,,,,,
8,,2008–2012*,,2008−2012*,,,,,,,...,,,,,,,2012,,2008–2012*,
9,Afghanistan,–,,–,,–,,29,x,54,...,59,x,9,x,4.6,x,–,,20.4,



You will find that the data starts from row index 9.
The first 9 rows contain column heads in three different languages.
As mentioned above, to keep the script simple, you should not extract column heads here,
rather you need to delete them.

Similarly, if you print out the last 40 rows, the data wanted ends at row 205.
Therefore, you need to <font color = "red">delete the first 9 rows and the
last 39 rows, and then reindex all the rows left.</font>

In [15]:
df[-40:]

Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
205,Zimbabwe,11,,65.2,,31.4,,86,,19.5,...,32.0,,3.0,,5.5,,61,,94,y
206,SUMMARY INDICATORS#,,,,,,,,,,...,,,,,,,,,,
207,Sub-Saharan Africa,13.052633,,45.297565,,36.056241,,56.498691,,50.482704,...,37.97,,9.05,,5.54,,68.456849,,53.993565,
208,Eastern and Southern Africa,11.254036,,59.847888,,51.540234,,72.309396,,60.837445,...,38.82,,7.26,,4.92,,55.955394,,60.509784,
209,West and Central Africa,14.170071,,34.508419,,24.691961,,45.081478,,43.714518,...,36.91,,10.6,,5.9,,77.446049,,52.612735,
210,Middle East and North Africa,–,,–,,–,,–,,–,...,18.17,,7.74,,10.97,,–,,–,
211,South Asia,27.75447,,41,,48.957027,,57,,78.168149,...,37.69,,16.04,,3.59,,69.131567,,71.165538,
212,East Asia and the Pacific,5.525951,,41.345298,,30.164535,,50.562937,,45.213429,...,11.84,,3.52,,5.29,,81,**,90.917443,
213,Latin America and the Caribbean,8.861783,,48.815863,,39.053019,,–,,–,...,11.23,,1.39,,7.46,,–,,–,
214,CEE/CIS,–,,–,,–,,–,,–,...,11.25,,1.49,,15.11,,–,,–,


In [None]:
# Delete the first 9 rows
df = df.drop(df.index[0:9])
# Delete the last 39 rows
df = df.drop(df.index[-39:])

In [17]:
# Reindex rows
df.index = range(len(df.index)) # df.reset_index(drop=True)
df

Unnamed: 0,TABLE 2. NUTRITION,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,Afghanistan,–,,–,,–,,29,x,54,...,59,x,9,x,4.6,x,–,,20.4,
1,Albania,3.6,,42.9,,38.6,,78.278804,,31,...,19,,9,,21.7,,–,,75.6,
2,Algeria,6,x,49.5,x,7,x,39,"x,y",22.2,...,15,x,4,x,12.9,x,–,,60.7,x
3,Andorra,–,,–,,–,,–,,–,...,–,,–,,–,,–,,–,
4,Angola,12,x,54.9,x,11,x,77.2,x,37,...,29,x,8.2,x,–,,44,,44.7,x
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
192,Venezuela (Bolivarian Republic of),8,,–,,–,,–,,–,...,15.6,x,5,x,6.1,x,–,,–,
193,Viet Nam,5.1,,39.7,,17,,50.4,,19.4,...,22.7,,4.1,,4.4,,98,w,45.1,
194,Yemen,–,,30,x,12,x,76.3,"x,y",–,...,57.7,x,15.2,x,5,x,11,,29.5,x
195,Zambia,11,x,56.5,x,61,x,94.341801,x,41.7,...,45.4,x,5.2,x,7.9,x,–,,77.4,x


####  **Task 2: Set country index**

So far you have extracted all the records (or rows) for 196 countries in the Excel file.
Let's set the country names as row indices, and reset the column labels.

In [22]:
# Set country names as row index
df2 = df.set_index(df['TABLE 2. NUTRITION'].values)

# Delete now redundant column
df2 = df2.drop('TABLE 2. NUTRITION', axis=1)

# Reindex column names
df2.columns = list(range(len(df2.columns))) # overwrites column labels with 
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
Afghanistan,–,,–,,–,,29,x,54,x,...,59,x,9,x,4.6,x,–,,20.4,
Albania,3.6,,42.9,,38.6,,78.278804,,31,,...,19,,9,,21.7,,–,,75.6,
Algeria,6,x,49.5,x,7,x,39,"x,y",22.2,x,...,15,x,4,x,12.9,x,–,,60.7,x
Andorra,–,,–,,–,,–,,–,,...,–,,–,,–,,–,,–,
Angola,12,x,54.9,x,11,x,77.2,x,37,x,...,29,x,8.2,x,–,,44,,44.7,x
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela (Bolivarian Republic of),8,,–,,–,,–,,–,,...,15.6,x,5,x,6.1,x,–,,–,
Viet Nam,5.1,,39.7,,17,,50.4,,19.4,,...,22.7,,4.1,,4.4,,98,w,45.1,
Yemen,–,,30,x,12,x,76.3,"x,y",–,,...,57.7,x,15.2,x,5,x,11,,29.5,x
Zambia,11,x,56.5,x,61,x,94.341801,x,41.7,x,...,45.4,x,5.2,x,7.9,x,–,,77.4,x


#### **Task 3 Tidy up all columns**

However, those records are still messy.
As you can see in the printout, there are a lot of NaNs,
and cell values with both numbers and letters (e.g., "6 x", " 39 x,y",) spread over two columns.
Therefore, <font color = "red">we need to merge every two columns together</font>.

How can you do that?

Let us have a look at the first 10 rows and 2 columns.

In [23]:
df2.iloc[:10, :2]

Unnamed: 0,0,1
Afghanistan,–,
Albania,3.6,
Algeria,6,x
Andorra,–,
Angola,12,x
Antigua and Barbuda,5,x
Argentina,7.2,
Armenia,8,
Australia,7,x
Austria,7,x


A close look at the printout will give you the following patterns:
* If the cell contains only a float or '-', the corresponding cell value in the odd-numbered column is "NaN".
See the rows labeled "**Afghanistan**", "**Albania**", etc.
* If the original cell contains a float and a couple of letters, the cell in the even-numbered column contains the float, and the one in the odd-numbered column contains the letters.
See the rows labeled "**Algeria**", "**Angola**". etc.

Assume that you are going to merge the two cells containing a float and letters respectively.
You need a `FOR` loop iterating over either odd- or even-numbered columns.
Within this `FOR` loop, another `FOR` loop is needed to iterate over rows.
For each row, you can check if the cell in the odd-numbered column contains `NaN`.
If it does, then you can merge it with the cell in the corresponding even-numbered column on the left.

In [24]:
# for i in range(len(df2.columns))/2:
for col_idx in range(1,24,2):
    for row_idx in range(len(df2)):
        # Merge odd-numbered non-null columns with even-numbered non-null columns
        if not pd.isnull(df2.iloc[row_idx, col_idx]):
            df2[col_idx-1][row_idx] = str(df2[col_idx-1][row_idx]) +  ' ' + str(df2[col_idx][row_idx])
df2.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
Afghanistan,–,,–,,–,,29 x,x,54 x,x,...,59 x,x,9 x,x,4.6 x,x,–,,20.4,
Albania,3.6,,42.9,,38.6,,78.278804,,31,,...,19,,9,,21.7,,–,,75.6,
Algeria,6 x,x,49.5 x,x,7 x,x,"39 x,y","x,y",22.2 x,x,...,15 x,x,4 x,x,12.9 x,x,–,,60.7 x,x
Andorra,–,,–,,–,,–,,–,,...,–,,–,,–,,–,,–,
Angola,12 x,x,54.9 x,x,11 x,x,77.2 x,x,37 x,x,...,29 x,x,8.2 x,x,–,,44,,44.7 x,x
Antigua and Barbuda,5 x,x,–,,–,,–,,–,,...,–,,–,,–,,–,,–,
Argentina,7.2,,–,,54,,–,,28 x,x,...,8.2 x,x,1.2 x,x,9.9 x,x,–,,–,
Armenia,8,,35.7,,34.6,,75,,22.8,,...,19.3,,4,,15.3,,–,,97 x,x
Australia,7 x,x,–,,–,,–,,–,,...,–,,–,,–,,–,,–,
Austria,7 x,x,–,,–,,–,,–,,...,–,,–,,–,,–,,–,


The next step is to remove the odd-numbered columns in the data frame, as they are redundant now.
To do this, you can use DataFrame's `drop()` function again as follows

Now the data is in a pretty good shape aside from the column heads.
You can extract the column heads from the Excel file using either manual or programmatic method.
Here, you can do it manually. Considering that you need to save results in an csv file, you can use the long name from the raw data.



Finally, you have extracted the data table from our Excel file, and put it into a Pandas DataFrame.
The DataFrame has 197 rows and 12 columns, where rows correspond to records for individual countries
and columns are variables (or attributes).
Our last step is to save the data table in a CSV file.

What is the problem you get? Let's check the type of some values in the DataFrame using
```
    type(df.iloc[i,j])
```
where i indicates row index, and j indicates column index.
You will find that DataFrame's `read_excel` method has parsed all strings and special characters,
like '-', into Unicode objects.
If you print the DataFrame, however, you'll get the printed version of the Unicode.
In contrast, printing a value in a specific location, for example,
```python
    df.iloc[0,0]
```
gives you the original Unicode,
```
    u'\u2013'
```

The default encoding in pandas.to_csv() is 'ascii' on Python 2 and 'utf-8' on Python 3.

Therefore, you need to <font color = "red"> specify the encoding method </font> while saving
the DataFrame into a CSV file.

## 2. Parsing CSV & JSON Files

Due to advances in technologies for data storage, data from various sources is always stored in different formats
and file types.
Some data formats store data in a way that can be easily handled by a machine, such as CSV, JSON, and XML.
Those formats are usually referred to as machine-readable formats.
In contrast, some other data formats or file types store data in a way meant to be read by a human
using front-end desktop tools.
Those formats or file types are often referred to as hard-to-parse formats.
We will use a series of examples to demonstrate how to extract data stored in
both machine-readable and hard-to-parse formats,
and then store the extracted data in formats that can be easily adopted by the downstream data wranngling tasks.
This chapter will cover how to read the common machine-readable formats:
* **CSV**: Comma Separated Values
* **JSON**: JavaScript Object Notation

In most cases, the two formats togeather with XML are the best available resource while you are scraping data from
the web or requesting data directly from an organization or agency.
They are more easily used and ingested by programming languages, like Python.
Our suggestion is that you should try your best to get data in these formats, before you start looking
into other formats that might be hard to parse, like PDFs.

There are many ways of reading and storing data in those formats,
which depends on the programming language you use.
Here we are going to focus on Python.
Searching the Internet, you will find there are a lot of online tutorials on handling data stored in different
data formats with Python.
We suggest the following:
* "*Data Loading, Storage, and File Formats*", Chapter 6 of "**Python for Data Analysis**": This chapter covers reading files in a variety of formats, loading data from databases and interacting with Internet via APIs. Please read pages 155-166, and download and run the Python scripts from [the author's github site](https://github.com/pydata/pydata-book). 📖

The dataset used in this chapter was downloaded from
[data.gov.au](https://data.melbourne.vic.gov.au/explore/dataset/melbourne-bike-share-station-readings-2011-2017/information/).
It is available in the following formats: CSV, JSON, XML, RDF, etc.
The first two formats are used, i.e., the following two files
* Melbourne_bike_share.csv
* Melbourne_bike_share.json

In the following sections, you will learn how to scrape data from the two
example files, and store the extracted data into Pandas DataFrame.

### 2.1 Example scenario
Assume that you are going to analyze and predict bicycle hubway station status to answer the following questions:
* What do usage patterns look like with respect to specific stations and how that translates to imbalances in the system?
* Can we integrate these explanatory variables and these usage patterns into a predictive algorithm that would predict empty and full stations in the near future?
* What form should that algorithm take?
* How do environmental variables affect the future state of Hubway stations?

See <a href="http://cs109hubway.github.io/classp/"><font color="red">Predicting Hubway Stations status in Boston</font></a> for more discussion.

The first step we have to do is to acquire the hub station data and as well as weather data. Here, for demonstration purpose, we use the Melbourne bike share data published by the government. The files have been downloaded and come along with this notebook.

* * *

### 2.2 Parsing CSV file
A CSV is a Comma Separated Values file, which allows data to be saved in a tabular format.
Each row of the file is a data record; each column is a field (or an attribute).
Each data record consists of one or more fields, separated by commas.
As one of the most popular file formats,
it is supported by any spreadsheet programs, such as
Microsoft Excel, Open Office Calc, and Google Spreadsheets,
Because of its simplicity,
it differs from other spreadsheet file types, such as Excel, in that one can only store a single sheet in a file.
It cannot be used to store cell, columns or row styling, figures and formulas.
To make our CSV file, i.e., Melbourne_bike_share.csv, easier to view here,
a sample of the data with trimmed down records is shown below.
You should see something similar to this when you open the excel file in your text editor,
![csv1.png](./csv1.png)

Note that tabs can also be used to separate values of different fields.
This type of files is usually called TSV, Tab Separated Values.
Sometimes TSVs get classified as CSVs.
The only difference between CSVs and TSVs is the delimiter.
Essentially, the two types of files will act the same in Python and most of the other
programming languages.
It is worth mentioning that they often take the form of a text file containing information
separated by commas.
This section will show you how to use Pandas
[read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to
load our CSV file, and how to tidy the loaded data a bit.
Before we start importing our CSV file, it might be good for you to read [Pandas tutorial
on reading CSV files](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) 📖.

### 2.3 Alternative approach to inspect your data

In [None]:
with open("/content/drive/Shareddrives/FIT5196_S2_2025/week4/Melbourne_bike_share.csv", 'r') as f:
    for line in f.readlines()[:10]:
        print (line)

In [None]:
with open("/content/drive/Shareddrives/FIT5196_S2_2025/week4/Melbourne_bike_share.csv", 'r') as f:
    for line in f.readlines()[-10:]:
        print (line)

### 2.4 Importing CSV data
Importing CSV files with Pandas <a href='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html'><font color = "blue">read_csv()</font></a> function and converting the data into a form Python can understand
is simple.
It only takes a couple of lines of code.
The imported data will be stored in Pandas DataFrame.

In [None]:
import pandas as pd
csvdf = pd.read_csv("/content/drive/Shareddrives/FIT5196_S2_2025/week4/Melbourne_bike_share.csv")
type(csvdf)

Or you can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html"><font color='blue'>read_table()</font></a> function

In [None]:
csvdf_1 = pd.read_table("/content/drive/Shareddrives/FIT5196_S2_2025/week4/Melbourne_bike_share.csv", sep=",")
type(csvdf_1)

Now, the data should be loaded into Python.
Let's have a look at the first 5 records in the dataset.
There are a coupe of ways to retrieve these records.
For example, you can use
* <font color='blue'>csvdf.head(n = 5)</font>: It will return first `n` rows in a DataFrame, n = 5 by default.
* <font color='blue'>csvdf[:5]</font>: It uses the slicing method to retrieve the first 5 rows

Refer to "[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html)"
for how to slice, dice, and generally get and set subsets of pandas objects.
Here, we use the `head` function.

In [None]:
csvdf.head()
#csvdf.loc[:4]
#csvdf[:5]

In [None]:
csvdf.tail()

Currently, the row indices are integers automatically generated by Pandas.
Suppose you want to set IDs as row indices and delete the ID column.
Resetting the row indices can be easily done with the following DataFrame function
```python
    DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```
See its [API webpage](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)
for the detailed usage.
The keys are going to be the IDs in the first column.
By setting `inplace = True`, the corresponding change is done inplace and won't return a new DataFrame object.

In [None]:
csvdf.set_index(csvdf.ID, inplace = True)
csvdf.head()

To remove the ID column that is now redundant, you use DataFrame `drop` function and set `inplace = True`
```python
    DataFrame.drop(labels, axis=0, level=None, inplace=False, errors='raise')
```

In [None]:
csvdf.drop('ID', axis=1, inplace = True)
csvdf.head()

Instead of using the above method of setting row indices to IDs, you can specify which column to
be used as row indices while reading the CSV file. See the API reference page for
[pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).
To do so, you can use the <font color='blue'>index_col</font> argument of <font color='blue'>read_csv()</font>.

In [None]:
csvdf = pd.read_csv("/content/drive/Shareddrives/FIT5196_S2_2025/week4/Melbourne_bike_share.csv", index_col = "ID")
csvdf.head()

Similarly, with the <font color='blue'>read_table()</font> function, you can also set the value of <font color='blue'> index_col</font> to "ID".

### 2.5 Manipulating the Data

So far, you have learned a little bit about the Melbourne_bike_share data.
Let's further process the data by splitting the coordinates into latitude and longitude.
First figure out what type of data we're dealing with, i.e., the data type of the "Coordinates" column.

In [None]:
type(csvdf['Coordinates'])
# type(csvdf.Coordinates)

In [None]:
print(csvdf['Coordinates'].iloc[0])
type(csvdf['Coordinates'].iloc[0])

Those coordinates are indeed strings. Thus, to extract both latitude and longitude, you
can either use regular expressions introduced in the previous chapter or common string operations.

To use regular expressions, the key is figuring out the patterns of characters. Then
according to those patterns, you formulate your regular expressions.
Looking at the first couple of coordinates in the Series object, i.e.:
```
    (-37.814022, 144.939521)
    (-37.817523, 144.967814)
    (-37.84782, 144.948196)
```
You will find that latitudes are always negative real values, and longitudes are positive real values.
That is because Australia lies between latitudes 9° and 44°S, and longitudes 112° and 154°E.
The regular expression is
```
    r"-?\d+\.?\d*"
```
![](./regex1.jpg)
It contains four parts
* "-?": optionally matches a single '-'.
* "\d+": matches one or more digits.
* "\\.?": optionally matches a single dot.
* "\d*": matches zero or more digits.

The following code extracts all real values matching this regular expression.
The <font color="blue">re.findall()</font> returns all matched values in a Python list.

In [None]:
import re
str1 = csvdf['Coordinates'].iloc[0] # csvdf.Coordinates
re.findall(r"-?\d+\.?\d*", str1)

Using common string operations might be simpler than using regular expressions.
<font color="blue">str.split()</font> is the function used here to extract both latitudes and longitudes.
However, you should choose a proper delimiter to split a string.
First, split the string by ',':

In [None]:
s = csvdf['Coordinates'].iloc[1].split(', ') # assuming they're all '(x, y)'
print ('lat = ', s[0], ' long = ', s[1])

The printout shows that the latitude contains '(', and the longitude contains ')'.
You should consider removing both the left and the right parentheses.
Of course, the `split` function can be used again.
Note that the goal here is to remove the leading and trailing parentheses.
Python string class provides two functions to do the two operations,
which are:
* <font color="blue">string.lstrip()</font>: returns a copy of the string with leading characters removed
* <font color="blue">string.rstrip()</font>: returns a copy of the string with trailing characters removed.

Let's try the two functions.

In [None]:
print(s[0].lstrip('('))
print(s[1].rstrip(')'))

The latitude and longitude in the first coordinate have been successfully extracted.
Next, we are going to apply the extracting process to every coordinate in the DataFrame.
There are multiple ways of doing that.
The most straightforward way is to write a FOR loop to iterate over all the coordinates,
and apply the above scripts to each individual coordinate.
Two Pandas Series can be then used to store latitudes and longitudes.
However, we are going to show you how to use some advanced Python programming functionality.

Pandas Series class implements an [`apply()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method that applies a given function
to all values in a Series object, and returns a new one.
Please note that this function can only works on single values.
To apply <font color="blue">str.split()</font> to every coordinate and
get latitudes and longitudes, you can use the following two lines of code:

In [None]:
csvdf['lat'] = csvdf['Coordinates'].apply(lambda x: x.split(', ')[0])
csvdf['lon'] = csvdf['Coordinates'].apply(lambda x: x.split(', ')[1])
csvdf.head()

The first line extracts all the latitudes and store them in a column in our DataFrame.
The second line extracts all the longitudes.
You might wonder what "lambda" is in the code.
It is a Python keyword used to construct small anonymous functions at runtime. (See [Section 4.7.5. Lambda Expressions](https://docs.python.org/2/tutorial/controlflow.html) 📖 )
You can use a similar approach to remove the heading and trailing parentheses.

In [None]:
csvdf['lat'] = csvdf['lat'].apply(lambda x: x.lstrip('('))
csvdf['lon'] = csvdf['lon'].apply(lambda x: x.rstrip(')'))
csvdf.drop('Coordinates', axis=1, inplace = True)
csvdf.head()

So far, we have split the "Coordinates" column into two columns, i.e., "lat" and 'lon' in the DataFrame,
and dumped the "Coordinates" column.
The last step is to infer better type for object columns.
All the numerical values and dates are encoded as strings in the current DataFrame.
We would like to convert those values to types that they are supposed to have.

In [None]:
csvdf.dtypes

In [None]:
csvdf = csvdf.apply(pd.to_numeric, errors='ignore')
csvdf.dtypes

However, dates are still strings, which means the `convert_object` function cannot convert data strings to datatime
object.
Here you need to force them to be converted to datatime object with [`pd.to_datetime`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).

In [None]:
csvdf['UploadDate'] = pd.to_datetime(csvdf['UploadDate'])
print (csvdf.dtypes)
csvdf

Finally, you have loaded the given CSV file into Python with Pandas.
You have also tidied the data a bit by getting latitudes and longitudes out
from the strings.

Besides `read_csv`, there are other parsing functions in pandas for
reading tabular data as a DataFrame object. They include
* [`read_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html): Reads general delimited file into DataFrame. The default delimiter is '\t'.
* [`read_fwf`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html): Reads a table of fixed-width formatted lines into DataFrame.
* [`read_clipboard`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html): Reads text from clipboard and passes to read_table. See read_table for the full argument list.
* * *

### 2.6 Parsing JSON files

JSON (JavaScript Object Notation) is one of the most commonly used formats
for transferring data between web services and other applications via HTTP requests.
Nowadays, many sites have JSON-enabled APIs and
JSON is quickly becoming the encoding protocol of choice.
As a light weighted data-interchange format inspired by JavaScript,
it is clean, easy to read, and easy to parse.
Here is a simple example adapted from [Wikipedia page on JSON](https://en.wikipedia.org/wiki/JSON)
```
[
{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
   }
}
]

```

From the above example, you will see that each data record looks like a [Python dictionary](https://docs.python.org/2/tutorial/datastructures.html#dictionaries).
A JSON file usually contains a list of dictionaries, which is defined by '[' and ']'.
In each of those dictionaries,
there is a key-value pair for each row and the key and value are separated by a colon.
Different key-value pairs are separated by commas.
Note that a value can also be a dictionary, see "address" in the example.
The basic types are object, array, value, string and number.
If you would like to know more about JSON, please refer to
* [Introducing to JSON](http://www.json.org/): the JSON org website gives a very good diagrammatic explanation
of JSON 📖.
* [Introduction to JSON](https://www.youtube.com/watch?v=WWa0cg_xMC8): a 15-minutes Youtube video on JSON, recommended for visual learners.

(Of course, you can also go and find your own materials on JSON by searching the Internet.)

In the rest of this section, we will start from an simple example, walking through steps of acquiring JSON Data from Google Maps Elevation API and normalizing those data into a flat table. Then, we revisit the dataset mentioned in the previous section (except that it is now in JSON format), parsing the data and store them in a Pandas DataFrame object.
Before we start, it might be good for you to view one of the following tutorials on parsing JSON files:
* [Working with JSON data](https://www.linkedin.com/learning/learning-python-14393370/working-with-json-data?u=2046060): A Lynda tutorial on parsing JSON data. You need a Monash account to access this website.
[here](http://resources.lib.monash.edu.au/eresources/lynda-guide.pdf) is the lynda settup guide.
* A [Youtube video](https://www.youtube.com/watch?v=9Xt2e9x4xwQ ) on extracting data from JSON files (**optional**).

#### 2.6.1 Acquiring JSON Data From The Internet
This section will start with showing you how to acquire a small chunk of JSON data
from Internet via HTTP requests and load it into Python with `json` library.
The example we used is inspired by the [DBLP database Access API](https://dblp.org/faq/How+to+use+the+dblp+search+API.html). Whcih are used to search for publications in the DBLP database using the DBLP search API.

The first step is to make a HTTP request to get the data from the Google Maps API.
Here we are going to use [`urllib2`](https://docs.python.org/2/library/urllib2.html) library.
It defines a set of functions and classes that help in opening URLs.

In [None]:
from urllib.request import urlopen, Request # for python 3
request = Request("https://dblp.org/search/publ/api?q=data%20wrangling&format=json")
response = urlopen(request)
elevations = response.read()
elevations.splitlines()

In the above code, we have:
1. Imports Request class and the <font color="blue">urlopen() </font> function from `urllibs` module.
2. Defines a path with the coordinates of the start and end points
3. Creates a URL Request object. Note that you can change the output format by replacing '/json' with '/xml'.
4. Opens the URL, and returns a file-like object.
5. Reads data returned from the HTTP request.

The returned data is actually stored in a string.
You can check it out using Python's built-in function `type`,
```python
    type(elevations)
```
What does the data look like?
In stead of printing the data in one single string, one can use
```python
    elevations.splitlines()
```
to print the data as a list of lines in the string, breaking
at line boundaries, i.e., '\n'.

It is easy to dump the data into a JSON file, which just takes three lines of code:
```python
    import json
    with open("elevations.json", "w") as outfile:
         json.dump(elevations, outfile)
```

To read the acquired JSON data, you can use the `json` module as follows:

In [None]:
import json
data = json.loads(elevations)
print(type(data))
data

It loads the data into a Python dictionary.
The data we want is stored in the first entry.
The value of this entry is a list of two dictionaries, each of which corresponds to a record.
see [JSON encoder and decoder](https://docs.python.org/2/library/json.html) for more on reading
JSON files.

As mentioned earlier in this section,
we will convert the JSON data into Pandas DataFrame.
Therefore, Pandas functions on reading JSON are to be used.
If you would like to know about those functions, you can read Pandas tutorial on [Reading JSON](http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader) (**optional**).
Let's first try the <font color="blue">read_json()</font> function.

In [None]:
# using pandas to read json, the json data should be in a string format
json_str = elevations.decode('utf-8')
df = pd.read_json(json_str)
df

Unfortunately, the DataFrame returned by `read_json` is not the one we want.
You might wonder why the `read_json` function did not return the DataFrame we want.
There is a straight forward answer.
Let's try to build a DataFrame from `data` returned by
```
    data = json.loads(elevations)
```
What do you get?

In [None]:
pd.DataFrame(data)

You have got a DataFrame that is exactly the same as the one returned by `read_json`.
This is due to Pandas' way of constructing a DataFrame from a dictionary.
See [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
for constructing a DataFrame from a dictionary
and "Object Creation" in [10 Mintues to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) 📖.
It is not hard to figure out that dictionary keys
are used as column
labels, and values of whatever data types are put as column values.

What we want is to flatten out JSON object into a flat table.
Fortunately, Pandas provides a JSON normalization function [(<font color="blue">json_normalize()</font>)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html)
that takes a dict or list of dicts and normalize semi-structured data into a flat table.

In [None]:
#from pandas.io.json import json_normalize
from pandas import json_normalize
json_normalize(data['result'])

Eventually, the <font color="blue">json_normalize()</font> function returns the DataFrame we want.
However flattening objects with embedded arrays/lists is not as trivial.
See [Flattening JSON objects in Python](https://gist.github.com/amirziai/2808d06f59a38138fa2d)
for more information.

#### 2.6.2 Parsing the "Melbourne_bike_share.json"  File

Now that you have learned how to use `json` module and Pandas together to parse a simple JSON file.
In this section we will walk you through the process of extracting bike hub station statistical data from "Melbourne_bike_share.json". Then produce the same DataFrame as the one in Section 1.

Remember that the first step is always to glance through the JSON file with your favorite editor.
Below is the first 20 lines from our JSON file.

<img src = "./json20.png" width = "700", hight = "800">

This JSON file is much more complex that the one used in the previous section
It might take a bit of time to figure out that this file is a dictionary of
two large dictionaries, one with key "meta", and another with "data".
The "meta" dictionary contains all the meta information, including column names.
The "data" dictionary actually contains the data we want.
In the following subsection, we will show you how to extract records from the "data"
dictionary, while leaving the task of extracting column labels from the "meta" dictionary as an exercise.
Similarly, our JSON data can be read into Python as follows.

In [None]:
import json
with open("/content/drive/Shareddrives/FIT5196_S1_2025/week4/Melbourne_bike_share.json") as json_file:
    json_data = json.load(json_file)
print (type(json_data))
json_data['meta']['view']

The loaded JSON data has been saved in a Python dictionary with two entries, one for "data" and another for "meta".
Using `json_normalize`, you can flatten the "data" dictionary into a table and save it in a DataFrame.

In [None]:
df = pd.json_normalize(json_data,'data')
df.head()

We seem to have a lot of extra columns.
The data we want starts at column 8.
Therefore, dump all the irrelevant preceding columns.

In [None]:
df.drop(range(8), axis=1, inplace=True) # For python 3
df.head()

Renaming all the columns with the field names given by the CSV file.
You can programmatically extract field names from the "meta" dictionary.
We will leave it for you to do as an exercise.
Similar to parsing CSV file, IDs are unique and can be set to row indices.

In [None]:
df.columns = ['id','featurename','terminalname','nbbikes','nbemptydoc','uploaddate','coordinates']
df.set_index(df.id, inplace= True)
df.drop('id', axis=1, inplace = True)
df.head()

What's in the last two columns?

Let's first convert those integers into standard datetime.
The following Python code converts
one of these integers into a standard datetime using Python
[`datatime`](https://docs.python.org/2/library/datetime.html) module:
```python
    import datatime
    date = datetime.datetime.fromtimestamp(df.iloc[0,4])
    print data
```
The output is
```
    2016-01-28 23:45:05
```
Similar to the way of splitting coordinates in Section 2.1,
one can use `pandas.Series.apply` to invoke  `datetime.datetime.fromtimestamp`
on each individual integer in the column.
Please try this method by yourself.

Instead, we will show you a pandas specific way of converting
timestamp values in milliseconds into standard datetime.
Here we use Pandas [`to_datetime`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
function.

In [None]:
df['uploaddate'] = pd.to_datetime(df['uploaddate'], unit='s')
df.head()

In [None]:
print(csvdf.iloc[0,4]) # the csv date
print(df.iloc[0,4])

The difference is due to that two files were downloaded one after another.
However, the time format is the same.

The last step is to extract latitudes and longitudes into two columns.
Each coordinate in the last column of the DataFrame is a Python list.
The second and the third entries are latitude and longitude respectively.
It is very easy to get the two entries into a list.
We will apply the following anonymous function to all the coordinates one after another
```python
    lambda col: col[i]
```
where i = 1 or 2. While i = 1, it returns latitudes; i = 2, it returns longitudes.

In [None]:
df['lat'] = df['coordinates'].apply(lambda col: col[1]) # arrrrgh
df['lon'] = df['coordinates'].apply(lambda col: col[2])
df.head()

Now, dump the "coordinates" columns and change data type of each column.

In [None]:
df.drop('coordinates', axis=1, inplace = True)
df['lat'] = df['lat'].apply(pd.to_numeric, errors='ignore')
df['lon'] = df['lon'].apply(pd.to_numeric, errors='ignore')

print(df.dtypes)
df

## 3. Summary

Files in either CSV or JSON format are the easiest ones to preview, understand and parse.
In this chapter，you have learned about how to pull data out from files stored in those two formats
using Pandas. You should now be familiar with these two formats.
