# Reading Data from files
###### **19.01.2022**

### Reading data from csv files

Let’s start by creating a toy example of a CSV file. You can do this straight in the JupyterLab by opening up a new text file. Enter the following text. Let’s now import our csv file as a pandas DataFrame. To do this, we will use the function read_csv() from pandas which takes data from a csv file and converts and stores it as a DataFrame. Since we saved the file directly in our working directory, we can pass the file name directly.

In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv("c2_file.csv")
df

Unnamed: 0,a,b,c,d
0,yellow,10,2,3.2
1,green,2,3,8.1
2,blue,7,1,0.4


In [3]:
# If our file did not contain any headers, and the first row was part of the data values
pd.read_csv("c2_file.csv", header=None)

Unnamed: 0,0,1,2,3
0,a,b,c,d
1,yellow,10,2,3.2
2,green,2,3,8.1
3,blue,7,1,0.4


In [4]:
# We can also specify the column labels ourselves as follows
pd.read_csv("c2_file.csv", names=["column 1", "column 2", "column 3", "column 4"])

Unnamed: 0,column 1,column 2,column 3,column 4
0,a,b,c,d
1,yellow,10,2,3.2
2,green,2,3,8.1
3,blue,7,1,0.4


In [5]:
# We can specify a specific column to be the index by using the argument index_col and choosing the position of the column we want as an index.
pd.read_csv("c2_file.csv", index_col=0)

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
yellow,10,2,3.2
green,2,3,8.1
blue,7,1,0.4


In [6]:
# We can assign each column the data type which is the most general among the values encountered in that column.
df.dtypes

a     object
b      int64
c      int64
d    float64
dtype: object

In [7]:
# We can also force the data types of the columns to whatever we like when creating the DataFrame
df2 = pd.read_csv("c2_file.csv", dtype={"b": np.float64}) #the type must be coherent with the information contained in the columns
df2.dtypes

a     object
b    float64
c      int64
d    float64
dtype: object

In [8]:
# Loading partial data
pd.read_csv("c2_file.csv", usecols=["a", "b"])

Unnamed: 0,a,b
0,yellow,10
1,green,2
2,blue,7


### Reading data from Excel files

We can import data from an Excel file using the pandas function pd.read_excel(). Let’s try this out:

In [9]:
import pandas as pd

pd.read_excel("c2_data.xls")

Unnamed: 0,varA,varB,varC
0,0.391723,-0.155122,0.381104
1,0.575125,-0.105817,0.232245
2,0.672305,0.424688,-0.694795
3,0.766115,-0.79135,-0.028739
4,0.677259,-0.817543,-0.537088
5,-0.029702,-0.891848,-0.682719
6,-0.161366,-0.6596,-0.727898
7,0.031672,0.016607,-0.940479
8,0.833212,-0.503236,-0.88721
9,0.907753,0.265177,-0.390762


In [11]:
#  If we want to load data from a different spreadsheet, we must specify its name or index.
pd.read_excel("c2_data.xls", sheet_name="Sheet2")

Unnamed: 0,varD,varE,varF
0,0.907753,0.265177,-0.390762
1,0.755019,-0.768056,-0.528307
2,0.850692,-0.537159,-0.601387
3,0.131663,0.941327,0.240073
4,0.5744,0.091735,-0.395277
5,0.81663,0.875612,-0.880044
6,0.536732,0.175428,-0.473053
7,-0.084641,-0.042827,0.053344
8,0.268271,-0.010628,-0.090952
9,0.166792,-0.872579,-0.556899


### JSON data

JSON is a file format used to transmit information between web servers and clients or browsers in logical and structured manner.

#### Syntax and structure

JSON can contain two type of objects: JSON and arrays. We can import the data from this file similarly to how we dealt with csv and Excel files.

In [14]:
pd.read_json("frame.json")

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11
row4,12,13,14,15


Now in this example, the JSON file that we used was already in what is called tabular form. This means that we could directly load it as a DataFrame. However, this is not usually the case with JSON files. To explore this, let’s create a second JSON file. In this case we will store some bibliography entries. 

In [15]:
pd.read_json("c2_books.json")

Unnamed: 0,books
0,"{'isbn': '9781593275846', 'title': 'Eloquent J..."
1,"{'isbn': '9781449331818', 'title': 'Learning J..."
2,"{'isbn': '9781449365035', 'title': 'Speaking J..."


We can see that all of the information about a book was stored in a single data field, so we cannot really distinguish between the different information such as the author, title, etc. It would be much more useful to create a DataFrame where we have different columns for the different types of data. In order to do this, we will need to perform some additional steps. We start by importing the json library for python, and a special function from pandas called json_normalize():



In [16]:
import json
from pandas import json_normalize

Let’s load the data from the JSON file and convert it to an object (a dictionary, really) which we store in the variable dictionary:

In [17]:
with open("c2_books.json", "r") as f:
    json_string = f.read()
    dictionary = json.loads(json_string)

If we now type the command dictionary, we can see indeed the data from our file. Once we have the data in this format, we can apply a process known as normalization. It’s called normalization because it “normalizes” JSON data, which can be quite complex in structure, into a flat table structure (a DataFrame, to be more precise). To do this, we use the json_normalize() function. This function turns an array of nested JSON objects into a DataFrame, with the columns corresponding to the different variables stored in the JSON file. We will pass as arguments the variable dictionary which contains the data as a dictionary, and then we have to mention a key, which is used for separating the entries. To know which key to use, we must look at our JSON file and see the name that is given before the entries. In our case, this name is books.

In [18]:
json_normalize(dictionary, 'books')

Unnamed: 0,isbn,title,subtitle,author,published,publisher,pages,description,website
0,9781593275846,"Eloquent JavaScript, Second Edition",A Modern Introduction to Programming,Marijn Haverbeke,2014-12-14T00:00:00.000Z,No Starch Press,472,JavaScript lies at the heart of almost every m...,http://eloquentjavascript.net/
1,9781449331818,Learning JavaScript Design Patterns,A JavaScript and jQuery Developer's Guide,Addy Osmani,2012-07-01T00:00:00.000Z,O'Reilly Media,254,"With Learning JavaScript Design Patterns, you'...",http://www.addyosmani.com/resources/essentialj...
2,9781449365035,Speaking JavaScript,An In-Depth Guide for Programmers,Axel Rauschmayer,2014-02-01T00:00:00.000Z,O'Reilly Media,460,"Like it or not, JavaScript is everywhere these...",http://speakingjs.com/


### HTML files

This is a webpage containing some famous quotes. Our goal will be to extract these quotes and store them in a DataFrame. Clicking on the view source code option opens up a new page with the HTML source code. This source code might seem very complicated at first sight. We can see that the actual quotes we want are found in this code, but they are surrounded by so many extra symbols. In order to be able to extract them we must first know some basic facts about HTML code.



In [21]:
# 1. Read the webpage into python
import requests
# 2. To download the contents of the HTML code, the requests library will make a GET request to the web serves
page = requests.get(
    "https://web.archive.org/web/20180908144902/en.proverbia.net/shortfamousquotes.asp"
)
# 3. The variable page is a Response object. It has an attribute called text which contains the HTML code. We can verify this by peeking at the first 100 entries
page.text[0:100]
# 4. It also has an attribute called status_code which indicates if the page was downloaded correctly
page.status_code
# A status code of 200 means that the page was downloaded correctly.

200

### Web scraping

In the previous unit, we downloaded the data from a webpage to an object called page using the requests library. We saw that this object had an attribute called text which contained the HTML source code of the webpage. We will now look at how we can extract the information we want from this code. We will use a python library called BeautifulSoup. Let’s start by importing this library:

In [22]:
from bs4 import BeautifulSoup
# We parse the HTML code stored in the object page and create a BeautifulSoup object
soup = BeautifulSoup(page.text, "html.parser")

This soup object now contains all of the HTML code in the original document. Now comes the fun part: we want to extract the information that we want from the HTML code. In our case, **our goal is to extract the quotes**. To do this, we have to do some investigative work. We will look in the HTML code for some patterns that we can exploit in order to find all the quotes. That is, we want to find some tags or identifiers which occur around every quote. If we inspect the section in which the quotes appear we will see that they start and end with <blockquote and </blockquote, respectively. (For this we run the soup object). Once we identified such a pattern, we can use BeautifulSoup to find all instances of the pattern for us. This is done through the function find_all(). Let’s give it a try:



In [25]:
quotes = soup.find_all("blockquote")
# It returns a ResultSet object which we can treat as a list of tags
quotes

[<blockquote>There is a natural aristocracy among men. The grounds of this are virtue and talents. </blockquote>,
 <blockquote>All our words from loose using have lost their edge. </blockquote>,
 <blockquote>God couldn't be everywhere, so he created mothers </blockquote>,
 <blockquote>Be not afraid of going slowly, be afraid only of standing still. </blockquote>,
 <blockquote>Learn from yesterday, live for today, hope for tomorrow. </blockquote>,
 <blockquote>Do not confine your children to your own learning, for they were born in another time. </blockquote>,
 <blockquote>I hear and I forget, I see and I remember. I do and I understand. </blockquote>,
 <blockquote>In teaching others we teach ourselves. </blockquote>,
 <blockquote>Happiness will never come to those who fail to appreciate what they already have. </blockquote>,
 <blockquote>Without His love I can do nothing, with His love there is nothing I cannot do. </blockquote>]

We can iterate over it or access entries at a specific index. It also has a very useful attribute called text.

In [26]:
quotes[0].text

'There is a natural aristocracy among men. The grounds of this are virtue and talents. '

The attribute text extracted the text of the tag for us, and returned a string object containing the quote. We can now create a loop to extract all the quotes:

In [27]:
quote_list = []
for quote in quotes:
    string = quote.text
    quote_list.append(string)
    
# Finally, we can create a DataFrame to store our strings:
df = pd.DataFrame(quote_list, columns=["Quote"])
df

Unnamed: 0,Quote
0,There is a natural aristocracy among men. The ...
1,All our words from loose using have lost their...
2,"God couldn't be everywhere, so he created moth..."
3,"Be not afraid of going slowly, be afraid only ..."
4,"Learn from yesterday, live for today, hope for..."
5,Do not confine your children to your own learn...
6,"I hear and I forget, I see and I remember. I d..."
7,In teaching others we teach ourselves.
8,Happiness will never come to those who fail to...
9,"Without His love I can do nothing, with His lo..."


We will now continue by extracting the **authors** of each quote. The first step is as before: to identify a pattern of tags that surround the text that we want. If we repeat the same logic that we used for the quotes we will find that the author is contained inside the tags <p class="a" and </p. Let’s try to use the find_all() function again. But wait - this time we cannot just search for all instances of the tag <p> <\p> because this tag occurs in more cases then just around the text giving the authors of the code! Instead, we will have to take advantage of the class property. The HTML code is structured in such a way that all tags of the form <p> <\p> that surround the text giving the authors are of the type class="a". We can specify this to the find_all() command as follows:



In [28]:
authors = soup.find_all("p", class_="a")

Once again we can treat the object authors as a list and access the attribute text, which returns us the text inside the tag for each item of the list.

In [29]:
authors[0].text

'\nThomas Jefferson (1743-1826) Third president of the United States.\n'

We can see that there are some extra characters that we would like to remove from the text. We can do this by just selecting the indices of the string that we want to keep. In this case, we want to get rid of the first and last characters (note that \n is considered as a single character - it’s the new line character) - so the string we want to keep is:

In [30]:
authors[0].text[1:-1]

'Thomas Jefferson (1743-1826) Third president of the United States.'

In [31]:
author_list = []
for author in authors:
    string = author.text[1:-1]
    author_list.append(string)
df["Author"] = author_list
df

Unnamed: 0,Quote,Author
0,There is a natural aristocracy among men. The ...,Thomas Jefferson (1743-1826) Third president o...
1,All our words from loose using have lost their...,Ernest Hemingway (1898-1961) American Writer.
2,"God couldn't be everywhere, so he created moth...",Jewish proverb
3,"Be not afraid of going slowly, be afraid only ...",Chinese proverb
4,"Learn from yesterday, live for today, hope for...",Unknown Source
5,Do not confine your children to your own learn...,Chinese proverb
6,"I hear and I forget, I see and I remember. I d...",Chinese proverb
7,In teaching others we teach ourselves.,Proverb
8,Happiness will never come to those who fail to...,Unknown Source
9,"Without His love I can do nothing, with His lo...",Unknown Source


#### **In summary**

* Download HTML code using the requests library
* Create a BeautifulSoup object to contain the parse HTML code
* Look for patterns identifying the information that you want to extract from the code
* Search for specific tags using the find_all() method
* Iterate over the object returned by find_all() and use the text attribute to extract the text between each set of tags
* Store the strings in a Python list and convert to a DataFrame for further analysis

A word of caution: the legal aspects of web scraping can be a highly complex matter and when doing so one has to be very careful with the terms and conditions of each website. In fact, many websites state clearly in their terms and conditions that web scraping is not allowed. Others, provide what is known as an API ( “Application Programming Interface”) which is a set of predefined functions and procedures that control the access we can have to a site’s data. Whenever such an API is provided it is best practice to use it, instead of developing your own scraping methods.

#### Special case: scrapping tables

It may happen that you want to directly scrape tables out of webpages. For that, you could apply the method described above i.e. collect the HTML from a webpage, then parse it using BeautifulSoup and insert it in a DataFrame. But fortunately, for that specific use case, Pandas has a built-in and straightforward method pd.read_html() that will automatically retrieve tables in a webpage (as identified by <table in its HTML code) and insert them into a DataFrame. For example, if you want to collect the table of additives as described in this Open Food Facts webpage, you can call pd.read_html() with the URL as input:

In [32]:
tables = pd.read_html("https://world.openfoodfacts.org/additives")
print(len(tables))  # 1
print(tables[0].head())

1
                   Additive  Products    * Risk
0        E330 - Citric acid    138624  NaN  NaN
1          E322 - Lecithins     95545  NaN  NaN
2          E322i - Lecithin     86705  NaN  NaN
3  E500 - Sodium carbonates     58968  NaN  NaN
4        E415 - Xanthan gum     52039  NaN  NaN


You could use the same method in a webpage with many tables, as for instance the Wikipedia page for world record progression for 50m metres freestyle, and it would return all the available tables:



In [33]:
tables = pd.read_html(
    "https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle"
)
print(len(tables))
print(tables[4].head())
print(tables[-2].head())

9
   Pos   Time                   Swimmer              Date          Venue   Ref
0    1  20.91         Cesar Cielo (BRA)  17 December 2009         Brazil   NaN
1    2  20.94  Frederick Bousquet (FRA)     22 April 2009         France   NaN
2    3  21.04      Caeleb Dressel (USA)      27 July 2019    South Korea   NaN
3    3  21.04      Caeleb Dressel (USA)      20 June 2021          Omaha  [19]
4    4  21.11      Benjamin Proud (GBR)     3 August 2018  Great Britain   NaN
   Pos   Time                    Swimmer              Date       Venue  Ref
0    1  22.93  Ranomi Kromowidjojo (NED)     7 August 2017     Germany  NaN
1    2  23.00       Sarah Sjöström (SWE)     7 August 2017     Germany  NaN
2    3  23.19        Cate Campbell (AUS)   27 October 2017      Russia  NaN
3    4  23.25     Marleen Veldhuis (NED)     13 April 2008  Manchester  NaN
4    5  23.27    Therese Alshammar (SWE)  21 November 2009   Singapore  NaN


And if you’re only interested in tables mentioning “Switzerland”, then there is a parameter match exactly done for that:

In [34]:
tables = pd.read_html(
    "https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle",
    match="Switzerland",
)
print(len(tables))  # 1
print(tables[0][10:15][["Time", "Name", "Nationality"]])


1
     Time          Name    Nationality
10  22.54   Robin Leamy  United States
11  22.52  Dano Halsall    Switzerland
12  22.40     Tom Jager  United States
13  22.33   Matt Biondi  United States
14  22.33   Matt Biondi  United States
