# Reading data from csv files

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('test_file.csv')
df

Unnamed: 0,a,b,c,d
0,yellow,10,2,3.2
1,green,2,3,8.1
2,blue,7,1,0.4


In [3]:
pd.read_csv('test_file.csv',names=['column 1','column 2','column 3','column 4'])

Unnamed: 0,column 1,column 2,column 3,column 4
0,a,b,c,d
1,yellow,10,2,3.2
2,green,2,3,8.1
3,blue,7,1,0.4


In [4]:
pd.read_csv('test_file.csv', index_col=0)

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
yellow,10,2,3.2
green,2,3,8.1
blue,7,1,0.4


In [5]:
df.dtypes

a     object
b      int64
c      int64
d    float64
dtype: object

In [6]:
#Forcer le format

df2 = pd.read_csv('test_file.csv',  dtype = { 'b' : np.float64})
df2.dtypes

a     object
b    float64
c      int64
d    float64
dtype: object

#### Loading Partial

In [7]:
pd.read_csv("test_file.csv", usecols=['a', 'b'])

Unnamed: 0,a,b
0,yellow,10
1,green,2
2,blue,7


# Reading data from Excel files

In [9]:
import pandas as pd
pd.read_excel('data.xls')

Unnamed: 0,varA,varB,varC
0,0.391723,-0.155122,0.381104
1,0.575125,-0.105817,0.232245
2,0.672305,0.424688,-0.694795
3,0.766115,-0.79135,-0.028739
4,0.677259,-0.817543,-0.537088
5,-0.029702,-0.891848,-0.682719
6,-0.161366,-0.6596,-0.727898
7,0.031672,0.016607,-0.940479
8,0.833212,-0.503236,-0.88721
9,0.907753,0.265177,-0.390762


In [10]:
pd.read_excel('data.xls', sheet_name='Sheet2')

Unnamed: 0,varD,varE,varF
0,0.907753,0.265177,-0.390762
1,0.755019,-0.768056,-0.528307
2,0.850692,-0.537159,-0.601387
3,0.131663,0.941327,0.240073
4,0.5744,0.091735,-0.395277
5,0.81663,0.875612,-0.880044
6,0.536732,0.175428,-0.473053
7,-0.084641,-0.042827,0.053344
8,0.268271,-0.010628,-0.090952
9,0.166792,-0.872579,-0.556899


In [11]:
pd.read_excel('data.xls', sheet_name='Sheet2', usecols = ['varD','varE'])

Unnamed: 0,varD,varE
0,0.907753,0.265177
1,0.755019,-0.768056
2,0.850692,-0.537159
3,0.131663,0.941327
4,0.5744,0.091735
5,0.81663,0.875612
6,0.536732,0.175428
7,-0.084641,-0.042827
8,0.268271,-0.010628
9,0.166792,-0.872579


# JSON data

When dealing with data on the web, the most common format that we will come across is JSON, which stands for JavaScript Object Notation. In a nutshell, JSON is a file format used to transmit information between web servers and clients or browsers in logical and structured manner. It was first developed in the early 2000s as a response to a need for a better server-to-browser communication protocol. As suggested by its name, it was originally derived from the JavaScript programming language; however, unlike JavaScript objects, a JSON object can be transferred between different programming languages in a format that all languages can work with. In fact, nowadays, almost all programming languages contain functions or libraries that can read and write JSON data.


#### Syntax and structure
JSON can contain two types of elements:

- JSON objects
- arrays

A JSON object is essentially just a key-value data format that is stored inside curly brackets. Here is an example:

In [12]:
{
  "userID": 12345,
  "userName": "John Smith"
}

{'userID': 12345, 'userName': 'John Smith'}

An array is an ordered collection that can contain values of different data types. The main syntactical difference between JSON objects and arrays is that arrays are stored inside square brackets. We can use arrays as the value field of a JSON object as shown below

In [13]:
{
  "userID": 12345,
  "userName": "John Smith",
  "results": [
    {
      "test": "Verbal Reasoning",
      "score": 140
     },
    {
      "test":"Quantitative Reasoning",
       "score": 165
    },
    {
      "test":"Analytical Writing",
       "score": 5
    }
  ],
  "testCompleted": True
}

{'userID': 12345,
 'userName': 'John Smith',
 'results': [{'test': 'Verbal Reasoning', 'score': 140},
  {'test': 'Quantitative Reasoning', 'score': 165},
  {'test': 'Analytical Writing', 'score': 5}],
 'testCompleted': True}

In [14]:
{
    "col1":
        {
            "row1":0,"row2":4,"row3":8,"row4":12
        },
    "col2":
        {
            "row1":1,"row2":5,"row3":9,"row4":13
        },
    "col3":
        {
            "row1":2,"row2":6,"row3":10,"row4":14
        },
    "col4":
        {
            "row1":3,"row2":7,"row3":11,"row4":15
        }
}

{'col1': {'row1': 0, 'row2': 4, 'row3': 8, 'row4': 12},
 'col2': {'row1': 1, 'row2': 5, 'row3': 9, 'row4': 13},
 'col3': {'row1': 2, 'row2': 6, 'row3': 10, 'row4': 14},
 'col4': {'row1': 3, 'row2': 7, 'row3': 11, 'row4': 15}}

In [15]:
pd.read_json('frame.json')

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11
row4,12,13,14,15


Now in this example, the JSON file that we used was already in what is called tabular form. This means that we could directly load it as a DataFrame. However, this is not usually the case with JSON files.



In [16]:
pd.read_json('books.json')

Unnamed: 0,books
0,"{'isbn': '9781593275846', 'title': 'Eloquent J..."
1,"{'isbn': '9781449331818', 'title': 'Learning J..."
2,"{'isbn': '9781449365035', 'title': 'Speaking J..."


### We can see that the data in this file is not longer tabular. If we try to load the data directly with the read_json() the result is not very useful:

In order to do this, we will need to perform some additional steps. We start by importing the json library for python, and a special function from pandas called json_normalize():

In [17]:
import json
from pandas.io.json import json_normalize

Let's load the data from the JSON file and convert it to an object (a dictionary, really) which we store in the variable dictionary:

In [18]:
with open('books.json', 'r') as f:
    json_string = f.read()
    dictionary = json.loads(json_string)

If we now type the command dictionary, we can see indeed the data from our file. Once we have the data in this format, we can apply a process known as normalization. It's called normalization because it "normalizes" JSON data, which can be quite complex in structure, into a flat table structure (a DataFrame, to be more precise). To do this, we use the json_normalize() function. This function turns an array of nested JSON objects into a DataFrame, with the columns corresponding to the different variables stored in the JSON file. We will pass as arguments the variable dictionary which contains the data as a dictionary, and then we have to mention a key, which is used for separating the entries. To know which key to use, we must look at our JSON file and see the name that is given before the entries. In our case, this name is books. Let's try this out:

In [19]:
json_normalize(dictionary, 'books')

Unnamed: 0,author,description,isbn,pages,published,publisher,subtitle,title,website
0,Marijn Haverbeke,JavaScript lies at the heart of almost every m...,9781593275846,472,2014-12-14T00:00:00.000Z,No Starch Press,A Modern Introduction to Programming,"Eloquent JavaScript, Second Edition",http://eloquentjavascript.net/
1,Addy Osmani,"With Learning JavaScript Design Patterns, you'...",9781449331818,254,2012-07-01T00:00:00.000Z,O'Reilly Media,A JavaScript and jQuery Developer's Guide,Learning JavaScript Design Patterns,http://www.addyosmani.com/resources/essentialj...
2,Axel Rauschmayer,"Like it or not, JavaScript is everywhere these...",9781449365035,460,2014-02-01T00:00:00.000Z,O'Reilly Media,An In-Depth Guide for Programmers,Speaking JavaScript,http://speakingjs.com/


# HTML files

The web is one of the major sources of data that you will encounter. Getting data from the web is known as **web scraping**, and it is a very useful skill in any data scientist's toolbox. It allows us to get data from the web that is not yet in a well-structured format which you can download directly for data analysis such as csv. **You might wonder, why don't we just copy and paste the data manually? Well, this might work for a small webpage but in general, we will be interested in scraping large amounts of data that would be extremely time consuming and completely impractical to do by hand**. Luckily, Python has several tools which help automate this process for us.

Before we get into it, a word of warning: Be cautious when crawling the web. In particular, some Terms of Services may explicitly prohibit you from scraping the website, and the data may itself be copyrighted. So be sure to understand what you're doing (here is a an interesting analysis of the problem).

#### What exactly is HTML?

We will not get into too much detail here about HTML, the HyperText Markup Language that powers the web, but we will cover some very basic facts that will be sufficient for you to perform successful web scraping. **HTML is the source code that generates a webpage.** When viewing a webpage in our web browser, we can look at its source code by right-clicking and selecting view page source or show page source, depending on the browser we are using. Here is an example:

We will exploit these patterns to retrieve the information that we want. We will be especially interested in the attributes **class and id**. These are special properties that give HTML elements names, and we can take advantage of these names when web scraping. An element can have multiple classes but only one id. When writing HTML code it is not necessary to give elements classes and ids however, so not all web pages might have these attributes.

#### The requests library

The first step in web scraping is to read the web page into python. This is done using the requests library, so we have to make sure that we first import it as follows:

In [20]:
import requests

In [21]:
page=requests.get('https://web.archive.org/web/20180908144902/http://en.proverbia.net/shortfamousquotes.asp')

In [22]:
page.status_code

200

In [23]:
page.text[0:100]

'\n<!DOCTYPE html>\n\n<html lang="en" xml:lang="en">\n<head><script src="//archive.org/includes/analytics'

# Web scraping

In [24]:
from bs4 import BeautifulSoup

In [25]:
soup = BeautifulSoup(page.text, 'html.parser')

In [26]:
quotes = soup.find_all('blockquote')

In [27]:
quotes

[<blockquote>There is a natural aristocracy among men. The grounds of this are virtue and talents. </blockquote>,
 <blockquote>All our words from loose using have lost their edge. </blockquote>,
 <blockquote>God couldn't be everywhere, so he created mothers </blockquote>,
 <blockquote>Be not afraid of going slowly, be afraid only of standing still. </blockquote>,
 <blockquote>Learn from yesterday, live for today, hope for tomorrow. </blockquote>,
 <blockquote>Do not confine your children to your own learning, for they were born in another time. </blockquote>,
 <blockquote>I hear and I forget, I see and I remember. I do and I understand. </blockquote>,
 <blockquote>In teaching others we teach ourselves. </blockquote>,
 <blockquote>Happiness will never come to those who fail to appreciate what they already have. </blockquote>,
 <blockquote>Without His love I can do nothing, with His love there is nothing I cannot do. </blockquote>]

In [28]:
quotes[0].text

'There is a natural aristocracy among men. The grounds of this are virtue and talents. '

In [29]:
quote_list = []
for quote in quotes:
    string = quote.text
    quote_list.append(string)

In [30]:
quote_list

['There is a natural aristocracy among men. The grounds of this are virtue and talents. ',
 'All our words from loose using have lost their edge. ',
 "God couldn't be everywhere, so he created mothers ",
 'Be not afraid of going slowly, be afraid only of standing still. ',
 'Learn from yesterday, live for today, hope for tomorrow. ',
 'Do not confine your children to your own learning, for they were born in another time. ',
 'I hear and I forget, I see and I remember. I do and I understand. ',
 'In teaching others we teach ourselves. ',
 'Happiness will never come to those who fail to appreciate what they already have. ',
 'Without His love I can do nothing, with His love there is nothing I cannot do. ']

In [31]:
import pandas as pd
df = pd.DataFrame(quote_list, columns=['Quote'])
df

Unnamed: 0,Quote
0,There is a natural aristocracy among men. The ...
1,All our words from loose using have lost their...
2,"God couldn't be everywhere, so he created moth..."
3,"Be not afraid of going slowly, be afraid only ..."
4,"Learn from yesterday, live for today, hope for..."
5,Do not confine your children to your own learn...
6,"I hear and I forget, I see and I remember. I d..."
7,In teaching others we teach ourselves.
8,Happiness will never come to those who fail to...
9,"Without His love I can do nothing, with His lo..."


In [32]:
authors=soup.find_all('p', class_="a")

In [33]:
authors[0].text

'\nThomas Jefferson (1743-1826) Third president of the United States.\n'

In [34]:
authors[0].text[1:-1]

'Thomas Jefferson (1743-1826) Third president of the United States.'

In [35]:
author_list=[]
for author in authors:
    string = author.text[1:-1]
    author_list.append(string)
df['Author']=author_list
df

Unnamed: 0,Quote,Author
0,There is a natural aristocracy among men. The ...,Thomas Jefferson (1743-1826) Third president o...
1,All our words from loose using have lost their...,Ernest Hemingway (1898-1961) American Writer.
2,"God couldn't be everywhere, so he created moth...",Jewish proverb
3,"Be not afraid of going slowly, be afraid only ...",Chinese proverb
4,"Learn from yesterday, live for today, hope for...",Unknown Source
5,Do not confine your children to your own learn...,Chinese proverb
6,"I hear and I forget, I see and I remember. I d...",Chinese proverb
7,In teaching others we teach ourselves.,Proverb
8,Happiness will never come to those who fail to...,Unknown Source
9,"Without His love I can do nothing, with His lo...",Unknown Source


Let's summarize the steps that we did:

- Download HTML code using the requests library
- Create a BeautifulSoup object to contain the parse HTML code
- Look for patterns identifying the information that you want to extract from the code
- Search for specific tags using the find_all() method
- Iterate over the object returned by find_all() and use the text attribute to extract the text between each set of tags
- Store the strings in a Python list and convert to a DataFrame for further analysis

#### A special case: scraping tables

For example, if you want to collect the table of additives as described in this Open Food Facts webpage, you can call pd.read_html() with the URL as input:

In [36]:
tables = pd.read_html("https://world.openfoodfacts.org/additives")
print(len(tables))  # 1 
print(tables[0].head())

1
                   Additive  Products   * Risk
0        E330 - Citric acid    128133 NaN  NaN
1          E322 - Lecithins     87083 NaN  NaN
2          E322i - Lecithin     80006 NaN  NaN
3  E500 - Sodium carbonates     54548 NaN  NaN
4        E415 - Xanthan gum     48773 NaN  NaN


In [37]:
tables = pd.read_html("https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle")
print(len(tables))  # 7

9


In [38]:
print(tables[4].head())

   Pos   Time                   Swimmer              Date          Venue
0    1  20.91         Cesar Cielo (BRA)  17 December 2009         Brazil
1    2  20.94  Frederick Bousquet (FRA)     22 April 2009         France
2    3  21.04      Caeleb Dressel (USA)      27 July 2019    South Korea
3    4  21.11      Benjamin Proud (GBR)     3 August 2018  Great Britain
4    5  21.19       Ashley Callus (AUS)  26 November 2009      Australia


In [39]:
print(tables[-2].head())

   Pos  Swimmer                       Time              Date          Venue
0    1    22.93  Ranomi Kromowidjojo (NED)     7 August 2017        Germany
1    2    23.00       Sarah Sjöström (SWE)     7 August 2017        Germany
2    3    23.19        Cate Campbell (AUS)   27 October 2017         Russia
3    4    23.25     Marleen Veldhuis (NED)     13 April 2008  Great Britain
4    5    23.27    Therese Alshammar (SWE)  21 November 2009      Singapore


**And if you're only interested in tables mentioning "Switzerland", then there is a parameter match exactly done for that:**



In [None]:
tables = pd.read_html("https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle", match="Switzerland")
print(len(tables))  # 1
print(tables[0][10:15][['Time', 'Name', 'Nationality']])

# Getting data from the web using APIs

n its simplest form, an API is a contract between two parties saying that if one party provides input in a pre-defined format then the other party will provide a pre-defined output. An API basically allows two pieces of code to interact with each other. When it comes to getting data from the web, APIs are extremely useful. Sites like LinkedIn, Reddit, Twitter, and Facebook all offer certain data through their APIs


When using an API we are essentially making a request to a remote web server to retrieve the data that we need. The way this request is implemented can vary based on the type of API. The most popular APIs are

- SOAP - Simple Object Access Protocol
- REST - Representational State Transfer

In SOAP, requests are submitted and received via a file format called XML, while in REST requests are usually submitted using the HTTP protocol. REST tends to be the far more popular choice - In this unit, we will go through an example with the strava.com API which uses REST. To achieve this, we will learn how to make HTTP requests with the Python Requests library.

You might wonder what exactly is the benefit to using an API over just scraping the data we need directly? Well, as we mentioned in our last unit, scraping the data might be illegal in certain cases. Public APIs can provide easier, faster (and legal!) data retrieval than web scraping. There are also great for dealing with cases where the data is changing quickly (ex. stock prices), or you want only specific aspects or subsets of the data.

#### Use case: Strava API

Known as the athlete’s social network, Strava is a place where you can record all of your athletic activities, share them with your friends, and compete for glory by claiming the fastest times on local segments.

In [48]:
import json

# Load credentials
with open('client-credentials.json') as file:
    client_credentials = json.load(file)



In [49]:
print('Credentials:', list(client_credentials.keys())) 

Credentials: ['client_id', 'client_secret']


In [51]:
print(client_credentials['client_id']) # Client ID

61535


#### Authorize the app
The Strava API is made for applications that need to access user data from Strava. In this scenario, the app (client) has to be authorized by the users to access and manage their Strava data

In our case, we analyze our own data, but the process is the same - we need to explicitly authorize our app to access and manage our data. Let's see how to do this.

Authorization is done in the browser via a URL that links to the website Authorization Service. For Strava, it is