# Reading data from csv files

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('test_file.csv')
df

Unnamed: 0,a,b,c,d
0,yellow,10,2,3.2
1,green,2,3,8.1
2,blue,7,1,0.4


In [3]:
pd.read_csv('test_file.csv',names=['column 1','column 2','column 3','column 4'])

Unnamed: 0,column 1,column 2,column 3,column 4
0,a,b,c,d
1,yellow,10,2,3.2
2,green,2,3,8.1
3,blue,7,1,0.4


In [4]:
pd.read_csv('test_file.csv', index_col=0)

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
yellow,10,2,3.2
green,2,3,8.1
blue,7,1,0.4


In [5]:
df.dtypes

a     object
b      int64
c      int64
d    float64
dtype: object

In [6]:
#Forcer le format

df2 = pd.read_csv('test_file.csv',  dtype = { 'b' : np.float64})
df2.dtypes

a     object
b    float64
c      int64
d    float64
dtype: object

#### Loading Partial

In [7]:
pd.read_csv("test_file.csv", usecols=['a', 'b'])

Unnamed: 0,a,b
0,yellow,10
1,green,2
2,blue,7


# Reading data from Excel files

In [8]:
import pandas as pd
pd.read_excel('data.xls')

Unnamed: 0,varA,varB,varC
0,0.391723,-0.155122,0.381104
1,0.575125,-0.105817,0.232245
2,0.672305,0.424688,-0.694795
3,0.766115,-0.79135,-0.028739
4,0.677259,-0.817543,-0.537088
5,-0.029702,-0.891848,-0.682719
6,-0.161366,-0.6596,-0.727898
7,0.031672,0.016607,-0.940479
8,0.833212,-0.503236,-0.88721
9,0.907753,0.265177,-0.390762


In [9]:
pd.read_excel('data.xls', sheet_name='Sheet2')

Unnamed: 0,varD,varE,varF
0,0.907753,0.265177,-0.390762
1,0.755019,-0.768056,-0.528307
2,0.850692,-0.537159,-0.601387
3,0.131663,0.941327,0.240073
4,0.5744,0.091735,-0.395277
5,0.81663,0.875612,-0.880044
6,0.536732,0.175428,-0.473053
7,-0.084641,-0.042827,0.053344
8,0.268271,-0.010628,-0.090952
9,0.166792,-0.872579,-0.556899


In [10]:
pd.read_excel('data.xls', sheet_name='Sheet2', usecols = ['varD','varE'])

Unnamed: 0,varD,varE
0,0.907753,0.265177
1,0.755019,-0.768056
2,0.850692,-0.537159
3,0.131663,0.941327
4,0.5744,0.091735
5,0.81663,0.875612
6,0.536732,0.175428
7,-0.084641,-0.042827
8,0.268271,-0.010628
9,0.166792,-0.872579


# JSON data

When dealing with data on the web, the most common format that we will come across is JSON, which stands for JavaScript Object Notation. In a nutshell, JSON is a file format used to transmit information between web servers and clients or browsers in logical and structured manner. It was first developed in the early 2000s as a response to a need for a better server-to-browser communication protocol. As suggested by its name, it was originally derived from the JavaScript programming language; however, unlike JavaScript objects, a JSON object can be transferred between different programming languages in a format that all languages can work with. In fact, nowadays, almost all programming languages contain functions or libraries that can read and write JSON data.


#### Syntax and structure
JSON can contain two types of elements:

- JSON objects
- arrays

A JSON object is essentially just a key-value data format that is stored inside curly brackets. Here is an example:

In [11]:
{
  "userID": 12345,
  "userName": "John Smith"
}

{'userID': 12345, 'userName': 'John Smith'}

An array is an ordered collection that can contain values of different data types. The main syntactical difference between JSON objects and arrays is that arrays are stored inside square brackets. We can use arrays as the value field of a JSON object as shown below

In [12]:
{
  "userID": 12345,
  "userName": "John Smith",
  "results": [
    {
      "test": "Verbal Reasoning",
      "score": 140
     },
    {
      "test":"Quantitative Reasoning",
       "score": 165
    },
    {
      "test":"Analytical Writing",
       "score": 5
    }
  ],
  "testCompleted": True
}

{'userID': 12345,
 'userName': 'John Smith',
 'results': [{'test': 'Verbal Reasoning', 'score': 140},
  {'test': 'Quantitative Reasoning', 'score': 165},
  {'test': 'Analytical Writing', 'score': 5}],
 'testCompleted': True}

In [13]:
{
    "col1":
        {
            "row1":0,"row2":4,"row3":8,"row4":12
        },
    "col2":
        {
            "row1":1,"row2":5,"row3":9,"row4":13
        },
    "col3":
        {
            "row1":2,"row2":6,"row3":10,"row4":14
        },
    "col4":
        {
            "row1":3,"row2":7,"row3":11,"row4":15
        }
}

{'col1': {'row1': 0, 'row2': 4, 'row3': 8, 'row4': 12},
 'col2': {'row1': 1, 'row2': 5, 'row3': 9, 'row4': 13},
 'col3': {'row1': 2, 'row2': 6, 'row3': 10, 'row4': 14},
 'col4': {'row1': 3, 'row2': 7, 'row3': 11, 'row4': 15}}

In [14]:
pd.read_json('frame.json')

Unnamed: 0,col1,col2,col3,col4
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11
row4,12,13,14,15


Now in this example, the JSON file that we used was already in what is called tabular form. This means that we could directly load it as a DataFrame. However, this is not usually the case with JSON files.



In [15]:
pd.read_json('books.json')

Unnamed: 0,books
0,"{'isbn': '9781593275846', 'title': 'Eloquent J..."
1,"{'isbn': '9781449331818', 'title': 'Learning J..."
2,"{'isbn': '9781449365035', 'title': 'Speaking J..."


### We can see that the data in this file is not longer tabular. If we try to load the data directly with the read_json() the result is not very useful:

In order to do this, we will need to perform some additional steps. We start by importing the json library for python, and a special function from pandas called json_normalize():

In [16]:
import json
from pandas.io.json import json_normalize

Let's load the data from the JSON file and convert it to an object (a dictionary, really) which we store in the variable dictionary:

In [17]:
with open('books.json', 'r') as f:
    json_string = f.read()
    dictionary = json.loads(json_string)

If we now type the command dictionary, we can see indeed the data from our file. Once we have the data in this format, we can apply a process known as normalization. It's called normalization because it "normalizes" JSON data, which can be quite complex in structure, into a flat table structure (a DataFrame, to be more precise). To do this, we use the json_normalize() function. This function turns an array of nested JSON objects into a DataFrame, with the columns corresponding to the different variables stored in the JSON file. We will pass as arguments the variable dictionary which contains the data as a dictionary, and then we have to mention a key, which is used for separating the entries. To know which key to use, we must look at our JSON file and see the name that is given before the entries. In our case, this name is books. Let's try this out:

In [18]:
json_normalize(dictionary, 'books')

Unnamed: 0,author,description,isbn,pages,published,publisher,subtitle,title,website
0,Marijn Haverbeke,JavaScript lies at the heart of almost every m...,9781593275846,472,2014-12-14T00:00:00.000Z,No Starch Press,A Modern Introduction to Programming,"Eloquent JavaScript, Second Edition",http://eloquentjavascript.net/
1,Addy Osmani,"With Learning JavaScript Design Patterns, you'...",9781449331818,254,2012-07-01T00:00:00.000Z,O'Reilly Media,A JavaScript and jQuery Developer's Guide,Learning JavaScript Design Patterns,http://www.addyosmani.com/resources/essentialj...
2,Axel Rauschmayer,"Like it or not, JavaScript is everywhere these...",9781449365035,460,2014-02-01T00:00:00.000Z,O'Reilly Media,An In-Depth Guide for Programmers,Speaking JavaScript,http://speakingjs.com/


# HTML files

The web is one of the major sources of data that you will encounter. Getting data from the web is known as **web scraping**, and it is a very useful skill in any data scientist's toolbox. It allows us to get data from the web that is not yet in a well-structured format which you can download directly for data analysis such as csv. **You might wonder, why don't we just copy and paste the data manually? Well, this might work for a small webpage but in general, we will be interested in scraping large amounts of data that would be extremely time consuming and completely impractical to do by hand**. Luckily, Python has several tools which help automate this process for us.

Before we get into it, a word of warning: Be cautious when crawling the web. In particular, some Terms of Services may explicitly prohibit you from scraping the website, and the data may itself be copyrighted. So be sure to understand what you're doing (here is a an interesting analysis of the problem).

#### What exactly is HTML?

We will not get into too much detail here about HTML, the HyperText Markup Language that powers the web, but we will cover some very basic facts that will be sufficient for you to perform successful web scraping. **HTML is the source code that generates a webpage.** When viewing a webpage in our web browser, we can look at its source code by right-clicking and selecting view page source or show page source, depending on the browser we are using. Here is an example:

We will exploit these patterns to retrieve the information that we want. We will be especially interested in the attributes **class and id**. These are special properties that give HTML elements names, and we can take advantage of these names when web scraping. An element can have multiple classes but only one id. When writing HTML code it is not necessary to give elements classes and ids however, so not all web pages might have these attributes.

#### The requests library

The first step in web scraping is to read the web page into python. This is done using the requests library, so we have to make sure that we first import it as follows:

In [19]:
import requests

In [20]:
page=requests.get('https://web.archive.org/web/20180908144902/http://en.proverbia.net/shortfamousquotes.asp')

In [21]:
page.status_code

200

In [22]:
page.text[0:100]

'\n<!DOCTYPE html>\n\n<html lang="en" xml:lang="en">\n<head><script src="//archive.org/includes/analytics'

# Web scraping

In [23]:
from bs4 import BeautifulSoup

In [24]:
soup = BeautifulSoup(page.text, 'html.parser')

In [25]:
quotes = soup.find_all('blockquote')

In [26]:
quotes

[<blockquote>There is a natural aristocracy among men. The grounds of this are virtue and talents. </blockquote>,
 <blockquote>All our words from loose using have lost their edge. </blockquote>,
 <blockquote>God couldn't be everywhere, so he created mothers </blockquote>,
 <blockquote>Be not afraid of going slowly, be afraid only of standing still. </blockquote>,
 <blockquote>Learn from yesterday, live for today, hope for tomorrow. </blockquote>,
 <blockquote>Do not confine your children to your own learning, for they were born in another time. </blockquote>,
 <blockquote>I hear and I forget, I see and I remember. I do and I understand. </blockquote>,
 <blockquote>In teaching others we teach ourselves. </blockquote>,
 <blockquote>Happiness will never come to those who fail to appreciate what they already have. </blockquote>,
 <blockquote>Without His love I can do nothing, with His love there is nothing I cannot do. </blockquote>]

In [27]:
quotes[0].text

'There is a natural aristocracy among men. The grounds of this are virtue and talents. '

In [28]:
quote_list = []
for quote in quotes:
    string = quote.text
    quote_list.append(string)

In [29]:
quote_list

['There is a natural aristocracy among men. The grounds of this are virtue and talents. ',
 'All our words from loose using have lost their edge. ',
 "God couldn't be everywhere, so he created mothers ",
 'Be not afraid of going slowly, be afraid only of standing still. ',
 'Learn from yesterday, live for today, hope for tomorrow. ',
 'Do not confine your children to your own learning, for they were born in another time. ',
 'I hear and I forget, I see and I remember. I do and I understand. ',
 'In teaching others we teach ourselves. ',
 'Happiness will never come to those who fail to appreciate what they already have. ',
 'Without His love I can do nothing, with His love there is nothing I cannot do. ']

In [30]:
import pandas as pd
df = pd.DataFrame(quote_list, columns=['Quote'])
df

Unnamed: 0,Quote
0,There is a natural aristocracy among men. The ...
1,All our words from loose using have lost their...
2,"God couldn't be everywhere, so he created moth..."
3,"Be not afraid of going slowly, be afraid only ..."
4,"Learn from yesterday, live for today, hope for..."
5,Do not confine your children to your own learn...
6,"I hear and I forget, I see and I remember. I d..."
7,In teaching others we teach ourselves.
8,Happiness will never come to those who fail to...
9,"Without His love I can do nothing, with His lo..."


In [31]:
authors=soup.find_all('p', class_="a")

In [32]:
authors[0].text

'\nThomas Jefferson (1743-1826) Third president of the United States.\n'

In [33]:
authors[0].text[1:-1]

'Thomas Jefferson (1743-1826) Third president of the United States.'

In [34]:
author_list=[]
for author in authors:
    string = author.text[1:-1]
    author_list.append(string)
df['Author']=author_list
df

Unnamed: 0,Quote,Author
0,There is a natural aristocracy among men. The ...,Thomas Jefferson (1743-1826) Third president o...
1,All our words from loose using have lost their...,Ernest Hemingway (1898-1961) American Writer.
2,"God couldn't be everywhere, so he created moth...",Jewish proverb
3,"Be not afraid of going slowly, be afraid only ...",Chinese proverb
4,"Learn from yesterday, live for today, hope for...",Unknown Source
5,Do not confine your children to your own learn...,Chinese proverb
6,"I hear and I forget, I see and I remember. I d...",Chinese proverb
7,In teaching others we teach ourselves.,Proverb
8,Happiness will never come to those who fail to...,Unknown Source
9,"Without His love I can do nothing, with His lo...",Unknown Source


Let's summarize the steps that we did:

- Download HTML code using the requests library
- Create a BeautifulSoup object to contain the parse HTML code
- Look for patterns identifying the information that you want to extract from the code
- Search for specific tags using the find_all() method
- Iterate over the object returned by find_all() and use the text attribute to extract the text between each set of tags
- Store the strings in a Python list and convert to a DataFrame for further analysis

#### A special case: scraping tables

For example, if you want to collect the table of additives as described in this Open Food Facts webpage, you can call pd.read_html() with the URL as input:

In [35]:
tables = pd.read_html("https://world.openfoodfacts.org/additives")
print(len(tables))  # 1 
print(tables[0].head())

1
                   Additive  Products   * Risk
0        E330 - Citric acid    128275 NaN  NaN
1          E322 - Lecithins     87168 NaN  NaN
2          E322i - Lecithin     80061 NaN  NaN
3  E500 - Sodium carbonates     54621 NaN  NaN
4        E415 - Xanthan gum     48802 NaN  NaN


In [36]:
tables = pd.read_html("https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle")
print(len(tables))  # 7

9


In [37]:
print(tables[4].head())

   Pos   Time                   Swimmer              Date          Venue
0    1  20.91         Cesar Cielo (BRA)  17 December 2009         Brazil
1    2  20.94  Frederick Bousquet (FRA)     22 April 2009         France
2    3  21.04      Caeleb Dressel (USA)      27 July 2019    South Korea
3    4  21.11      Benjamin Proud (GBR)     3 August 2018  Great Britain
4    5  21.19       Ashley Callus (AUS)  26 November 2009      Australia


In [38]:
print(tables[-2].head())

   Pos  Swimmer                       Time              Date          Venue
0    1    22.93  Ranomi Kromowidjojo (NED)     7 August 2017        Germany
1    2    23.00       Sarah Sjöström (SWE)     7 August 2017        Germany
2    3    23.19        Cate Campbell (AUS)   27 October 2017         Russia
3    4    23.25     Marleen Veldhuis (NED)     13 April 2008  Great Britain
4    5    23.27    Therese Alshammar (SWE)  21 November 2009      Singapore


**And if you're only interested in tables mentioning "Switzerland", then there is a parameter match exactly done for that:**



In [39]:
tables = pd.read_html("https://en.wikipedia.org/wiki/World_record_progression_50_metres_freestyle", match="Switzerland")
print(len(tables))  # 1
print(tables[0][10:15][['Time', 'Name', 'Nationality']])

1
     Time          Name    Nationality
10  22.54   Robin Leamy  United States
11  22.52  Dano Halsall    Switzerland
12  22.40     Tom Jager  United States
13  22.33   Matt Biondi  United States
14  22.33   Matt Biondi  United States


# Getting data from the web using APIs

n its simplest form, an API is a contract between two parties saying that if one party provides input in a pre-defined format then the other party will provide a pre-defined output. An API basically allows two pieces of code to interact with each other. When it comes to getting data from the web, APIs are extremely useful. Sites like LinkedIn, Reddit, Twitter, and Facebook all offer certain data through their APIs


When using an API we are essentially making a request to a remote web server to retrieve the data that we need. The way this request is implemented can vary based on the type of API. The most popular APIs are

- SOAP - Simple Object Access Protocol
- REST - Representational State Transfer

In SOAP, requests are submitted and received via a file format called XML, while in REST requests are usually submitted using the HTTP protocol. REST tends to be the far more popular choice - In this unit, we will go through an example with the strava.com API which uses REST. To achieve this, we will learn how to make HTTP requests with the Python Requests library.

You might wonder what exactly is the benefit to using an API over just scraping the data we need directly? Well, as we mentioned in our last unit, scraping the data might be illegal in certain cases. Public APIs can provide easier, faster (and legal!) data retrieval than web scraping. There are also great for dealing with cases where the data is changing quickly (ex. stock prices), or you want only specific aspects or subsets of the data.

#### Use case: Strava API

Known as the athlete’s social network, Strava is a place where you can record all of your athletic activities, share them with your friends, and compete for glory by claiming the fastest times on local segments.

In [40]:
import json

# Load credentials
with open('client_credentials.json') as file:
    client_credentials = json.load(file)



In [41]:
from IPython.display import Image
Image(url= "https://d7whxh71cqykp.cloudfront.net/uploads/image/data/4113/APIs.png")

In [42]:
print('Credentials:', list(client_credentials.keys())) 

Credentials: ['client_id', 'client_secret']


In [43]:
print(client_credentials['client_id']) # Client ID

61535


#### Authorize the app
The Strava API is made for applications that need to access user data from Strava. In this scenario, the app (client) has to be authorized by the users to access and manage their Strava data

In our case, we analyze our own data, but the process is the same - we need to explicitly authorize our app to access and manage our data. Let's see how to do this.

Authorization is done in the browser via a URL that links to the website Authorization Service. For Strava, it is

In [44]:
from IPython.display import Image
Image(url= "https://d7whxh71cqykp.cloudfront.net/uploads/image/data/4115/authorization-service.png")

According to the Strava Authentication API reference, we need to provide the following parameters

- client_id which identifies our application
- scope which defines the requested scope/rights for our application ex. read user profile, read user activities

When the user authorizes our application via the Autorization Service, Strava sends the user back to some redirect_uri with an Access Code stored as GET parameters. We will use this code later when accessing this user data.

This Redirect URI is usually a link to our application website or related app ex. mobile app. Since we don't have one in our case, we will set it to https://localhost which is the computer local address - we will come back to this point below.

Let's store all the required GET parameters in a Python dictionary

In [45]:
oauth_params = {
    'client_id': client_credentials['client_id'],
    'scope': 'read_all,profile:read_all,activity:read_all',
    'redirect_uri': 'https://localhost',
    'response_type': 'code'
}

We get the client_id from the json file and set the scope to get full read access to the user data. Also, as specified in the Strava API Authentication Reference, we need to set a response_type parameter to 'code' to explicitly say that Strava should return us with some Access Code for this user data.

Let's now create the authorization link with the GET syntax that we saw above. In practice, URLs follow strict rules and we typically use a function to achieve this. In Python, we can use the urlencode() one from the built-in urllib Python module

In [46]:
from urllib.parse import urlencode

# Generate link that users can copy/paste in their browser to authorize our app
print('https://www.strava.com/oauth/authorize' + '?' + urlencode(oauth_params))

https://www.strava.com/oauth/authorize?client_id=61535&scope=read_all%2Cprofile%3Aread_all%2Cactivity%3Aread_all&redirect_uri=https%3A%2F%2Flocalhost&response_type=code


ou écrit en dur

https://www.strava.com/oauth/authorize?
  client_id=....&
  scope=read_all%2Cprofile%3Aread_all%2Cactivity%3Aread_all&
  redirect_uri=https%3A%2F%2Flocalhost&
  response_type=code

As we can see, Strava returned three GET parameters

- code which is the Access Code needed to retrieve the user data
- scope which lists the scopes accepted by the user
- state which is an optional flag from the Strava API used for security reasons - not important in our case since we're not really building an app but just doing some data analysis

**The important parameter here is the user Access Code. Let's extract it from the URL.**

*Again, the access code is private and shouldn't be shared via our notebook. One solution is to automatically extract it from the URL. First, let's load the URL into memory with the Python getpass() built-in function*

In [47]:
from getpass import getpass

# After authorizing the app, user is redirected to
authorization_response = getpass(prompt='Full callback URL')

Full callback URL ······························································································································


In [48]:
from urllib.parse import urlparse, parse_qs

# Extract Authorization Code from URL
authorization_code = parse_qs(urlparse(authorization_response).query)['code'][0]

In [49]:
urlparse(authorization_response).query

'state=&code=183ad10fbb76a7531f9d6fae0cfa4b54a33d6b40&scope=read,activity:read_all,profile:read_all,read_all'

In [50]:
parse_qs(urlparse(authorization_response).query)

{'code': ['183ad10fbb76a7531f9d6fae0cfa4b54a33d6b40'],
 'scope': ['read,activity:read_all,profile:read_all,read_all']}

In [51]:
print(authorization_code)

183ad10fbb76a7531f9d6fae0cfa4b54a33d6b40


#### Get access token

Now that our app has been authorized to access data from the user via the **Access Code**, we need to retrieve an **Access Token** from Strava to actually perform API queries and access user resources

In [52]:
from IPython.display import Image
Image(url= "https://d7whxh71cqykp.cloudfront.net/uploads/image/data/4118/strava-api.png")

Why do we need an Access Token? Isn't the Access Code sufficient? This actually depends on the API. Strava implements the OAuth 2.0 protocol where the app first needs to get some Access Code before getting the final Access Token that can be used to retrieve data. However, you might work later with other APIs that directly provide the Access Token. For instance, this was the case with the Strava API before October 2018 when they adopted the OAuth 2.0 standard.

*APIs can have very different implementations. When working with a new API, you will likely need to first check the API documentation and potentially search online for additional examples*

You can think of GET requests as the addresses that are in the browser URL field at the top of the window. They are used to get content from the web but are not necessarily meant to send data to a web service besides short GET variables as we saw above. On the other hand, POST requests are used to send/post data to a web service, data that will typically be stored by it. In our API scenario which exposes entries of user data, we will typically use

- GET queries to list the entries
- POST queries to create new entries
- PUT queries to update existing entries

We can use the Python Requests library to perform those three types of queries via its .get(), .post() and .put() functions. Let's see how to do the POST one to get the access token

In [53]:
import requests

# Exchange Authorization Code for Access Token
r = requests.post('https://www.strava.com/oauth/token', data={
    'client_id': client_credentials['client_id'],
    'client_secret': client_credentials['client_secret'],
    'code': authorization_code,
    'grant_type': 'authorization_code'
})
r.status_code # 200

200

In [54]:
print(r.text)

{"token_type":"Bearer","expires_at":1613493657,"expires_in":9947,"refresh_token":"7e3e8aa8f6944722ef5f48525301ac822a94a6ab","access_token":"a20219742bfc4baccfddf350ba8b48262bf7b7c6","athlete":{"id":78376368,"username":null,"resource_state":2,"firstname":"Lyes","lastname":"Oudinache","city":"Genève","state":"Genève","country":"Switzerland","sex":"M","premium":false,"summit":false,"created_at":"2021-02-11T13:24:02Z","updated_at":"2021-02-11T13:25:28Z","badge_type_id":0,"profile_medium":"https://d3nn82uaxijpm6.cloudfront.net/assets/avatar/athlete/medium-bee27e393b8559be0995b6573bcfde897d6af934dac8f392a6229295290e16dd.png","profile":"https://d3nn82uaxijpm6.cloudfront.net/assets/avatar/athlete/large-800a7033cc92b2a5548399e26b1ef42414dd1a9cb13b99454222d38d58fd28ef.png","friend":null,"follower":null}}


In [55]:
r.json()

{'token_type': 'Bearer',
 'expires_at': 1613493657,
 'expires_in': 9947,
 'refresh_token': '7e3e8aa8f6944722ef5f48525301ac822a94a6ab',
 'access_token': 'a20219742bfc4baccfddf350ba8b48262bf7b7c6',
 'athlete': {'id': 78376368,
  'username': None,
  'resource_state': 2,
  'firstname': 'Lyes',
  'lastname': 'Oudinache',
  'city': 'Genève',
  'state': 'Genève',
  'country': 'Switzerland',
  'sex': 'M',
  'premium': False,
  'summit': False,
  'created_at': '2021-02-11T13:24:02Z',
  'updated_at': '2021-02-11T13:25:28Z',
  'badge_type_id': 0,
  'profile_medium': 'https://d3nn82uaxijpm6.cloudfront.net/assets/avatar/athlete/medium-bee27e393b8559be0995b6573bcfde897d6af934dac8f392a6229295290e16dd.png',
  'profile': 'https://d3nn82uaxijpm6.cloudfront.net/assets/avatar/athlete/large-800a7033cc92b2a5548399e26b1ef42414dd1a9cb13b99454222d38d58fd28ef.png',
  'friend': None,
  'follower': None}}

This token contains all the informations we need to retrieve the data from the Strava API

- The access_token - we will use it in all of our API requests
- expires_at and expires_in which specify when the token expires
- A refresh_token to get a new Access Token when this one expires
- Strava also sends some basic information about the user in an athlete field

Let's store the token as we will need it later when interacting with the Strava API. One solution is to store it in a .json file that we can later reload in our code and notebooks. We can do this via a token_saver() function that takes the JSON object, creates a token.json file and saves it with the json.dump() function

In [56]:
# Token saver
def token_saver(token_obj):
    with open('token.json', 'w') as file:
        json.dump(token_obj, file, indent=4)

token_saver(r.json())

#### Refreshing Tokens

In [57]:
# Token loader
def get_token():
    with open('token.json', 'r') as file:
        return json.load(file)

token = get_token()
token.keys() 

dict_keys(['token_type', 'expires_at', 'expires_in', 'refresh_token', 'access_token', 'athlete'])

In [58]:
print('Expires in:', token['expires_in']) # initially: 21600 (6 hours)
print('Expires at:', token['expires_at']) # in seconds

Expires in: 9947
Expires at: 1613493657


In [59]:
from datetime import datetime, timedelta

print('Expires at:', datetime.fromtimestamp(token['expires_at'])) # date, time
print('Expires in:', timedelta(seconds=token['expires_in'])) # time delta

Expires at: 2021-02-16 17:40:57
Expires in: 2:45:47


In [60]:
# Refresh expired Access Tokens
r = requests.post('https://www.strava.com/oauth/token', data={
    'client_id': client_credentials['client_id'],
    'client_secret': client_credentials['client_secret'],
    'refresh_token': token['refresh_token'],
    'grant_type': 'refresh_token'
})
token_saver(r.json())
token = get_token()

#### Reach API using Python Requests

Finally, let's get some data from the API. Strava exposes several endpoints such as

- /activities to add, retrieve and update the athlete activities
- /clubs to list the athlete clubs
- /routes, /segments and so on

APIs usually provide an **API Reference Documentation** that lists the different endpoints and explains how they work. For Strava, you can take a look at developers.strava.com/docs/reference. For this example, we will simply list some of the athlete activities with the /athlete/activities one.

As you can see in the related documentation entry List Athlete Activities, this endpoint accepts GET requests and has a few optional parameters ex. to specify the date interval or the desired number of results. For this example, we will use the defaults and only specify the Access Token

In [61]:
# List activities
r = requests.get('https://www.strava.com/api/v3/athlete/activities', params={
    'access_token': token['access_token']
})
r.status_code # 200

200

In [62]:
# Save actvitivies
with open('activities.json', 'w') as file:
    json.dump(r.json(), file, indent=4)

In [63]:
# Load data into DataFrame
activities_df = pd.read_json(r.text)
activities_df[['name', 'type', 'distance', 'elapsed_time', 'max_speed']]

Unnamed: 0,name,type,distance,elapsed_time,max_speed
0,Course à pied l'après-midi,Run,21000,6397,0


#### With requests_oauthlib

In the example from above, we saw how to reach the Strava API by making all the required requests manually with the Python Requests .get() and .post() methods which should give you a good overview of interacting with a web service in general. However, we mentioned above that Strava implements OAuth 2.0 which is a very common protocol for APIs, and we can easily find Python libraries that simplify a bit the entire process. Let's quickly see how using them would change our code!

**For this example, we will use requests-oauthlib which is a popular Python library (> 1,000 stars on GitHub) that implements the OAuthlib protocol with Python Requests i.e. the library that we used above to manually perform GET/POST requests.** Let's first install it in the course environment. Open a new terminal window, activate the environment and install the library with

In [64]:
from requests_oauthlib import OAuth2Session

The first step is to create an OAuth2Session object with the client information, redirect URI and requested scope

In [65]:
# 1 Create a session for initialization
init_session = OAuth2Session(
    client_credentials['client_id'],
    redirect_uri='https://localhost',
    scope='read_all,profile:read_all,activity:read_all'
)

In [66]:
# 2 Get authorization link
user_link, state = init_session.authorization_url('https://www.strava.com/oauth/authorize')
print('Visit link:', user_link)

Visit link: https://www.strava.com/oauth/authorize?response_type=code&client_id=61535&redirect_uri=https%3A%2F%2Flocalhost&scope=read_all%2Cprofile%3Aread_all%2Cactivity%3Aread_all&state=CCv2znWH82GqF7DL52V6QQ3uiWIv6D


In [67]:
authorization_response = getpass(prompt='Full callback URL')

Full callback URL ····························································································································································


In [68]:
# 3 Get Access Token
token = init_session.fetch_token(
    'https://www.strava.com/oauth/token',
    authorization_response=authorization_response,
    include_client_id=True,
    client_secret=client_credentials['client_secret']
)

In [69]:
token_saver(token)

In [70]:
# 4 Create a session for reaching the API
api_session = OAuth2Session(
    client_credentials['client_id'],
    token=token, # pass Access Token

    # Automatically refresh expired token
    auto_refresh_url='https://www.strava.com/oauth/token',
    auto_refresh_kwargs={
        'client_id': client_credentials['client_id'],
        'client_secret': client_credentials['client_secret']
    },
    token_updater=token_saver # automatically saves new tokens
)

The object also provides a way to automatically refresh expired tokens with a few additional parameters

- auto_refresh_url the URL of the token service
- auto_refresh_kwargs to pass additional values to the service when needed ex. Strava requires the client_id and client_secret in our case
- token_updater a function to automatically save the new token

We can now use this new api_session object to make requests

In [71]:
# 5 List activities
r = api_session.get('https://www.strava.com/api/v3/athlete/activities')
r.status_code # 200

200

In [72]:
activities_df = pd.read_json(r.text)
activities_df[['name', 'type', 'distance', 'elapsed_time', 'max_speed']]


Unnamed: 0,name,type,distance,elapsed_time,max_speed
0,Course à pied l'après-midi,Run,21000,6397,0


#### With custom libraries - stravalib

You might also find custom Python libraries for popular APIs. A few examples are

Tweepy for the Twitter API
facebook-sdk for Facebook's Graph API
praw for Reddit API
For Strava, there is stravalib which implements the latest API interface (v3). Let's see how the workflow changes.

First, we need to install the library. The package can be easily installed in the course environment with pip

* **Note** - What's the difference between pip and conda? In short: both are package managers. Pip is the Python default one, while Conda is developed by Anaconda. Conda makes packages installation much easier because it verifies that the packages have compatible versions between each other and installs them in an isolated Conda environment. Most of the data science packages are available in the Anaconda repository but some of them are only in the pip Python Package Index (PyPi). Conda handles this scenario and Pip packages installed will still be local to the Conda environment. However, we should prefer installing Conda packages when available. You can read more about the differences in Understanding Conda and Pip*

The authentication workflow is very similar to requests-oauthlib. We first create a Client object and uses one of its method to get the Authorization Link

In [74]:
from stravalib import Client

# Create client
client = Client()

# Get Authorization URL
user_link = client.authorization_url(
    client_id=client_credentials['client_id'],
    redirect_uri='https://localhost',
    scope=['read_all', 'profile:read_all', 'activity:read_all']
)
print('Visit link:', user_link)

Visit link: https://www.strava.com/oauth/authorize?client_id=61535&redirect_uri=https%3A%2F%2Flocalhost&approval_prompt=auto&response_type=code&scope=read_all%2Cprofile%3Aread_all%2Cactivity%3Aread_all


In [75]:
authorization_response = getpass(prompt='Full callback URL')
authorization_code = parse_qs(urlparse(authorization_response).query)['code'][0]


Full callback URL ······························································································································


In [76]:
# Get Access Token
token = client.exchange_code_for_token(
    client_id=client_credentials['client_id'],
    client_secret=client_credentials['client_secret'],
    code=authorization_code)
token_saver(token)

In [78]:
import time

# Refresh token if necessary
if time.time() > token['expires_at']:
    token = client.refresh_access_token(
        client_id=client_credentials['client_id'],
        client_secret=client_credentials['client_secret'],
        refresh_token=token['refresh_token'])
    token_saver(token)

The main advantage of stravalib is that it implements the different API calls as methods of the client object: requests are performed behind the scenes. For instance, to get the last five activities, we can write

In [None]:
# Get activities
activities = client.get_activities(limit=5)
activities # <BatchedResultsIterator entity=Activity>


The method returns an iterator that we can use to iterate over the results

In [81]:
for activity in activities:
    print(activity)

# <Activity id=... name='...' resource_state=..>
# ..

<Activity id=4795989348 name="Course à pied l'après-midi" resource_state=2>


As we can see, stravalib automatically parses the API JSON responses into Python objects that we can use to easily work with the data. For instance, we can print the details for the first activity with

In [82]:
a = list(activities)[0] # Get the first activity

print('Activity name:', a.name)
print('Distance:', a.distance)
print('Athlete name:', a.athlete.firstname)
print('Average heart rate:', a.average_heartrate)

Activity name: Course à pied l'après-midi
Distance: 21000.00 m
Athlete name: None
Average heart rate: None


### Summary

In this unit, we went through a concrete example of using APIs and discussed different ways to retrieve data from them. Here are a few takeaways

- APIs can require some form of Authorization
- Communication with APIs is usually done via Access Tokens
- HTTP requests can be performed with the Python Requests library

As you gain more experience with data science projects you will most likely come across many different APIs. While the details of each implementation will be different the general framework above of registering an app, getting a personal token, and setting up the connecting using the specific objects and methods found in the documentation should give you a good starting point.