<font style='font-size:1.5em'>**🧑‍🏫 Week 08 Lecture**</font><br>
<font style='font-size:1.3em;color:#888888'>NOTEBOOK 01: Collecting data from an API that requires authentication</font>

<font style='font-size:1.2em;color:#e26a4f;font-weight:bold'>LSE DS105A – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 21 November 2024 

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: In Weeks 08 and 09, we will revisit the data science workflow (collection -> storage -> processing -> analysis -> visualization) but with new tools and techniques. We will cover how to use APIs that require authentication to collect data, we will revisit the notion of API endpoints and then, once we have collected the data, we will learn how to store it in a more structured way using databases.

<details style="width:70%;font-size:1em;border: 1px solid #aaa;border-radius: 4px;padding: .5em;margin-left:0em"><summary style="font-weight:bold">🖇️ EXPAND FOR USEFUL LINKS</summary>

- Python 3's [`venv` module documentation](https://docs.python.org/3/library/venv.html)

- W3 Schools' [HTTP Request Methods](https://www.w3schools.com/tags/ref_httpmethods.asp) page

- [Reddit API documentation](https://www.reddit.com/dev/api/)
- [Reddit API Rules](https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki)

- The [JSON Crack Extension](https://marketplace.visualstudio.com/items?itemName=AykutSarac.jsoncrack-vscode) for VS Code to visually inspect JSON files.

- 🐼 pandas' [`pd.json_normalize()` function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html)

- [The `pydotenv` library](https://pypi.org/project/python-dotenv/)
- [What is the gitignore file?](https://www.atlassian.com/git/tutorials/saving-changes/gitignore)


Not covered here but relevant to your upcoming assignment:

- [Spotify API documentation](https://developer.spotify.com/documentation/web-api/)
- [Spotify Getting Started Guide](https://developer.spotify.com/documentation/web-api/quick-start/)

</details>


---

**⚙️ SETUP**

Before you continue, set up your Python environment.

<details style="width:70%;font-size:0.85em;border: 1px solid #aaa;border-radius: 4px;padding: .5em;margin-left:0em"><summary style="font-weight:bold">🔧 Click here for virtual environment setup instructions</summary>

It all depends on whether you have conda installed or not. If you type `conda` on your terminal and it says "command not found," then you probably don't have it installed. In that case, you can use Python's built-in `venv` module to create a virtual environment.

1. If you already have conda installed: 

    - you can create a new environment with the following command:

        ```bash
        conda create -n .venv
        ```

    - Then, activate the environment:

        ```bash
        conda activate .venv
        ```

2. Otherwise, let's use `venv`. 

    - On the command line, run the following commands:

        ```bash
        cd /path/to/ds105a-2024 # go to the root folder (not where this notebook is!)
        python -m venv .venv
        ```

    - Then, activate the virtual environment.

        If on Windows, run:

        ```powershell
        .venv\Scripts\activate
        ```

        If on MacOS or Linux, run:

        ```bash
        source .venv/bin/activate
        ```

You should see a `(.venv)` in your terminal prompt now.

<span style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;width:40%">🤔 **Think about it:** What does that mean?</span>

</details>

Then, install the required libraries:

```bash
cd /path/to/ds105a-2024 # go to the root folder (not where this notebook is!)
pip install -r requirements.txt
```

Finally, change the kernel of this notebook to the virtual environment you just created. (Go to the button on the top right, click on it, and select the kernel you just created.)

\#TODO: I will eventually move these instructions to the README and make them universal for all notebooks.

---

In [1]:
import os
import requests

import pandas as pd

from dotenv import load_dotenv
from tqdm.notebook import tqdm
tqdm.pandas()

# 1. The Reddit API

In this lecture, I will show you how to collect data from an authenticated API, using Reddit as a case study. You are not required to create a Reddit account if you don’t want to. Pay close attention to my explanations and demonstrations, and think about how you can use these same methods with the Spotify API in the future.

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 10px 20px 5px 20px; margin: 10px 0 10px 10px; flex: 1 1 calc(65% - 20px);min-width: 250px;max-width: 450px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

Reddit is a social media platform where users can post links, text, images, and videos, which others can upvote, downvote, or comment on. Reddit consists of communities called subreddits, each focused on a specific topic. For example, [r/datascience](https://reddit.com/r/datascience) is dedicated to data science, while [r/aww](https://reddit.com/r/aww) features cute animals. Each subreddit has moderators who enforce its specific rules.

</div>

<details style="width:70%;font-size:1em;border: 1px solid #aaa;border-radius: 4px;padding: .5em;margin-left:0em"><summary>🔵 Click here if you want to set up a Reddit developer account </summary>


If you want to replicate the analysis in this notebook, you will need to:

- Create a [Reddit account](https://www.reddit.com/register/) (or reuse the one you already have)
- Then, follow these [First Steps](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#first-steps) to create an app and get your credentials.
- Take note of your Reddit username and password, as well as the client ID and client secret of the app you created:

    ![](../figures/reddit/screenshot_reddit_app_details.png)

</details>



👉🏻 Let's browse the [Reddit API documentation](https://www.reddit.com/dev/api/) to see what's in there. 

Listen closely as I explain the different API endpoints and I comment the decisions around **pagination** implemented by the developers of the API.

## 1.1. Your credentials are sensitive information

Unlike OpenMeteo, Reddit requires authentication to access its API. This means we need to pass our username, password, client ID, and client secret to the API before we can make any requests.

we need to provide several sensitive pieces of information:

- your Reddit username (_do you want people to know it?_)
- your Reddit password, in plain text (_do you want people to know it?_)
- your Reddit app's client ID (_do you want anyone to send requests on your behalf?_)
- your Reddit app's client secret (_do you want anyone to send requests on your behalf?_)

If I leave this information in the notebook (on GitHub, especially), anyone reading it can impersonate me and send requests to Reddit on my behalf - a **serious security risk**.

**NEVER leave your credentials anywhere in your GitHub repository or notebook!**

🔊 Louder for those in the back:

<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0px 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 750px;align-items:top;min-height: calc(75% - 20px); box-sizing: border-box;font-size:2.5em;">

☣️ **NEVER leave your credentials anywhere in your GitHub repository or notebook!** ☣️

Even if you delete it afterwards, it will still be in the repository's history, and ANYONE can access it.
</div>


## 1.2 Using the `python-dotenv` library

I saved those credentials in a `.env` file, which I can load into this notebook using the `python-dotenv` library:

In [2]:
# Load the .env file
load_dotenv()

True

Once I've loaded the .env file, they are stored in the `os.environ` dictionary, a safe place closer to your Operating System.

We can use `os.getenv()` to retrieve the values from the dictionary when passing to the Reddit API without ever looking at them.

```python
# If I were to run this code, I would expose my credentials to everyone
# as it would be saved in the notebook's output and in the repository's history forever (if I commit it).
os.getenv('REDDIT_USERNAME')
```

## 1.3 Getting an access token

Having a username, password, client ID, and client secret is not enough to get data from Reddit!

You still need to send a first request to the API to get an access token. This token is a string that you will pass in the headers of all subsequent requests to prove that you are who you say you are, and it has an expiration date.

**Setup the credentials before sending the request**


In [10]:
# We will still use the requests library, only this time we have to set up authentication parameters first
client_auth = requests.auth.HTTPBasicAuth(os.getenv("REDDIT_CLIENT_ID"), os.getenv("REDDIT_CLIENT_SECRET"))

You also need to send, via [HTTP POST](https://www.w3schools.com/tags/ref_httpmethods.asp), your Reddit username and password:

In [12]:
post_data = {"grant_type": "password", "username": os.getenv('REDDIT_USERNAME'), "password": os.getenv('REDDIT_PASSWORD')}

It is also a good practice to identify yourself in the `User-Agent` header, as Reddit documentation suggests.

In [13]:
headers = {"User-Agent": f"LSE DS105A (2024/25) API practice by {os.getenv('REDDIT_USERNAME')}"}

**Actually send the request**


In [14]:
# From their documentation, I learned this is the endpoint I need
ACCESS_TOKEN_ENDPOINT = "https://www.reddit.com/api/v1/access_token"

# This time we are sending a HTTP POST instead of a HTTP GET
response = requests.post(ACCESS_TOKEN_ENDPOINT, auth=client_auth, data=post_data, headers=headers)
response.json()

{'access_token': 'eyJhbGciOiJSUzI1NiIsImtpZCI6IlNIQTI1NjpzS3dsMnlsV0VtMjVmcXhwTU40cWY4MXE2OWFFdWFyMnpLMUdhVGxjdWNZIiwidHlwIjoiSldUIn0.eyJzdWIiOiJ1c2VyIiwiZXhwIjoxNzMyMTQ0MTA2LjYwNjc1MiwiaWF0IjoxNzMyMDU3NzA2LjYwNjc1MiwianRpIjoiVS1hY0VjOGVlME55eTF0d2hRSi1aX1RLWGRwS2ZBIiwiY2lkIjoiY0RMbjUzTTBZaUR6cEVhUkRIOGhQQSIsImxpZCI6InQyX25kejV5NzNueCIsImFpZCI6InQyX25kejV5NzNueCIsImxjYSI6MTY5OTM0ODU0NDk3NCwic2NwIjoiZUp5S1Z0SlNpZ1VFQUFEX193TnpBU2MiLCJmbG8iOjl9.ClW_LX6_u14D85tpdLdeWmYAAiMb1o89lgg0BJ5uqljLZDfdiX4I9dV5VEbXgIw5IGMs4q2n58bZ4WZ0yNTP79X5ufitKcMcSKQkJRgHq5_S04orZxjQkUI64dO-18oUQrOLBmIJAgcE-_DTcj7ZcAv_YKpBYYiLp2tj6nyVTCpdKsGXk0xLOKmO3sW73LUmS5s7PBCgR4AO7YYOJJZfyZUFew8PvIYa3A8PFi9IW9d8AgO-etDGEzhLaoZl4zlekScedKOF-XKM6pFjlxJkY1XbWGoLQfY_QY4nWpgYm9ll2-T35MQkTaEDaukxnVelz3RcyapfE5pb7iKMEn-How',
 'token_type': 'bearer',
 'expires_in': 86400,
 'scope': '*'}

If you configured everything correctly, you should get a response like this:

```json
{
    "access_token": "a_long_string_of_characters",
    "token_type": "bearer",
    "expires_in": 86400, // in seconds
    "scope": "*"
}
```

**Prepare a new header for future requests**

Let's store our token in a variable to use it in the next requests.

In [15]:
my_token = response.json()['access_token']

From now on, all my requests need to be followed by these HTTP HEADERS:

In [17]:
headers = {"Authorization": f"bearer {my_token}", "User-Agent": f"LSE DS105A (2024/25) API practice by {os.getenv('REDDIT_USERNAME')}"}