# Using APIs to Extract Data from an URL

An API is a set of software interfaces for developers to use third party code without knowing implementation details. For example, say you are asked to evaluate the performance of a marketing campaign for a Consumer Packaged Goods firm. You could extract data using the Twitter Search API, filter tweets that contain the campaign tagline or hashtag, and analyze the text to understand people's reactions. For another example, say you're asked to help identify upcoming technology areas. While this can be achieved by attending conferences and reading academic publications, you may extract data on questions being asked using the StackOverflow API and identify emerging topics using text analytics.

# Setup and load Pythong settings

These are provided by Blueprints and we should just use them.

In [None]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch02/setup.py')

%run -i setup.py

You are working on Google Colab.
Files will be downloaded to "/content".
Downloading required files ...
!wget -P /content https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/settings.py
!wget -P /content/packages/blueprints https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/packages/blueprints/exploration.py
!wget -P /content/ch02 https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/ch02/requirements.txt

Additional setup ...
!pip install -r ch02/requirements.txt
!python -m nltk.downloader stopwords


In [None]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# to print output of all statements and not just the last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# otherwise text between $ signs will be interpreted as formula and printed in italic
pd.set_option('display.html.use_mathjax', False)

# path to import blueprints packages
sys.path.append(BASE_DIR + '/packages')

In [None]:
# adjust matplotlib resolution for book version
matplotlib.rcParams.update({'figure.dpi': 200 })

# GitHub API

GitHub hosts open source projects such as Python, scikit-learn, TensorFlow, and many others. GitHub API v3 is a web-based REST API, while GitHub API v4 is a GraphQL API. GraphQL overcomes some of the drawbacks in REST. In particular, what takes several API calls in REST may only need one API call in GraphQL. But REST APIs are still much more common.

Here is the link to GitHub API details:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api.

The following API call lists all the repositories on GitHub.

In [None]:
import requests

response = requests.get('https://api.github.com/repositories',
                        headers={'Accept': 'application/vnd.github.v3+json'})
                        ### GitHub API syntax
print(response.status_code)

200


**Some response code and meaning**

200 -- API call was successful

403 -- The server understands the request but refuses to authorize it

422 -- API call failed

503 -- The server is not ready to handle the request; often happens when calling from a browser

The headers object is a dictionary that contains more detailed information such as the server name, response timestamp, status, and so on.

In [None]:
print (response.encoding)
print (response.headers['Content-Type'])
print (response.headers['server'])

utf-8
application/json; charset=utf-8
GitHub.com


In [None]:
print(response)

<Response [200]>


In [None]:
response.headers
### a dict
type(response.headers)

{'Server': 'GitHub.com', 'Date': 'Wed, 02 Feb 2022 18:00:39 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept, Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"3b08785a912fd9158abc75739a1173520f25f493467dd3eaef5d02674b84ca0f"', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/repositories?since=369>; rel="next", <https://api.github.com/repositories{?since}>; rel="first"', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protec

requests.structures.CaseInsensitiveDict

In [None]:
import json
if (response.status_code == 200):
    print (json.dumps(response.json()[0], indent=2)[:300])
### the first element in the response with the first 300 characters for brevity

{
  "id": 1,
  "node_id": "MDEwOlJlcG9zaXRvcnkx",
  "name": "grit",
  "full_name": "mojombo/grit",
  "private": false,
  "owner": {
    "login": "mojombo",
    "id": 1,
    "node_id": "MDQ6VXNlcjE=",
    "avatar_url": "https://avatars.githubusercontent.com/u/1?v=4",
    "gravatar_id": "",
    "url":


In [None]:
len(response.json())
len(response.json()[3])
#response.json()[3]
#response.json()
#print(response.status_code)

100

46

In [None]:
type(response.json())
response.json()

list

[{'archive_url': 'https://api.github.com/repos/mojombo/grit/{archive_format}{/ref}',
  'assignees_url': 'https://api.github.com/repos/mojombo/grit/assignees{/user}',
  'blobs_url': 'https://api.github.com/repos/mojombo/grit/git/blobs{/sha}',
  'branches_url': 'https://api.github.com/repos/mojombo/grit/branches{/branch}',
  'collaborators_url': 'https://api.github.com/repos/mojombo/grit/collaborators{/collaborator}',
  'comments_url': 'https://api.github.com/repos/mojombo/grit/comments{/number}',
  'commits_url': 'https://api.github.com/repos/mojombo/grit/commits{/sha}',
  'compare_url': 'https://api.github.com/repos/mojombo/grit/compare/{base}...{head}',
  'contents_url': 'https://api.github.com/repos/mojombo/grit/contents/{+path}',
  'contributors_url': 'https://api.github.com/repos/mojombo/grit/contributors',
  'deployments_url': 'https://api.github.com/repos/mojombo/grit/deployments',
  'description': '**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object

In [None]:
response.json()[0]

{'archive_url': 'https://api.github.com/repos/mojombo/grit/{archive_format}{/ref}',
 'assignees_url': 'https://api.github.com/repos/mojombo/grit/assignees{/user}',
 'blobs_url': 'https://api.github.com/repos/mojombo/grit/git/blobs{/sha}',
 'branches_url': 'https://api.github.com/repos/mojombo/grit/branches{/branch}',
 'collaborators_url': 'https://api.github.com/repos/mojombo/grit/collaborators{/collaborator}',
 'comments_url': 'https://api.github.com/repos/mojombo/grit/comments{/number}',
 'commits_url': 'https://api.github.com/repos/mojombo/grit/commits{/sha}',
 'compare_url': 'https://api.github.com/repos/mojombo/grit/compare/{base}...{head}',
 'contents_url': 'https://api.github.com/repos/mojombo/grit/contents/{+path}',
 'contributors_url': 'https://api.github.com/repos/mojombo/grit/contributors',
 'deployments_url': 'https://api.github.com/repos/mojombo/grit/deployments',
 'description': '**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented re

In [None]:
response.json()[0]["full_name"]
response.json()[0].keys()
response.json()[0].values()

'mojombo/grit'

dict_keys(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url'])

dict_values([1, 'MDEwOlJlcG9zaXRvcnkx', 'grit', 'mojombo/grit', False, {'login': 'mojombo', 'id': 1, 'node_id': 'MDQ6VXNlcjE=', 'avatar_url': 'https://avatars.githubusercontent.com/u/1?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/mojombo', 'html_url': 'https://github.com/mojombo', 'followers_url': 'https://api.github.com/users/mojombo/followers', 'following_url': 'https://api.github.com/users/mojombo/following{/other_user}', 'gists_url': 'https://api.github.com/users/mojombo/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/mojombo/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/mojombo/subscriptions', 'organizations_url': 'https://api.github.com/users/mojombo/orgs', 'repos_url': 'https://api.github.com/users/mojombo/repos', 'events_url': 'https://api.github.com/users/mojombo/events{/privacy}', 'received_events_url': 'https://api.github.com/users/mojombo/received_events', 'type': 'User', 'site_admin': False}, 'https://github.com

In [None]:
response = requests.get('https://api.github.com/search/repositories') ### Wrong API call
print (response.status_code)

422


**Correction**

In [None]:
response = requests.get('https://api.github.com/search/repositories',
    params={'q': 'data_science+language:python'},
    headers={'Accept': 'application/vnd.github.v3.text-match+json'})
print(response.status_code)

200


In [None]:
response.json()["total_count"]
#response.json()['items'][:1]
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

for item in response.json()['items'][:7]:
    printmd('**' + item['name'] + '**' + ': repository ' +
            item['text_matches'][0]['property'] + ' - \"*' +
            item['text_matches'][0]['fragment'] + '*\" matched with ' + '**' +
            item['text_matches'][0]['matches'][0]['text'] + '**')

14409

**data-science-from-scratch**: repository description - "*code for Data Science From Scratch book*" matched with **Data Science**

**data-science-blogs**: repository description - "*A curated list of data science blogs*" matched with **data science**

**galaxy**: repository description - "*Data intensive science for everyone.*" matched with **Data**

**DataCamp**: repository description - "*DataCamp data-science courses*" matched with **data**

**data-scientist-roadmap**: repository description - "*Toturials coming with the "data science roadmap" picture.*" matched with **data science**

**dsp**: repository description - "*data science preparation*" matched with **data science**

**Kaggler**: repository description - "*Code for Kaggle Data Science Competitions*" matched with **Data Science**

# A Use Case

Monitor the comments in a repository, say PyTorch repository, and ensure that they adhere to community guidelines.

In [None]:
response = requests.get(
    'https://api.github.com/repos/pytorch/pytorch/issues/comments') ### PyTorch API is simple
print('Response Code', response.status_code)
print('Number of comments', len(response.json()))
response.json()

Response Code 200
Number of comments 30


[{'author_association': 'NONE',
  'body': 'A good reason to use Python 3\n',
  'created_at': '2016-08-16T14:13:50Z',
  'html_url': 'https://github.com/pytorch/pytorch/issues/1#issuecomment-240114487',
  'id': 240114487,
  'issue_url': 'https://api.github.com/repos/pytorch/pytorch/issues/1',
  'node_id': 'MDEyOklzc3VlQ29tbWVudDI0MDExNDQ4Nw==',
  'performed_via_github_app': None,
  'reactions': {'+1': 0,
   '-1': 0,
   'confused': 0,
   'eyes': 0,
   'heart': 0,
   'hooray': 0,
   'laugh': 1,
   'rocket': 0,
   'total_count': 1,
   'url': 'https://api.github.com/repos/pytorch/pytorch/issues/comments/240114487/reactions'},
  'updated_at': '2016-08-16T14:13:50Z',
  'url': 'https://api.github.com/repos/pytorch/pytorch/issues/comments/240114487',
  'user': {'avatar_url': 'https://avatars.githubusercontent.com/u/161935?v=4',
   'events_url': 'https://api.github.com/users/alexbw/events{/privacy}',
   'followers_url': 'https://api.github.com/users/alexbw/followers',
   'following_url': 'https:/

<b>Question</b>: Are there only 30 comments? This cannot be true. What happens was due to pagination: each page is limitted to 30 comments. The folloing methods returns the number of pages:

In [None]:
response.links

{'last': {'rel': 'last',
  'url': 'https://api.github.com/repositories/65600975/issues/comments?page=1334'},
 'next': {'rel': 'next',
  'url': 'https://api.github.com/repositories/65600975/issues/comments?page=2'}}

The following snippet retrieves all pages on comments posted since Jan. 1, 2022 without rate limiting. If the server responds with a response status code of 503, try to clear up browsing history and cookies relevant to the URL that API calls are made to. If the response status code is 403, then waite for a while before you run it again, for the web server may simply block your API call when it detects multiple visits from you within a short period of time. To resolve this problem appropriately, we would want to control the access rate. 

This process takes a bit of time and so it's commented out. You may uncomment it when you use this snippet.

In [None]:
#def get_all_pages(url, params=None, headers=None):
#    output_json = []
#    response = requests.get(url, params=params, headers=headers)
#    print("the response status is ", response.status_code)
#    if response.status_code == 200:
#        output_json = response.json()
#        if 'next' in response.links:
#            next_url = response.links['next']['url']
#            if next_url is not None:
#                output_json += get_all_pages(next_url, params, headers)
#    return output_json
#
###
#out = get_all_pages(
#    "https://api.github.com/repos/pytorch/pytorch/issues/comments",
#    params={
#        'since': '2022-01-01T10:00:01Z',
#        'sorted': 'created',
#        'direction': 'desc'
#    },
#    headers={'Accept': 'application/vnd.github.v3+json'})
#df = pd.DataFrame(out)

# Rate limiting

GitHub allows unauthorized requests for up to 60 requests per hour. This information can be obtained from the header section of the repsonse object.

In [None]:
response = requests.head(
    'https://api.github.com/repos/pytorch/pytorch/issues/comments')
print('X-Ratelimit-Limit', response.headers['X-Ratelimit-Limit'])
print('X-Ratelimit-Remaining', response.headers['X-Ratelimit-Remaining'])

# Converting UTC time to human-readable format
import datetime
print(
    'Rate Limits reset at',
    datetime.datetime.fromtimestamp(int(
        response.headers['X-RateLimit-Reset'])).strftime('%c'))

X-Ratelimit-Limit 60
X-Ratelimit-Remaining 57
Rate Limits reset at Wed Feb  2 19:00:39 2022


We ourght to honor GitHub's rate limits, and so we would want to make API calls evenly: one call per minute.

In [None]:
from datetime import datetime
import time

def handle_rate_limits(response):
    now = datetime.now()
    reset_time = datetime.fromtimestamp(
        int(response.headers['X-RateLimit-Reset']))
    remaining_requests = response.headers['X-Ratelimit-Remaining']
    remaining_time = (reset_time - now).total_seconds()
    intervals = remaining_time / (1.0 + int(remaining_requests))
    print('Sleeping for', intervals)
    time.sleep(intervals)
    return True

In [None]:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    total=5,
    status_forcelist=[500, 503, 504],
    backoff_factor=1
)

retry_adapter = HTTPAdapter(max_retries=retry_strategy)

http = requests.Session()
http.mount("https://", retry_adapter)
http.mount("http://", retry_adapter)

response = http.get('https://api.github.com/search/repositories',
                   params={'q': 'data_science+language:python'})

for item in response.json()['items'][:5]:
    print (item['name'])

data-science-from-scratch
data-science-blogs
galaxy
DataCamp
data-scientist-roadmap


Use the following code to handle pagination and rate limit

In [None]:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    total=5,
    status_forcelist=[500, 503, 504],
    backoff_factor=1
)

retry_adapter = HTTPAdapter(max_retries=retry_strategy)

http = requests.Session()
http.mount("https://", retry_adapter)
http.mount("http://", retry_adapter)

def get_all_pages(url, param=None, header=None):
    output_json = []
    response = http.get(url, params=param, headers=header)
    print("The response status code is ", response.status_code)
    if response.status_code == 200:
        output_json = response.json()
        if 'next' in response.links:
            next_url = response.links['next']['url']
            if (next_url is not None) and (handle_rate_limits(response)): 
                output_json += get_all_pages(next_url, param, header)
    return output_json

In [None]:
out = get_all_pages("https://api.github.com/repos/pytorch/pytorch/issues/comments", \
                    #param={'since': '2021-04-01T00:00:01Z'}
                    param={'since': '2022-01-01T10:00:01Z',
                            'sorted': 'created',
                            'direction': 'desc'},
                    header={'Accept': 'application/vnd.github.v3+json'})

df = pd.DataFrame(out)
#print(df)

The response status code is  200
Sleeping for 62.27763908888889
The response status code is  200
Sleeping for 62.252818795454544
The response status code is  200
Sleeping for 62.23140809302325
The response status code is  200
Sleeping for 62.21054973809524
The response status code is  200
Sleeping for 62.20024468292683
The response status code is  200
Sleeping for 62.20077834999999
The response status code is  200
Sleeping for 62.14729974358974
The response status code is  200
Sleeping for 62.07490689473684
The response status code is  200
Sleeping for 62.0477952972973
The response status code is  200
Sleeping for 62.03573191666666
The response status code is  200
Sleeping for 62.00583505714286
The response status code is  200
Sleeping for 61.97500102941177
The response status code is  200
Sleeping for 61.96311112121212
The response status code is  200
Sleeping for 61.950417625
The response status code is  200
Sleeping for 61.9365824516129
The response status code is  200
Sleeping for 

In [None]:
pd.set_option('display.max_colwidth', -1) ### -1 is the option to show the full text
print (df['body'].count())
df[['id','created_at','body']].sample(3, random_state=15)

1350


Unnamed: 0,id,created_at,body
43,1007490600,2022-01-07T15:20:39Z,"\n<!-- ciflow-comment-start -->\n<details><summary>CI Flow Status</summary><br/>\n\n## :atom_symbol: CI Flow\nRuleset - Version: `v1`\nRuleset - File: https://github.com/lithuak/pytorch/blob/ae2f0b202773634f20f428f4c7e10bbf17f8c8ce/.github/generated-ciflow-ruleset.json\nPR ciflow labels: `ciflow/default`\n| Workflows | Labels (bold enabled) | Status |\n| :-------- | :-------------------- | :------ |\n| **Triggered Workflows** |\n| linux-bionic-py3.7-clang9 | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-docs | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-vulkan-bionic-py3.7-clang9 | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan` | :white_check_mark: triggered |\n| linux-xenial-cuda11.3-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-cuda11.3-py3.7-gcc7-bazel-test | `ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3-clang5-mobile-build | `ciflow/all`, **`ciflow/default`**, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3-clang5-mobile-custom-build-static | `ciflow/all`, **`ciflow/default`**, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3.7-clang7-asan | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3.7-clang7-onnx | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3.7-gcc5.4 | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3.7-gcc7 | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| linux-xenial-py3.7-gcc7-no-ops | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single | `ciflow/all`, `ciflow/android`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit | `ciflow/all`, `ciflow/android`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/linux`, `ciflow/trunk` | :white_check_mark: triggered |\n| win-vs2019-cpu-py3 | `ciflow/all`, `ciflow/cpu`, **`ciflow/default`**, `ciflow/trunk`, `ciflow/win` | :white_check_mark: triggered |\n| win-vs2019-cuda11.3-py3 | `ciflow/all`, `ciflow/cuda`, **`ciflow/default`**, `ciflow/trunk`, `ciflow/win` | :white_check_mark: triggered |\n| **Skipped Workflows** |\n| caffe2-linux-xenial-py3.7-gcc5.4 | `ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n| docker-builds | `ciflow/all`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-arm64 | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-arm64-coreml | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-arm64-custom-ops | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-arm64-full-jit | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-arm64-metal | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-x86-64 | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-x86-64-coreml | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| ios-12-5-1-x86-64-full-jit | `ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| libtorch-linux-xenial-cuda10.2-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n| libtorch-linux-xenial-cuda11.3-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n| linux-bionic-cuda10.2-py3.9-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk` | :no_entry_sign: skipped |\n| linux-docs-push | `ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled` | :no_entry_sign: skipped |\n| linux-xenial-cuda11.3-py3.7-gcc7-no-ops | `ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n| macos-10-15-py3-arm64 | `ciflow/all`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| macos-10-15-py3-lite-interpreter-x86-64 | `ciflow/all`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| macos-11-py3-x86-64 | `ciflow/all`, `ciflow/macos`, `ciflow/trunk` | :no_entry_sign: skipped |\n| parallelnative-linux-xenial-py3.7-gcc5.4 | `ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n| periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled` | :no_entry_sign: skipped |\n| periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled` | :no_entry_sign: skipped |\n| periodic-linux-bionic-cuda11.5-py3.7-gcc7 | `ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled` | :no_entry_sign: skipped |\n| periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck | `ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck` | :no_entry_sign: skipped |\n| periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug | `ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled` | :no_entry_sign: skipped |\n| periodic-win-vs2019-cuda11.1-py3 | `ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win` | :no_entry_sign: skipped |\n| periodic-win-vs2019-cuda11.5-py3 | `ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win` | :no_entry_sign: skipped |\n| pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build | `ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk` | :no_entry_sign: skipped |\n<br/>\nYou can add a comment to the PR and tag @pytorchbot with the following commands:\n<br/>\n\n```sh\n# ciflow rerun, ""ciflow/default"" will always be added automatically\n@pytorchbot ciflow rerun\n\n# ciflow rerun with additional labels ""-l <ciflow/label_name>"", which is equivalent to adding these labels manually and trigger the rerun\n@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow\n```\n\n<br/>\n\nFor more information, please take a look at the [CI Flow Wiki](https://github.com/pytorch/pytorch/wiki/Continuous-Integration#using-ciflow).\n</details><!-- ciflow-comment-end -->"
1242,1005209250,2022-01-04T22:12:34Z,@Linux-cpp-lisp - does the recent nightly resolve your issue as well? If not please create another issue.
622,1006119175,2022-01-05T22:12:49Z,"I have trouble formalizing what the expected behavior would be here.\r\n\r\nThe contract for hooks is that they should be a pair of functions (`pack`, `unpack`) with `pack: torch.Tensor -> Any` and `unpack: Any -> torch.Tensor` and `pack(unpack(x)) = x`.\r\n\r\n- In the case where where output of `pack_1` and `pack_2` are tensors, I can visualize how composing them would work. \r\n```\r\n with saved_tensors_hooks(pack_2, unpack_2):\r\n with saved_tensors_hooks(pack_1, unpack_1):\r\n y = f(x)\r\n```\r\nIf the forward pass needs to a tensor `a`, the above code would save `saved = pack_2(pack_1(a))`.\r\nThen during backward, the tensor would be retrieved with `unpack_1(unpack_2(saved))`.\r\n(I guess even in the above example it's a bit unclear which order the operations should be done in).\r\n\r\n- However, in the case where the output of the inner `pack` function is not a tensor, do we then need to change what is acceptable as an outer hook? Concretely, in your example, the `checkpoint` inner hook outputs an `int`. Clearly, moving the `int` to CPU is not what you'd want. Alternatively, the saved tensor can be unpacked (by the inner unpack) before being packed by the outer pack. But that would void using the inner hook altogether.\r\n\r\nWhat am I missing here?"


# Blueprint - Extracting Twitter data with Tweepy

You'd need to register yourself as a developer: Go to https://developer.twitter.com/en/apps and click on "Create an app". Selecting "create an API" you will obtain an API key and an AP secret key. Input them below

In [None]:
import tweepy

app_api_key = 'AyWFqRzvENuPUxxxxxxxxxxxx' ### enter yours
app_api_secret_key = '9kbOw3atv3XMm9yh8dTKuoAVr85On6a9W5mxxxxxxxxxxxxxxx' ### enter yours

auth = tweepy.AppAuthHandler(app_api_key, app_api_secret_key)
api = tweepy.API(auth)

print ('API Host:', api.host)
print ('API Cache:', api.cache)

API Host: api.twitter.com
API Cache: None


In [None]:
import pandas as pd

pd.set_option('display.max_colwidth', None) ###
search_term = 'cryptocurrency'

#
#tweets = tweepy.Cursor(api.search_tweets,
#                       q=search_term,
#                       lang="en").items(100)
#
#retrieved_tweets = [tweet._json for tweet in tweets]
#df = pd.json_normalize(retrieved_tweets)
#
#df[['text']].sample(3)

The following code needs Elevated access
--
<code>
tweets = tweepy.Cursor(api.search_tweets, q=search_term, lang="en").items(100)
retrieved_tweets = [tweet._json for tweet in tweets] 
df = pd.json_normalize(retrieved_tweets)
df[['text']].sample(3)
</code>

Running this code retuns the following message:

Forbidden: 403 Forbidden
453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve

The following functions allow access to Twitter API v2 endpoints. More information is given at

https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9

In [None]:
import tweepy

### Authentication
client = tweepy.Client\
(bearer_token= ### enter yours
 'AAAAAAAAAAAAAAAAAAAAAMBWXwEAAAAAv4Dg%2FvOZhXo1wQSECvc5%2Bn%2BtPz8%3DdoeiZ2qbU0VKuNnqjhVOLnxxxxxxxxxxxxxxxxxxxxxxxxxxxx')


AttributeError: ignored

In [None]:
# Replace with your own search query
# query = 'from:suhemparack -is:retweet'
query = search_term

tweets = client.search_recent_tweets(query=query, \
                                     #tweet_fields=['context_annotations', 'created_at'], \
                                     max_results=100)
print(tweets)

#for tweet in tweets.data:
#    print(tweet.text)
#    if len(tweet.context_annotations) > 0:
#        print(tweet.context_annotations)
        


NameError: ignored

Getting more than 100 tweets at a timem using paginator
--

In [None]:
# Replace with your own search query
query = 'covid -is:retweet'

# Replace the limit=1000 with the maximum number of Tweets you want
for tweet in tweepy.Paginator(client.search_recent_tweets, query=query,
                              tweet_fields=['context_annotations', 'created_at'], 
                              max_results=100).flatten(limit=1000):
    print(tweet.id)

1479539956943101954
1479539955957665793
1479539955856781313
1479539955630317568
1479539955122720780
1479539955089219589
1479539954959015937
1479539954946617350
1479539954707357698
1479539954653052929
1479539954518839299
1479539954502017026
1479539954070007817
1479539952581066756
1479539952404807680
1479539951750430720
1479539951482163206
1479539951448600580
1479539951104626690
1479539951024758784
1479539950676856835
1479539950433538051
1479539950118834176
1479539949208670217
1479539949036834830
1479539948386947072
1479539948260773888
1479539946742599683
1479539945903734785
1479539945542848512
1479539945446457357
1479539945190612992
1479539944460795909
1479539944028876802
1479539943542296582
1479539942019719178
1479539941977821191
1479539941415825410
1479539941415665666
1479539941088497664
1479539940820242432
1479539940761419780
1479539940501381125
1479539940480462848
1479539940153253895
1479539939855454212
1479539939461238784
1479539939335458816
1479539938253148160
1479539938165207045


1479539792635473920
1479539792618479618
1479539792283148288
1479539792257843202
1479539792232599555
1479539792194985984
1479539791964389379
1479539791624556547
1479539791481884674
1479539790873780233
1479539790756278274
1479539790592806914
1479539790353678337
1479539790039203841
1479539789619732480
1479539788864790533
1479539788529156097
1479539788407525378
1479539788252393474
1479539788252213248
1479539788026097664
1479539787614892039
1479539787379970048
1479539784846561280
1479539784640913411
1479539784477560832
1479539784423026696
1479539783873482759
1479539783789694983
1479539783592464390
1479539782392946689
1479539782111965184
1479539782019698692
1479539781331787776
1479539781092745222
1479539780715266048
1479539780287442946
1479539780035751936
1479539779985420295
1479539779364474881
1479539779247214603
1479539778630656005
1479539777837813764
1479539777724731392
1479539777389187081
1479539777246535691
1479539777066123265
1479539776810270729
1479539776600559619
1479539776546082825


Writing Tweets to a text file
--

In [None]:
# Replace with your own search query
query = 'covid -is:retweet'

# Name and path of the file where you want the Tweets written to
file_name = 'tweets.txt'

with open(file_name, 'a+') as filehandle:
    for tweet in tweepy.Paginator(client.search_recent_tweets, query=query,
                                  tweet_fields=['context_annotations', 'created_at'], 
                                  max_results=100).flatten(
            limit=1000):
        filehandle.write('%s\n' % tweet.id)

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

Getting Tweet counts (volume) for a search query
--

In [None]:
# Replace with your own search query
query = 'covid -is:retweet'

counts = client.get_recent_tweets_count(query=query, granularity='day')

for count in counts.data:
    print(count)

{'end': '2022-01-01T00:00:00.000Z', 'start': '2021-12-31T19:54:04.000Z', 'tweet_count': 128593}
{'end': '2022-01-02T00:00:00.000Z', 'start': '2022-01-01T00:00:00.000Z', 'tweet_count': 490157}
{'end': '2022-01-03T00:00:00.000Z', 'start': '2022-01-02T00:00:00.000Z', 'tweet_count': 555998}
{'end': '2022-01-04T00:00:00.000Z', 'start': '2022-01-03T00:00:00.000Z', 'tweet_count': 706760}
{'end': '2022-01-05T00:00:00.000Z', 'start': '2022-01-04T00:00:00.000Z', 'tweet_count': 872745}
{'end': '2022-01-06T00:00:00.000Z', 'start': '2022-01-05T00:00:00.000Z', 'tweet_count': 949672}
{'end': '2022-01-07T00:00:00.000Z', 'start': '2022-01-06T00:00:00.000Z', 'tweet_count': 864923}
{'end': '2022-01-07T19:54:04.000Z', 'start': '2022-01-07T00:00:00.000Z', 'tweet_count': 660070}


# Coding Assignment 2

Retrieve tweets, using Tweepy on Twitter API v2 endpoints, from a query of your own choice for the last 7 days. Organize retrieved tweets into a Pandas DataFrame so that each tweet is a row entry with the following columns: ID, Name, Text, Created At, Number of Followers, and other columns you deem fit.

In [33]:

import tweepy

# Just keys and stuff for the program to use to contact the twitter api

# Dont run this too much, if you do youll get rate limited, and its gonna look
# like that it doesnt work.

API_KEY = "Z6gxEuJCeCK7VScnVL34qrj6k"
API_SECRET = "L0VtVDlyCGW80c8giJKQYaPJYsIvJsGkDZ0JIedu3Dd6cy39b2"
API_BEARER = "AAAAAAAAAAAAAAAAAAAAAGIxYwEAAAAAJ9InAcr1bcmD%2BeB1ZpQBDDcmoDI%3DRxZg09ZuIychODro9GDwsL2p7aV7dKttt8bPZzAg6TNxDwm9oX"

ACCESS_TOKEN = "1488942220363255808-d5EOkKILUuc4kuIpi718c15xfqpbVI"
ACCESS_SECRET = "e7nCctpxRay24A1eKwn2JHNe2k12uDkhSCMe73ezBNO78"

BOSTON_ID = 2367105



class Client:
  """
  Client class encapsulates methods for querying to the api for it.
  contains all the keys and tokens for authentication,
  """
  def __init__(self,KEY,SECRET,ACCESS_TOK,ACCESS_SEC, location):
    self.KEY, self.SECRET, self.ACCESS_TOK, self.ACCESS_SEC = KEY, SECRET, ACCESS_TOK, ACCESS_SEC
    self.client = None
    self.location = location

  """
  Begins the oAuth handshake
  """
  def authentication_routine(self):
    print("Beginning Authentication Routine")
    auth = tweepy.OAuthHandler(self.KEY, self.SECRET)
    auth.set_access_token(self.ACCESS_TOK, self.ACCESS_SEC)
    self.client = tweepy.API(auth)

  """
  Gathers trends from the area, boston
  """
  def get_current_trends(self):
    trends = self.client.trends_place(id=self.location)
    all_trend_info = trends[0]['trends']
    return all_trend_info
  """
  Returns the querys as status objects for a query.
  """
  def retrieve_tweets(self,query):
    return self.client.search(query,max_results=50)
user_client = Client(API_KEY,API_SECRET,ACCESS_TOKEN, ACCESS_SECRET, BOSTON_ID)
user_client.authentication_routine()
trends = user_client.get_current_trends()

print("decoding querys into list so that its easier to parse")
  
# Makes a list of the querys, then makes it one dimensional
querys = list(map(lambda query: user_client.retrieve_tweets(query),
              map(lambda trend: trend['query'], trends)))
results = []
for query in querys:
  for result in query:
    results.append(result)


print("translating list of status into dictionary of lists")

#converts the status object into a dictionary for easier parsing
def translate_into_dict(status):
  return {"id":status.id,"text":status.text,"user":status.user.name,"date":status.created_at,"followers":status.user.followers_count}
results = list(map(translate_into_dict, results))

#then changes the dimesnionality of it.
data_frame_dict = {
    "ID": list(map(lambda a: a['id'],results)),
    "Name": list(map(lambda a: a['user'],results)),
    "Text": list(map(lambda a: a['text'],results)),
    "Created At": list(map(lambda a: a['date'],results)),
    "Number of Followers": list(map(lambda a: a['followers'],results))
}


from pandas import DataFrame
# construct the dataframe.
data = DataFrame(data=data_frame_dict)
print(data)

Beginning Authentication Routine
decoding querys into list so that its easier to parse
translating list of status into dictionary of lists
                      ID  ... Number of Followers
0    1491502383008043016  ...                  27
1    1491502381024088072  ...                2363
2    1491502380533252096  ...                  12
3    1491502377630842885  ...                   5
4    1491502376125087756  ...                3216
..                   ...  ...                 ...
723  1491501624405245952  ...                 342
724  1491501612036239366  ...                  10
725  1491501578670546945  ...                1699
726  1491501539940311043  ...                  72
727  1491501446877126664  ...                2374

[728 rows x 5 columns]
