Today we're going to look at some of the ways we extract and work with data from the internet, as way of "analyzing content"

We're going to look at two methods of doing this, which are interrelated: a) scraping b) using APIS (or, using apps that are hooked up to APIs)

In general, scraping is a method of extracting data from a website, programatically. For example, let's say I want to download the text of a whole bunch of wikipedia articles at once (way more than I could reasonably expect to manually copy and paste); I "scrape" these articles from the internet.

An API, or appplication programmming interface, is a system through which a social media platform (or other type of platform/site/etc.) gives you access to their data so that you can "scrape" or extract it more easily. Last week we signed up for the twitter api.

Today we're going to talk a bit about both scraping and APIs, as we look at the sample code below. But first, we need to remind ourselves, quickly, of one of the better ways of storing and manipulating data: the dataframe, which, in python, we work with using a package called pandas

**Pandas**

In [None]:
#packages - Last week, you might remember we discussed how, to perform special functions/commands in python, we download "packages" that have those functions
#tTo download a package we type import, the name of the package, and then as "the name we want to call it"; here we download pandas and call it pd:
import pandas as pd

In [None]:
#pandas is a package we use to manipulate what are called dataframes
#data frames are a nice place to store and work with data; basically, they look like grids or spreadsheets
#because they work like grids or spreadsheets it's easy for me to convert them into excel or csv files
#so, if i want to scrape a bunch of articles from wikipedia, let's say, I can store them in a grid called a dataframe
#maybe i'll use one column for the article's title, one for its text, one for its publication date, and so on
#then i can download that dataframe into an excel (or csv) file on my computer; and i can upload it back into my code as a dataframe to work with it in pandas
#in pandas we can create a dataframe like this:

data = pd.DataFrame()

#this is an empty dataframe with nothing in it; it is like a spreadsheet with zero rows and zero columns


In [None]:
Tess = pd.DataFrame()

In [None]:
#let's say we want to create a dataframe with some things in it, like a few columns and rows with numbers in them
#here's how we could create a dataframe from a dictionary of items, each representing the numbers we want in one column
#(if you were at the coding workshop last week, I'm sorry, we did not learn what "dictionaries" are; but they're a bit like lists, and I'll explain!)


d = {'col1': [1, 2], 'col2': [3, 4]}
data1 = pd.DataFrame(data=d)
print(data1)

In [None]:

d = {'banana': [7, 25], 'terrarium': [32, 2]}
Tess = pd.DataFrame(data=d)
print(Tess)

In [None]:
#there's a lot we can do with dataframes, but let's just highlight a few functions for the moment
#one thing we might want to pick out an item in a dataframe
#let's say, e.g., we want to pick out, from the above dataframe, the number four that's in row number 1 and column 2. we do this:

data1["col2"][1]

#what we just did there was type the name of the dataframe, then the name of the column we want in quotes and brackets, and then the index of the row
#for the new coders from last week - note the indexes, or numbers, of the rows start at 0 and not 1, just like the letters in a string or the numbers in a list

In [None]:
#Assignment: pick out the number 2 from this dataframe (called data 1)

In [None]:
#another thing we often want to do with dataframes is loop through their rows, in a particular column, so we can do the same thing to everything in a row
#like, for example, if we had a dataframe with wikipedia article texts in one column, we might want to loop through each text, and search it for some key word
#here, let's make a for loop to loop through each item in col2 and print it:

for i in range(2):
  print(data1["col2"][i])

  #quiz for the new coders, from last week: can you tell me why I put "2" in the range for the for loop?

In [None]:
#Assignment: loop through the first column, banana, of the dataframe we just made called Tess

In [None]:
#another thing it's really valuable to be able to do with pandas is take a file from your computer, like a csv and excel file, and load it into a dataframe
#usually we upload a file into our codebook using lines of code from pandas. In colab, however, we do it a special different way
#so I'm going to show you the normal way we do this, but it's not how it works in colab
#the normal way we do this, with, e.g., in excel file is this:

data3 = pd.read_excel("name of excel file from computer here and full file path if need be.xlsx")

#or use the same command with csv instead of excel and a .csv file to read one of those
#what can be tricky here is knowing the full "filepath"; let's not worry about that now because we're using colab, which makes this easier


In [None]:
#in colab, though, we do one of two things. Either, we do this process, for which there are instructions here (https://colab.research.google.com/notebooks/io.ipynb)
#if you have an excel file on your comp you can try uploading it this way:

from google.colab import files
uploaded = files.upload()

In [None]:
print(uploaded)
pd.read_excel('out name of your file here.xlsx')

In [None]:
#or, we'll also just use files together that I upload for all of us, so we can all use the same data
#in that case the code will look weird, but it's usually what we're ging to do; we'll do that a bit next week, don't worry about it now

In [None]:
#in pandas you also might want to save the dataframe you're working with as an excel or csv file to your computer
#again, colab does this differently
#but here's the normal way:

data1.to_excel("name to give file.xlsx")


In [None]:
#here's how colab does it:

files.download('#name of file')

In [None]:
#there's a lot more we can do in pandas. like add rows or columns, sort them by quantities, and so on
#you can learn about all that in the pandas documentation here: https://pandas.pydata.org/docs/index.html
#or by googling what you want to do in pandas and then finding how to (in e.g. stack exchange, a website where people post solutions to coding questions)
#These days, you can also ask chat gpt, which knows a lot about basic python packages like pandas an how to use them
#but for now this is all we need to basically understand

**scraping**

we're taking a quick moment of in class time, now, to introduce how data is stored on websites in html, and how we "scrape" that html using Beautiful Soup

once we've talked about that, we'll look at this sample code below, which scrapes wiki articles and puts them into a dataframe:

In [5]:
# Import packages
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/John_D._Hunter').read()
# Make a soup
soup = BeautifulSoup(source,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>John D. Hunter - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vec

In [6]:
# Extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
    text += paragraph.text

text

"John D. Hunter (August 1, 1968 – August 28, 2012) was an American neurobiologist and the original author of Matplotlib.[1]\nHunter was brought up in Dyersburg, Tennessee, and attended The McCallie School. He graduated from Princeton University in 1990 and obtained a Ph.D. in neurobiology from the University of Chicago in 2004.[2][3] In 2005, he joined TradeLink Securities as a Quantitative Analyst.[4] Later, he was one of the founding directors of NumFOCUS Foundation.[5]\nHunter initially developed Matplotlib during his postdoctoral research in neurobiology to visualize electrocorticography (ECoG) data of epilepsy patients.[4] The open-source tool emerged as the most widely used plotting library for the Python programming language and a core component of the scientific Python stack, along with NumPy, SciPy and IPython.[6] Matplotlib was used for data visualization during the 2008 landing of the Phoenix spacecraft on Mars and for the creation of the first image of a black hole.[7][8]\n

In [None]:
#what if we wanted to scrape text from multiple wikipedia links? let's say, 3? what might we do?
#hint: we might start by making a list of the links we want to scrape

In [None]:
#challenge problem - let's all make a list of three wikipedia articles and try to write code scrape all three

In [None]:
#what if we wanted to put the texts from our 3 links into a pandas dataframe? maybe with the link in one row, and then the text in another?

**apis**

ok now we'll take a few moments to discuss apis

then we'll play with some sample code with the twitter api, which, as you will recall, you got your bearer token for yesterday

remember, share this with no one! and don't post it in a public place online. you'll need to fill it in the code below for it to work

In the past, it was possible, and easy, to use the twitter api together. Sadly, that is no longer possible (I will explain!). There is one api that you can still freely access, which is YouTube. and if you want to do that come talk to me and I can share the code/help you get the data. Today, however, we're going to take a simpler route, and use an app that's hooked up to the reddit api to play with reddit data

Just so you know, though, my old twitter code for this tutorial is down at the bottom of the notebook. You could still use it to scrape twitter, but you'd have to pay a pretty steep fee first to get an access code, and I doubt anyone wants to do that! So, for our class, we can do api scraping with reddit and YouTube - reddit via the app below, YouTube with my help.

You can find the reddit app here:
https://smile.smm.ncsa.illinois.edu/

In [None]:
#Assignment, step 1: download data to an excel file that you want to play with. Then, upload the file here:

In [None]:
#Assignment, step 2: select one item in your reddit dataset and print it

In [None]:
#Assignment, step 3: loop through one columbn of your reddit dataset, and print the contents

Old Twitter scraping code is here (again, you'd need an access code that costs money). Ask me for youtube scraping code, which you can access for free (and I'll help you get the passcode/token for it) if you want it

In [None]:
#twitter scrape sample with changing access token

#import libraries

import requests
import os
import json
import time
import pandas as pd

In [None]:
#put in your individual bearer token (THIS IS THE PART THAT NOW COSTS MONEY - YOU HAVE TO PAY TWITTER FOR THE TOKEN - WHICH I WOULDNT ADVISE, ACTUALLY)

bearer_token = ''

In [None]:
#create some functions to prepare for scraping (this does not need to make sense! don't worry!)
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params):
    response = requests.request("GET", search_url, headers=headers, params=params)
    # print(response.status_code)
    if response.status_code != 200:
        print('OOPS')
        raise Exception(response.status_code, response.text)
    return response.json()

In [None]:
# SMALL TEST QUERY
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields

#scrape by user using their handle; if you want two users add an OR after the first 'from:' and pu in another

query_params = {'query': 'from:elonmusk lang:en -is:retweet -is:reply',
                'tweet.fields': 'author_id,created_at,conversation_id,public_metrics',
                 'max_results': 10}

In [None]:
# Sample run, the output will be a json file called "json_response"

headers = create_headers(bearer_token)
search_url = "https://api.twitter.com/2/tweets/search/recent"
json_response = connect_to_endpoint(search_url, headers, query_params)
print(json_response)

In [None]:
#turn the json response into a pandas dataframe
df = pd.DataFrame(json_response['data'])
print(df)

search by hashtag vs. user

In [None]:
query_params = {'query': '#metoo lang:en -is:retweet -is:reply',
                'tweet.fields': 'author_id,created_at,conversation_id,public_metrics',
                 'max_results': 10}

In [None]:

headers = create_headers(bearer_token)
search_url = "https://api.twitter.com/2/tweets/search/recent"
json_response = connect_to_endpoint(search_url, headers, query_params)
print(json_response)

learn more at the twitter api documentaiton:
https://developer.twitter.com/en/docs/twitter-api