<a href="https://colab.research.google.com/github/TessM2/content/blob/main/Workshop_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today we're going to look at some of the ways we extract and work with data from the internet, as way of "analyzing content"

We're going to look at two methods of doing this, which are interrelated: a) scraping b) using APIS

In general, scraping is a method of extracting data from a website, programatically. For example, let's say I want to download the text of a whole bunch of wikipedia articles at once (way more than I could reasonably expect to manually copy and paste); I "scrape" these articles from the internet.

An API, or appplication programmming interface, is a system through which a social media platform (or other type of platform/site/etc.) gives you access to their data so that you can "scrape" or extract it more easily. Last week we signed up for the twitter api.

Today we're going to talk a bit about both scraping and APIs, as we look at the sample code below. But first, we need to remind ourselves, quickly, of one of the better ways of storing and manipulating data: the dataframe, which, in python, we work with using a package called pandas

**Pandas**

In [None]:
#packages - for the new coders who came to the workshop last week, yu might remember we discussed how, to perform special functions/commands in python, we download "packages" that have those functions
#to download a package we type import, the name of the package, and then as "the name we want to call it"; here we download pandas and call it pd:
import pandas as pd

In [None]:
#pandas is a package we use to manipulate what are called dataframes
#data frames are a nice place to store and work with data; basically, they look like grids or spreadsheets
#because they work like grids or spreadsheets it's easy for me to convert them into excel or csv files
#so, if i want to scrape a bunch of articles from wikipedia, let's say, I can store them in a grid called a dataframe
#maybe i'll use one column for the article's title, one for its text, one for its publication date, and so on
#then i can download that dataframe into an excel file on my computer; and i can upload it back into my code as a dataframe to work with it in pandas
#in pandas we can create a dataframe like this:

data = pd.DataFrame()

#this is an empty dataframe with nothing in it; it is like a spreadsheet with zero rows and zero columbs


In [None]:
Tess = pd.DataFrame()

In [None]:
#let's say we want to create a dataframe with some things in it, like a few columns and rows with numbers in them
#here's how we could create a dataframe from a dictionary of items, each representing the numbers we want in one column
#(if you were at the coding workshop last week, I'm sorry, we did not learn what "dictionaries" are; but they're a bit like lists, and I'll explain!)


d = {'col1': [1, 2], 'col2': [3, 4]}
data1 = pd.DataFrame(data=d)
print(data1)

   col1  col2
0     1     3
1     2     4


In [None]:

d = {'banana': [7, 25], 'terrarium': [32, 2]}
Tess = pd.DataFrame(data=d)
print(Tess)

   banana  terrarium
0       7         32
1      25          2


In [None]:
#there's a lot we can do with dataframes, but let's just highlight a few functions for the moment, before we start scraping things, and putting them in dataframes
#one thing we might want to pick out an item in a dataframe
#let's say, e.g., we want to pick out, from the above dataframe, the number four that's in row number 1 and column 2. we do this:

data1["col2"][1]

#what we just did there was type the name of the dataframe, then the name of the column we want in quotes and brackets, and then the index of the row 
#for the new coders from last week - note the indexes, or numbers, of the rows start at 0 and not 1, just like the letters in a string or the numbers in a list

4

In [None]:
#another thing we often want to do with dataframes is loop through their rows, in a particular column, so we can do the same thing to everything in a row
#like, for example, if we had a dataframe with wikipedia article texts in one column, we might want to loop through each text, and search it for some key word
#here, let's make a for loop to loop through each item in col2 and print it:

for i in range(2):
  print(data1["col2"][i])

  #quiz for the new coders, from last week: can you tell me why I put "2" in the range for the for loop?

3
4


In [None]:
#another thing it's really valuable to be able to do with pandas is file from your computer, like a csv and excel file, and load it into a dataframe
#actually, we do that sort of weirdly in colab, and especially working as a group
#so I'm going to show you the normal way we do this, but it's not how it works in colab
#the normal way we do this, with, e.g., in excel file is this:

data3 = pd.read_excel("name of excel file from computer here and full file path if need be.xlsx")

#or use the same command with csv instead of excel and a .csv file to read one of those


FileNotFoundError: ignored

In [None]:
#in colab, though, we do one of two things. Either, we do this process, for which there are instructions here (https://colab.research.google.com/notebooks/io.ipynb)
#if you have an excel file on your comp you can try uploading it this way:

from google.colab import files
uploaded = files.upload()

KeyboardInterrupt: ignored

In [None]:
print(uploaded)
pd.read_excel('name of excel file here .xlsx')

NameError: ignored

In [None]:
#or, we'll use files together that I upload for all of us, in whcih case the code will look weird; we'll do that a bit next week, don't worry about it now

In [None]:
#in pandas you also might want to save the dataframe you're working with as an excel or csv file to your computer
#again, colab does this differently
#but here's the normal way:

data1.to_excel("name to give file.xlsx")


In [None]:
#here's how colab does it:

files.download('#name of file')

FileNotFoundError: ignored

In [None]:
#there's a lot more we can do in pandas. like add rows or columns, sort them by quantties, and so on
#you can learn about all that in the pandas documentation here: https://pandas.pydata.org/docs/index.html
#or by googling what you want to do in pandas and then finding how to in e.g. stack exchange
#but for now this is all we need to basically understand

**scraping**

we're taking a quick moment of in class time, now, to introduce how data is stored on websites in html, and how we "scrape" that html using Beautiful Soup

once we've talked about that, we'll look at this sample code below, which scrapes wiki articles and puts them into a dataframe:

In [None]:
# Import packages
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/John_D._Hunter').read()
# Make a soup 
soup = BeautifulSoup(source,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>John D. Hunter - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8262c61c-bed5-40eb-a1eb-2c50defe37c7","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"John_D._Hunter","wgTitle":"John D. Hunter","wgCurRevisionId":1102368745,"wgRevisionId":1102368745,"wgArticleId":45714542,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with hCards","1968 births","2012 deaths","American neuroscientists","Princeton University alumni","University of Chicago alumni","People fro

In [None]:
# Extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
    text += paragraph.text
    
text

"\n\t\t\t\tPages for logged out editors learn more\nJohn D. Hunter (August 1, 1968 – August 28, 2012) was an American neurobiologist and the original author of Matplotlib.[1]\nHe was brought up in Dyersburg, Tennessee. He graduated from The McCallie School. He studied initially at Princeton University, later he obtained a Ph.D. in neurobiology from the University of Chicago in 2004.[2][3] In 2005, he joined TradeLink Securities as a Quantitative Analyst.[4] Later, he was one of the founding directors of NumFOCUS Foundation.[5]\nMatplotlib was originally  conceived to visualize electrocorticography (ECoG) data of epilepsy patients during post-doctoral research in neurobiology.[4] The open-source tool  emerged as the most widely used plotting library for the Python programming language, and a core component of the scientific Python stack, along with NumPy, SciPy and IPython.[6] Matplotlib was used for data visualization during landing of the Phoenix spacecraft in 2008 as well as for the 

In [None]:
#what if we wanted to scrape text from multiple wikipedia links? let's say, 3? what might we do?
#hint: we might start by making a list of the links we want to scrape

In [None]:
#what if we wanted to put the texts from our 3 links into a pandas dataframe? maybe with the link in one row, and then the text in another?

**apis**

ok now we'll take a few moments to discuss apis

then we'll play with some sample code with the twitter api, which, as you will recall, you got your bearer token for yesterday

remember, share this with no one! and don't post it in a public place online. you'll need to fill it in the code below for it to work

In [None]:
#twitter scrape sample with changing access token

#import libraries

import requests
import os
import json
import time
import pandas as pd

In [None]:
#put in your individual bearer token (DO NOT USE MINE!!!!)

bearer_token = ''

In [None]:
#create some functions to prepare for scraping (this does not need to make sense! don't worry!)
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params):
    response = requests.request("GET", search_url, headers=headers, params=params)
    # print(response.status_code)
    if response.status_code != 200:
        print('OOPS')
        raise Exception(response.status_code, response.text)
    return response.json()

In [None]:
# SMALL TEST QUERY
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields

#scrape by user using their handle; if you want two users add an OR after the first 'from:' and pu in another

query_params = {'query': 'from:elonmusk lang:en -is:retweet -is:reply',
                'tweet.fields': 'author_id,created_at,conversation_id,public_metrics',
                 'max_results': 10}

In [None]:
# Sample run, the output will be a json file called "json_response"

headers = create_headers(bearer_token)
search_url = "https://api.twitter.com/2/tweets/search/recent"
json_response = connect_to_endpoint(search_url, headers, query_params)
print(json_response)

{'data': [{'author_id': '44196397', 'public_metrics': {'retweet_count': 14298, 'reply_count': 13595, 'like_count': 232771, 'quote_count': 2440, 'impression_count': 23186252}, 'created_at': '2023-01-21T17:41:22.000Z', 'id': '1616853475220156417', 'text': 'Ads are too frequent on Twitter and too big. Taking steps to address both in coming weeks.', 'conversation_id': '1616853475220156417', 'edit_history_tweet_ids': ['1616853475220156417']}, {'author_id': '44196397', 'public_metrics': {'retweet_count': 6373, 'reply_count': 5326, 'like_count': 71287, 'quote_count': 1860, 'impression_count': 9472798}, 'created_at': '2023-01-21T00:31:38.000Z', 'id': '1616594332907372544', 'text': 'Next Twitter update will remember whether you were on For You (ie recommended), Following or list you made &amp; stop switching you back to recommended tweets', 'conversation_id': '1616594332907372544', 'edit_history_tweet_ids': ['1616594332907372544']}, {'author_id': '44196397', 'public_metrics': {'retweet_count': 

In [None]:
#turn the json response into a pandas dataframe
df = pd.DataFrame(json_response['data'])
print(df)

  author_id                                     public_metrics  \
0  44196397  {'retweet_count': 14298, 'reply_count': 13595,...   
1  44196397  {'retweet_count': 6373, 'reply_count': 5326, '...   
2  44196397  {'retweet_count': 24616, 'reply_count': 18621,...   
3  44196397  {'retweet_count': 2983, 'reply_count': 1956, '...   
4  44196397  {'retweet_count': 10953, 'reply_count': 7681, ...   
5  44196397  {'retweet_count': 10328, 'reply_count': 7884, ...   
6  44196397  {'retweet_count': 8163, 'reply_count': 8221, '...   
7  44196397  {'retweet_count': 17280, 'reply_count': 14155,...   
8  44196397  {'retweet_count': 4858, 'reply_count': 3794, '...   
9  44196397  {'retweet_count': 4167, 'reply_count': 3084, '...   

                 created_at                   id  \
0  2023-01-21T17:41:22.000Z  1616853475220156417   
1  2023-01-21T00:31:38.000Z  1616594332907372544   
2  2023-01-21T00:18:21.000Z  1616590989145313283   
3  2023-01-20T23:23:17.000Z  1616577133945688065   
4  2023-01-20

search by hashtag vs. user

In [None]:
query_params = {'query': '#metoo lang:en -is:retweet -is:reply',
                'tweet.fields': 'author_id,created_at,conversation_id,public_metrics',
                 'max_results': 10}

In [None]:

headers = create_headers(bearer_token)
search_url = "https://api.twitter.com/2/tweets/search/recent"
json_response = connect_to_endpoint(search_url, headers, query_params)
print(json_response)

{'data': [{'author_id': '355270753', 'text': 'Hope you all can join me this coming weekend! #MeToo https://t.co/mOhXOkAiCk', 'edit_history_tweet_ids': ['1617323075904876545'], 'created_at': '2023-01-23T00:47:24.000Z', 'public_metrics': {'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0, 'impression_count': 8}, 'id': '1617323075904876545', 'conversation_id': '1617323075904876545'}, {'author_id': '17372996', 'text': 'Hmmmm...whom to believe, whom to believe: Pamela Anderson or Tim Allen? #MeToo', 'edit_history_tweet_ids': ['1617322162007805953'], 'created_at': '2023-01-23T00:43:46.000Z', 'public_metrics': {'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0, 'impression_count': 88}, 'id': '1617322162007805953', 'conversation_id': '1617322162007805953'}, {'author_id': '80589197', 'text': 'The latest The GTD AwesomeSauce! https://t.co/JnzPO0ffiT #metoo #tabletop', 'edit_history_tweet_ids': ['1617322037881569281'], 'created_at': '2023-01-23T00:43:16

learn more at the twitter api documentaiton:
https://developer.twitter.com/en/docs/twitter-api