**Collecting Social Media Data**

Last week, we talked about basic web-scraping, where you use the "requests" function to request the html from a webpage; unfortunately, that method doesn't work too well with complex and privacy protected sites like social media platfomrs. So, today, we'll talk about getting social media data.

minilecture...

**Method 1: Official APIs**

The "best" - which is to say, entirely sanctioned, totally "legal" (or "legalistic") and most tractable way to deal with social media is through the vehicle of an officially offered API for academic researchers. This is a tool that platforms share with academic researchers - sometimes througn an application process - curating their data. You usually have to know how to program to actually download/use the data, but the APIs also have documentation of the systems they use and other methods of collecting the data (also involve technical means, but not via Python). The problem, here, however is threefold: first, some platforms don't offer official apis; second, even where platforms do offer official apis, they may be difficult to access for one of two reasons: a) they cost a lot of money (as with Twitter's now) b) they require a cumbersome appication process (this is less prohibitive, but can still be annoying - like Fbk and Instagram right now) c) they have weird terms of service (TikTok) or d) they only let you download limited types of data (e.g., YouTube, which doesn't let you download full videos. Here's a rundown of the current, official API status of the platforms:

**Youtube** - free, accessible academic API. Hurray! But you can't download videos

**TikTok** -there is an academic api which you can apply for, and is probably fine to use, though the terms of service are shady

**Facebook/Instagram** - after not having an API for some time, they recently released a "content library" through the University of Michigan, which is meant to function as a data api for academic applicants. That said, I've seem few people use this yet, the application process is currently on hold

**Twitter** - used to be famous for having an openly accessibly academic api, which you could easily apply for access to (applications were often approved within hours); Elon musk made it exorbitantly expensive.

**Reddit**- also used to be famously open and accessible, and did not even require application, like YouTube. But also started costing money when Musk tanked the Twitter one.

For more info, you can also see this UIUC library guide on the topic that I helped create: https://guides.library.illinois.edu/c.php?g=1281804&p=9406783#s-lg-box-29755025


tldr - if you are doing slocial media research long term, you can consider applying to use either the tiktok or facebook/instagram (meta) apis. Though both have their complications, it's potentially worth applying to use them, depending on your use case. Or, if you have lots of (research) money, you could buy access to the Twitter api (though some academic programs are now trying to make twitter api access more affordable, like a program through a company called meltwater which has been providing academics twitter api access for recent tweets at 10-20$ a month) For the purposes of our class, though, the only API immediately usable - which is to say, free, accessible, and no applicaton needed - is YouTube.

**API example: the YouTube API**

So, usually when you use an API, you have to do so programmatically, which means using code.

The steps usually go like this:

First, you have to go through a process of creating what's called an "API key" which is, like, your secret passcode to use the api. Every time you use the api, you need to put this passcode in your code. BUT you should NEVER share it with anyone, or leave it in code that you share with another person, since it's a security risk. If the api is one you have to go through a big long application process for, like TikTok or meta, you won't get the key til you succesfulyl apply. If the API is more accessible, like YouTube, you can get the key instantaneously after going through a brief process online. Here's where you can find the intructions for how to make an API key to use the YouTube Data API through Google, YouTube's parent company: https://developers.google.com/youtube/v3/getting-started. Follow the links in #2 to do the process.

Second, once you get the key, you need to run a set of commands to collect data from the api. The commands are supplied through the API documentation which is like a long, annoying instruction manual. Sorry to say, if you want the data, you have to read it and learn the commands, though chat GPT can help. The YouTube api documentaiton is also at the link above. Documentation is good to read because it also includes important restrictions like telling you a) what your data quotas are (i.e., when you will run out of the ability to collect more data, per day, week, month, etc.) and also b) other restictions, like reminding you not to create an use two different API keys for one project to circumvent your quotas or you will get kicked off (which happens; I know, I did it a bunch of times, lol)

Third, and finally, once you get the data you want, you will likely have to convert it into a format you can use. Often, api data comes in the form of .json files, which you need to parse with a program called json. We'll return to this later, but here's a sample of what json looks like:



In [None]:
# Simulated social media API response
social_media_data = {
    "tweets": [
        {
            "id": "153482901",
            "user": {
                "username": "tech_guru99",
                "display_name": "Tech Guru",
                "followers_count": 20500
            },
            "text": "AI is transforming the world! What are your thoughts? #ArtificialIntelligence #Future",
            "created_at": "2025-02-18T14:23:00Z",
            "engagement": {
                "likes": 120,
                "retweets": 45,
                "replies": 10
            }
        },
        {
            "id": "153482902",
            "user": {
                "username": "nature_lover",
                "display_name": "Emma Green",
                "followers_count": 3450
            },
            "text": "Nothing beats a peaceful sunrise hike. 🌄 #NatureLover",
            "created_at": "2025-02-18T06:15:00Z",
            "engagement": {
                "likes": 87,
                "retweets": 20,
                "replies": 8
            }
        },
        {
            "id": "153482903",
            "user": {
                "username": "dev_journey",
                "display_name": "Code Explorer",
                "followers_count": 15000
            },
            "text": "Just deployed my first full-stack app! Feeling accomplished. 🚀 #100DaysOfCode",
            "created_at": "2025-02-17T20:45:00Z",
            "engagement": {
                "likes": 200,
                "retweets": 75,
                "replies": 30
            }
        }
    ]
}



Wow, that all sounds pretty involved and annoying and yes, the first time, it is. Like everything with coding though, once you do it once, it's easier evey other time after.

Also, fortunately for you, sometimes some people do a nice thing, and create a GUI wrapper, or front end application, so that you can use an open platform API without coding. My colleague Chen Wang and I at UIUC's NCSA worked together to develop a YouTube api wrapper in what is called the "SMILE" tool, where you can access the YouTube API without coding. So, let's do it there! Hurray. As a special bonus, the SMILE tool also still curates access to what used to be the free reddit api, for some reason - reddit simply hasn't cut if off, yet - so you can also use it to collect reddit data.

**Follow link to SMILE tool here:** https://smile.smm.ncsa.illinois.edu/

In [None]:
#in class exercise, use smile tool to collect YouTube and/or reddit data

Note that you can make your own YouTube API key and punch it into the SMILE tool. The major advantage of doing this is that you can then use your own quota limits, rather than being bound to the tool's limits which, with multiple people using it, will run out quickly. If you plan to do a YouTube project for the class, I'd take this step.

**Method 2: Webscraping**

OK, well let's say there's no official api for your platform that's easy to use. The next method you might try is webscraping, which basically means ripping the data from the platform's front end. Tis is the process we've been doing - in Melanie's textbook - with pages like Wikipedia where we grab the html and parse it with beautiful soup. But, with social media platforms, it's a lot more complicated. There are two major differences:

  1. first, because social media companies don't want you to scrape their data, it's a lot more complicated, and requires writing code that gets past their blocks. For this reason, researchers often don't go through the process of writing all the code to scrape the website themselves, and instead use scraping tools (or program-accessible applications) that others have created, already. One prominent one for tiktok, right now, for example, is the scraper called "pyktok" by deen freelon, et. al (see below). You can find these scrapers by googling, or by looking for articles which have recently used data from your platform and seeing if they list the scraper they used
  2. rather than giving you the data in html format, scrapers will often give you the data in JSON format - which we'll discuss in a bit. Sometimes, though, they'll be even nicer and just give you teh data in straight csv file format, which, as we'll see, is true of pyktok.

Scraping has 3 downsides, that make it less desirable (sometimes) than using sanctioned platform apis:

  1. scraping almost always violates the platforms' terms of service, which you automatically agree to if you make an account on the platform or, possibly, even access it at all. This doesn't mean scraping is "illegal," necessarily—that jury is out—but it does mean it technically violates the terms. So, the platform could, in theory, send you a cease and desist letter on your research, and that has, in the history of the world, happened to people (often with very high profile research that is very unflattering to the platform). That almost never happens, and huge amounts of academic researchers scrape platform data every day, publish with it, and don't worry about it; it's a norm. still, it's good to know. Do research into your particular use case, talk to others, and consider if you want to undertake the "risk"
  2. scraping means you can only access the data on the platform in the form that it appears on the "front end" or GUI of the website. APIs can be more flexible and let you access or sort through the data form the "back end"; often, that's not a huge problem, but there can be differences. (will discuss in class).
  3. You can get your account banned. Pretty easy to reinstate it, usually; but still good to know. I often make a burner account to scrape. Not necessary, I think, with the example below (pyktok).

sample, Pyktok scraper, here:
https://github.com/dfreelon/pyktok

because of the "sensitive" nature of scraping, I'm going to have you work through the (pretty simple) instructions for how to use this scraper as an optional homework assignment; if you want to work on social media as your research area, i'd suggest doing it; if not, you can skip it.

**important note though: for whatever reason, i have trouble running the sample pyktok code in colab, so you might try it, instead, in jupyter notebook; you may or may not need to download the thing called playwright, and if so might do so from the terminal or ask gpt for help; i'd suggest using firefox as the browser (easier for me) as opposed to chrome**

Not all websites are as hard to scrape as the major social media platforms! Some social media platforms (like, say 4chan) are easier to scrape, and some coder friendly sites, like Medium.com, don't even bar scraping in their terms of service. Sites like Goodreads, too, or fanfiction.net, have been pretty scrape-able. Mel walsh and maria antoniak made a nice goodreads scraper, here:https://github.com/maria-antoniak/goodreads-scraper. TLDR: do research into the platform/site you want to scrape, and there should be a way.  

**Nota bene:** often, it is best to use not simply an API, or scraping, but both together - this is because often an API makes one type of data available (metadata concerning contnet from the "backend" while a scraper makes other types available: e.g. full videos).

examples: you might, for example, use the YouTube api to get a list of links to all videos with a certain hashtag in the title, and filtered by region (US only) which it would be hard to download from the front end. But, because the YouTube API doesnt offer full videos, you might then use this list of urls to feed into a scraper and download the videos.

You might use th zeeschuimer tool (below) to gather a list of all urls to tiktoks from a certain page, or called up for a certain keyword, from the front end of the platform. You might then feed that list of urls into pyktok to scrape the videos.

It's common to combine sources in this way. But, given that we now live in a time when APIs are less and less accessible, it's common to use scraping, alone, as well.

**Three other methods: **

apis and scraping are the norm. But two other methods are worth noting:
1. Corporate purchases: if youre in this research area long term, and get some funding, purchasing data from corporate aggregators, who have api access that academics don't, is worth considering.
2. If you're a famous academic working on a very urgent topic (like fake news) you might be able to reach out to the platform, directly, to get data. That doesn'y apply to any of us, myself included - but it's a thing
3. Front end collection tools - there are some tools that, though not quite scrapers, allow you to collect data from the front end of a platform (often not full videos, but useful metadata while you scroll through the platform browsing. THere is some argument (though who knows how specious) that these dont violate terms of service. One is called zeeschuimer, let's check it out now:


**Zeeschuimer**

https://github.com/digitalmethodsinitiative/zeeschuimer

For the rest of class, we're going to work together to see if we can get this front end browser extension up and running on your computers to download data from the front end from a number of social media platfroms. If you also download and get up and runnning the thing called "4cat" then you can get this data directly as csv files. If, however, you can't get 4cat going (and I usually don't use it) the data will come to you, like much platform API data, in the form of .json files (in this case,a permutation on json files called .ndjson). So, let's talk about json for a second:

**json** is the format that a lot of api data from social media platforms 'arrives" in. Much like a number of data formats we've talked about, like markup languages like xml and html, json is a format of data that you can "parse" or read using the program. To read json, we don't use beautful soup (for the markup languages like xml and html). Instead, we use the json module in python. Much like with those markup languages, though, it can be easiest, at least when you're starting out,not to try to "memorize" the whole syntax of json for yourself, but simply learn how to read it - or, exammine it for the parts you want to extract - when you need to parse it.

**Here's what json looks like:**

https://json.org/example.html

Note that if you use the "raw" Youtube api, instead of through the SMILE tool, the files will come in json too.

Let's get zeeschuimer running for ourselves so we can collect a sample ndjson file, annd inspect it. Then, we can practice "parsing" some json (or, in this case, ndjson, on the assumption we won't use the 4cat instace, though we could do that).

Samples of parsing json:

OK, so let's say we have a json file and we want to load it in. We can load it in and then use the program called json to turn it into a python dictionary. Here's how we'd do that:

In [None]:
from google.colab import files
import json

# Upload the JSON file
uploaded = files.upload()  # User selects a file to upload

# Get the filename (first key in uploaded dictionary)
json_filename = list(uploaded.keys())[0]

# Open and parse the JSON file
with open(json_filename, "r") as file:
    data = json.load(file)  # Converts JSON file into a Python dictionary

# Pretty-print the parsed JSON data
print(json.dumps(data, indent=4))


In the case of using zeeschuimer, we get an ndjson file, which is roughly the same. here's how we'd load in and parse that

In [None]:
from google.colab import files
import json

# Upload the NDJSON file
uploaded = files.upload()

# Get the filename (first key in uploaded dictionary)
ndjson_filename = list(uploaded.keys())[0]

# Read and parse the NDJSON file
data = []
with open(ndjson_filename, "r") as file:
    for line in file:
        data.append(json.loads(line))  # Parse each line as a separate JSON object

# Print the first few parsed JSON objects
for item in data[:3]:  # Show only the first 3 for readability
    print(json.dumps(item, indent=4))



What's the difference? an ndjson file just takes the json objects in a json file and puts them each on one line.

Ok, but let's assuem we've already loaded in the json and made it a python dictionary called social_media_data. Here it is (the same sample we looked at above):

In [None]:
# Simulated social media API response
social_media_data = {
    "tweets": [
        {
            "id": "153482901",
            "user": {
                "username": "tech_guru99",
                "display_name": "Tech Guru",
                "followers_count": 20500
            },
            "text": "AI is transforming the world! What are your thoughts? #ArtificialIntelligence #Future",
            "created_at": "2025-02-18T14:23:00Z",
            "engagement": {
                "likes": 120,
                "retweets": 45,
                "replies": 10
            }
        },
        {
            "id": "153482902",
            "user": {
                "username": "nature_lover",
                "display_name": "Emma Green",
                "followers_count": 3450
            },
            "text": "Nothing beats a peaceful sunrise hike. 🌄 #NatureLover",
            "created_at": "2025-02-18T06:15:00Z",
            "engagement": {
                "likes": 87,
                "retweets": 20,
                "replies": 8
            }
        },
        {
            "id": "153482903",
            "user": {
                "username": "dev_journey",
                "display_name": "Code Explorer",
                "followers_count": 15000
            },
            "text": "Just deployed my first full-stack app! Feeling accomplished. 🚀 #100DaysOfCode",
            "created_at": "2025-02-17T20:45:00Z",
            "engagement": {
                "likes": 200,
                "retweets": 75,
                "replies": 30
            }
        }
    ]
}



Here's some examples of, once we had this json loaded in, how we'd search around and parse it. First, let's just look at how you'd translate this whole thing into a pandas dataframe. Note that some of the columns will contain nested data if you do this, so it's not the ideal method

In [None]:

import pandas as pd

# Convert the JSON "tweets" list into a Pandas DataFrame
df = pd.DataFrame(social_media_data["tweets"])

# Display the first few rows
df.head()


Yes, it would likely be better to go in more manually to separate out those nested elements; some samples:

****

In [None]:
# Loop through each tweet and print the text
for tweet in social_media_data["tweets"]:
    print(tweet["text"])


In [None]:
# Loop through each tweet and print the username of the poster
for tweet in social_media_data["tweets"]:
    print(tweet["user"]["username"])


In [None]:
# Print each tweet's likes
for tweet in social_media_data["tweets"]:
    print(f"Tweet: {tweet['text']}, Likes: {tweet['engagement']['likes']}")


Putting it all together, here's how we'd extract only some fields into a dataframe

In [None]:
df = pd.DataFrame([
    {
        "username": tweet["user"]["username"],
        "display_name": tweet["user"]["display_name"],
        "followers": tweet["user"]["followers_count"],
        "text": tweet["text"],
        "likes": tweet["engagement"]["likes"],
        "retweets": tweet["engagement"]["retweets"],
        "replies": tweet["engagement"]["replies"],
        "timestamp": tweet["created_at"]
    }
    for tweet in social_media_data["tweets"]
])

df.head()


Here is sample code for loading in an ndjson file collected from zeeschuimer (twitter data) and parsing it. If you download data using zeeschuimer you can use this for assistance, but you'll need to adjust it depending on what platform you're collecting data from since the structure will be different (this, again, is a file of twitter data)

In [None]:
from google.colab import files

# Upload an NDJSON file
uploaded = files.upload()


In [None]:
import json

# Get the uploaded filename
ndjson_filename = list(uploaded.keys())[0]

# Read the NDJSON file and parse each line as a JSON object
zeeschuimer_file = []
with open(ndjson_filename, "r") as file:
    for line in file:
        zeeschuimer_file.append(json.loads(line))  # Convert each line into a dictionary

# Print the first few entries to check
for item in zeeschuimer_file[:3]:  # Show first 3 entries
    print(json.dumps(item, indent=4))


examine what a single entry, or item, in the json looks like, so you can discern which categories you want to pluck out from each to put in a dataframe

In [None]:
for item in zeeschuimer_file[:1]:  # Show first single item
    print(json.dumps(item, indent=4))

In [None]:
# for each item in the json, pick outthe catgeories you want to put in a dataframe and compile like this; you have to find every item that the item you want is nested within, and then pick it out accordingly:

import pandas as pd

# Extract required fields from "data" → "legacy" in each item
df = pd.DataFrame([
    {
        "favorite_count": item["data"]["legacy"]["favorite_count"],
        "favorited": item["data"]["legacy"]["favorited"],
        "full_text": item["data"]["legacy"]["full_text"],
        "is_quote_status": item["data"]["legacy"]["is_quote_status"],
        "lang": item["data"]["legacy"]["lang"],
        "quote_count": item["data"]["legacy"]["quote_count"],
        "reply_count": item["data"]["legacy"]["reply_count"],
        "retweet_count": item["data"]["legacy"]["retweet_count"],
        "retweeted": item["data"]["legacy"]["retweeted"],
        "user_id_str": item["data"]["legacy"]["user_id_str"],
        "id_str": item["data"]["legacy"]["id_str"]
    }
    for item in zeeschuimer_file  # Loop through all items
])

# Display the first few rows of the DataFrame
df.head()
