# Scraping Toots about Switzerland's Migros on Mastodon Social 

This notebook is a prototype on how to scrape messages - so called toots - from Mastodon with Python and Pandas.

Mastodon is like Twitter, a free online service which allows you to send messages to thousands/ millions of people - especially your followers. What is the difference - Masatodon is not a croporate, imagine it as thousands of small Twitters, so called instances. Everyone can build an instance and host it. Some of these instances are public, but for most one has to demand access. We are going to scrape data from an open instance - the Social channel.

When you are scraping from Mastodon consider these two things:

1: Strict research rules are in place - whatever you do, do it in line with the Mastodon community: studies have been withdrawn because of privacy violations (Roel Roscam Abbing, Robert W. Gehl, 2024). This is especially true for the closed instances.

2: Lots of instances have a public facing REST API for allowing users to interact with their services using third-party software. Which makes it very easy to scrape toots for a data project.

So, it is easy to scrape data from Mastodon, but be ethical!

Migros is Switzerland's largest retail company, its largest supermarket chain and largest employer. It is also one of the forty largest retailers in the world. It is structured in the form of a cooperative federation (the Federation of Migros Cooperatives), with more than two million members.

Let's see what Mastodon "toots" about Migros!

FYI: you can simply adapt this to your hashtag of choice, try it out!

## Load libraries and modules

In [22]:
import json
import requests
import pandas as pd

In [23]:
from bs4 import BeautifulSoup

## Preparation - Defining Parameters, Flags, and a Timeframe

The hashtag you're interested in (in this case, 'migros').
The API endpoint that returns posts with the given hashtag from Mastodon. We are scraping from Mastodon's social channel.

In [24]:
hashtag = 'migros'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'

I am  interested in a certain timeframe, which I am going to define in the since variable. 
This has to be set as a timestamp.

In [25]:
since = pd.Timestamp('2022-01-01 00:00:00', tz='UTC')

Obviously, I will work with a loop to gather the toots. is_end is a flag that will help terminate the loop once the condition is met (when no more recent posts are available).

In [26]:
is_end = False

I am going to work with Pythons requests library. The .get function requires an URL - which I already defined. Additional parameters are possible. I am going to work with params, which should be defined as dictionary. First key-value pair is 'limit':40. The API request will ask for a maximum of 40 posts in each request, which is due to the fact that the maximum value of toots that can be pulled at once is 40. Which also means, that the toots coem in as chunks.

In [27]:
params = {
    'limit': 40
}

## Browsing and scraping

The following code now actually fetches recent posts with a specific hashtag from Mastodon and stores them in a Pandas DataFrame. See below for step by step description.

In [28]:
results = []
chunk_no = 1

while True:

    try:
        response = requests.get(URL, params=params)
        print("STATUS OF YOUR SCRAPING")
        print("OK - scraping of chunk no {} worked.".format(chunk_no))
        print("--------------------------------")
        chunk_no += 1
        response.close()
    except:
        print("An error occured." 
              "The http status code is {}".format(response.status_code))
    
    toots = json.loads(response.text)

    if len(toots) == 0:
        print("There were no toots returned. " 
              "Check for spelling or use another hashtag")
        break
    
    for toot in toots:
        timestamp = pd.Timestamp(toot['created_at'], tz='utc')
        if timestamp <= since:
            is_end = True
            break
            
        results.append(toot)
    
    if is_end:
        break
    
    max_id = toots[-1]['id']
    params['max_id'] = max_id

    
df = pd.DataFrame(results)

STATUS OF YOUR SCRAPING
OK - scraping of chunk no 1 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 2 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 3 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 4 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 5 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 6 worked.
--------------------------------
STATUS OF YOUR SCRAPING
OK - scraping of chunk no 7 worked.
--------------------------------


Let's go through the code step by step:

results - An empty list that will store the fetched posts (toots).

chunk_no - a little extra for the loop to count the chunk and follow the process

while True: - A while True loop is initiated to continuously send requests to the API. 

try - except with:

    -- response = requests.get(URL, params=params) - sends a GET request to the Mastodon API.
    -- print statement which lets you follow the status, adding 1 to chunk_no
    -- except statement when something goes wrong - probably the http status code is not 200

toots = json.loads(response.text) - converts the JSON response into a Python object (list of dictionaries). Only the .text attribute is stored.

if len(toots) == 0: break - if there are no toots returned in the response, the loop terminates by using break. A statement is printed.

for loop: The script loops over each toot in the response. For each toot: 

    -- timestamp: The creation time of the toot is converted into a pandas.Timestamp in UTC.
    -- If the toot's timestamp is older than or equal to the since timestamp, the loop breaks, and the fetching ends (is_end  is set to True).
    -- If the toot is within the timeframe, it is appended to the results list.

max_id = toots[-1]['id'] and params['max_id'] = max_id - The last toot's ID (max_id) is stored. And we add max_id to the params dictionary. This way, I ensure that when the next iteration in the loop starts, I will get the next set of toots and not the same again. Saying: The max_id parameter tells the API to return posts older than the specified ID.

if is_end: break - If is_end is set to True (i.e., a post older than the date set in since), the loop terminates.

## Let's have a look at results and df

In [29]:
df.shape

(245, 24)

As one can see, there are 24 columns - so there is a lot of variables - information stored. Let's have a look at these. 

In [30]:
df.columns

Index(['id', 'created_at', 'in_reply_to_id', 'in_reply_to_account_id',
       'sensitive', 'spoiler_text', 'visibility', 'language', 'uri', 'url',
       'replies_count', 'reblogs_count', 'favourites_count', 'edited_at',
       'content', 'reblog', 'account', 'media_attachments', 'mentions', 'tags',
       'emojis', 'card', 'poll', 'application'],
      dtype='object')

I am only interested in id, created_at, content, account, tags and emojis. Here comes what needs to be considered:

The column account stores values in a dictionary. All kind of info given by the user is stored (websites, note, avatar, emojis, etc.). I am only interested in the account id, which I will use later for some tiny cleaning up. If one wants to use any of the other info stored in account, please make sure to check what can be used and what not.

content is stored as HTML code.

The column tag is a list of dictionaries. I am interested in those, which is why I am using the explode() module on that column.


## Cleaning up the data - respectively the DataFrame

**First, let's extract the account id and build a small subset with only the columns of interest**

In [31]:
df['account_id'] = df['account'].apply(lambda x: x['id'])
df_small = df[["id", "created_at", "language", "account_id", "content", "tags", "emojis"]]
df_small.head()

Unnamed: 0,id,created_at,language,account_id,content,tags,emojis
0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,<p>Migros rechnet mit minimal höherem Betriebs...,"[{'name': 'migros', 'url': 'https://mastodon.s...",[]
1,113187529649386605,2024-09-23T15:17:19.000Z,de,109471764792431240,<p>Migros veut donner sa chance à Tegut</p><p>...,"[{'name': 'migros', 'url': 'https://mastodon.s...",[]
2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","[{'name': 'qualcomm', 'url': 'https://mastodon...",[]
3,113160216482271211,2024-09-18T19:30:43.000Z,en,109248094876977535,"<p>…wann explodieren die <a href=""https://mast...","[{'name': 'grapefruits', 'url': 'https://masto...",[]
4,113153493068774612,2024-09-17T15:01:22.000Z,de,109669545373361861,<p>Migros schliesst Webshops von Tochterfirmen...,"[{'name': 'migros', 'url': 'https://mastodon.s...",[]


In [32]:
df_small.dtypes

id            object
created_at    object
language      object
account_id    object
content       object
tags          object
emojis        object
dtype: object


**You might want to skip the following step, if you're not using the migros hashtag**

It is always a good idea to browse the posts which have the researched hashtags on the Mastadon website - in this case Mastadon Social, to get a first impression.

I realized, that there is an account which posts about all things business, entrepreneurship in Switzerland but always in German AND in French. It would obviously be the same content, so I decided to drop the account posting the message in French and to keep the one posting the content in German.

In [33]:
id_to_drop = ['109471764792431240']
df_clean = df_small[df_small.account_id.isin(id_to_drop) == False]
df_clean.head()

Unnamed: 0,id,created_at,language,account_id,content,tags,emojis
0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,<p>Migros rechnet mit minimal höherem Betriebs...,"[{'name': 'migros', 'url': 'https://mastodon.s...",[]
2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","[{'name': 'qualcomm', 'url': 'https://mastodon...",[]
3,113160216482271211,2024-09-18T19:30:43.000Z,en,109248094876977535,"<p>…wann explodieren die <a href=""https://mast...","[{'name': 'grapefruits', 'url': 'https://masto...",[]
4,113153493068774612,2024-09-17T15:01:22.000Z,de,109669545373361861,<p>Migros schliesst Webshops von Tochterfirmen...,"[{'name': 'migros', 'url': 'https://mastodon.s...",[]
5,113117546765206911,2024-09-11T06:39:45.000Z,de,109669545373361861,<p>Migros lanciert mit Migros Bank kostenloses...,"[{'name': 'migrosbank', 'url': 'https://mastod...",[]


**Using explode on the tags column**

As mentioned before, the tags column is a list of dictionaries, which is why I am going to use the explode() function. Every dictionary contains information about one hashtag used in the post. So number of hashtags = number of dictionaries. Included is the name and hashtag url.

What happens, when using explode: every dictionary (so every hashtag) get's it's own row, while the values of the other columns are copied and stay the same. So every tag will end on it's own line, which is why I will keep the index column, there I can easily follow which rows belonged to the same post.

In [34]:
df_flat = df_clean.explode("tags").reset_index()
df_flat.head(10)

Unnamed: 0,index,id,created_at,language,account_id,content,tags,emojis
0,0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,<p>Migros rechnet mit minimal höherem Betriebs...,"{'name': 'migros', 'url': 'https://mastodon.so...",[]
1,0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,<p>Migros rechnet mit minimal höherem Betriebs...,"{'name': 'news', 'url': 'https://mastodon.soci...",[]
2,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'qualcomm', 'url': 'https://mastodon....",[]
3,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'intel', 'url': 'https://mastodon.soc...",[]
4,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'apple', 'url': 'https://mastodon.soc...",[]
5,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'microsoft', 'url': 'https://mastodon...",[]
6,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'volg', 'url': 'https://mastodon.soci...",[]
7,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'migros', 'url': 'https://mastodon.so...",[]
8,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'schweiz', 'url': 'https://mastodon.s...",[]
9,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,"<p><a href=""https://smnn.ch/tags/Qualcomm"" cla...","{'name': 'eu', 'url': 'https://mastodon.social...",[]


**Cleaning a little more the tags column**

The tags column includes the name of the tag as well as the url - which one could click to follow the hashtag... But I only need the name, which is also easier to grasp at first sight. The following code will create a new column tag_name, I will drop the original tags column later on, when I create a new DataFrame.

In [35]:
df_flat['tag_name'] = df_flat['tags'].apply(lambda y: y['name'])

**Extracting the text from the content column**

As pointed out earlier, the content column is in HTML code. Defining a function which utilizes BeautifulSoup to parse the code. Again creating a new column extracted_content which stores the result. Going to drop the original content column in the next step - and create a new DataFrame.

In [36]:
def extract_text_from_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()

df_flat["extracted_content"] = df_flat["content"].apply(extract_text_from_html)

**Narrowing down the DataFrame again**

Dropping the original tags and content column.

In [37]:
df_flat_small = df_flat[[
    "index", 
    "id", 
    "created_at", 
    "language", 
    "account_id", 
    "extracted_content", 
    "tag_name", 
    "emojis"]]

In [38]:
df_flat_small

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name,emojis
0,0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,Migros rechnet mit minimal höherem Betriebsgew...,migros,[]
1,0,113204241376169101,2024-09-26T14:07:19.000Z,de,109669545373361861,Migros rechnet mit minimal höherem Betriebsgew...,news,[]
2,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,#Qualcomm will #Intel kaufen? Was kommt als Nä...,qualcomm,[]
3,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,#Qualcomm will #Intel kaufen? Was kommt als Nä...,intel,[]
4,2,113179882647913861,2024-09-22T06:52:31.000Z,de,109415182274164306,#Qualcomm will #Intel kaufen? Was kommt als Nä...,apple,[]
...,...,...,...,...,...,...,...,...
753,243,108488064885312947,2022-06-16T16:21:37.000Z,de,733653,Die #Schweiz ist eine #Demokratie in der sehr ...,volk,[]
754,243,108488064885312947,2022-06-16T16:21:37.000Z,de,733653,Die #Schweiz ist eine #Demokratie in der sehr ...,demokratie,[]
755,243,108488064885312947,2022-06-16T16:21:37.000Z,de,733653,Die #Schweiz ist eine #Demokratie in der sehr ...,schweiz,[]
756,244,107823705876568727,2022-02-19T08:26:23.000Z,tr,1411493,#Migros şeye güveniyor olmalı boykot kültürünü...,MigrosaGitmiyorum,[]


## Having a closer look at the hashtags

I am going to use value_counts() in the tag_name column to count the appearance of each hashtag. Creating a new DataFrame with just the hashtags and their individual count.

In [39]:
tags_count = df_flat_small["tag_name"].value_counts().reset_index()
tags_count.head(25)

Unnamed: 0,tag_name,count
0,migros,210
1,news,49
2,coop,28
3,schweiz,22
4,migrosbank,7
5,denner,6
6,lebensmittel,6
7,micarna,6
8,lidl,4
9,SOCAR,4


**Result**

So, there is not much interesting to see, but also: we now have a time frame from the 1st of January 2020 until newest and we have 210 posts, which ist not that much. Not many posts about Migros on Mastodon Social. Things to do: try another open Mastodon channel or try and find an interesting instance, aks them if they are okay with scraping...

Nevertheless - and the technique remains the same - we can now have an even closer look at certain hashtags. Of course, there is Migros - as we searched for that one. Then Coop - another Swiss supermarket (in fact, as well a big cooperative) - is mentioned quite often together with Migros. We have companies mentioned which belong to the cooperate: Denner, Micarna, Migrolino.

As sustainability is a big topic, I am interested in the posts which mention nachhaltig (sustainable), nachhaltigkeit (sustainability) and greenwashing. Let's have a look at the content of these posts and see what people think. Maybe marketing or management can get some ideas on what could be tackled ;).

In [52]:
df_sustainable = df_flat_small[(df_flat_small["tag_name"] == "nachhaltig") | (df_flat_small["tag_name"] == "nachhaltigkeit") | (df_flat_small["tag_name"] == "greenwashing")]


In [56]:
df_sustainable

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name,emojis
518,181,110349159277525674,2023-05-11T08:42:22.000Z,de,503089,Finde den Fehler!5 Jahre alte #Migros vs. neue...,nachhaltig,"[{'shortcode': 'mastodon', 'url': 'https://fil..."
521,181,110349159277525674,2023-05-11T08:42:22.000Z,de,503089,Finde den Fehler!5 Jahre alte #Migros vs. neue...,greenwashing,"[{'shortcode': 'mastodon', 'url': 'https://fil..."
540,189,110212861469012654,2023-04-17T07:00:03.000Z,en,386737,"Was bedeutet #nachhaltig, #klimaneutral oder #...",nachhaltig,[]
555,191,110208752325952479,2023-04-16T13:35:02.000Z,en,386737,« Die grünen Werbeslogans von #migros und #coo...,nachhaltig,[]
556,191,110208752325952479,2023-04-16T13:35:02.000Z,en,386737,« Die grünen Werbeslogans von #migros und #coo...,nachhaltigkeit,[]
587,205,109887961586665040,2023-02-18T21:53:44.000Z,de,108679092840815729,#satire #cartoon #migros #greenwashing #umwelt...,greenwashing,[]
627,215,109726807647827542,2023-01-21T10:50:01.000Z,de,109540716282243994,"Neu auf #Infosperber:Trauben aus Namibia, Heid...",greenwashing,[]
647,220,109597519889070644,2022-12-29T14:50:35.000Z,de,109388126942795815,Martin #Jucker von der Jucker #Farm wirft der ...,nachhaltig,[]
648,220,109597519889070644,2022-12-29T14:50:35.000Z,de,109388126942795815,Martin #Jucker von der Jucker #Farm wirft der ...,nachhaltigkeit,[]
737,238,109340783009520194,2022-11-14T06:39:03.916Z,de,109246062752827438,"Wer es mit der #Nachhaltigkeit ernst meint, ve...",nachhaltigkeit,[]


In [53]:
sustainable = df_sustainable["extracted_content"].to_list()
sustainable

['Finde den Fehler!5 Jahre alte #Migros vs. neue #Rewe Tasche.100% vs. 36% :mastodon:  #nachhaltig #recycling #Plastik #pet #greenwashing @rewe',
 'Finde den Fehler!5 Jahre alte #Migros vs. neue #Rewe Tasche.100% vs. 36% :mastodon:  #nachhaltig #recycling #Plastik #pet #greenwashing @rewe',
 'Was bedeutet #nachhaltig, #klimaneutral oder #natürlich? @ErichBuergler berichtet in der @sonntagszeitung über unsere Kritik an der #Werbung von #Migros und #Coop(Abo-Artikel).https://www.tagesanzeiger.ch/greenpeace-wirft-migros-und-coop-greenwashing-vor-243204184726',
 '«\xa0Die grünen Werbeslogans von #migros und #coop  sind laut der Umweltschutzorganisation Greenpeace undurchsichtig. Auch die Wissenschaft kritisiert das Marketing von Schweizer Unternehmen.#lebensmittel #detailhandel #nachhaltig #nachhaltigkeit https://lnkd.in/eASV8eM6\xa0»— Retweet https://twitter.com/ErichBuergler/status/1647532695696572417',
 '«\xa0Die grünen Werbeslogans von #migros und #coop  sind laut der Umweltschutzorgan

In [54]:
df_schlachthof = df_flat_small[df_flat_small["tag_name"] == "schlachthof"]
df_schlachthof

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name,emojis
491,169,110615042718162477,2023-06-27T07:40:02.000Z,en,386737,#Schlachthof #Migros: Die Gemeinde St-Aubin ha...,schlachthof,[]
570,196,110144756385512536,2023-04-05T06:20:02.000Z,en,386737,#Migros plant mit ihrer Tochtergesellschaft #M...,schlachthof,[]


In [55]:
schlachthof = df_schlachthof["extracted_content"].to_list()
schlachthof

['#Schlachthof #Migros: Die Gemeinde St-Aubin hat unsere Einsprache abgelehnt. Wir legen nun beim Kanton Freiburg Rekurs ein. Gemeinde und Kanton sollten der Umwelt Priorität einräumen können, nicht den wirtschaftlichen Interessen der Migros-Gruppe.https://act.gp/442XspQ',
 '#Migros plant mit ihrer Tochtergesellschaft #Micarna, einen gigantischen #Schlachthof im Herzen der Westschweiz zu bauen. 🐔 40 Millionen Hühner sollen da jährlich getötet werden. 🐔Wir stellen uns gegen dieses umwelt- und klimaschädliche Projekt. 👇https://act.gp/3zu7qCZ']