# Scraping Toots about IKEA from Mastodon Social 

This notebook is a bit more advanced. It will only come with little explanations regarding the code. If you need more guidance, please refer to the notebook Mastodon_Migros_Prototype.

## Load libraries and modules

In [1]:
import json
import requests
import pandas as pd

In [2]:
from bs4 import BeautifulSoup

## Preparation - Defining hashtag and URL for requests, set timeframe

In [3]:
def ask_for_hashtag():

    hashtag = input("Enter the hashtag which you would like to search: ")
    return hashtag

In [4]:
def ask_for_mastodon_instance():

    instance = input("Enter the Mastodon instance you would like to scrape. " 
    "Good starting points are for example mastodon.social, mastodon.world. "
    "Make sure, you enter the entire name of the instance, e.g. mastodon.world. "
    "Instance of choice: ")
    return instance

In [5]:
def build_url():

    user_hashtag = ask_for_hashtag()
    cleaned_hashtag = user_hashtag.lower()

    user_instance = ask_for_mastodon_instance()
    cleaned_instance = user_instance.lower()

    url = f'https://{cleaned_instance}/api/v1/timelines/tag/{cleaned_hashtag}'

    return url

In [6]:
def ask_for_starting_date():

    user_starting_date = input("Please enter the date you would like to start scraping. "
                          "Tip: Depending on the hashtag, start with a short time period. "
                          "Use the YYYY-MM-DD notation: ")

    return user_starting_date

## Scraping the toots

The following code now actually fetches recent posts with a specific hashtag from Mastodon and stores them in a Pandas DataFrame.

In [7]:
def scraping_toots():
    
    params = {'limit': 40} #one can only fetch 40 posts at once
    
    URL = build_url()
    print(URL)
    
    date = ask_for_starting_date()
    since = pd.Timestamp(f'{date} 00:00:00', tz='UTC')
    print(since)
    
    is_end = False
    
    
    results = []
    chunk_no = 1
    
    while True:
    
        try:
            response = requests.get(URL, params=params)
            print("STATUS OF YOUR SCRAPING: chunk number", chunk_no)
            chunk_no += 1
            response.close()
        except:
            print("An error occured." 
                  "The http status code is {}".format(response.status_code))
        
        toots = json.loads(response.text)
    
        if len(toots) == 0:
            print("There were no toots returned. " 
                  "Check for spelling or use another hashtag")
            break
        
        for toot in toots:
            timestamp = pd.Timestamp(toot['created_at'], tz='utc')
            if timestamp <= since:
                is_end = True
                break
                
            results.append(toot)
        
        if is_end:
            break
        
        max_id = toots[-1]['id']
        params['max_id'] = max_id
    
        
    df_hashtag = pd.DataFrame(results)
    
    return df_hashtag

In [8]:
df = scraping_toots()

Enter the hashtag which you would like to search:  IKEA
Enter the Mastodon instance you would like to scrape. Good starting points are for example mastodon.social, mastodon.world. Make sure, you enter the entire name of the instance, e.g. mastodon.world. Instance of choice:  mastodon.social


https://mastodon.social/api/v1/timelines/tag/ikea


Please enter the date you would like to start scraping. Tip: Depending on the hashtag, start with a short time period. Use the YYYY-MM-DD notation:  2024-10-01


2024-10-01 00:00:00+00:00
STATUS OF YOUR SCRAPING: chunk number 1
STATUS OF YOUR SCRAPING: chunk number 2
STATUS OF YOUR SCRAPING: chunk number 3


## Results and DF

In [19]:
df.shape

(83, 24)

In [20]:
df.columns

Index(['id', 'created_at', 'in_reply_to_id', 'in_reply_to_account_id',
       'sensitive', 'spoiler_text', 'visibility', 'language', 'uri', 'url',
       'replies_count', 'reblogs_count', 'favourites_count', 'edited_at',
       'content', 'reblog', 'application', 'account', 'media_attachments',
       'mentions', 'tags', 'emojis', 'card', 'poll'],
      dtype='object')

## Cleaning up the data

In [21]:
def account_id_column(dataframe):
    dataframe['account_id'] = dataframe['account'].apply(lambda x: x['id'])

    return dataframe

In [22]:
def tag_name_column(dataframe):

    dataframe = dataframe.explode("tags").reset_index()
    dataframe["tag_name"] = dataframe["tags"].apply(lambda y: y['name'])

    return dataframe

In [23]:
def extract_text_from_html(html):

    """Parsing a string which is in html code to extract actual text.

    Parameters
    ----------
    html : string
        String that stores the content of the post (or any text) in html.

    Returns
    -------
    soup.get_text() : string
        A string with the actual text, without any html tags etc.
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()

In [24]:
def parsed_content_column(dataframe):

    dataframe["extracted_content"] = dataframe["content"].apply(extract_text_from_html)

    return dataframe

In [28]:
def choose_languages():

    defined_languages = False
    
    while True: 
        languages = input("Which languages should the posts have? "
                          "Leave the field empty to use the default which is English. "
                          "Otherwise refer to the List of ISO 639 language codes. "
                          "Use the following way to write all of them down: en sv fr. ") or "en"
        languages = languages.split(" ")
        print("Chosen languages: ", languages)
        
        answer = input("If these are the languages you want to use, type yes: ")
        
        if answer.lower() == "yes":
            defined_languages = True
            
            print("Chosen languages are: ", languages)
            break
        
    return languages

In [29]:
def cleaning_data(dataframe):

    dataframe = account_id_column(dataframe)
    dataframe = tag_name_column(dataframe)
    dataframe = parsed_content_column(dataframe)

    
    language_competences = ["en", "de", "fr", "sv"]
    print("The following languages are used: ", language_competences)
    language_answer = input("If you want to use the same languages, type yes, else type no: ")
    
    if language_answer.lower() != "yes":
        language_competences = choose_languages()
    
    new = dataframe[dataframe['language'].isin(language_competences)]

    
    df_small = new[[
    "index", 
    "id", 
    "created_at", 
    "language", 
    "account_id", 
    "extracted_content", 
    "tag_name"]]

    return df_small

In [30]:
clean_df = cleaning_data(df)

I used the following languages:  ['en', 'de', 'fr', 'sv']


If you want to use the same languages, type yes, or else no:  no
Which languages should the posts have? Leave the field empty to use the default which is English. Otherwise refer to the List of ISO 639 language codes. Use the following way to write all of them down: en sv fr.  


Chosen languages:  ['en']


If these are the languages you want to use, type yes:  yes


Chosen languages are:  ['en']


In [31]:
clean_df

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name
0,0,113341403324783437,2024-10-20T19:29:25.386Z,en,109358010895553496,Can we make #ikea #furniture also easy to disa...,ikea
1,0,113341403324783437,2024-10-20T19:29:25.386Z,en,109358010895553496,Can we make #ikea #furniture also easy to disa...,furniture
13,4,113333960822788987,2024-10-19T11:56:41.000Z,en,109284345361269535,My #HomeAssistant Green has been delayed. The ...,kodi
14,4,113333960822788987,2024-10-19T11:56:41.000Z,en,109284345361269535,My #HomeAssistant Green has been delayed. The ...,picoreserver
15,4,113333960822788987,2024-10-19T11:56:41.000Z,en,109284345361269535,My #HomeAssistant Green has been delayed. The ...,ikea
...,...,...,...,...,...,...,...
268,78,113238225913420391,2024-10-02T14:10:03.028Z,en,110428422327008517,Chytrá zásuvka INSPELNING vám pomůže hlídat sp...,chytradomacnost
269,78,113238225913420391,2024-10-02T14:10:03.028Z,en,110428422327008517,Chytrá zásuvka INSPELNING vám pomůže hlídat sp...,chytrazasuvka
270,78,113238225913420391,2024-10-02T14:10:03.028Z,en,110428422327008517,Chytrá zásuvka INSPELNING vám pomůže hlídat sp...,ikea
271,78,113238225913420391,2024-10-02T14:10:03.028Z,en,110428422327008517,Chytrá zásuvka INSPELNING vám pomůže hlídat sp...,merenispotreby


## Having a closer look at the hashtags

In [None]:
tags_count = df_small["tag_name"].value_counts().reset_index()
tags_count.head(25)

**Some Notes**

I recommend checking on hashtags, one does not understand - to get to know what's behind it. I checked on zpravicky, chytradomacnost, cesko and zigbee. The first three come from an account where langugae is marked as "en" but posts are in Czech. They come from the same account, going to drop the account. Not without checking beforehand if all the account's posts are actually in Czech.

In [None]:
accounts_to_drop = ["110428422327008517"]

In [None]:
df_small = df_small[df_small['account_id'].isin(accounts_to_drop) == False]

In [None]:
tags_count = df_small["tag_name"].value_counts().reset_index()
tags_count.head(25)

**Checking posts around sustainability**

In [None]:
sustainability = ["greenwashing", "klimaschutz", "klima", "umwelt", "umweltschutz", "nachhaltig", "nachhaltigkeit", "recycling", "deforestation", "environmental", "environment", "forets"]

In [None]:
df_sustainable = df_small[df_small["tag_name"].isin(sustainability)]

In [None]:
# as described before, explode() results in almost exact copies of some rows (except for tag name)
# we only need the content once, to avoid duplicates utilizing drop_duplicates
sustainable = df_sustainable.drop_duplicates("extracted_content")["extracted_content"].to_list()
for post in sustainable[:10]:
    print(post)
    print("---------------------------------------------")

In [None]:
len(sustainable)