# Scraping Toots about IKEA from Mastodon Social 

This notebook is a bit more advanced. It will only come with little explanations regarding the code. If you need more guidance, please refer to the notebook Mastodon_Migros_Prototype.

## Load libraries and modules

In [1]:
import json
import requests
import pandas as pd

In [2]:
from bs4 import BeautifulSoup

## Preparation - Defining hashtag and URL for requests, set timeframe

In [3]:
def ask_for_hashtag():

    hashtag = input("Enter the hashtag which you would like to search: ")
    return hashtag

In [4]:
def build_url():

    user_hashtag = ask_for_hashtag()
    cleaned_hashtag = user_hashtag.lower()

    url = f'https://mastodon.world/api/v1/timelines/tag/{cleaned_hashtag}'

    return url

In [5]:
def ask_for_starting_date():

    user_starting_date = input("Please enter the date you would like to start scraping. "
                          "Tip: Depending on the hashtag, start with a short time period. "
                          "Use the YYYY-MM-DD notation: ")

    return user_starting_date

## Scraping the toots

The following code now actually fetches recent posts with a specific hashtag from Mastodon and stores them in a Pandas DataFrame.

In [6]:
def scraping_toots():
    
    params = {'limit': 40} #one can only fetch 40 posts at once
    
    URL = build_url()
    print(URL)
    
    date = ask_for_starting_date()
    since = pd.Timestamp(f'{date} 00:00:00', tz='UTC')
    print(since)
    
    is_end = False
    
    
    results = []
    chunk_no = 1
    
    while True:
    
        try:
            response = requests.get(URL, params=params)
            print("STATUS OF YOUR SCRAPING: chunk number", chunk_no)
            chunk_no += 1
            response.close()
        except:
            print("An error occured." 
                  "The http status code is {}".format(response.status_code))
        
        toots = json.loads(response.text)
    
        if len(toots) == 0:
            print("There were no toots returned. " 
                  "Check for spelling or use another hashtag")
            break
        
        for toot in toots:
            timestamp = pd.Timestamp(toot['created_at'], tz='utc')
            if timestamp <= since:
                is_end = True
                break
                
            results.append(toot)
        
        if is_end:
            break
        
        max_id = toots[-1]['id']
        params['max_id'] = max_id
    
        
    df_hashtag = pd.DataFrame(results)
    
    return df_hashtag

In [7]:
df = scraping_toots()

Enter the hashtag which you would like to search:  IKEA


https://mastodon.world/api/v1/timelines/tag/ikea


Please enter the date you would like to start scraping. Tip: Depending on the hashtag, start with a short time period. Use the YYYY-MM-DD notation:  2024-09-01


2024-09-01 00:00:00+00:00
STATUS OF YOUR SCRAPING: chunk number 1
STATUS OF YOUR SCRAPING: chunk number 2
STATUS OF YOUR SCRAPING: chunk number 3
STATUS OF YOUR SCRAPING: chunk number 4
STATUS OF YOUR SCRAPING: chunk number 5


## Results and DF

In [8]:
df.shape

(166, 24)

In [9]:
df.columns

Index(['id', 'created_at', 'in_reply_to_id', 'in_reply_to_account_id',
       'sensitive', 'spoiler_text', 'visibility', 'language', 'uri', 'url',
       'replies_count', 'reblogs_count', 'favourites_count', 'edited_at',
       'content', 'reblog', 'account', 'media_attachments', 'mentions', 'tags',
       'emojis', 'card', 'poll', 'application'],
      dtype='object')

## Cleaning up the data

In [10]:
def account_id_column(dataframe):
    dataframe['account_id'] = dataframe['account'].apply(lambda x: x['id'])

    return dataframe

In [11]:
def tag_name_column(dataframe):

    dataframe = dataframe.explode("tags").reset_index()
    dataframe["tag_name"] = dataframe["tags"].apply(lambda y: y['name'])

    return dataframe

In [12]:
def extract_text_from_html(html):

    """Parsing a string which is in html code to extract actual text.

    Parameters
    ----------
    html : string
        String that stores the content of the post (or any text) in html.

    Returns
    -------
    soup.get_text() : string
        A string with the actual text, without any html tags etc.
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()

In [13]:
def parsed_content_column(dataframe):

    dataframe["extracted_content"] = dataframe["content"].apply(extract_text_from_html)

    return dataframe

In [14]:
def choose_languages():

    defined_languages = False
    
    while True: 
        languages = input("Which languages should the posts have? "
                          "Leave the field empty to use the default which is English. "
                          "Otherwise refer to the List of ISO 639 language codes. "
                          "Use the following way to write all of them down: en sv fr. ") or "en"
        languages = languages.split(" ")
        print("Chosen languages: ", languages)
        
        answer = input("If these are the languages you want to use, type yes: ")
        
        if answer.lower() == "yes":
            defined_languages = True
            
            print("Chosen languages are: ", languages)
            break
        
    return languages

In [15]:
def cleaning_data(dataframe):

    dataframe = account_id_column(dataframe)
    dataframe = tag_name_column(dataframe)
    dataframe = parsed_content_column(dataframe)

    
    language_competences = ["en", "de", "fr", "sv"]
    print("The following languages are used: ", language_competences)
    language_answer = input("If you want to use the same languages, type yes, else type no: ")
    
    if language_answer.lower() != "yes":
        language_competences = choose_languages()
    
    new = dataframe[dataframe['language'].isin(language_competences)]

    
    df_small = new[[
    "index", 
    "id", 
    "created_at", 
    "language", 
    "account_id", 
    "extracted_content", 
    "tag_name"]]

    return df_small

In [16]:
clean_df = cleaning_data(df)

The following languages are used:  ['en', 'de', 'fr', 'sv']


If you want to use the same languages, type yes, else type no:  yes


In [17]:
clean_df

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name
0,0,113470963623530040,2024-11-12T16:38:12.000Z,de,109353323900910324,Ikea hat doch extra metallbedampfte Scheiben o...,ikea
1,0,113470963623530040,2024-11-12T16:38:12.000Z,de,109353323900910324,Ikea hat doch extra metallbedampfte Scheiben o...,verschworung
2,1,113464391758290608,2024-11-11T12:46:56.000Z,de,109456243616966336,Lernen und Entwicklung bei IKEA mit Tatevik Mk...,training
3,1,113464391758290608,2024-11-11T12:46:56.000Z,de,109456243616966336,Lernen und Entwicklung bei IKEA mit Tatevik Mk...,ikea
4,1,113464391758290608,2024-11-11T12:46:56.000Z,de,109456243616966336,Lernen und Entwicklung bei IKEA mit Tatevik Mk...,Weiterbildung
...,...,...,...,...,...,...,...
542,164,113064473902306361,2024-09-01T21:42:33.000Z,en,111002268970402978,More photos from yesterday! Before getting sta...,Chicago
543,164,113064473902306361,2024-09-01T21:42:33.000Z,en,111002268970402978,More photos from yesterday! Before getting sta...,pastry
544,164,113064473902306361,2024-09-01T21:42:33.000Z,en,111002268970402978,More photos from yesterday! Before getting sta...,cozy
545,165,113064393892538732,2024-09-01T21:22:15.000Z,en,110676962360830227,Urge IKEA to Stop Selling Animal Skin #AnimalR...,animalrights


## Having a closer look at the hashtags

In [19]:
tags_count = clean_df["tag_name"].value_counts().reset_index()
tags_count.head(25)

Unnamed: 0,tag_name,count
0,ikea,128
1,tradfri,6
2,diy,6
3,smarthome,6
4,homeassistant,5
5,blahaj,4
6,capital,3
7,m6,3
8,greenpeace,3
9,furniture,3


**Some Notes**

I recommend checking on hashtags, one does not understand - to get to know what's behind it. I checked on zigbee, m6 and totaberlustig. Zigbee is some smart home solution. M6 refers to the French television network, there was a program about IKEA in the show Capital. totaberlustig references to a comedian. 

You might stumble upon wrong labelling of languages and many other things. As an example for extensive cleaning check my other noteboook. 

For now, move on.

**Checking posts around sustainability**

In [31]:
sustainability = ["greenwashing", "klimaschutz", "klima", "umwelt", 
                  "umweltschutz", "nachhaltig", "nachhaltigkeit", "recycling", 
                  "deforestation", "environmental", "environment", "forets",
                 "greenpeace", "waldzerstorung"]

In [32]:
df_sustainable = clean_df[clean_df["tag_name"].isin(sustainability)]

In [33]:
# as described before, explode() results in almost exact copies of some rows (except for tag name)
# we only need the content once, to avoid duplicates utilizing drop_duplicates
sustainable = df_sustainable.drop_duplicates("extracted_content")["extracted_content"].to_list()
for post in sustainable[:10]:
    print(post)
    print("---------------------------------------------")

@rootsandcalluses Irgendwie bin ich nicht überrascht. Gibt es bei #IKEA überhaupt wertige Produkte?Ich beispielsweise fahre mit gebrauchtem Ceracron, Friesland, Melitta, Rosenthal, Thomas z.B. bisher sehr gut. #Keramik #Nachhaltigkeit #Porzellan #Steingut
---------------------------------------------
Ikea mal nen Brief schicken und nett nachfragen ob sie bitte nicht Europas letzten Urwald fällen mögen...https://act.greenpeace.de/offenerbrief-ikea-g#greenpeace #savetheplanet #ikea #FuckElon #savethefuckingplanetbeforeitstoolate
---------------------------------------------
>#Waldzerstörung: #IKEA ist nachhaltig? Von wegen<Offener Brief an Ikea  - Jeder ist willkommen ihn zu unterschreiben! 🙏#Ecocide #Topocide #Climatecrisis #Biodiversity  #DemocracyForFuture  https://youtube.com/watch?v=05HRzxUnNn8&si=FN0f8FTZ7BtC86f7
---------------------------------------------
#DorotheaEpperl, #Greenpeace-Kolleg:innen aus #Rumänien und Vertreter:innen von #Ikea waren in den Wäldern der #Karpaten. Gem

In [27]:
len(sustainable)

6