# Scraping Toots about IKEA from Mastodon Social 

This notebook is a bit more advanced. It will only come with little explanations regarding the code. If you need more guidance, please refer to the notebook Mastodon_Migros_Prototype.

## Load libraries and modules

In [2]:
import json
import requests
import pandas as pd

In [3]:
from bs4 import BeautifulSoup

## Preparation - Defining Parameters, Flags, and a Timeframe

In [4]:
hashtag = 'ikea'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'

## Browsing and scraping

The following code now actually fetches recent posts with a specific hashtag from Mastodon and stores them in a Pandas DataFrame.

In [48]:
params = {'limit': 40}

since = pd.Timestamp('2024-01-01 00:00:00', tz='UTC')
is_end = False


results = []
chunk_no = 1

while True:

    try:
        response = requests.get(URL, params=params)
        print("STATUS OF YOUR SCRAPING")
        #print("OK - scraping of chunk no {} worked.".format(chunk_no))
        #print("--------------------------------")
        chunk_no += 1
        response.close()
    except:
        print("An error occured." 
              "The http status code is {}".format(response.status_code))
    
    toots = json.loads(response.text)

    if len(toots) == 0:
        print("There were no toots returned. " 
              "Check for spelling or use another hashtag")
        break
    
    for toot in toots:
        timestamp = pd.Timestamp(toot['created_at'], tz='utc')
        if timestamp <= since:
            is_end = True
            break
            
        results.append(toot)
    
    if is_end:
        break
    
    max_id = toots[-1]['id']
    params['max_id'] = max_id

    
df = pd.DataFrame(results)

STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING
STATUS OF YOUR SCRAPING


## Results and DF

In [49]:
df.shape

(1050, 24)

In [50]:
df.columns

Index(['id', 'created_at', 'in_reply_to_id', 'in_reply_to_account_id',
       'sensitive', 'spoiler_text', 'visibility', 'language', 'uri', 'url',
       'replies_count', 'reblogs_count', 'favourites_count', 'edited_at',
       'content', 'reblog', 'account', 'media_attachments', 'mentions', 'tags',
       'emojis', 'card', 'poll', 'application'],
      dtype='object')

## Cleaning up the data - respectively the DataFrame

In [51]:
df['account_id'] = df['account'].apply(lambda x: x['id'])
#df_small = df[["id", "created_at", "language", "account_id", "content", "tags"]]
#df_small.head()

**Using explode on the tags column**

In [52]:
df = df.explode("tags").reset_index()


In [None]:
#df.head()

**Cleaning a little more the tags column**

In [53]:
df["tag_name"] = df["tags"].apply(lambda y: y['name'])

**Extracting the text from the content column**

In [54]:
def extract_text_from_html(html):

    """Parsing a string which is in html code to extract actual text.

    Parameters
    ----------
    html : string
        String that stores the content of the post (or any text) in html.

    Returns
    -------
    soup.get_text() : string
        A string with the actual text, without any html tags etc.
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()



In [55]:
df["extracted_content"] = df["content"].apply(extract_text_from_html)

**Narrowing down the DataFrame**

In [56]:
df_small = df[[
    "index", 
    "id", 
    "created_at", 
    "language", 
    "account_id", 
    "extracted_content", 
    "tag_name"]]

In [57]:
df_small.head()

Unnamed: 0,index,id,created_at,language,account_id,extracted_content,tag_name
0,0,113314594177757198,2024-10-16T01:51:29.000Z,en,109455979333972647,"Sometimes, we should learn more from our envir...",cybersecurity
1,0,113314594177757198,2024-10-16T01:51:29.000Z,en,109455979333972647,"Sometimes, we should learn more from our envir...",ikea
2,1,113313239562925427,2024-10-15T20:07:00.483Z,de,113283168624628309,@ninja_turtle und eine neue Duftkerze? Ich nim...,duftkerze
3,1,113313239562925427,2024-10-15T20:07:00.483Z,de,113283168624628309,@ninja_turtle und eine neue Duftkerze? Ich nim...,ikea
4,2,113312810493539002,2024-10-15T18:17:53.402Z,de,113282508677757328,was wäre ein Ikea-Besuch ohne einen Hotdog #ik...,ikea


In [69]:
df_small["language"].unique()

array(['en', 'de', 'fr', 'sv'], dtype=object)

In [61]:
# languages that I can read and therefore interpret texts, hashtags
language_competences = ["en", "de", "fr", "sv"] 

In [64]:
df_small = df_small[df_small['language'].isin(language_competences) == True]

In [68]:
df_small["language"].unique()

array(['en', 'de', 'fr', 'sv'], dtype=object)

## Having a closer look at the hashtags

In [71]:
tags_count = df_small["tag_name"].value_counts().reset_index()
tags_count.head(25)

Unnamed: 0,tag_name,count
0,ikea,868
1,blahaj,25
2,zpravicky,20
3,diy,18
4,arte,17
5,furniture,17
6,chytradomacnost,17
7,tradfri,16
8,smarthome,15
9,homeassistant,14


**Some Notes**

I recommend checking on hashtags, one does not understand - know what's behind it. I checked on zpravicky, chytradomacnost, cesko and zigbee. The first three come from an account where langugae is marked as "en" but posts are in Czech. They come from the same account, going to drop the account. Not without checking beforehand if all the account's posts are actually in Czech.

In [85]:
accounts_to_drop = ["110428422327008517"]

In [94]:
df_small = df_small[df_small['account_id'].isin(accounts_to_drop) == False]

**Checking posts around sustainability**

In [98]:
sustainability = ["greenwashing", "klimaschutz", "klima", "umwelt", "umweltschutz", "nachhaltig", "nachhaltigkeit", "recycling", "deforestation", "environmental"]

In [99]:
df_sustainable = df_flat_small[df_flat_small["tag_name"].isin(sustainability)]

In [102]:
# as described before, explode() results in almost exact copies of some rows (except for tag name)
# we only need the content once, to avoid duplicates utilizing drop_duplicates
sustainable = df_sustainable.drop_duplicates("extracted_content")["extracted_content"].to_list()
for post in sustainable[:10]:
    print(post)
    print("---------------------------------------------")

🙅 Refusons la #surconsommation et #greenwashing associé que nous vendent les multinationales comme #ikea 🌲🌳🪓@XR_Bordeaux propose une petite vidéo de sensibilisation sur le sujet 😉#StopDeforestation
---------------------------------------------
⏳️ Sensibilisation sur la déforestation par Extinction Rebellion Bordeauxhttps://tube.extinctionrebellion.fr/videos/watch/d03c725c-80b3-45bc-a35f-25f33bd5961f
---------------------------------------------
Ich wusste gar nicht, dass deren Möbel überhaupt so lange halten.

Ganz abgesehen davon holzt IKEA unsere Wälder ab in Rumänien u. a. ... 🤬 #ikea #umwelt #umweltschutz #klima #holzmafiaRE: https://bsky.app/profile/did:plc:vk2mooi24pafrjmhpg4ymrv3/post/3l2moh4uoci2r
---------------------------------------------
Ich möchte gerne eine semi-geheime Information über #IKEA weitergeben:Die Trinkflasche "ENKELSPÅRIG" hat im Deckel einen Dichtungsring aus Silikon, der leider leicht rausfällt und verloren geht, wodurch die eigentlich unverwüstliche Flasch

In [103]:
len(sustainable)

36