# Recommender System

In [168]:
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#!python -m spacy download en_core_web_lg
#!python -m spacy download de_core_news_lg

import pandas as pd
import spacy
import os as os

nlp = spacy.load("en_core_web_lg")
nlp_germ = spacy.load("de_core_news_lg")


### Load prepared Dataset

In [169]:
filename = "all_toots.csv"
path= "../scraper/datasets"
data = pd.read_csv(os.path.join(path, filename), sep=";")
data.head()

Unnamed: 0,toot_id,content,reblogs_count,favourites_count,replies_count,mentions,tags,language,created_at,edited_at,instance
0,110322104651328999,Study makes troubling revelation about the bot...,3,0,0,[],"[{'name': 'ocean', 'url': 'https://mastodon.so...",en,2023-05-06 14:02:02+00:00,,mastodon.social
1,110322072260430195,ふわふわじゃないのに高いの…最悪じゃん〜\n\n,0,0,0,[],[],ja,2023-05-06 13:53:49+00:00,,mastodon.social
2,110322107441942062,\n\n,0,0,0,[],[],en,2023-05-06 14:02:46.388000+00:00,,mastodon.social
3,110322074699247900,2週間ぶり\n\n,1,0,0,[],[],,2023-05-06 13:54:26.666000+00:00,,mastodon.social
4,110322069624451350,せ、生命活動・・・\n[#おうどんラジオ](https://social.vivaldi.n...,0,0,0,[],"[{'name': 'おうどんラジオ', 'url': 'https://mastodon....",ja,2023-05-06 13:53:08+00:00,,mastodon.social


### Select toots in english and german

In [170]:
mask_language = (data["language"] == "en") | (data["language"] == "de")
data = data[mask_language]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60050 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   toot_id           60050 non-null  int64 
 1   content           60050 non-null  object
 2   reblogs_count     60050 non-null  int64 
 3   favourites_count  60050 non-null  int64 
 4   replies_count     60050 non-null  int64 
 5   mentions          60050 non-null  object
 6   tags              60050 non-null  object
 7   language          60050 non-null  object
 8   created_at        60050 non-null  object
 9   edited_at         1880 non-null   object
 10  instance          60050 non-null  object
dtypes: int64(4), object(7)
memory usage: 5.5+ MB


In [184]:
test_toot_df = data
#delete entries with same toot_id
test_toot_df = test_toot_df.drop_duplicates(subset="toot_id", keep="first")
test_toot_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1524 entries, 0 to 56662
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   toot_id           1524 non-null   int64 
 1   content           1524 non-null   object
 2   reblogs_count     1524 non-null   int64 
 3   favourites_count  1524 non-null   int64 
 4   replies_count     1524 non-null   int64 
 5   mentions          1524 non-null   object
 6   tags              1524 non-null   object
 7   language          1524 non-null   object
 8   created_at        1524 non-null   object
 9   edited_at         47 non-null     object
 10  instance          1524 non-null   object
dtypes: int64(4), object(7)
memory usage: 142.9+ KB


### Recommender System for local timeline

- Step 1. Get relevant toots depending on content after selecting the interests (after registration) from people in local timeline
- Step 2. Get toots from people you follow 
- Step 3. Get persons with simular interests (who to follow)
- Step 4. Get toots by hashtags (filter hashtags by interests)
- Step 5. Mix data
- Step 6. Rank the toots in a ranking system and sort them descending

##### Initial problems on setup: 
- missing toots in local timeline
- missing persons with simular interests
- missing toots from peope you follow

##### Solutions:

- Create initial content in local timeline bot content 
- ....

#### Step 1: Get relevant toots depending on content after selecting the interests (after registration) from people in local timeline

In [172]:
interests = ["climbing", "gaming", "datascience", "politics", "math"] #create list of interests after login/registration

##### Simularity Check with spacy

In [173]:
def lemmatize_text(text):
    """Function to lemmatize text data and remove the stopwords."""
    doc = nlp(text)
    
    # Lemmatization and removal of stop words
    processed_tokens = [token.lemma_ for token in doc if not token.is_stop]
    
    # Return the formatted text as a string
    processed_text = ' '.join(processed_tokens)
    
    return processed_text

In [174]:
# Create new column with lemmatized text
test_toot_df["content_lemma"] = test_toot_df["content"].apply(lemmatize_text)
test_toot_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_toot_df["content_lemma"] = test_toot_df["content"].apply(lemmatize_text)


Unnamed: 0,toot_id,content,reblogs_count,favourites_count,replies_count,mentions,tags,language,created_at,edited_at,instance,content_lemma
0,110322104651328999,Study makes troubling revelation about the bot...,3,0,0,[],"[{'name': 'ocean', 'url': 'https://mastodon.so...",en,2023-05-06 14:02:02+00:00,,mastodon.social,study make troubling revelation ocean : ' \n t...
2,110322107441942062,\n\n,0,0,0,[],[],en,2023-05-06 14:02:46.388000+00:00,,mastodon.social,\n\n
6,110322056600760455,It’s like a sketch. I can’t quite believe it’s...,0,0,0,[],[],en,2023-05-06 13:49:46+00:00,,mastodon.social,like sketch . believe real . \n\n
7,110322106189475983,Catching up with Dwellings? Add all four back ...,0,0,0,"[{'id': 109297712949714412, 'username': 'jstep...","[{'name': 'crowdfunding', 'url': 'https://mast...",en,2023-05-06 14:02:27.281000+00:00,,mastodon.social,catch dwelling ? add issue \n [ @jstephenscomi...
11,110322085673775812,[#KBOS](https://mastodon.social/tags/KBOS) /\n...,0,0,0,[],"[{'name': 'kbos', 'url': 'https://mastodon.soc...",en,2023-05-06 13:57:14.245000+00:00,,mastodon.social,[ # kbos](https://mastodon.social / tag / KBOS...


In [182]:
def calculate_content_similarity_score(interests, toot_dataframe, sort_dataframe_by_content_similarity=True):
    """Function to calculate the similarity score between the interests and the toot content."""
    
    # Create a list of tuples (similarity, toot) for the most similar toots
    similarity_scores = []
    for _, toot in toot_dataframe.iterrows():
        toot_content = toot['content_lemma']
        toot_doc = nlp(toot_content)
        
        # Calculate the average similarity between the interests and the toot content
        similarity_scores_sum = 0
        for interest in interests:
            interest_doc = nlp(interest)
            similarity_scores_sum += toot_doc.similarity(interest_doc)
        
        # Calculate the average similarity score
        similarity_score = similarity_scores_sum / len(interests)
        
        similarity_scores.append((similarity_score, toot))
    
    # Create a new DataFrame with the additional column similarity_score
    result_dataframe = toot_dataframe.copy()
    result_dataframe['content_similarity_score'] = [score for score, _ in similarity_scores]
    
    if sort_dataframe_by_content_similarity:
        # Sort the DataFrame by the column similarity_score (descending) and reset the index
        result_dataframe.sort_values('content_similarity_score', ascending=False, inplace=True)
        result_dataframe.reset_index(drop=True, inplace=True)
    
    
    return result_dataframe

In [183]:
similar_toots = calculate_content_similarity_score(interests, test_toot_df)
similar_toots.head()

  similarity_scores_sum += toot_doc.similarity(interest_doc)


Unnamed: 0,toot_id,content,reblogs_count,favourites_count,replies_count,mentions,tags,language,created_at,edited_at,instance,content_lemma,content_similarity_score
0,110322098597109624,It's also hilarious that half of these people ...,0,1,1,[],[],en,2023-05-06 14:00:23+00:00,,mastodon.social,hilarious half people make video \n young curr...,0.325887
1,110322062761903802,I can see why people would play with the idea ...,0,0,0,[],[],en,2023-05-06 13:51:23+00:00,,mastodon.social,people play idea App . \n mean people try arch...,0.325529
2,110322107673623618,trivia quiz game show-style videos!: Using mul...,0,0,1,[],[],en,2023-05-06 14:02:49+00:00,,mastodon.social,trivium quiz game - style video ! : multimedia...,0.325447
3,110322097183909229,Watching these idiots learn the value of artis...,14,1,1,[],[],en,2023-05-06 14:00:09+00:00,,mastodon.social,watch idiot learn value artistic labor enterta...,0.322838
4,110322107628697417,online or through apps: There are several mobi...,0,0,1,[],[],en,2023-05-06 14:02:48+00:00,,mastodon.social,online app : mobile app like QuizUp Trivia \n ...,0.320174


### Interaction Score
Im folgenden Abschnitt wird ein Interaktion Score berechnet der sich aus der Summe der Interaktionen (favourites_count, replies_count, reblogs_count) zusammensetzt. Dieser Score wird anschließend auf 0-1 nomiert. 

In [185]:
def calculate_interaction_score(toot_df, sort_by_interaction_score=False):
    """Function to calculate the interaction score of a toot."""
    
    # Calculate the interaction score
    toot_df['interaction_score'] = toot_df['favourites_count'] + toot_df['replies_count'] + toot_df['reblogs_count']
    
    # Normalize the interaction score to the value range [0, 1]
    max_interaction_score = toot_df['interaction_score'].max()
    toot_df['interaction_score'] = toot_df['interaction_score'] / max_interaction_score
    
    if sort_by_interaction_score:
        # Sort the DataFrame according to the interaction score (descending)
        toot_df.sort_values('interaction_score', ascending=False, inplace=True)
        toot_df.reset_index(drop=True, inplace=True)
    
    return toot_df

In [178]:
similar_toots = calculate_interaction_score(similar_toots)
similar_toots.head()

Unnamed: 0,toot_id,content,reblogs_count,favourites_count,replies_count,mentions,tags,language,created_at,edited_at,instance,content_lemma,content_similarity_score,interaction_score
0,110322104651328999,Study makes troubling revelation about the bot...,3,0,0,[],"[{'name': 'ocean', 'url': 'https://mastodon.so...",en,2023-05-06 14:02:02+00:00,,mastodon.social,study make troubling revelation ocean : ' \n t...,-0.061759,0.009063
2,110322107441942062,\n\n,0,0,0,[],[],en,2023-05-06 14:02:46.388000+00:00,,mastodon.social,\n\n,0.0,0.0
6,110322056600760455,It’s like a sketch. I can’t quite believe it’s...,0,0,0,[],[],en,2023-05-06 13:49:46+00:00,,mastodon.social,like sketch . believe real . \n\n,0.26116,0.0
7,110322106189475983,Catching up with Dwellings? Add all four back ...,0,0,0,"[{'id': 109297712949714412, 'username': 'jstep...","[{'name': 'crowdfunding', 'url': 'https://mast...",en,2023-05-06 14:02:27.281000+00:00,,mastodon.social,catch dwelling ? add issue \n [ @jstephenscomi...,-0.071895,0.0
11,110322085673775812,[#KBOS](https://mastodon.social/tags/KBOS) /\n...,0,0,0,[],"[{'name': 'kbos', 'url': 'https://mastodon.soc...",en,2023-05-06 13:57:14.245000+00:00,,mastodon.social,[ # kbos](https://mastodon.social / tag / KBOS...,-0.083949,0.0


### Vorübergehender Ranking Score
Im folgenden Abschnitt wird ein Ranking Score berechnet der sich aus der Summe der der gewichteten Scores zusammensetzt. Das Dataframe wird nach dem Ranking Score definiert.

In [179]:
def calculate_ranking_score(toot_df, similarity_weight, interaction_weight):
    """Function to calculate the ranking score of a toot."""
    
    # Calculate the ranking score
    toot_df['ranking_score'] = (similarity_weight * toot_df['content_similarity_score']) + (interaction_weight * toot_df['interaction_score']) 
    
    # Sort the DataFrame according to the ranking score (descending)
    toot_df.sort_values('ranking_score', ascending=False, inplace=True)
    toot_df.reset_index(drop=True, inplace=True)
    
    return toot_df

In [180]:
# Set the weights for Similarity score and Interaction score
similarity_weight = 0.9
interaction_weight = 0.1

# Calculate the ranking score and expand the DataFrame 
toot_df_with_ranking = calculate_ranking_score(similar_toots, similarity_weight, interaction_weight)
toot_df_with_ranking.head()

Unnamed: 0,toot_id,content,reblogs_count,favourites_count,replies_count,mentions,tags,language,created_at,edited_at,instance,content_lemma,content_similarity_score,interaction_score,ranking_score
0,110322106296283536,I’m proud to live in a country where we earn p...,116,200,15,[],[],en,2023-05-06 14:02:28.907000+00:00,,mastodon.social,proud live country earn power democratic way ....,0.261034,1.0,0.334931
1,110322097183909229,Watching these idiots learn the value of artis...,14,1,1,[],[],en,2023-05-06 14:00:09+00:00,,mastodon.social,watch idiot learn value artistic labor enterta...,0.322838,0.048338,0.295388
2,110322098597109624,It's also hilarious that half of these people ...,0,1,1,[],[],en,2023-05-06 14:00:23+00:00,,mastodon.social,hilarious half people make video \n young curr...,0.325887,0.006042,0.293902
3,110322107673623618,trivia quiz game show-style videos!: Using mul...,0,0,1,[],[],en,2023-05-06 14:02:49+00:00,,mastodon.social,trivium quiz game - style video ! : multimedia...,0.325447,0.003021,0.293204
4,110322062761903802,I can see why people would play with the idea ...,0,0,0,[],[],en,2023-05-06 13:51:23+00:00,,mastodon.social,people play idea App . \n mean people try arch...,0.325529,0.0,0.292976


In [181]:
for toot_content in toot_df_with_ranking[:10].content:
    print(toot_content)

I’m proud to live in a country where we earn power in a more democratic way.
By calling Georgia officials and asking them to find a few thousand votes.


Watching these idiots learn the value of artistic labor is more entertaining
than any show.


It's also hilarious that half of these people making these videos are too
young to have used them when they were current tech so they are looking at
them as some sort of retro curiosity.


trivia quiz game show-style videos!: Using multimedia platforms like YouTube
allows one the creative freedom to generate unique content related quizzes
they can then share online with others around the world wishing likewise
quality entertainment within this category - all focused upon favorite themes
central enjoyed commonly together between shared followers alike concerning
any detail worthy enough disclosure overall among peers attention solely
concentrated


I can see why people would play with the idea of an Everything App. By that I
mean people who wo

## Probleme:
- Performance
    - Der Similarity Check dauert relativ lange (>1000 Toots)
    - Das Lemmatizen dauert relativ lange (>1000 Toots)
    -> Kann beim Öffnen der App zu langer Ladezeit führen.
    
    Mögliche Lösung: Kategorisierung des Toot Contents nach dem veröffentlichen, Persistierierung in DB, Mustererkennung mit Regex (Abgleich der Interessen mit Kategorien) 