<a id=contents></a>

# Model building

In this notebook I'm loading up the BERTopic model that I had trained on GoogleColab, merging the results of the model with the rest of the data and exploring the results. 


[1. Loading up the training data and model](#one1)

[2. Exploring and reducing the topics](#two)

[3. Linkages in Topics](#three)

[4. Our main topics over time](#four)

[4.1 Sentiment over time](#five)

[4.2 Model 2](#two)

[4.2 Model 3](#three)

[4.2 Model 4](#four)

[4.2 Model 5](#five)

[7. Conclusions and model comparison table](#conc)

<a id=one1></a>

## 1. Load up training data and model

[LINK to table of contents](#contents)

In [44]:
%load_ext autoreload
%autoreload 2

from functions.topic_modelling_BertTopic import *

import warnings
warnings.filterwarnings('ignore')
import pandas as pd


import joblib
from umap import UMAP

from scipy.stats import hmean
import plotly
import torch
import pickle, io

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
print('Loading up docs')
filepath = 'data/clean/features/text_for_topics_post_Aug22.csv'
docs = pd.read_csv(filepath, index_col='tweet_id')

#meta data getting loaded too
filepath = 'data/clean/dashboard_data.csv'
df = pd.read_csv(filepath, index_col='tweet_id').drop(columns=['Unnamed: 0', 'clean_tweet_text'],)
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.join(docs, how='right')
df.info()

Loading up docs
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12037 entries, 1580168615357140992 to 1565189065158311937
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   Unnamed: 0.2                 12037 non-null  int64              
 1   Unnamed: 0.1                 12037 non-null  int64              
 2   datetime                     12037 non-null  datetime64[ns, UTC]
 3   display_name                 12037 non-null  object             
 4   tweet_text                   12037 non-null  object             
 5   User_id                      12037 non-null  float64            
 6   #likes                       12037 non-null  float64            
 7   #retweets                    12037 non-null  float64            
 8   #responses                   12037 non-null  float64            
 9   language                     12037 non-null  object             
 10

In [45]:
class CPU_Unpickler(pickle.Unpickler):
    def find_class(self, module, name):
        if module == 'torch.storage' and name == '_load_from_bytes':
            return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
        else: return super().find_class(module, name)



In [53]:
topic_model = joblib.load('models/topic/BertTopic/topic_model_final.pkl', 'r+')

TypeError: 'bytes' object cannot be interpreted as an integer

<a id=two></a>

## 2. Exploring and reducing topics

[LINK to table of contents](#contents)

<a id=three></a>

## 3. Linkages in topics

[LINK to table of contents](#contents)

<a id=four></a>

## 4. Our main topics over time

[LINK to table of contents](#contents)

<a id=five></a>

## 5. Sentiment over time

[LINK to table of contents](#contents)

On GoogleColab, I had also used a pre-trained [sentiment_classifier](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) from BERT in order to infer the sentiment of the tweets collected. Let's explore how sentiment varies across the other dimensions of our dataset. 

In [19]:
sent = pd.read_csv('data/preds/text_and_sentiment_preds.csv', index_col='tweet_id')
sent = sent.join(df[['datetime', 'By_or_at_Musk', 'language', '#likes', '#retweets', '#responses','display_name']])
# dropping non-english tweets
sent = sent.loc[sent.language=='en']
sent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11914 entries, 1580168615357140992 to 1565189065158311937
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   clean_tweet_text      11914 non-null  object             
 1   Pred_sentiment_out    11914 non-null  object             
 2   Pred_sentiment_score  11914 non-null  float64            
 3   datetime              11914 non-null  datetime64[ns, UTC]
 4   By_or_at_Musk         11914 non-null  object             
 5   language              11914 non-null  object             
 6   #likes                11914 non-null  float64            
 7   #retweets             11914 non-null  float64            
 8   #responses            11914 non-null  float64            
 9   display_name          11914 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(4), object(5)
memory usage: 1023.9+ KB


In [32]:
sent_agg = sent[['datetime', 'By_or_at_Musk', 'Pred_sentiment_out']].groupby([pd.Grouper(key='datetime', freq='D'), 
                        'By_or_at_Musk', 'Pred_sentiment_out']).size().reset_index()
                        
                        # agg({'Pred_sentiment_out':'count', 
                        #                         # 'Pred_sentiment_score':lambda x : hmean(x),
                        #                         '#likes': 'mean', 
                        #                         '#retweets':'mean',
                        #                         '#responses':'mean'
                        #                         }).reset_index()
# sent_agg_melt = sent_agg.melt(id_vars=['datetime', 'By_or_at_Musk'])
sent_agg.head()


Unnamed: 0,datetime,By_or_at_Musk,Pred_sentiment_out,0
0,2022-09-01 00:00:00+00:00,By @elonmusk,POSITIVE,3
1,2022-09-01 00:00:00+00:00,Mentions @elonmusk,NEGATIVE,44
2,2022-09-01 00:00:00+00:00,Mentions @elonmusk,POSITIVE,23
3,2022-09-02 00:00:00+00:00,Mentions @elonmusk,NEGATIVE,25
4,2022-09-02 00:00:00+00:00,Mentions @elonmusk,POSITIVE,10


In [None]:
## sentiment over time, agg'd by day and before and after 

plotly.pl

<a id=conc ><a/> 

## 7. Conclusions and model comparison table
    
[LINK to table of contents](#contents)