### Analysing the Distilbert ED

Steps -
* Read `all_user_bios.csv` file (consider the duplicates, multiline bios)
* Remove empty bios
* Remove all Links
* Remove all user mentions
* Remove all hashtags
* Write the bio to `words_with_emoji.txt` file
* Read and run Distilbert on each bio

Look at `fileparser.user_bio_parser()` and `analyses.users_bio_ed_distilbert()` functions

In [1]:
import pandas as pd

In [2]:
users_bio_distilbert = pd.read_csv('users_bio_distilbert.csv', encoding='ISO-8859-1', dtype={'id': str, 'sadness': float, 'joy': float, 'love': float, 'anger': float, 'fear': float, 'surprise': float, 'verdict': str})
users_bio_distilbert.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1741912 entries, 0 to 1741911
Data columns (total 8 columns):
 #   Column    Dtype  
---  ------    -----  
 0   id        object 
 1   sadness   float64
 2   joy       float64
 3   love      float64
 4   anger     float64
 5   fear      float64
 6   surprise  float64
 7   verdict   object 
dtypes: float64(6), object(2)
memory usage: 106.3+ MB


In [3]:
users_bio_distilbert.head()

Unnamed: 0,id,sadness,joy,love,anger,fear,surprise,verdict
0,413080213,0.041237,0.758531,0.005717,0.091492,0.098849,0.004173,joy
1,493832011,0.001672,0.989075,0.001292,0.005647,0.001596,0.000718,joy
2,2989319032,0.031329,0.175984,0.005536,0.732789,0.050099,0.004264,anger
3,1042385216,0.018754,0.298485,0.062315,0.574024,0.042194,0.004227,anger
4,490149888,0.003129,0.971793,0.002522,0.017236,0.004106,0.001214,joy


In [4]:
duplicated_ids = users_bio_distilbert.duplicated(subset='id', keep='first')
duplicated_ids.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1741912 entries, 0 to 1741911
Series name: None
Non-Null Count    Dtype
--------------    -----
1741912 non-null  bool 
dtypes: bool(1)
memory usage: 1.7 MB


In [5]:
grp = users_bio_distilbert.groupby('verdict')['id'].count()
grp.head()

verdict
anger      529758
fear        86905
joy        991504
love        60409
sadness     63194
Name: id, dtype: int64

In [10]:
grp = pd.DataFrame(grp)
grp['percent'] = grp.id.map(lambda x: (x * 100)/len(users_bio_distilbert))
grp

Unnamed: 0_level_0,id,percent
verdict,Unnamed: 1_level_1,Unnamed: 2_level_1
anger,529758,30.412443
fear,86905,4.989058
joy,991504,56.920441
love,60409,3.467971
sadness,63194,3.627853
surprise,10142,0.582234


In [6]:
users_bio_distilbert.loc[0, ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']].sum()

1.0000000461004672

In [7]:
sum_probs = []
for i in users_bio_distilbert.index:
    sum_probs.append(users_bio_distilbert.loc[i, ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']].sum())


In [8]:
users_bio_distilbert['sums']=sum_probs

In [9]:
users_bio_distilbert

Unnamed: 0,id,sadness,joy,love,anger,fear,surprise,verdict,sums
0,413080213,0.041237,0.758531,0.005717,0.091492,0.098849,0.004173,joy,1.0
1,493832011,0.001672,0.989075,0.001292,0.005647,0.001596,0.000718,joy,1.0
2,2989319032,0.031329,0.175984,0.005536,0.732789,0.050099,0.004264,anger,1.0
3,1042385216,0.018754,0.298485,0.062315,0.574024,0.042194,0.004227,anger,1.0
4,490149888,0.003129,0.971793,0.002522,0.017236,0.004106,0.001214,joy,1.0
...,...,...,...,...,...,...,...,...,...
1741907,889910735073050627,0.001344,0.000766,0.000211,0.995538,0.001940,0.000201,anger,1.0
1741908,836012302910414848,0.000564,0.972632,0.025366,0.000485,0.000307,0.000645,joy,1.0
1741909,1307971887184654337,0.022885,0.505443,0.005269,0.423803,0.040328,0.002273,joy,1.0
1741910,1250435179,0.005195,0.980053,0.002286,0.007358,0.004211,0.000896,joy,1.0


In [13]:
len(users_bio_distilbert[users_bio_distilbert.sums > 1.1])

0