# Hastag text enrichment - explained

# 0) Imports

In [1]:
import pandas as pd
import os
from text_enrichment import *

wd = '/'.join(os.getcwd().split('/')[:-1])

random_state=69

In [2]:
cols = ['iro', 'text']

df1 = pd.read_csv(wd + "/data/training_set_sentipolc16.csv", sep=",")
df2 = pd.read_csv(wd + "/data/test_set_sentipolc16_gold2000.csv", sep=",",names=list(df1.columns))

df1 = pd.concat([df1, df2])[cols]
df = df1.copy()
df.head(2)

Unnamed: 0,iro,text
0,0,Intanto la partita per Via Nazionale si compli...
1,0,"False illusioni, sgradevoli realtà Mario Monti..."


# 1)collecting informations

## 1.1) Selecting feature
Here we search for what will be considered a feature. in out case:
- hastag
- laugh
- link (if set)
the result of the followind code is a new column (hastag) containing the feature finded in the relative tweet

In [3]:
COL_NAME = 'hastag'
placeholder_list=[]

df[COL_NAME] = collect_info(df,
                        link_as_hastag = False,    # troppi link, non può funzionare bene
                        placeholder_list=placeholder_list,
            )
df.head()

Unnamed: 0,iro,text,hastag
0,0,Intanto la partita per Via Nazionale si compli...,"[Saccomanni, Monti]"
1,0,"False illusioni, sgradevoli realtà Mario Monti...",[]
2,0,"False illusioni, sgradevoli realtà #editoriale...","[editoriale, rassegna]"
3,0,Mario Monti: Berlusconi risparmi all'Italia il...,[mariomontipremier]
4,0,Mario Monti: Berlusconi risparmi all'Italia il...,[]


removing all tweets that are not containing any feature.<br>
this algorithm is meant to work only with tweets containing the features. 

In [4]:
df = remove_empty_list(df, col_name=COL_NAME)

## 1.2) occurrence dict
the following counts how many times aech feature appears in ironic / non ironic tweets 

In [5]:
dizionario_occorrenze = create_occurrences_dict(df)
dizionario_occorrenze

{0: {'saccomanni': 1,
  'monti': 665,
  'editoriale': 1,
  'rassegna': 1,
  'mariomontipremier': 1,
  'ottoemezzo': 9,
  'italiaresiste': 4,
  'sarkozy': 1,
  'ballarò': 15,
  'g20': 1,
  'consultazioni': 1,
  'acasa': 4,
  'ciampi': 1,
  'berlusconi': 53,
  'repubblica': 3,
  'elezionisubito': 4,
  'incominciodasel': 1,
  'dimissioni': 5,
  'aeiouy': 2,
  'opencamera': 18,
  'finanza': 6,
  'laresadeiconti': 8,
  'revolution': 1,
  'fb': 10,
  'napolitano': 33,
  'elezioni': 12,
  '308': 10,
  'malimortaccivostraedichinonloritwitta': 1,
  'silvioforever': 1,
  'la7': 9,
  'sapevatelo': 27,
  'fullmonti': 17,
  'stamoagioca': 1,
  'udc': 5,
  'rialzatiitalia': 1,
  'ditelavostra': 10,
  'tgmattina': 4,
  'mario': 38,
  'governo': 194,
  'fatepresto': 13,
  'comedire': 1,
  'anagr': 2,
  'noelezioni': 1,
  'nuovaleggeelettorale': 1,
  'matteorenzi': 5,
  'sondaggio': 4,
  'paese': 2,
  'crisi': 19,
  'colle': 3,
  'lacrisiancoradevearrivare': 1,
  'mariomonti': 6,
  'economia': 5,
  'sp

## 1.3) probabilities
We evaluate
$$
    P(\text{iro}=1|\text{feature}) =
    \frac{|\text{tweets\_iro\_with\_feature}|}
    {|\text{tweets\_iro\_with\_feature}| + |\text{tweets\_non\_iro\_with\_feature}|}
$$

to avoid finding too low probabilities we evaluate them only for those feature which:
1.   $$
        P(\text{feature}|\text{iro}=1) >\text{p\_thr}
    $$
2.   $$
        |\text{features\_i}| > \text{max\_n\_feature}
    $$

where in our implementation:
- p_thr = .01
- max_n_feature = 15
- |.| means cardinality

NB. 
- (1) is used to avoid considering feature with too few occurrences (eg. we do not want to consider feature wich appears only once among ironic tweets)
- (2) avoid to have too many features. with our dataset this is never the case as it is too tiny

clearly this method does not take into account co occurrences. in any case out dataset is too tiny to consider them.

In [6]:
iro = 1
p_feature_iro = calcola_prob_hastag_dato_iro(dizionario_occorrenze, iro=iro)        # eg. P(grillo|iro=1)

In [7]:
p_iro_hastg = {tweet:
                   (dizionario_occorrenze[1][tweet])/
                   (dizionario_occorrenze[0][tweet] + dizionario_occorrenze[1][tweet])
                   for tweet in p_feature_iro.keys()}

In [8]:
p_iro_hastg

{'monti': 0.18803418803418803,
 'labuonascuola': 0.11377245508982035,
 'grillo': 0.048514851485148516,
 'governo': 0.11009174311926606,
 'renzi': 0.2564102564102564,
 'risata': 0.15730337078651685,
 'manovra': 0.21311475409836064,
 'serviziopubblico': 0.09090909090909091}

## 2) hastag enrichment
the new choosen feature is prob: the idea behind it is to fed it into the classifier part of the model with the desired effect of increase the mean activations of the classifier.

In [9]:
df.loc[:,'prob'] = df['text'].apply(lambda x : get_prob_from_sentence(x, p_iro_hastg))

## 3) examples:

In [10]:
df[df['prob']>0].head()

Unnamed: 0,iro,text,hastag,prob
0,0,Intanto la partita per Via Nazionale si compli...,"[Saccomanni, Monti]",0.188034
19,0,Mario #Monti: La lira non era una moneta stran...,[Monti],0.188034
58,0,Perchè Silvio deve lasciare il posto a Mario #...,"[Monti, dimissioni, opencamera, acasa, aeiouy]",0.188034
60,0,Editoriale di Mario #Monti LINK #finanza,"[Monti, finanza]",0.188034
75,0,"@PietroSalvatori @daw_blog verso l'implosione,...",[Monti],0.188034


Our dataset is extremely unbalanced toward non ironic tweets. As a result, the model will be biased, resulting in high precision for the majority class but poor recall for the minority class.
to overcome this we did not consider
$$
    P(\text{iro}=0|\text{feature})
$$
the effect of this enrichment is to increase the recall (of ironic tweets), potentially at the cost of overall precision.


if we wanted to use bayes' theorem:

In [11]:
n_tweet = len(df)
p_iro = sum(dizionario_occorrenze[1].values())/n_tweet

p_tweet = {tweet:(dizionario_occorrenze[0][tweet] + dizionario_occorrenze[1][tweet])/n_tweet for tweet in p_feature_iro.keys()}

p_iro_hastg_2 = {tweet: (p_feature_iro[tweet] * p_iro)/p_tweet[tweet] for tweet in p_feature_iro.keys()}
p_iro_hastg_2

{'monti': 0.18803418803418806,
 'labuonascuola': 0.11377245508982035,
 'grillo': 0.048514851485148516,
 'governo': 0.11009174311926608,
 'renzi': 0.25641025641025644,
 'risata': 0.15730337078651688,
 'manovra': 0.2131147540983607,
 'serviziopubblico': 0.09090909090909093}