<a href="https://colab.research.google.com/github/Huertas97/Sentiment_Analysis/blob/main/tass_models/TASS_data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook we parse and merge all the data from Taller de Análisis Semántico en la SEPLN (TASS). The TASS dataset used is a compilation of tweets from TASS competitions celebrated from 2012 to 2019 with a total of 53k tweets. TASS includes tweets from various topics (TV, politics, sports) from different Spanish speaking countries (Spain, Costa Rica, Uruguay, Mexico and Peru)

# Tweet parsing

In [None]:
!pip install -q emoji
!pip install -U -q emot 
!pip install mtranslate
!pip install -U -q Unidecode
!pip install -U -q tweet-preprocessor

[?25l[K     |██████▍                         | 10kB 18.7MB/s eta 0:00:01[K     |████████████▉                   | 20kB 10.6MB/s eta 0:00:01[K     |███████████████████▎            | 30kB 8.6MB/s eta 0:00:01[K     |█████████████████████████▊      | 40kB 7.6MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.8MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone
Collecting mtranslate
  Downloading https://files.pythonhosted.org/packages/85/17/a2e5c00bf666fcbc119a44c5f20406e7be951a7d720bb48e573815a8ffc3/mtranslate-1.8.tar.gz
Building wheels for collected packages: mtranslate
  Building wheel for mtranslate (setup.py) ... [?25l[?25hdone
  Created wheel for mtranslate: filename=mtranslate-1.8-cp36-none-any.whl size=3695 sha256=9e6d2803e69319b6f4107e689f909050246f7b5b2713674d4cb5b619aad06969
  Stored in directory: /root/.cache/pip/wheels/eb/fb/4a/f63e74cbfb835161f3f2d1a6b607b137d344a5cb6d8c1303fa
Successfully built mtranslate
Installing collected pac

## Emoji parsing

The data is preprocessed and cleaned. Firstly, emojis not related to emotions or feelings are deleted from the text, but emojis related to emotions are converted into text. Secondly, URLs and tweet mentions are removed.

In [None]:
import emot
import emoji
import re
from mtranslate import translate
import unidecode
import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.MENTION)

In [None]:

# Actualizamos el diccionario de EMOTICONOS (";D" --> " ;D ")
for k in emot.EMOTICONS.keys():
  if len(re.findall("^[A-Za-z]|[A-Za-z]$", k)) != 0:
    new_k = re.sub(k, " "+k+" ", k)
    emot.EMOTICONS[new_k] = emot.EMOTICONS.pop(k)

def decode_emot_emoji(string, translation = False):
  """
  Función encargada de transformar los emoticonos y emojis de una frase 
  en su significado en español. 
  Se emplea el paquete "emot" para detectar emoticonos y emojis. 

  El funcionamiento es el siguiente. 1) Se añaden espacios al principio y
  final de la frase para facilitar la transformación de emoticonos en estas
  posiciones. 2) Se transforma el emoticono del corazón "<3" que no viene en
  los diccionarios de emoticonos. 3) Si se detectan emoticonos se recorre
  el diccionario donde se guardan el valor y significado de cada emoticono.
  Recorremos en un bucle los distintos emoticonos por índice (valor 1 del emo 1 etc). 
  4) Lo mismo para los emojis
  
  Params:
  -------
    string: str
        Frase ha decodificar

  Returns:
  --------
    string: str
      Frase decodificada
  """
  valid_emojis = ["face with steam from nose",
                  "middle finger"
               ] 

  # Delete URL and mentions
  string = p.clean(string).replace("#","")


  # Añado espacios por si hay emoticono al final o inicio Ej. "blablabla ;D."
  string = " "+string+" "

  # ----------------- Emoticono corazOn ------------------
  # Añado el emoticono del corazon. Es muy habitual
  if " <3 " in string:
    if translation == "es":
      string = string.replace("<3", " corazón ")
    else:
      string = string.replace("<3", " heart ")
  if " </3 " in string:
    if translation == "es":
      string = string.replace("<3", " corazón roto ")
    else:
      string = string.replace("<3", " broken heart ")
  
  # -------------- EMOTICONOS --------------
  info_emoticons = emot.emoticons(string)
  
  # Vemos si detectamos emoticonos. Puede darse de dos manera (por eso try except)
  try:
    search = info_emoticons["flag"]
  except:
    search = info_emoticons[0]["flag"]

  if search:
    n = len(info_emoticons["value"])
    for i in range(n):
      value = info_emoticons["value"][i]
      mean = info_emoticons["mean"][i].lower()
      if translation == "es":
        mean = translate(mean, to_language = "es")             # Traducción
      string = string.replace(value, " "+mean+" ")
    string.strip()

  else: 
    string.strip()
  
  # -------------- EMOJIS --------------
  info_emoji = emot.emoji(string)
  try:
    search = info_emoji["flag"]
  except:
    search = info_emoji[0]["flag"]

  if search:
    n = len(info_emoji["value"])
    for i in range(n):
      value = info_emoji["value"][i]
      mean = info_emoji["mean"][i]
      mean = mean.replace("_", " ").strip(":") # Formateo :green_heart: --> green heart
      if mean in valid_emojis or any( e in mean.split() for e in ["heart", 
                                                          "face",
                                                          "cat",
                                                          "dna",
                                                          "microscope",
                                                          "pill",
                                                          "blood", 
                                                          "syringe",
                                                          "soap",
                                                          "no",
                                                          "radioactive",
                                                          "biohazard",
                                                          "warning",
                                                          "prohibited"
                                                          ]): 


        if translation == "es":
          mean = translate(mean, to_language = "es")    # Traducción
        string = string.replace(value, " "+mean+" ")
      
      else: 
        string = string.replace(value, "")
        
      
      
    # Corrijo dobles espacios
    string  = re.sub("\s{2,}", " ", string)
    # print(string)
    string = unidecode.unidecode(string).replace("[?]", "")


    
    return string.strip() 

  else: 
    string  = re.sub("\s{2,}", " ", string)
    string = p.clean(string)
    return string.strip()
    

In [None]:
strings = [
          "I love Python 🐍 <3 :) https://unicode.org/emoji/charts/full-emoji-list.html",
          "Odio las películas de acción 😤",
          "Hoy es San Valentín 👩‍❤️‍💋‍👩!!",
          "Prefiero las pizza 🍕🌭😻 a la lechuga 🥦🤢",
           "No me piengo vacunar 💉", 
           "las farmaceúticas 💊 son una estafa 🚫",
           "Hay que arreglar a estos políticos 🔧",
           "Lo que nadie queire que sepas 🔒🔒🔒🔒",
           "Nos van a encerrar como becerros 😱 https://github.com/s/preprocessor",
           "@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl",
           "Las mejores fotos son las que no se ven ✨🌟"

          
          ]

[decode_emot_emoji(s, translation = "es") for s in strings]

['I love Python corazon cara feliz o sonriente',
 'Odio las peliculas de accion cara con vapor de la nariz',
 'Hoy es San Valentin  corazon rojo !!',
 'Prefiero las pizza cara de gato sonriente con ojos de corazon a la lechuga  cara con nauseas',
 'No me piengo vacunar jeringuilla',
 'las farmaceuticas pildora son una estafa prohibido',
 'Hay que arreglar a estos politicos',
 'Lo que nadie queire que sepas',
 'Nos van a encerrar como becerros cara gritando de miedo',
 'raw text data usually has lots of residue.',
 'Las mejores fotos son las que no se ven']

# TASS data collecting

Juntaremos todos los datos (test  o train, eliminaremos los duplicados y partiremos nosotros los datos en tres partes).

In [None]:
import pandas as pd
pd.set_option('max_colwidth',1000)
# "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/general-train-tagged-3l.xml"

In [None]:
import os
from tqdm.auto import tqdm
os.listdir("/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/")

['general-test-tagged-3l.xml',
 'politics-test-tagged.xml',
 'socialtv-test-tagged.xml',
 'intertass-ES-development-tagged.xml',
 'intertass-CR-development-tagged.xml',
 'intertass-PE-development-tagged.xml',
 'stompol-test-tagged.xml']

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "general-train-tagged-3l.xml"

try:
    df = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    df = pd.DataFrame(columns=('content', 'polarity', 'agreement'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [tweet.content.text, tweet.sentiments.polarity.value.text, tweet.sentiments.polarity.type.text]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)
    df.to_csv(f.replace(".xml", '.csv'), index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=7219.0, style=ProgressStyle(description_widt…




In [None]:
sum(pd.read_csv("/content/general-train-tagged-3l.csv").content == "P")

0

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/"
f = "general-test-tagged-3l.xml"

try:
    df = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    df = pd.DataFrame(columns=('content', 'polarity', 'agreement'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)
    df.to_csv(f.replace(".xml", '.csv'), index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=60798.0, style=ProgressStyle(description_wid…




In [None]:
sum(pd.read_csv("/content/general-test-tagged-3l.csv").content == "NEU")

0

# Stopmpol

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "stompol-train-tagged.xml"

try:
    df = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    df = pd.DataFrame(columns=('content', 'polarity', "agreement"))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)
    df.to_csv(f.replace(".xml", '.csv'), index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=784.0, style=ProgressStyle(description_width…




In [None]:
sum(pd.read_csv("/content/stompol-train-tagged.csv").content == "P")

0

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/"
f = "stompol-test-tagged.xml"




from lxml import objectify
xml = objectify.parse(open('/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/stompol-test-tagged.xml'))
#sample tweet object
root = xml.getroot()
stompol_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
    row_s = pd.Series(row)
    row_s.name = i
    stompol_tweets_corpus_test = stompol_tweets_corpus_test.append(row_s)
stompol_tweets_corpus_test.to_csv('stompol-tweets-test-tagged.csv', index=False, encoding='utf-8')

In [None]:
stompol_tweets_corpus_test.to_csv('stompol-tweets-test-tagged.csv', index=False, encoding='utf-8')

In [None]:
sum(stompol_tweets_corpus_test.content == "NEU")

0

# Social TV

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "socialtv-train-tagged.xml"

try:
    social_tweets_corpus_train = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    social_tweets_corpus_train = pd.DataFrame(columns=('content', 'polarity', "agreement"))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        social_tweets_corpus_train = social_tweets_corpus_train.append(row_s)
    social_tweets_corpus_train.to_csv('socialtv-tweets-train-tagged.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=1773.0, style=ProgressStyle(description_widt…




In [None]:
sum(social_tweets_corpus_train.content == "NEU")

0

In [None]:
try:
    social_tweets_corpus_test = pd.read_csv("/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/socialtv-test-tagged.xml", encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open("/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/socialtv-test-tagged.xml"))
    #sample tweet object
    root = xml.getroot()
    social_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        social_tweets_corpus_test = social_tweets_corpus_test.append(row_s)
    
social_tweets_corpus_test.to_csv('socialtv-tweets-test-tagged.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=1000.0, style=ProgressStyle(description_widt…




# Politics

In [None]:
try:
    politic_tweets_corpus_test = pd.read_csv("/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/politics-test-tagged.xml", encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open("/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/politics-test-tagged.xml"))
    #sample tweet object
    root = xml.getroot()
    politic_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity', "agreement"))
    tweets = root.getchildren()
    content = []
    polarity = []
    agreement = []
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [str(tweet.content), str(tweet.sentiments.polarity.value), 
                                                              str(tweet.sentiments.polarity.type)]))
        row_s = pd.Series(row)
        row_s.name = i
        politic_tweets_corpus_test = politic_tweets_corpus_test.append(row, ignore_index=True)
    
politic_tweets_corpus_test.to_csv('politics-test-tagged.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=2500.0, style=ProgressStyle(description_widt…




In [None]:
sum(politic_tweets_corpus_test.content == "N")

0

# Spain

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "TASS2019_country_ES_train.xml"


try:
    TASS_2019_spain = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    TASS_2019_spain = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        TASS_2019_spain = TASS_2019_spain.append(row_s)
    TASS_2019_spain.to_csv('TASS_2019_spain.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=1125.0, style=ProgressStyle(description_widt…




In [None]:
sum( TASS_2019_spain.content == "NEU")

0

# México

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "TASS2019_country_MX_train.xml"


try:
    TASS_2019_mx = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    TASS_2019_mx = pd.DataFrame(columns=('content', 'polarity', "agreement"))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        TASS_2019_mx =TASS_2019_mx.append(row_s)
    TASS_2019_mx.to_csv('TASS_2019_mx.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=989.0, style=ProgressStyle(description_width…




# Uruguay

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "TASS2019_country_UY_train.xml"


try:
    TASS_2019_uy = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    TASS_2019_uy = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        TASS_2019_uy = TASS_2019_uy.append(row_s)
    TASS_2019_uy.to_csv('TASS_2019_uy.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=943.0, style=ProgressStyle(description_width…




# Perú

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "TASS2019_country_PE_train.xml"


try:
    TASS_2019_spain = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    TASS_2019_pe = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        TASS_2019_pe = TASS_2019_pe.append(row_s)
    TASS_2019_pe.to_csv('TASS_2019_pe.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=966.0, style=ProgressStyle(description_width…




# InterTASS SP

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "intertass-ES-train-tagged.xml"


try:
    intertass_spain = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_spain = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_spain = intertass_spain.append(row_s)
    intertass_spain.to_csv('intertass_spain.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=1008.0, style=ProgressStyle(description_widt…




In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/"
f = "intertass-ES-development-tagged.xml" # "TASS2019_country_ES_train.xml"


try:
    intertass_ES_development = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_ES_development = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_ES_development = intertass_ES_development.append(row_s)
    intertass_ES_development.to_csv('intertass_ES_development.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=506.0, style=ProgressStyle(description_width…




# InterTASS CR

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "intertass-CR-train-tagged.xml"


try:
    intertass_cr = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_cr = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_cr = intertass_cr.append(row_s)
    intertass_cr.to_csv('intertass_cr.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=800.0, style=ProgressStyle(description_width…




In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/"
f = "intertass-CR-development-tagged.xml" # "TASS2019_country_ES_train.xml"


try:
    intertass_CR_development = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_CR_development = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_CR_development = intertass_CR_development.append(row_s)
    intertass_CR_development.to_csv('intertass_CR_development.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=300.0, style=ProgressStyle(description_width…




# InterTASS PE

In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/train/"
f = "intertass-PE-train-tagged.xml"


try:
    intertass_pe = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_pe = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_pe = intertass_pe.append(row_s)
    intertass_pe.to_csv('intertass_pe.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=1000.0, style=ProgressStyle(description_widt…




In [None]:
path = "/content/drive/MyDrive/Datos_fake_news/Sentiment_Analysis/Data/test/"
f = "intertass-PE-development-tagged.xml" # "TASS2019_country_ES_train.xml"


try:
    intertass_PE_development = pd.read_csv(os.path.join(path, f), encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open(os.path.join(path, f)))
    #sample tweet object
    root = xml.getroot()
    intertass_PE_development = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in tqdm(range(0,len(tweets)), desc = "Tweets"):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity'], [tweet.content, tweet.sentiment.polarity.value]))
        row_s = pd.Series(row)
        row_s.name = i
        intertass_PE_development = intertass_PE_development.append(row_s)
    intertass_PE_development.to_csv('intertass_PE_development.csv', index=False, encoding='utf-8')

HBox(children=(FloatProgress(value=0.0, description='Tweets', max=500.0, style=ProgressStyle(description_width…




# Merge all data

In [None]:
df_list = []
for f in os.listdir():
  if f.endswith(".csv"):
    print(f)
    df_list.append(pd.read_csv(f))

tweets_corpus = pd.concat(df_list)

stompol-tweets-test-tagged.csv
intertass_CR_development.csv
intertass_pe.csv
intertass_ES_development.csv
intertass_PE_development.csv
socialtv-tweets-train-tagged.csv
socialtv-tweets-test-tagged.csv
politics-test-tagged.csv
general-test-tagged-3l.csv
TASS_2019_spain.csv
TASS_2019_mx.csv
TASS_2019_uy.csv
intertass_cr.csv
stompol-train-tagged.csv
TASS_2019_pe.csv
general-train-tagged-3l.csv
intertass_spain.csv


Only those tweets from TASS datasets with a level of agreement for the sentiment label are selected

In [None]:
# Eliminamos duplicados
tweets_corpus = tweets_corpus.drop_duplicates(subset=['content'], keep = "first", 
                                            ignore_index= True)
tweets_corpus = tweets_corpus.query('agreement != "DISAGREEMENT" and polarity != "NONE"')
tweets_corpus

Unnamed: 0,content,polarity,agreement
0,"Mapa con el ""batiburrillo"" d partidos que ha absorbido @CiudadanosCs frente a @UPyD que se mantiene #Libres @mpalcedo http://t.co/sYQYv0zSjz",N,
1,Leyendo programas de @CiudadanosCs y @ahorapodemos me encuentro exactamente lo mismo: reestructuración ordenada de la deuda,NEU,
2,"Buenos días, lo que está ocurriendo con el barco ruso es una ínfima muestra de lo que pudo suceder con las prospecciones de @PPopular .",N,
3,@CambiarMadrid_ @AhoraGetafe la verdadera @iunida http://t.co/NJ8cL7wUaL,NEU,
4,"En el @PSOE saben que tendrán que pactar, no saben con quien, si quieren gobernar",NEU,
...,...,...,...
79444,Las mejores fotos son las que no se ven ✨🌟,P,
79445,"Es increíble y tal pero no , no voy a ir al TH",NEU,
79446,"a to esto, el tio de ono no ha venio",N,
79447,Es que me lo como de lo bonito que es,P,


In [None]:
tweets_corpus.polarity.unique()

array(['N', 'NEU', 'P'], dtype=object)

In [None]:
# Now we preprocess the TASS data with the preprocessor declared above

content = tweets_corpus.content.to_list()

# Limpiamos los datos
tweets_corpus.content = [decode_emot_emoji(s, translation = "es") for s in tqdm(content, 
                                                                                desc = "Cleaning")]

HBox(children=(FloatProgress(value=0.0, description='Cleaning', max=54112.0, style=ProgressStyle(description_w…




In [None]:
tweets_corpus

Unnamed: 0,content,polarity,agreement
0,"Mapa con el ""batiburrillo"" d partidos que ha absorbido frente a que se mantiene Libres",N,
1,Leyendo programas de y me encuentro exactamente lo mismo: reestructuración ordenada de la deuda,NEU,
2,"Buenos días, lo que está ocurriendo con el barco ruso es una ínfima muestra de lo que pudo suceder con las prospecciones de .",N,
3,la verdadera,NEU,
4,"En el saben que tendrán que pactar, no saben con quien, si quieren gobernar",NEU,
...,...,...,...
79444,Las mejores fotos son las que no se ven,P,
79445,"Es increíble y tal pero no , no voy a ir al TH",NEU,
79446,"a to esto, el tio de ono no ha venio",N,
79447,Es que me lo como de lo bonito que es,P,


In [None]:
tweets_corpus.to_pickle("TASS_data_df.pkl")

In [None]:
import pandas as pd

TASS_data = pd.read_pickle("/content/TASS_data_df.pkl")
TASS_data.sample(20)

Unnamed: 0,content,polarity,agreement
9378,Practicando la ceja para RAFA!!;) vamos!!;)),P,
126,Lógico o ?es que quiere nombrarlos El Diario?Otro asunto es que sea o no aceptable pero es el que gobierna,NEU,
70419,"No,No,No! No sabe de Fútbol El mejor equipo es el Barcelona JAJAJAJAJAJAJA",NEU,
29734,No puede aclararlo pq no lo han decidido. Puede q ni sean penales...,N,
55425,Aspectos esenciales d reforma laboral no se pueden aplicar 1 mes después x falta desarrollo reglamentario. Hasta las elecciones andaluzas?,N,
45296,habrá en los premios consignas contra Ref laboral? Son capaces tras años sin hablar de paro!,P,
50378,"Carlos Floriano pide a Rubalcaba que deje poner palos en la rueda a los que ""tratan de sacar a España de la situación que su partido dejó""",N,
39937,RT“: Rajoy (): «La situación es crítica; el paro aumentará en 2012» Empleo Economía”,N,
68674,Hoy microaventura en kayak con dos expertos kayakistas Ría de Villaviciosa. El Puntal…,P,
15460,"Rajoy anuncia la actualización del poder adquisitivo de las pensiones, desde 1 enero. ""el único aumento de gasto que me van a escuchar hoy""",P,


In [None]:
TASS_data.shape

(54112, 3)

Data composition

In [None]:
freq_data = TASS_data.groupby(["polarity"], as_index=False).agg({"content": 'count'})
freq_data

Unnamed: 0,polarity,content
0,N,22205
1,NEU,3945
2,P,27962


In [None]:
import plotly.express as px

freq_data = TASS_data.groupby(["polarity"], as_index=False).agg({"content": ['count']})

fig = px.bar(freq_data, x="polarity", y="content", color="polarity")

fig.update_layout(
    title="TASS data",
    title_font=dict(
        # family="Courier New, monospace",
        size=20,
    ),
    legend_font = dict(
        # family="Courier New, monospace",
        size=15,
    )
)

fig.update_xaxes(
        # tickangle = 90,
        title_text = "Polarity",
        title_font = {"size": 18},
        title_standoff = 25,
        tickfont=dict(size=14))

fig.update_yaxes(
        title_text = "Count",
        title_font = {"size": 18},
        title_standoff = 25,
        tickfont=dict(size=14))
fig.show()