# Cuaderno de presentación prueba NLP Davivienda

En este cuaderno se hace una breve descripción de aspectos relevantes obtenidos en los tweets.

In [44]:
#!pip install spacy
#!python -m spacy download es_core_news_sm
#!pip install transformers
#!pip install tensorflow
#!pip install bs4

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

import spacy
import es_core_news_sm
nlp=spacy.load("es_core_news_sm")
from spacy import displacy

import urllib
from bs4 import BeautifulSoup

import transformers

In [3]:
## Lectura de datos

df_tweets = pd.read_csv("../Datos/davivienda_tweets.csv")
df_tweets.head()

Unnamed: 0.1,Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL
0,0,Andrés Langebaek,@ALangebaek,2021-12-01T20:43:12.000Z,Andrés Langebaek\n@ALangebaek\n·\n1 dic.,La confianza se afectó. El indicador de confia...,,1.0,7.0,19.0,['https://pbs.twimg.com/media/FFjL57eXMAISBnk?...,https://twitter.com/ALangebaek/status/14661458...
1,1,Plaza Futura,@plaza_futura,2021-12-01T21:18:10.000Z,Plaza Futura\n@plaza_futura\n·\n1 dic.,Buscamos la accesibilidad y mejor atención en ...,✅ ✅ ✅ ✅ ✅,,,,['https://pbs.twimg.com/ext_tw_video_thumb/146...,https://twitter.com/plaza_futura/status/146615...
2,2,Julián Martinez,@JulianM998,2021-12-01T22:49:11.000Z,Julián Martinez\n@JulianM998\n·\n1 dic.,Señores \n@Davivienda\n no he podido ingresar ...,,1.0,,1.0,[],https://twitter.com/JulianM998/status/14661775...
3,3,Ferchis.,@fergomezr28,2021-12-01T12:29:07.000Z,Ferchis.\n@fergomezr28\n·\n1 dic.,Llevo toda una semana sufriendo intento de hur...,,2.0,1.0,2.0,[],https://twitter.com/fergomezr28/status/1466021...
4,4,MirandaL2,@MirandaSuspLo,2021-12-01T20:52:36.000Z,MirandaL2\n@MirandaSuspLo\n·\n1 dic.,Hemos retrocedido tanto en este país con este ...,,3.0,,8.0,[],https://twitter.com/MirandaSuspLo/status/14661...


In [80]:
## Limpieza de datos
## Iniciamos extrayendo caracteres especiales (\n)

df_tweets['Embedded_text_1'] = df_tweets['Embedded_text'].apply(lambda x: re.sub(r'\n[ 0-9]*', ' ', x))

## Luego extraemos los links
df_tweets['links'] = df_tweets['Embedded_text_1'].apply(lambda x: re.findall(r'http\S+', x))
links_title=[]
for i in df_tweets['links']:
    if len(i)==0:
        links_title.append("")
    else:
        lista_i=[]
        for k in i:
            #print(k)
            try:
                html_page = urllib.request.urlopen(k)
                soup = BeautifulSoup(html_page)
                lista_i.append(soup.title.string)
            except:
               new_k=k.replace("https://","")
               new_k=new_k.replace("http://","")
               new_k=re.sub(r'\.[A-Za-z\.]*', '', new_k)
               new_k=re.sub(r'\/.*', '', new_k)
               lista_i.append(new_k)
        links_title.append(lista_i)

df_tweets['links_title']=links_title
df_tweets['conteo_links'] = df_tweets['links'].apply(lambda x: len(x))
df_tweets['Embedded_text_1'] = df_tweets['Embedded_text_1'].apply(lambda x: re.sub(r'http\S+', 'Link_aqui', x))

## Luego extraemos los hashtags
df_tweets['hashtags'] = df_tweets['Embedded_text_1'].apply(lambda x: re.findall(r'#\S+', x))
df_tweets['conteo_hashtags'] = df_tweets['hashtags'].apply(lambda x: len(x))

## Luego extraemos los menciones
df_tweets['menciones'] = df_tweets['Embedded_text_1'].apply(lambda x: re.findall(r'@\S+', x))
df_tweets['conteo_menciones'] = df_tweets['menciones'].apply(lambda x: len(x))


In [81]:
## Luego extraemos los emojis
df_tweets['emojis'] = df_tweets['Embedded_text_1'].apply(lambda x: re.findall(r'\\u\S+', x))
df_tweets['conteo_emojis'] = df_tweets['emojis'].apply(lambda x: len(x))

## Luego extraemos los RT
df_tweets['RT'] = df_tweets['Embedded_text_1'].apply(lambda x: re.findall(r'RT', x))
df_tweets['conteo_RT'] = df_tweets['RT'].apply(lambda x: len(x))



In [85]:
## Análisis de sentimientos

## Cargamos el modelo de sentimientos

from transformers import pipeline
nlp_sentiment = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

## Aplicamos el modelo a los tweets

df_tweets['sentiment'] = df_tweets['Embedded_text_1'].apply(lambda x: nlp_sentiment(x)[0]['label'])
df_tweets['score'] = df_tweets['Embedded_text_1'].apply(lambda x: nlp_sentiment(x)[0]['score'])


Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/670M [00:00<?, ?B/s]

2022-11-08 07:41:45.916833: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [86]:
df_tweets['sentiment'].value_counts()

1 star     1250
5 stars     314
4 stars     152
3 stars      51
2 stars      44
Name: sentiment, dtype: int64

In [87]:
df_tweets['score'].describe()

count    1811.000000
mean        0.546778
std         0.188656
min         0.219787
25%         0.389547
50%         0.513138
75%         0.697739
max         0.971901
Name: score, dtype: float64

In [88]:
## Análisis de entidades

## Cargamos el modelo de entidades

nlp_entidades = pipeline("ner", model="dslim/bert-base-NER")

## Aplicamos el modelo a los tweets

df_tweets['entidades'] = df_tweets['Embedded_text_1'].apply(lambda x: nlp_entidades(x))


Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/434M [00:00<?, ?B/s]

Some layers from the model checkpoint at dslim/bert-base-NER were not used when initializing TFBertForTokenClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dslim/bert-base-NER.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [95]:
## Análisis de temas

## Cargamos el modelo de temas

nlp_temas = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")

All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.