In [14]:
import pandas as pd
import re

import nltk

**Pasos a seguir**
1. Carga y manipulación de tweets
2. Limpiar cada tweet (Eliminar carácteres no alfanumericos)
3. Tokenización
4. 

# 1 Datos

In [3]:
# Datos previamente descargados en Noviembre 2017
url = 'https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-con-R/master/datos/'
tweets_elon   = pd.read_csv(url + "datos_tweets_@elonmusk.csv")

In [8]:
tweets = tweets_elon[["created_at", "status_id", "text"]]
tweets.columns = ['fecha', 'id', 'texto']
tweets

Unnamed: 0,fecha,id,texto
0,2017-11-09T17:28:57Z,9.286758e+17,"""If one day, my words are against science, cho..."
1,2017-11-09T17:12:46Z,9.286717e+17,I placed the flowers\n\nThree broken ribs\nA p...
2,2017-11-08T18:55:13Z,9.283351e+17,Atatürk Anıtkabir https://t.co/al3wt0njr6
3,2017-11-07T19:48:45Z,9.279862e+17,"@Bob_Richards One rocket, slightly toasted"
4,2017-10-28T21:36:18Z,9.243894e+17,@uncover007 500 ft so far. Should be 2 miles l...
...,...,...,...
2673,2013-03-20T00:53:40Z,3.141779e+17,Testing separation of F9 rocket fairing (can h...
2674,2013-03-19T03:03:05Z,3.138481e+17,Sharing a metaphysical milkshake with @RainnWi...
2675,2013-03-17T18:32:54Z,3.133573e+17,"@JBSiegelMD Cool, I'm glad you like it!"
2676,2013-03-17T18:20:24Z,3.133541e+17,Craig Venter talks about flu vaccines and the ...


**Hacer unas gráficas**

# 2 Preprocesamiento y tokenización

## 2.1 Sin limpiar el tweet

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens. Contractions are split apart (e.g. “What's” becomes “What” “'s“)

In [28]:
tweets["texto"].apply(nltk.word_tokenize)

0       [``, If, one, day, ,, my, words, are, against,...
1       [I, placed, the, flowers, Three, broken, ribs,...
2       [Atatürk, Anıtkabir, https, :, //t.co/al3wt0njr6]
3       [@, Bob_Richards, One, rocket, ,, slightly, to...
4       [@, uncover007, 500, ft, so, far, ., Should, b...
                              ...                        
2673    [Testing, separation, of, F9, rocket, fairing,...
2674    [Sharing, a, metaphysical, milkshake, with, @,...
2675    [@, JBSiegelMD, Cool, ,, I, 'm, glad, you, lik...
2676    [Craig, Venter, talks, about, flu, vaccines, a...
2677    [Using, Über, to, order, a, Tesla, Model, S, @...
Name: texto, Length: 2678, dtype: object

In [22]:
nltk.word_tokenize("Esto $ es 1 ejemplo de l'limpieza de6 TEXTO  https://t.co/rnHPgyhx4Z @cienciadedatos #textmining")

['Esto',
 '$',
 'es',
 '1',
 'ejemplo',
 'de',
 "l'limpieza",
 'de6',
 'TEXTO',
 'https',
 ':',
 '//t.co/rnHPgyhx4Z',
 '@',
 'cienciadedatos',
 '#',
 'textmining']

Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer as a subset of word_tokenize. TweetTokenizer keeps hashtags intact while word_tokenize doesn't.

In [23]:
from nltk.tokenize import TweetTokenizer

In [29]:
?TweetTokenizer

In [30]:
t = TweetTokenizer()
t.tokenize("Esto $ es 1 ejemplo de l'limpieza de6 TEXTO  https://t.co/rnHPgyhx4Z @cienciadedatos #textmining")

['Esto',
 '$',
 'es',
 '1',
 'ejemplo',
 'de',
 "l'limpieza",
 'de6',
 'TEXTO',
 'https://t.co/rnHPgyhx4Z',
 '@cienciadedatos',
 '#textmining']

In [33]:
tweets["texto"].apply(t.tokenize)

0       [", If, one, day, ,, my, words, are, against, ...
1       [I, placed, the, flowers, Three, broken, ribs,...
2           [Atatürk, Anıtkabir, https://t.co/al3wt0njr6]
3       [@Bob_Richards, One, rocket, ,, slightly, toas...
4       [@uncover007, 500, ft, so, far, ., Should, be,...
                              ...                        
2673    [Testing, separation, of, F9, rocket, fairing,...
2674    [Sharing, a, metaphysical, milkshake, with, @R...
2675    [@JBSiegelMD, Cool, ,, I'm, glad, you, like, i...
2676    [Craig, Venter, talks, about, flu, vaccines, a...
2677    [Using, Über, to, order, a, Tesla, Model, S, @...
Name: texto, Length: 2678, dtype: object

# 3 Análisis exploratorio

# 4 Análisis de sentimiento

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the [MIT License] (we sincerely appreciate all attributions and readily accept most contributions, but please don’t hold us liable).

If you use either the dataset or any of the VADER sentiment analysis tools (VADER sentiment lexicon or Python code for rule-based sentiment analysis engine) in your research, please cite the above paper. For example:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [36]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [50]:
analyzer = SentimentIntensityAnalyzer()

In [79]:
tweets["texto"].apply(lambda x: analyzer.polarity_scores(x)["compound"])

0       0.0000
1      -0.2263
2       0.0000
3       0.0000
4       0.4019
         ...  
2673    0.0000
2674    0.4215
2675    0.7959
2676   -0.4389
2677    0.0000
Name: texto, Length: 2678, dtype: float64

In [63]:
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

VADER is smart, handsome, and funny.***************************** {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
VADER is smart, handsome, and funny!***************************** {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
VADER is very smart, handsome, and funny.************************ {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545}
VADER is VERY SMART, handsome, and FUNNY.************************ {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227}
VADER is VERY SMART, handsome, and FUNNY!!!********************** {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342}
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!********* {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
VADER is not smart, handsome, nor funny.************************* {'neg': 0.646, 'neu': 0.354, 'pos': 0.0, 'compound': -0.7424}
The book was good.*********************************************** {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'co

In [74]:
print("{:-<20}".format(1))

1-------------------


In [15]:
def limpiar_tokenizar(texto):
    '''
    Esta función limpia y tokeniza el texto en palabras individuales.
    El orden en el que se va limpiando el texto no es arbitrario.
    El listado de signos de puntuación se ha obtenido de: print(string.punctuation)
    y re.escape(string.punctuation)
    '''
    
    # Se convierte todo el texto a minúsculas
    nuevo_texto = texto.lower()
    # Eliminación de páginas web (palabras que empiezan por "http")
    nuevo_texto = re.sub('http\S+', ' ', nuevo_texto)
    # Eliminación de signos de puntuación
    regex = '[\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\@\\[\\\\\\]\\^_\\`\\{\\|\\}\\~]'
    nuevo_texto = re.sub(regex , ' ', nuevo_texto)
    # Eliminación de números
    nuevo_texto = re.sub("\d+", ' ', nuevo_texto)
    # Eliminación de espacios en blanco múltiples
    nuevo_texto = re.sub("\\s+", ' ', nuevo_texto)
    # Tokenización por palabras individuales
    nuevo_texto = nuevo_texto.split(sep = ' ')
    # Eliminación de tokens con una longitud < 2
    nuevo_texto = [token for token in nuevo_texto if len(token) > 1]
    
    return(nuevo_texto)

test = "Esto es 1 ejemplo de l'limpieza de6 TEXTO  https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"
print(test)
print(limpiar_tokenizar(texto=test))

Esto es 1 ejemplo de l'limpieza de6 TEXTO  https://t.co/rnHPgyhx4Z @cienciadedatos #textmining
['esto', 'es', 'ejemplo', 'de', 'limpieza', 'de', 'texto', 'cienciadedatos', 'textmining']
