Este notebook é uma introdução aos principais conceitos e bibliotecas utilizados em processamento de linguagem natural provido pelo canal KPG Talkie no youtube.

O objetivo é me familiarizar com esses conceitos e as bibliotecas em um projeto (análise de sentimento de tweets) do início ao fim e posteriormente aplicá-los em outros problemas de processamento de linguagem natural.

[Link do vídeo](https://www.youtube.com/watch?v=VyDmQggfsZ0)

[Link do dataset disponibilizado no Kaggle](https://www.kaggle.com/kazanova/sentiment140)

No dataset temos 16 milhões de tweets pegos na API do tweeter e disponibilizado pelo link colocado acima pela ṕlataforma do Kaggle. Temos cinco (nomeadas de 0 a 5) colunas que representam:
    
    0-> sentimento do tweet (0 é negativo e 4 é positivo)
    
    1-> id do tweet
    
    2-> data do tweet
    
    3-> buscar no Kaggle o SIGNIFICADO
    
    4-> nome do usuário no tweeter
    
    5-> o tweet

Installing libraries

SpaCy is an open-source software library that is published and distributed under MIT license, and is developed for performing simple to advanced Natural Language Processing (N.L.P) tasks such as tokenization, part-of-speech tagging, named entity recognition, text classification, calculating semantic similarities between text, lemmatization, and dependency parsing, among others.

In [4]:
#!pip install -U spacy
#!pip install -U spacy-lookups-data
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md
#!python -m spacy download en_core_web_lg

In this article, we are going to perform the below tasks.


*   General Feature Extraction

*   File loading

*   Word counts

*   List item

*   Characters count

*   Average characters per word

*   Stop words count

*   Count #HashTags and @Mentions

*   If numeric digits are present in twitts

*   Upper case word counts


Preprocessing and Cleaning

*  Lower case

*  Contraction to Expansion

*  Emails removal and counts

*  URLs removal and counts

*  Removal of RT

*  Removal of Special Characters

*  Removal of multiple spaces

*  Removal of HTML tags

*  Removal of accented characters

*  Removal of Stop Words

*  Conversion into base form of words

*  Common Occuring words Removal

*  Rare Occuring words Removal


*  Word Cloud

*  Spelling Correction

*  Tokenization

*  Lemmatization

*  Detecting Entities using NER

*  Noun Detection

*  Language Detection

*  Sentence Translation

*  Using Inbuilt Sentiment Classifier


Advanced Text Processing and Feature Extraction

*  N-Gram, Bi-Gram etc

*  Bag of Words (BoW)

*  Term Frequency Calculation TF

*  Inverse Document Frequency IDF

*  TFIDF Term Frequency – Inverse Document Frequency

*  Word Embedding Word2Vec using SpaCy

*  Machine Learning Models for Text Classification

*  SGDClassifier

*  LogisticRegression

*  LogisticRegressionCV

*  LinearSVC

*  RandomForestClassifier

Importing libraries

In [5]:
import pandas as pd
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [6]:
df = pd.read_csv('tweet16m.csv', encoding = 'latin1', header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [7]:
#Na coluna 5 temos o tweet e na coluna zero temos a classificação do teweet como positivo ou negativo
df= df[[5, 0]]

In [8]:
df.columns= ['tweet', 'sentiment']
df.head()

Unnamed: 0,tweet,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


In [9]:
#Vamos ver o balanceamento das duas classes
df['sentiment'].value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

In [10]:
map_sent= {0: 'negativo', 4: 'positivo'}

Word Counts

In [11]:
df['word_counts']= df['tweet'].apply(lambda x: len(str(x).split()))
df.head()

Unnamed: 0,tweet,sentiment,word_counts
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19
1,is upset that he can't update his Facebook by ...,0,21
2,@Kenichan I dived many times for the ball. Man...,0,18
3,my whole body feels itchy and like its on fire,0,10
4,"@nationwideclass no, it's not behaving at all....",0,21


Character Counts

In [12]:
df['char_counts']= df['tweet'].apply(lambda x: len(x))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115
1,is upset that he can't update his Facebook by ...,0,21,111
2,@Kenichan I dived many times for the ball. Man...,0,18,89
3,my whole body feels itchy and like its on fire,0,10,47
4,"@nationwideclass no, it's not behaving at all....",0,21,111


Average Word Lenght

In [13]:
def get_avg_word_len(x):
    words= x.split()
    word_len= 0 
    for word in words:
        word_len= word_len + len(word)
    return word_len / len(words)  # != len(x) / len(words)

In [14]:
df['avg_word_len']= df['tweet'].apply(lambda x: get_avg_word_len(x))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115,5.052632
1,is upset that he can't update his Facebook by ...,0,21,111,4.285714
2,@Kenichan I dived many times for the ball. Man...,0,18,89,3.944444
3,my whole body feels itchy and like its on fire,0,10,47,3.7
4,"@nationwideclass no, it's not behaving at all....",0,21,111,4.285714


Stop Words Count

In [15]:
print(STOP_WORDS)

#Obs: ser calculoso com a remoção direta das stop words. Analisar quais são e se o problema 
#não permite q algumas delas sejam removidas dos documentos para não perder informação importante

{'whose', 'ourselves', 'after', 'are', 'whether', 'show', 'become', 'thence', 'have', 'moreover', 'a', 'via', 'much', 'whoever', 'those', 'therefore', 'along', 'also', 'afterwards', "'ve", 'she', 'just', 'really', 'done', 'get', 'then', 'sometime', 'three', 'me', 'former', 'themselves', 'her', 'so', 'us', 'should', 'using', 'across', 'unless', 'on', 'beforehand', 'an', 'several', 'behind', 'of', 'mostly', "'re", 'eleven', 'their', 'indeed', 'to', 'n’t', 'third', 'anywhere', 'yet', 'both', '’re', 'with', 'more', 'none', 'i', 'part', 'our', 'if', 'nobody', 'is', 'see', 'always', 'can', 'often', 'please', 'had', 'without', 'six', 'anything', 'first', 'almost', 'last', 'twelve', 'until', 'will', 'further', 'somehow', 'only', 'did', 'or', 'alone', 'these', 'all', 'say', 'others', 'various', 'serious', 'very', 'myself', '‘ll', '’d', 'quite', 'your', 'by', '‘s', 'hereafter', 'whatever', 'same', 'now', 'ours', 'seems', 'each', 'might', 'rather', 'whereupon', 'otherwise', 'upon', 'his', 'yours'

In [16]:
#Contagem de stop words

df['stop_words_counts']= df['tweet'].apply(lambda x: len([word for word in x.split() if word in STOP_WORDS]))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115,5.052632,4
1,is upset that he can't update his Facebook by ...,0,21,111,4.285714,9
2,@Kenichan I dived many times for the ball. Man...,0,18,89,3.944444,7
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5
4,"@nationwideclass no, it's not behaving at all....",0,21,111,4.285714,10


Count #HashTags and @Mentions

In [17]:
df['hashtags_count']= df['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df['mentions_count']= df['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))

df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115,5.052632,4,0,1
1,is upset that he can't update his Facebook by ...,0,21,111,4.285714,9,0,0
2,@Kenichan I dived many times for the ball. Man...,0,18,89,3.944444,7,0,1
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0
4,"@nationwideclass no, it's not behaving at all....",0,21,111,4.285714,10,0,1


If numeric digits are present in twitts

In [18]:
df['numerics_count']= df['tweet'].apply(lambda x: len([n for n in x.split() if n.isdigit()]))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115,5.052632,4,0,1,0
1,is upset that he can't update his Facebook by ...,0,21,111,4.285714,9,0,0,0
2,@Kenichan I dived many times for the ball. Man...,0,18,89,3.944444,7,0,1,0
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0
4,"@nationwideclass no, it's not behaving at all....",0,21,111,4.285714,10,0,1,0


Upper case word counts

In [19]:
#Contagem importante para este problema porque é usual quando alguém está com raiva/triste 
#tweetar em letras maiúsculas
df['upper_counts']= df['tweet'].apply(lambda x: len([u for u in x.split() if u.isupper() and len(x)>3]))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,19,115,5.052632,4,0,1,0,1
1,is upset that he can't update his Facebook by ...,0,21,111,4.285714,9,0,0,0,0
2,@Kenichan I dived many times for the ball. Man...,0,18,89,3.944444,7,0,1,0,1
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0
4,"@nationwideclass no, it's not behaving at all....",0,21,111,4.285714,10,0,1,0,1


In [20]:
#Verificando tweet como exemplo do significado das letras maiúsculas mencionado

df.iloc[96]['tweet']

"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH  BUT OHH WELL....."

Preprocessing and Cleaning

Lower case conversion

In [21]:
df['tweet']= df['tweet'].apply(lambda x: x.lower())
df.head(2)

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts
0,"@switchfoot http://twitpic.com/2y1zl - awww, t...",0,19,115,5.052632,4,0,1,0,1
1,is upset that he can't update his facebook by ...,0,21,111,4.285714,9,0,0,0,0


Contraction to Expansion

In [22]:
#Dicionário para retornar contrações para a forma de escrita formal
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}

In [23]:
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

In [24]:
x= "hi, i'd be happy"

cont_to_exp(x)

'hi, i would be happy'

In [25]:
%%time
df['tweet']= df['tweet'].apply(lambda x: cont_to_exp(x))

CPU times: user 30 s, sys: 121 ms, total: 30.1 s
Wall time: 42.2 s


In [26]:
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts
0,"@switchfoot http://twitpic.com/2y1zl - awww, t...",0,19,115,5.052632,4,0,1,0,1
1,is upset that he cannot update his facebook by...,0,21,111,4.285714,9,0,0,0,0
2,@kenichan i dived many times for the ball. man...,0,18,89,3.944444,7,0,1,0,1
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0
4,"@nationwideclass no, it is not behaving at all...",0,21,111,4.285714,10,0,1,0,1


Emails removal and counts

In [27]:
import re

In [28]:
x= 'hi, my email me at email@email.com another@email.com'

re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)

['email@email.com', 'another@email.com']

In [29]:
df['email']= df['tweet'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email
0,"@switchfoot http://twitpic.com/2y1zl - awww, t...",0,19,115,5.052632,4,0,1,0,1,[]
1,is upset that he cannot update his facebook by...,0,21,111,4.285714,9,0,0,0,0,[]
2,@kenichan i dived many times for the ball. man...,0,18,89,3.944444,7,0,1,0,1,[]
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0,[]
4,"@nationwideclass no, it is not behaving at all...",0,21,111,4.285714,10,0,1,0,1,[]


In [30]:
df['emails_count']= df['email'].apply(lambda x: len(x))
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email,emails_count
0,"@switchfoot http://twitpic.com/2y1zl - awww, t...",0,19,115,5.052632,4,0,1,0,1,[],0
1,is upset that he cannot update his facebook by...,0,21,111,4.285714,9,0,0,0,0,[],0
2,@kenichan i dived many times for the ball. man...,0,18,89,3.944444,7,0,1,0,1,[],0
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0,[],0
4,"@nationwideclass no, it is not behaving at all...",0,21,111,4.285714,10,0,1,0,1,[],0


In [31]:
#Total de tweets com ao menos um e_mail contido neles
len(df[df['emails_count']>0])

564

In [36]:
x= 'hi, my email me at email@email.com another@email.com'

x= re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)

print(x)

hi, my email me at  


In [39]:
#Removendo e-mails
df['tweet']= df['tweet'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))

In [40]:
df[df['emails_count']>0].head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email,emails_count
4054,i want a new laptop. hp tx2000 is the bomb. :...,0,20,103,4.15,6,0,0,0,4,[gabbehhramos@yahoo.com],1
7917,who stole ?,0,3,31,9.0,1,0,0,0,0,[elledell@gmail.com],1
8496,@alexistehpom really? did you send out all th...,0,20,130,5.5,11,0,1,0,0,[missataari@gmail.com],1
10290,@laureystack awh...that is kinda sad lol add ...,0,8,76,8.5,0,0,1,0,0,[hello.kitty.65@hotmail.com],1
16413,"@jilliancyork got 2 bottom of it, human error...",0,21,137,5.428571,7,0,1,1,0,[press@linkedin.com],1


Count URLs and Remove it

In [44]:
x= 'hi, watch the video on https://youtube.com/kgptalkie'

re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

[('https', 'youtube.com', '/kgptalkie')]

In [46]:
df['urls_flag']= df['tweet'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

In [47]:
#Removendo links
df['tweet']= df['tweet'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))

In [48]:
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email,emails_count,urls_flag
0,"@switchfoot - awww, that is a bummer. you sh...",0,19,115,5.052632,4,0,1,0,1,[],0,1
1,is upset that he cannot update his facebook by...,0,21,111,4.285714,9,0,0,0,0,[],0,0
2,@kenichan i dived many times for the ball. man...,0,18,89,3.944444,7,0,1,0,1,[],0,0
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0,[],0,0
4,"@nationwideclass no, it is not behaving at all...",0,21,111,4.285714,10,0,1,0,1,[],0,0


Remove RT (retweet)

In [49]:
df['tweet']= df['tweet'].apply(lambda x: re.sub('RT', '', x))

Removing special characters and punctuation

In [50]:
df['tweet']= df['tweet'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))

In [51]:
df.head()

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email,emails_count,urls_flag
0,switchfoot - awww that is a bummer you shoul...,0,19,115,5.052632,4,0,1,0,1,[],0,1
1,is upset that he cannot update his facebook by...,0,21,111,4.285714,9,0,0,0,0,[],0,0
2,kenichan i dived many times for the ball manag...,0,18,89,3.944444,7,0,1,0,1,[],0,0
3,my whole body feels itchy and like its on fire,0,10,47,3.7,5,0,0,0,0,[],0,0
4,nationwideclass no it is not behaving at all i...,0,21,111,4.285714,10,0,1,0,1,[],0,0


Removing multiple spaces

In [53]:
x= "Hi,                            what's going on?"

' '.join(x.split())

"Hi, what's going on?"

In [54]:
df['tweet']= df['tweet'].apply(lambda x: ' '.join(x.split()))

In [55]:
df.sample(2)

Unnamed: 0,tweet,sentiment,word_counts,char_counts,avg_word_len,stop_words_counts,hashtags_count,mentions_count,numerics_count,upper_counts,email,emails_count,urls_flag
444338,akrapacs ouch the line will be really really b...,0,26,125,3.807692,16,0,1,0,0,[],0,0
139041,oh work jack is not gonna make it through the ...,0,18,92,4.111111,7,0,0,0,0,[],0,0
