<a href="https://colab.research.google.com/github/ABMHub/NLP/blob/main/preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inicialização

In [28]:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import nltk
from nltk.stem.porter import *
from typing import List

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
df = pd.read_csv("https://raw.githubusercontent.com/viniciusrpb/cic0269_natural_language_processing/main/corpus_tweets/twitter-2013train-A.txt", sep="\t", names=["ID", "emotion", "text"])
df

Unnamed: 0,ID,emotion,text
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I\u2019m going t...
1,263405084770172928,negative,Theo Walcott is still shit\u002c watch Rafa an...
2,262163168678248449,negative,its not that I\u2019m a GSP fan\u002c i just h...
3,264249301910310912,negative,Iranian general says Israel\u2019s Iron Dome c...
4,262682041215234048,neutral,Tehran\u002c Mon Amour: Obama Tried to Establi...
...,...,...,...
9679,103158179306807296,positive,RT @MNFootNg It's monday and Monday Night Foot...
9680,103157324096618497,positive,All I know is the road for that Lomardi start ...
9681,100259220338905089,neutral,"All Blue and White fam, we r meeting at Golden..."
9682,104230318525001729,positive,@DariusButler28 Have a great game agaist Tam...


Trecho de código que eu fico re-executando para conseguir ler tweets randomicamente

In [30]:
import random
df.text[random.randrange(0, 9683)]

'Tonight was just a preview of what Bob Jones will do to Austin at tomorrow\\u2019s vball game. VBC. Noon. #BeThere #BJP'

In [31]:
text_series = df.text
text_series[0:20]

0     Gas by my house hit $3.39!!!! I\u2019m going t...
1     Theo Walcott is still shit\u002c watch Rafa an...
2     its not that I\u2019m a GSP fan\u002c i just h...
3     Iranian general says Israel\u2019s Iron Dome c...
4     Tehran\u002c Mon Amour: Obama Tried to Establi...
5     I sat through this whole movie just for Harry ...
6     with J Davlar 11th. Main rivals are team Polan...
7     Talking about ACT\u2019s && SAT\u2019s\u002c d...
8     Why is \""Happy Valentines Day\"" trending? It...
9     They may have a SuperBowl in Dallas\u002c but ...
10    Im bringing the monster load of candy tomorrow...
11    Apple software\u002c retail chiefs out in over...
12    @oluoch @victor_otti @kunjand I just watched i...
13    One of my best 8th graders Kory was excited af...
14    #Livewire Nadal confirmed for Mexican Open in ...
15    @MsSheLahY I didnt want to just pop up... but ...
16    @Alyoup005 @addicted2haley hmmmm  November is ...
17    #Iran US delisting MKO from global terrori

# Teoria / Aula

In [32]:
primeira = text_series[0]
primeira_lower = primeira.lower()
primeira_lower

'gas by my house hit $3.39!!!! i\\u2019m going to chapel hill on sat. :)'

Tokenização. Separar a frase em palavras (tokens) e limpar o texto de caracteres inúteis.

In [33]:
tokens = re.split(r'[ !@#.,?\d$]+', primeira_lower)
tokens

['gas',
 'by',
 'my',
 'house',
 'hit',
 'i\\u',
 'm',
 'going',
 'to',
 'chapel',
 'hill',
 'on',
 'sat',
 ':)']

Stopwords são palavras que não acrescentam em nada no texto, ou seja, palavras de ligação, artigos... etc.

In [34]:
# stopwrd_list = ['and', 'by', 'on', 'to', 'or', 'in']
# stopwrd_list
stopword_list_nltk = nltk.corpus.stopwords.words("english")
# stopword_list_nltk

In [35]:
clean_words = []
for token in tokens:
    if token not in stopword_list_nltk:
      clean_words.append(token)

clean_words

['gas', 'house', 'hit', 'i\\u', 'going', 'chapel', 'hill', 'sat', ':)']

Stemmer é um algoritmo que pega as palavras e reduz elas ao seu radical, para que flexões de gênero, número, etc não façam diferença alguma. Afinal, todas as flexões significam a mesma coisa.

In [36]:
stemmer = PorterStemmer()
stem = [stemmer.stem(t) for t in tokens]
stem

['ga',
 'by',
 'my',
 'hous',
 'hit',
 'i\\u',
 'm',
 'go',
 'to',
 'chapel',
 'hill',
 'on',
 'sat',
 ':)']

# Atividade

In [37]:
mp = {
    ":)": "happy",
    "(:": "happy",
    ":D": "happy",
    ":(": "sad",
    "):": "sad",
    "D:": "sad",
    "xd": "fail",
    "gr8": "great",
    "lol": "laugh",
    "plz": "please",
    "m8": "mate",
    "idc": "i don't care",
    "imo": "in my opinion",
    "rt": ""
}

def processamento_de_texto(texto : str) -> List[str]:
  """Função para fazer o pré-processamento de textos. A função segue a seguinte ordem:
  Transforma string em lower case
  Separa a string em tokens, removendo caracteres especiais
  Substitui algumas palavras e acrônimos pelos seus significados


  Args:
      texto (str): _description_

  Returns:
      List[str]: Retorna lista de tweets com palavras processadas
  """
  lower_case = texto.lower()
  
  tokens = re.split(r'[ !@#.,?\d$\\/:~"\'\(\)\+\-;_\*<>\[\]\{\}\|\=\&\^\´\`\%]+', lower_case)
  for i in range(len(tokens)): # substituicao com dicionario
    if tokens[i] in mp:
      tokens[i] = mp[tokens[i]]
  stopword_list_nltk = nltk.corpus.stopwords.words("english")
  clean_words = []
  for token in tokens: # remocao de stop words
    if token not in stopword_list_nltk:
      clean_words.append(token)
  stemmer = PorterStemmer()
  stem = [stemmer.stem(t) for t in clean_words]
  return " ".join(stem)

## Tweets pré-processados

In [38]:
series = text_series.apply(processamento_de_texto)
df.text = series
df

Unnamed: 0,ID,emotion,text
0,264183816548130816,positive,ga hous hit u go chapel hill sat
1,263405084770172928,negative,theo walcott still shit u c watch rafa johnni ...
2,262163168678248449,negative,u gsp fan u c hate nick diaz u wait februari
3,264249301910310912,negative,iranian gener say israel u iron dome u deal mi...
4,262682041215234048,neutral,tehran u c mon amour obama tri establish tie m...
...,...,...,...
9679,103158179306807296,positive,mnfootng monday monday night footbal mind lo...
9680,103157324096618497,positive,know road lomardi start tonight set record pre...
9681,100259220338905089,neutral,blue white fam r meet golden corral dinner ton...
9682,104230318525001729,positive,dariusbutl great game agaist tampa bay tonight


## Vocabulário

In [39]:
st = set()
def vocab(wrds):
  for wrd in wrds.split():
    st.add(wrd)

series.apply(vocab)

vocabulario = pd.Series(list(st),name="Vocabulario")
vocabulario.sort_values().to_csv("output.csv", index=False)
vocabulario

0                   cuku
1                 parent
2                  tempt
3        mountainsandsea
4                    soy
              ...       
19539          jaggerhop
19540          denverdeb
19541          grupoedeb
19542             streep
19543             damian
Name: Vocabulario, Length: 19544, dtype: object