# Introdução a NLP
Nessa atividade vocês irão trabalhar em um problema de classificação de texto multiclasse. Considere o conjunto de dados sobre fetch_20newsgroups  

"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

Esse conjunto de dados pode ser carregado através so scikit-learn

from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train',
                                  shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test', 
                                 shuffle=True, random_state=42)

Dado esse contexto, escolha um único classificador, sem otimizar hiperparametros, treine e teste modelos considerando
- Bag of Words (contagem), sem pré-processamento
- TF-IDF, sem pré-processamento 
- Bag of Words, com pré-processamento
- TF-IDF, com pré-processamento
- Considere a métrica da acurácia e compare os resultados em uma tabela.

As etapas de pré-processamento devem conter pelo menos:
- lowercase
- remoção de pontuação
- remoção de números 
- remoção de stopwords (dica: utilize a biblioteca NLTK)
- lematização ou stemming (apenas um dos dois)

Outras etapas que você julgar necessárias podem ser utilizadas. Crie uma função para cada etapa e uma função chamada preprocess() que chame todas as etapas.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ruann\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ruann\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ruann\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

In [4]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
text_train = twenty_train['data']
y_train = twenty_train['target']
text_test = twenty_test['data']
y_test = twenty_test['target']

In [6]:
text_train;

### Fazendo extração
Dado esse contexto, escolha um único classificador, sem otimizar hiperparametros, treine e teste modelos considerando
- Bag of Words (contagem), sem pré-processamento
- TF-IDF, sem pré-processamento 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
bagwords = CountVectorizer()
tfidf = TfidfVectorizer()

In [38]:
X_train_b = bagwords.fit_transform(text_train)
X_text_b = bagwords.transform(text_test)
X_train_tf = tfidf.fit_transform(text_train)
X_text_tf = tfidf.transform(text_test)

In [9]:
d_train_b = X_train_b.shape
d_train_tf = X_train_tf.shape

In [10]:
X_train_b;
X_train_tf;

### Classificador Escolhido: Regressão Logística

In [11]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_b, y_train)

LogisticRegression(max_iter=1000)

In [12]:
model2 = LogisticRegression()
model2.fit(X_train_tf, y_train)

LogisticRegression()

In [13]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
y_pred = model.predict(X_text_b)
y_pred2 = model2.predict(X_text_tf)
acuracia1 = round(accuracy_score(y_test,y_pred)*100,2)
acuracia2 = round(accuracy_score(y_test,y_pred2)*100,2)
print('A acurácia do bagwords, foi:' ,acuracia1)
print('A acurácia do tf-idf, foi:' ,acuracia2)

A acurácia do bagwords, foi: 78.93
A acurácia do tf-idf, foi: 82.74


### Pré-Processamento 

As etapas de pré-processamento devem conter pelo menos:
- lowercase
- remoção de pontuação
- remoção de números 
- remoção de stopwords (dica: utilize a biblioteca NLTK)
- lematização ou stemming (apenas um dos dois)

In [14]:
def lowering(texto):
    return texto.lower()

In [15]:
import string
def remocao_pontuacao(texto):
    pontuacao_retirada = "".join([i for i in texto if i not in string.punctuation])
    return pontuacao_retirada

In [16]:
import re
def remocao_numeros(texto):
    number_regex = '\d+'
    x = re.sub(number_regex, '', texto)
    return x

In [17]:
def remocao_stopwords(texto):
    stopwords = nltk.corpus.stopwords.words('english')
    texto_stopwords = [j for j in texto.split() if j not in stopwords]
    frase = " ".join(texto_stopwords)
    return frase 

In [18]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatizacao(texto):
    lemm_texto = [wordnet_lemmatizer.lemmatize(word) for word in texto.split()]
    frase = " ".join(lemm_texto)
    return frase

In [19]:
def preprocessamento(texto):
    LO = lowering(texto)
    RP = remocao_pontuacao(LO)
    RN = remocao_numeros(RP)
    RS = remocao_stopwords(RN)
    LE = lemmatizacao(RS)
    return LE

In [20]:
X_train_p = []
for i in range(len(text_train)):
    p_process = preprocessamento(text_train[i])
    X_train_p.append(p_process)

In [21]:
X_test_p = []
for i in range(len(text_test)):
    p_process = preprocessamento(text_test[i])
    X_test_p.append(p_process)

In [22]:
text_train[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [23]:
X_train_p[0]

'lerxstwamumdedu wheres thing subject car nntppostinghost racwamumdedu organization university maryland college park line wondering anyone could enlighten car saw day door sport car looked late early called bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car made history whatever info funky looking car please email thanks il brought neighborhood lerxst'

In [24]:
text_test[0]

'From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)\nSubject: Need info on 88-89 Bonneville\nOrganization: University at Buffalo\nLines: 10\nNews-Software: VAX/VMS VNEWS 1.41\nNntp-Posting-Host: ubvmsd.cc.buffalo.edu\n\n\n I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.\n\n\t\t\tNeil Gandler\n'

In [25]:
X_test_p[0]

'vmbkubvmsdccbuffaloedu neil b gandler subject need info bonneville organization university buffalo line newssoftware vaxvms vnews nntppostinghost ubvmsdccbuffaloedu little confused model bonnevilles heard le se lse sse ssei could someone tell difference far feature performance also curious know book value prefereably model much le book value usually get word much demand time year heard midspring early summer best time buy neil gandler'

### Fazendo extração
Dado esse contexto, escolha um único classificador, sem otimizar hiperparametros, treine e teste modelos considerando
- Bag of Words, com pré-processamento
- TF-IDF, com pré-processamento

In [26]:
X_train_pb = bagwords.fit_transform(X_train_p)
X_text_pb = bagwords.transform(X_test_p)
X_train_ptf = tfidf.fit_transform(X_train_p)
X_text_ptf = tfidf.transform(X_test_p)

In [27]:
d_train_b_pre = X_train_pb.shape
d_train_tf_pre = X_train_ptf.shape

In [28]:
d_train_preprocessada = X_train_pb.shape
d_test_preprocessada = X_text_pb.shape

### Classificador Escolhido: Regressão Logística

In [29]:
model3 = LogisticRegression(max_iter=1000)
model3.fit(X_train_pb, y_train)

LogisticRegression(max_iter=1000)

In [30]:
model4 = LogisticRegression()
model4.fit(X_train_ptf, y_train)

LogisticRegression()

In [31]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
y_pred3 = model3.predict(X_text_pb)
y_pred4 = model4.predict(X_text_ptf)
acuracia3 = round(accuracy_score(y_test,y_pred3)*100,2)
acuracia4 = round(accuracy_score(y_test,y_pred4)*100,2)
print('A acurácia do bagwords após pre-processamento, foi:' ,acuracia3)
print('A acurácia do tf-idf após pre-processamento, foi:' ,acuracia4)

A acurácia do bagwords após pre-processamento, foi: 79.63
A acurácia do tf-idf após pre-processamento, foi: 83.23


### Comparação de Resultados - Tabela

In [32]:
#Criando um dataframe para apresentar os resultados
# Ainda vou colocar o terceiro
resultado1 = pd.Series({'Resultados': 'Acurácia BoW Sem Pré-Processamento', 'Acurácia': acuracia1, 'Dimensões': d_train_b})
resultado2 = pd.Series({'Resultados': 'Acurácia TF-IDF Sem Pré-Processamento', 'Acurácia': acuracia2, 'Dimensões': d_train_tf})
resultado3 = pd.Series({'Resultados': 'Acurácia BoW Com Pré-Processamento', 'Acurácia': acuracia3, 'Dimensões': d_train_b_pre})
resultado4 = pd.Series({'Resultados': 'Acurácia TF-IDF Com Pré-Processamento', 'Acurácia': acuracia4, 'Dimensões': d_train_tf_pre})


df_resultados = pd.DataFrame([resultado1,resultado2,resultado3,resultado4])
df_resultados

Unnamed: 0,Resultados,Acurácia,Dimensões
0,Acurácia BoW Sem Pré-Processamento,78.93,"(11314, 130107)"
1,Acurácia TF-IDF Sem Pré-Processamento,82.74,"(11314, 130107)"
2,Acurácia BoW Com Pré-Processamento,79.63,"(11314, 111379)"
3,Acurácia TF-IDF Com Pré-Processamento,83.23,"(11314, 111379)"


#### get_feature_names() antes do pre-processamento

In [39]:
bagwords.get_feature_names()

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '00000000',
 '0000000004',
 '0000000005',
 '00000000b',
 '00000001',
 '00000001b',
 '0000000667',
 '00000010',
 '00000010b',
 '00000011',
 '00000011b',
 '0000001200',
 '00000074',
 '00000093',
 '000000e5',
 '00000100',
 '00000100b',
 '00000101',
 '00000101b',
 '00000110',
 '00000110b',
 '00000111',
 '00000111b',
 '00000315',
 '000005102000',
 '00000510200001',
 '000007',
 '00000ee5',
 '00001000',
 '00001000b',
 '00001001',
 '00001001b',
 '00001010',
 '00001010b',
 '00001011',
 '00001011b',
 '000010af',
 '00001100',
 '00001100b',
 '00001101',
 '00001101b',
 '00001110',
 '00001110b',
 '00001111',
 '00001111b',
 '000021',
 '000042',
 '000062david42',
 '000094',
 '0000vec',
 '0001',
 '00010000',
 '00010000b',
 '00010001',
 '00010001b',
 '00010010',
 '00010010b',
 '00010011',
 '00010011b',
 '000100255pixel',
 '00010100',
 '00010100b',
 '00010101',
 '00010101b',
 '00010110',
 '00010110b',
 '00010111',
 '00010111b',
 '00011000',
 '00011000b',
 '00

#### get_feature_names() com o pre-processamento

In [37]:
bagwords.get_feature_names()

['aa',
 'aaa',
 'aaaa',
 'aaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg',
 'aaaaagggghhhh',
 'aaah',
 'aaahh',
 'aaahhhh',
 'aaai',
 'aaaimit',
 'aaalexlcsmitedu',
 'aaamajors',
 'aab',
 'aabcichlidcom',
 'aabluearbortextcom',
 'aacalcomsocalcom',
 'aacc',
 'aachen',
 'aaclevelandfreenetedu',
 'aacrnsuinpfr',
 'aadams',
 'aadangermousemitreorg',
 'aadnsi',
 'aadscrsiemenscom',
 'aaexpolcsmitedu',
 'aafcsymasussexacuk',
 'aafdcbedcbceb',
 'aaffff',
 'aafffff',
 'aafnscclehighedu',
 'aafnzfidonetorg',
 'aafreenetcarleton',
 'aafreenetcarletonca',
 'aagrendalcorpsuncom',
 'aagumbyaltoscom',
 'aah',
 'aaigcapcorg',
 'aainetgwpadeccom',
 'aainsaneapanaorgau',
 'aajscnasagov',
 'aakeplerunhedu',
 'aalborg',
 'aaldoubocopperdenvercoloradoedu',
 'aalmchgurmeyaridiylpehpaifhnfmqqlchvcduajjebndih',
 'aalocutuscscoloradoedu',
 'aalternate',
 'aamazing',
 'aamir',
 'aammmaaaazzzzzziinnnnggggg',
 'aamothrasyredu',
 'aams',
 'aamydualuucp',
 'aan',
