# Natural Language Processing Tutorial
- NLP - or Natural Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. 
- In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [3]:
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")

### Analisando nosso dados: Primeira analise

In [9]:
# Target: 1 = Disaster, Target: 0 = Not a disaster
train_df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


- Filtrando para encontrar um não desastre

In [17]:
# Filtro/ pega a coluna/ deixa os valores em um array (lista)/ pega o segundo elemento.
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

In [19]:
# Filtro/ pega a coluna/ deixa os valores em um array (lista)/ pega o segundo elemento.
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Transformando os textos em vetores: ( Exemplo)
- Premise: The words contained in each tweet are a good indicator of whether they're about a real disaster or not.
- Using: We'll use scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.
- Note: a vector is, in this context, a set of numbers that a machine learning model can work with.

In [20]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [30]:
# Kind of object
#example_train_vectors

In [31]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

# Montando um dataframe com essas informações
#display(pd.DataFrame(example_train_vectors.todense()))

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


- O array acima nos diz que:
    - há 54 palavras diferentes (unicas) nos cinco primeiros tweets.
    - O primeiro tweet possui apenas algumas dessas palavras unicas, representadas por 1.

### Transformando os textos em vetores: ( Real )

In [32]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

# Para vetorizar as palavras do tweet usamos apenas "transform", por conta de ja termos treinado o "coun_vectorizer".
test_vectors = count_vectorizer.transform(test_df["text"])

In [36]:
#train_vectors.todense().shape

### Criando nosso modelo

In [37]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

- Validando nosso modelo por "cross-validation", vendo nosso possivel score do teste.

In [39]:
# Selecionamos o modelo "cross-validation" para avaliar nosso modelo.
# Passamos o modelo, o X (dados de treino ), o y ( dados de target ), quantidade de cross-validation, parametro de score.
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59421842, 0.56498283, 0.64113893])

- Treinando nosso modelo.

In [46]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier()

In [49]:
clf.get_params()

{'alpha': 1.0,
 'class_weight': None,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': 'deprecated',
 'positive': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

### Submetendo nosso modelo

In [55]:
sample_submission = pd.read_csv("./data/sample_submission.csv")

sample_submission["target"] = clf.predict(test_vectors)

sample_submission.set_index("id", drop=True, inplace=True)

In [58]:
sample_submission.to_csv("./submits/baseline_RidgeClassifier.csv")