![](https://d33wubrfki0l68.cloudfront.net/d5cbc4b0e14c20f877366b69b9171649afe11fda/d96a8/assets/images/bigram-hmm/pos-title.jpg)

# Introduction

When creating machine learning models using NLP, people generally tend to use techniques that deal directly with the words in the text, such as bag of words. However, one could wonder, is there a difference between the frequency of adjectives, verbs and substantives in disasters tweets when compared to common tweets? This is a good technique when dealing with long texts (above 100 words usually), but maybe it's worth trying it here. It has also been tested to detect fake news with a very good accuracy [by the authors of this github repo](https://github.com/roneysco/Fake.br-Corpus).

Quoting wikipedia,
> In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. 

Basically, what we want to do is to use NLTK library to pos-tag a sentence and count the number of each class on them. 


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
import matplotlib.pyplot as plt
from collections import Counter

import os
print("File Folders:")
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

train["isTrain"] = True
test["isTrain"] = False

full = pd.concat([train, test])
full

# Feature engineering

The number of links and tags (like @Someone) seem to be important factors, as we tend to share links to the actual news when talking about disasters. Hashtags also seem to be important, along with the their number on a specific tweet, so we'll track those as well.

In [None]:
def get_at(row):
    return re.findall("@[\w]+", row["text"])

def get_http(row):
    return re.findall("http[\:\/\.\w]+", row["text"])

def get_hashtags(row):
    return re.findall("#[\w]+", row["text"])

def number_of_tags(row):
    return len(row["tags"])

def number_of_links(row):
    return len(row["links"])

def number_of_hashs(row):
    return len(row["hashtags"])

def clean_text(row):
    clean = row["text"]
    
    if len(row["tags"]) != 0:
        for word in row["tags"]:
            clean = clean.replace(word, "")
    
    if len(row["links"]) != 0:
        for word in row["links"]:
            clean = clean.replace(word, "")
    
    #only remove the # symbol
    clean = clean.replace("#", "").replace("/", "").replace("(", "").replace(")", "")
    
    return clean.strip()

full["tags"] = full.apply(lambda row: get_at(row), axis = 1)
full["links"] = full.apply(lambda row: get_http(row), axis = 1)
full["hashtags"] = full.apply(lambda row: get_hashtags(row), axis = 1)

full["number_of_tags"] = full.apply(lambda row: number_of_tags(row), axis = 1)
full["number_of_links"] = full.apply(lambda row: number_of_links(row), axis = 1)
full["number_of_hashs"] = full.apply(lambda row: number_of_hashs(row), axis = 1)

full["clean_text"] = full.apply(lambda row: clean_text(row), axis = 1)
full.sample(5)

We have cleaned our texts and stored links, hashtags and tags, it's time for the real deal. We'll first tokenize our texts and use them to get our grammatical classes.

In [None]:
from nltk.tokenize import word_tokenize

def get_tokens(row):
    return word_tokenize(row["clean_text"].lower())

full["tokens"] = full.apply(lambda row: get_tokens(row), axis = 1)
full.sample(5, random_state = 4)

In [None]:
s = ["screams", "in", "the", "distance"]

def get_postags(row):
    
    postags = nltk.pos_tag(row["tokens"])
    list_classes = list()
    for  word in postags:
        list_classes.append(word[1])
    
    return list_classes

full["postags"] = full.apply(lambda row: get_postags(row), axis = 1)
full.sample(5, random_state = 4)
# nltk.help.upenn_tagset('NNS')

Now we have the POS tags for every text. There are lots of categories and I'll focus only in a few of them. Here are the meanings:

- NN: noun (there are other categories that can fit within this one for our purposes, such as NNS, NNP, NNPS, which all belong to nouns, containing plurals and proper names)
- RB: adverb
- VB: verb (and similar categories indicating tense: VBP, VBG, VBS..)
- JJ: adjective or numeral


**Note:** *If you want to get more information about the classes that pos_tag can identify, use the command `nltk.help.upenn_tagset("NNS")` for instance. If you want to see all the similar categories, you can just write the first letter of the class: `nltk.help.upenn_tagset("N")`.*

In [None]:
def find_no_class(count, class_name = ""):
    total = 0
    for key in count.keys():
        if key.startswith(class_name):
            total += count[key]
            
            
    return total

def get_classes(row, grammatical_class = ""):
    count = Counter(row["postags"])
    return find_no_class(count, class_name = grammatical_class)/len(row["postags"])

full["freqAdverbs"] = full.apply(lambda row: get_classes(row, "RB"), axis = 1)
full["freqVerbs"] = full.apply(lambda row: get_classes(row, "VB"), axis = 1)
full["freqAdjectives"] = full.apply(lambda row: get_classes(row, "JJ"), axis = 1)
full["freqNouns"] = full.apply(lambda row: get_classes(row, "NN"), axis = 1)

full.sample(5)

In [None]:
training = full.loc[full["isTrain"] == True, :].copy()
testing = full.loc[full["isTrain"] == False, :].copy()

## A few visualizations of the POS for disaster and non-disaster tweets

Note: non-disaster tweets are shown in blue

### Noun frequency

In [None]:
training.loc[training["target"] == 0.0, "freqNouns"].hist(alpha = 0.5);
training.loc[training["target"] == 1.0, "freqNouns"].hist(alpha = 0.5);

### Verb Frequency

In [None]:
training.loc[training["target"] == 0.0, "freqVerbs"].hist(alpha = 0.5);
training.loc[training["target"] == 1.0, "freqVerbs"].hist(alpha = 0.5);

### Adjective frequency

In [None]:
training.loc[training["target"] == 0.0, "freqAdjectives"].hist(alpha = 0.5);
training.loc[training["target"] == 1.0, "freqAdjectives"].hist(alpha = 0.5);

The distribution of frequencies for the three grammatical classes are about the same for disaster and non-disaster tweets.  The most prominent difference being the number of instances. In that case, we have more non-disaster than disaster tweets in the training dataset. This distribution study is very important to do because there are cases (such as fake news detection mentioned earlier) in which both distributions have statistically different distributions, and this fact may help in detection.

We are ready to make a first model out of the variables we have. Let's create a simple train-test split.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix

x = training.loc[:, ["number_of_tags", "number_of_links", "freqAdverbs", "freqVerbs", "freqAdjectives", "freqNouns"]]
y = training.loc[:, "target"]

skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(x, y)

for train_index, test_index in skf.split(x, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    x_train, x_test = x.loc[train_index], x.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    clf = GradientBoostingClassifier(learning_rate=0.1, max_depth= 5, max_features = 5,random_state = 42)
#     clf = RandomForestClassifier(random_state = 42)
    
    clf.fit(x_train, y_train)
    preds = clf.predict(x_test)
    
    print(accuracy_score(y_test, preds))
    
    print(confusion_matrix(y_test, preds))


total_preds = clf.predict(x)
print("Confusion Matrix:")
confusion_matrix(y,total_preds)

And the feature importance is

In [None]:
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(20,15))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, x.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

After a few tests, it seems that our simple model using only these engineered variables gave us approximately 60% accuracy. It is also evident that the number of links on a page plays a big role in detecting disaster tweets. Maybe this could be a variable used in other kernels to improve scores. Before ending the notebook, let's check the distribution of number of links for both disaster and non-disaster tweets.

In [None]:
training.loc[training["target"] == 0.0, "number_of_links"].hist(alpha = 0.5);
training.loc[training["target"] == 1.0, "number_of_links"].hist(alpha = 0.5);

Indeed, most non-disaster tweets have no links at all, while most disaster ones have at least one.

Now we'll just score the test dataset and make the submission.

In [None]:
preds = clf.predict(testing.loc[:, ["number_of_tags", "number_of_links", "freqAdverbs", "freqVerbs", "freqAdjectives", "freqNouns"]])
testing["prediction"] = preds

In [None]:
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
submission["target"] = preds.astype(int)
submission.to_csv("submission.csv", index = False)