# NLP examples
> Using Python and different NLP techniques to predict disatser tweets

- toc: true 
- badges: true
- comments: true
- categories: [NLP, fastai, zero-shot, scikit learn, bag of words, Python, kaggle]

![](ghtop_images/header2.png)

## [Competition Description](https://www.kaggle.com/competitions/nlp-getting-started)

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. 


### Data

You'll need train.csv, test.csv and sample_submission.csv.
What should I expect the data format to be?

Each sample in the train and test set has the following information:

- The text of a tweet
- A keyword from that tweet (although this may be blank!)
- The location the tweet was sent from (may also be blank)

What am I predicting?

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.
Files

- `train.csv` - the training set
- `test.csv` - the test set
- `sample_submission.csv` - a sample submission file in the correct format

Columns

- `id` - a unique identifier for each tweet
- `text` - the text of the tweet
- `location` - the location the tweet was sent from (may be blank)
- `keyword` - a particular keyword from the tweet (may be blank)
- `target` - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)



## Analysis methods used

Code used for the threee methods:
- https://www.kaggle.com/code/thomassimm/nlp-disaster-tweets-one-shot
- https://www.kaggle.com/code/thomassimm/nlp-disaster-scikit
- https://www.kaggle.com/code/thomassimm/nlp-disaster-nn

### [one-shot](https://www.kaggle.com/code/thomassimm/nlp-disaster-tweets-one-shot)

For the [zero shot](https://joeddav.github.io/blog/2020/05/29/ZSL.html) learning method, the main advantage is no labelled data is needed nor any prior training. But the accuracy will be reduced without any training.

    What is zero-shot learning?

    Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it's been used much more broadly to mean get a model to do something that it wasn't explicitly trained to do. A well-known example of this is in the GPT-2 paper where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.
    https://joeddav.github.io/blog/2020/05/29/ZSL.html
    
The code used here is as follows:

`from transformers import pipeline`

`classifier = pipeline("zero-shot-classification", device=0)`

And can then be called as follows, where string is what we are classifying (ie. the tweet) and label what we are looking to classify it as (here we use disaster):

`classout=classifier( TEXT, LABEL, multi_class=True)`


> The accuracy is ~60%. But remember no training is done.


### [Bag of words sci-kit learn](https://www.kaggle.com/code/thomassimm/nlp-disaster-scikit)


This method uses scikit-learn to classify the tweets by using a *Bag of Words* approach. This example uses a Tf-idf-weighted document-term sparse matrix to encode the features and demonstrates various classifiers that can efficiently handle sparse matrices.

Methodology taken from [here](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py)

|Method |fit_time |	score_time |	test_accuracy |	train_accuracy |	test_f1 |	train_f1|
|-------|---------|-----------|-----------------|-----------------|-----------|------------------|
|Logistic Regression| 	2.349514 	|0.006580 |	0.739398 |	0.977276 	|0.639793 |	0.973091|
|Ridge Classifier |	0.079390 | 0.003698 |	0.736246 |	0.979509 |	0.631187 |	0.975799|
|kNN 	|0.004747 	|0.636052| 	0.720220 |	0.775220 |	0.601846 |	0.692391|
|Random Forest |	50.920274| 	0.207107 |	0.691847 |	0.988999 |	0.500637 |	0.987122|
|Linear SVC 	|0.093184 |	0.004035 |	0.733750 |	0.900926 |	0.605876 |	0.872385|
|log-loss SGD 	|0.002576 |	0.000000 |	NaN| 	NaN |	NaN 	|NaN|
|NearestCentroid |	0.008251 |	0.005659 |	0.664523 |	0.772659 |	0.658099 |	0.751426|
|Complement naive Bayes |	0.008391 |	0.003308 |	0.690007 	|0.979115 |	0.676552 |	0.975514|

> Results give ~80% on results file


### [NN with fastai](https://www.kaggle.com/code/thomassimm/nlp-disaster-nn)

The method used here is to do a two step process both of which use a RNN neural network with AWD-LSTM architecture:

- creating a language model for the data
- using this language model with labelled data to classify different texts

This method is shown here using [fastai](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb) for sentiment analysis (i.e. are reviews positive or negative) of IMDB data.

The basics of the code for the classifier part are:

create a data loader:

`dls = TextDataLoaders.from_df(df, text_col='text', label_col='target', 
                              valid_col='is_valid')`

create a learner, using text_classifier and AWD_LSTM

`learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5)`

`learn.metrics = [FBeta(beta=1),accuracy]`

And finally add the language model learnt on the tweets in the 1st step

`learn = learn.load_encoder('finetuned3')`

Then learn by:

`learn.fit_one_cycle(2, 1e-2)`


What surprised me was how low the final accuracy was given the complexity of the model and the time taken (over 8 mins). Maybe some work needs to be done to look at how I implemented this or try other RNN methods?

> Results give about 77% accuracy on results file

