<a href="https://colab.research.google.com/github/Ruby1517/NLP-Disaster-Tweets-Prediction/blob/main/NLP_disaster_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP Turorial
NLP - or Natrual Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language processing powers everything from chatbots to search engines, and is used in divers tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processung

from sklearn import feature_extraction, linear_model, model_selection, preprocessing


For more information about sklearn feature_extraction go here: https://scikit-learn.org/stable/modules/feature_extraction.html

In [2]:
train_df = pd.read_csv("drive/MyDrive/NLP-disaster-tweet/data/train.csv")
test_df = pd.read_csv("drive/MyDrive/NLP-disaster-tweet/data/test.csv")

In [3]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### A quick look at our data

Let's look at our data... first what is not disaster tweet

In [4]:
# Not Disaster tweet
train_df[train_df["target"] == 0 ]["text"].values[1]

'I love fruits'

In [5]:
# Disaster tweet
train_df[train_df["target"] == 1]["text"][1]

'Forest fire near La Ronge Sask. Canada'

### Building vectors
The theory behind the model we'll build notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not ( this is not entirly correct, but it's a grat place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.

**Note:** a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [6]:
# Count the word in each tweet and turn them into number
count_vectorizer = feature_extraction.text.CountVectorizer()

# Let's get counts for the first 5 tweet in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])
example_train_vectors

<5x54 sparse matrix of type '<class 'numpy.int64'>'
	with 61 stored elements in Compressed Sparse Row format>

In [7]:
# We use .todense() here because these vectors are "sparse" (on;y non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that:
1. There are 54 unique words(or "token") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.


In [8]:
# Let's create vectors for all of our tweets
train_vectors = count_vectorizer.fit_transform(train_df["text"])

# Note: we're not using .fit_transform() here. Using just .transform()
# make sure that the tokens in the train vectors are the only ones mapped to the test vecrtors -
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

## Our model
As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a *linear* connection. So let's build a linear model and see!

Classifier using Ridge regression.

This classifier first converts the target values into {-1, 1} and then treats the problem as a regression task (multi-output regression in the multiclass case).

In [9]:
# Our vectors are really big, so we want to push our model's weights
# toward 0 without completly discounting different words - ridge regression
clf = linear_model.RidgeClassifier()

Let's test our model and see how it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

In [10]:
# The metrics for this competition is F1, so let's use that
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59453669, 0.5642787 , 0.64082434])

In [11]:
clf.fit(train_vectors, train_df["target"])

In [12]:
sample_submission = pd.read_csv("drive/MyDrive/NLP-disaster-tweet/data/sample_submission.csv")

In [13]:
sample_submission

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0
...,...,...
3258,10861,0
3259,10865,0
3260,10868,0
3261,10874,0


In [14]:
sample_submission["target"] = clf.predict(test_vectors)

In [15]:
sample_submission

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [16]:
sample_submission.to_csv("drive/MyDrive/NLP-disaster-tweet/data/sample_submission_new.csv")