# KAGGLE: Natural Language Processing with Diaster Tweets
AUTHOR: SungwookLE  
DATE: '21.7/5  

### Competition Description

 It’s not always clear whether a person’s words are actually announcing a disaster. [This tweet](https://storage.googleapis.com/kaggle-media/competitions/tweet_screenshot.png) explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.  
 In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a [quick tutorial](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial) to get you up and running. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.  
 Another [reference](https://www.kaggle.com/andreshg/nlp-glove-bert-tf-idf-lstm-explained#7.-LSTM) is here.  

### DATA
#### 1) What am I predicting
- Predict whether a given tweet is about a real diaster or not. If so, predict a `1`, If not, predict a `0`.  

#### 2) Columns
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

### Before Starting, NLP Tutorial
Natural Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.  

In this tutorial we'll look at this competition's dataset, use a simple techniques to process it, build a machine learning model, and submit predictions for a score!


In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
train_df = pd.read_csv("input/train.csv")
test_df = pd.read_csv("input/test.csv")

#### A quick look at out data
Let's look at our data... first, an example of what is NOT a disaster twwet

In [3]:
train_df.loc[train_df["target"] == 0 ]['text'].values[1]

'I love fruits'

And one that is:

In [4]:
train_df.loc[train_df["target"] == 1 ]['text'].values[1]

'Forest fire near La Ronge Sask. Canada'

#### Building vectors
The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of wheter they're about a real disaster or not (this is not entirely correct, but it's a great place to start)

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.  

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [5]:
count_vectorizer = feature_extraction.text.CountVectorizer()
# let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [6]:
# we use .todense() here bacause these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.  
2. The first tweets contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [7]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

# note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

#### Our model
As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.  
What we're assuming here is a linear connection. So let's build a linear model and see!

In [8]:
# Our vectors are really big, so we want to push our model's weights
# toward 0 without completely discounting different words - ridge regression
# is a good way to do this.

clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.  
The metric for this competition is F1, so let's use that here

In [9]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv=3, scoring="f1")
scores

array([0.59421842, 0.56498283, 0.64113893])

In [10]:
clf = linear_model.LogisticRegression(penalty='l2')

In [11]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv=3, scoring="f1")
scores

array([0.6387547 , 0.61347869, 0.68350669])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the learderboard, There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of then a shot!  
In the meantime, let's do predictions on our training sets and build a submission for the competition.

In [16]:
clf.fit(train_vectors, train_df['target'])

LogisticRegression()

In [17]:
sample_submission = pd.read_csv('input/sample_submission.csv')

In [18]:
sample_submission['target'] = clf.predict(test_vectors)

In [19]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1


In [20]:
sample_submission.to_csv("output/submission.csv", index=False)

Now, in the viewer, you can submit the above file to the competition! Good luck!

In [22]:
count_vectorizer = feature_extraction.text.CountVectorizer()
a=[['WHAT IS'], ['COUNT VECTORIZER'], ['SungwookLE TEST']]
a = pd.DataFrame(a, columns=['text'])
print(a)
print()
example_vectors = count_vectorizer.fit_transform(a["text"])
# we use .todense() here bacause these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_vectors[0].todense().shape)
print(example_vectors[0].todense())

               text
0           WHAT IS
1  COUNT VECTORIZER
2   SungwookLE TEST

(1, 6)
[[0 1 0 0 0 1]]
