# Assignment: Text Classification


## Kaggle Competition: https://www.kaggle.com/c/nlp-getting-started/overview

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

### Dataset
Each sample in the train and test set has the following information:

* The text of a tweet
* A keyword from that tweet (although this may be blank!)
* The location the tweet was sent from (may also be blank)



## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

## Task 2: Feature Extraction
This week we focus on the extraction of text features. you are free to choose a ***word2vec*** or ***BERT*** embedding (bert is harder to setup but gives better results).

### word2vec / doc2vec
* use the implementation of the  **gensim** library

#### API
* https://radimrehurek.com/gensim/models/word2vec.html
* https://radimrehurek.com/gensim/models/doc2vec.html


### Bert
* kaggle tutorial on pre-trained Bert usage: https://keras.io/examples/nlp/text_extraction_with_bert/
* Note: the fine tuning step can be compute intensive, for this assignment it is ok to just use the pre-trained model


In [1]:
import pandas as pd
from time import time 
import spacy
from collections import defaultdict

In [2]:
data = pd.read_csv('train.csv' , encoding = "ISO-8859-1" )

In [3]:
data.head(20)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [5]:
data["target"].unique()

array([1, 0], dtype=int64)

In [6]:
data["location"].unique()

array([nan, 'Birmingham', 'Est. September 2012 - Bristol', ...,
       'Vancouver, Canada', 'London ', 'Lincoln'], dtype=object)

In [7]:
data.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [8]:
#data = data.dropna().reset_index(drop=True)
data["location"] = data["location"].fillna("")
data["keyword"] = data["keyword"].fillna("")

data.isnull().sum()

id          0
keyword     0
location    0
text        0
target      0
dtype: int64

In [9]:
data.drop('id', axis=1, inplace=True)

In [10]:
data["text"] = data["text"].str.replace('|', '')
data["text"] = data["text"].str.replace('@', '')
data["text"] = data["text"].str.replace('.', '')
data["text"] = data["text"].str.replace('/', '')
data["text"] = data["text"].str.replace('&', '')
data["text"] = data["text"].str.replace('+', '')
data["text"] = data["text"].str.replace('-', '')
data["text"] = data["text"].str.replace('(', '')
data["text"] = data["text"].str.replace(')', '')
data["text"] = data["text"].str.replace(':', '')

data["text"] = data["text"].str.replace('\d', '')

In [11]:
#data.head(20)
#data.info()

In [12]:
from gensim.models.phrases import Phrases, Phraser

In [13]:
sent = [row.split() for row in data['text']]
phrases = Phrases(sent, min_count=30, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]

In [14]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)


28422

In [15]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['the', 'a', 'in', 'of', 'to', 'and', 'I', 'for', 'is', 'on']

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [16]:
import multiprocessing

from gensim.models import Word2Vec

cores = multiprocessing.cpu_count() # Count the number of cores in a computer

In [17]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores)

In [18]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to build vocab: 0.01 mins


In [19]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 0.31 mins


In [20]:
w2v_model.init_sims(replace=True)

## Task 4: Evaluate 
* the test data does not contain labels! -> You can upload your solution to kaggle to get a score or build your own test set by splitting the training data

In [22]:
w2v_model.wv.most_similar(positive=["US"])

[('Be', 0.9996354579925537),
 ('part', 0.9996006488800049),
 ('area', 0.9995798468589783),
 ('No', 0.9995774030685425),
 ('On', 0.9995753765106201),
 ('Nuclear', 0.9995514154434204),
 ('Israeli', 0.9995409846305847),
 ('state', 0.9995357990264893),
 ('Dead', 0.9995285868644714),
 ('Over', 0.999525785446167)]

In [26]:
w2v_model.wv.most_similar(positive=["natural"])

[('Nuclear', 0.9996929168701172),
 ('nuclear', 0.9996321201324463),
 ('state', 0.9996248483657837),
 ('Be', 0.9996000528335571),
 ('part', 0.9995694160461426),
 ('No', 0.9995278120040894),
 ('deaths', 0.9994902610778809),
 ('US', 0.9994351863861084),
 ('area', 0.9994331002235413),
 ('wild', 0.9993730187416077)]