# Disaster Tweet Identification model using NLP from Twitter

### Synopsis

From a given set of tweets, we need to identify whether a given tweet is related to disaster or not. 

- ML Model used : **RidgeClassifier**

- Accuracy : **0.78**


In [1]:
# Importing required libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

### Input Data

Here the dataset we will be using contains an index and its corresponding tweet.

Training data:
- It should contain  
    - index,
    - tweet 
    - category number
- 1, if the tweet is categorised as disaster, otherwise 0

In [2]:
# Creating dataframes from input data
train_df = pd.read_csv("../input/nlp-getting-started/train.csv")
test_df = pd.read_csv("../input/nlp-getting-started/test.csv")

In [3]:
submission_csv=pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
submission_csv.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [4]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
train_df[train_df["target"] == 0]["text"].values

array(["What's up man?", 'I love fruits', 'Summer is lovely', ...,
       'These boxes are ready to explode! Exploding Kittens finally arrived! gameofkittens #explodingkittens\x89Û_ https://t.co/TFGrAyuDC5',
       'Sirens everywhere!',
       'I just heard a really loud bang and everyone is asleep great'],
      dtype=object)

# Building the model

Unlike other machine learning model, we need to focus only on one single dataframe, i.e "Text". 

We are considering each data in a text as a tweet ,and we need to classify each tweet as '**disaster type**' or **not**.

In [7]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

print(example_train_vectors[0].todense().shape)

(1, 54)


In [8]:

train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])

In [9]:
# Creating Ridge Classifier object
clf = linear_model.RidgeClassifier()

In [10]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59421842, 0.56498283, 0.64113893])

In [11]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier()

In [12]:
submission_csv["target"] = clf.predict(test_vectors)
submission_csv.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


Atlast, the output dataframe can be generated into a CSV file for easier access.

In [13]:
submission_csv.to_csv("submission.csv", index=False)