# **Natural Language Processing with Disaster Tweets**

Goal for this project is to predict which Tweets are about real disasters and which ones are not. Kaggle competition's dataset will be used and my model will be started from the theory that the words contained in each tweet are a good indicator of whether they're about a real disaster or not.

## 1) Look at the datasets

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [4]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [5]:
train_df.tail()

Unnamed: 0,id,keyword,location,text,target
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1
7612,10873,,,The Latest: More Homes Razed by Northern Calif...,1


In [6]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


5 Tweets which **are not** about disasters :

In [7]:
train_df[train_df["target"] == 0]["text"].values[0:5]

array(["What's up man?", 'I love fruits', 'Summer is lovely',
       'My car is so fast', 'What a goooooooaaaaaal!!!!!!'], dtype=object)

5 Tweets which **are** about disasters :

In [8]:
train_df[train_df["target"] == 1]["text"].values[0:5]

array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       '13,000 people receive #wildfires evacuation orders in California ',
       'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school '],
      dtype=object)

## 2) Preprocessing

### 2-1) CountVectorize

In [9]:
count_vectorizer = feature_extraction.text.CountVectorizer()

train_vectors = count_vectorizer.fit_transform(train_df["text"])

# 학습데이터에서 설정된 변환을 위한 기반 설정을 그대로 테스트 데이터에 적용
# 학습할 때와 동일한 기반설정으로 동일하게 테스트 데이터를 변환
test_vectors = count_vectorizer.transform(test_df["text"])

In [28]:
# text : 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
train_vectors[0]

<1x21637 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [19]:
count_vectorizer.get_feature_names()[:100]

['00',
 '000',
 '0000',
 '007npen6lg',
 '00cy9vxeff',
 '00end',
 '00pm',
 '01',
 '02',
 '0215',
 '02elqlopfk',
 '02pm',
 '03',
 '030',
 '033',
 '034',
 '039',
 '03l7nwqdje',
 '04',
 '05',
 '05th',
 '06',
 '060',
 '061',
 '06jst',
 '07',
 '073izwx0lb',
 '08',
 '0840728',
 '0853',
 '087809233445',
 '0880',
 '08lngclzsj',
 '09',
 '0abgfglh7x',
 '0ajisa5531',
 '0blkwcupzq',
 '0btniwagt1',
 '0bvk5tub4j',
 '0c1y8g7e9p',
 '0cr74m1uxm',
 '0cxm5tkz8y',
 '0dqjeretxu',
 '0drqlrsgy5',
 '0dxvz7fdh3',
 '0erisq25kt',
 '0f8xa4ih1u',
 '0fekgyby5f',
 '0fs9ksv5xk',
 '0ghk693egj',
 '0gidg9u45j',
 '0gknpy4lua',
 '0h7oua1pns',
 '0iw6drf5x9',
 '0iyuntxduv',
 '0jfnvaxfph',
 '0jmkdtcymj',
 '0kccg1bt06',
 '0keh2treny',
 '0krw1zyahm',
 '0l',
 '0la1aw9uud',
 '0llwuqn8vg',
 '0lmheaex9k',
 '0lpu0gr2j0',
 '0m1tw3datd',
 '0mcxc68gzd',
 '0migwcmtje',
 '0mnpcer9no',
 '0npzp',
 '0nr4dpjgyl',
 '0oms8ri3l1',
 '0pamznyyuw',
 '0q040stkcv',
 '0r03c6njli',
 '0rny349unt',
 '0rokdutyun',
 '0rsverlztm',
 '0s6ydfrwdq',
 '0sa6xx1o

Let's start from assuming that count-vector and target value(0 or 1) have linear connection. For testing on the trining data, I'll use cross-validation and the metric will be F1 score, which this competition is asking for.

In [21]:
ln_clf = linear_model.RidgeClassifier()

scores = model_selection.cross_val_score(ln_clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.60387232, 0.57607656, 0.64485082])

+) TFIDF, LSA, LSTM, RNNs ,...

In [30]:
ln_clf.fit(train_vectors, train_df['target'])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

In [41]:
mysample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [42]:
mysample_submission["target"] = ln_clf.predict(test_vectors)

mysample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [44]:
mysample_submission.to_csv("mysubmission.csv", index=False)