# A Kaggle competition

 https://www.kaggle.com/c/nlp-getting-started/overview

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, model_selection, linear_model, preprocessing

In [2]:
train_df = pd.read_csv('./nlp-getting-started/train.csv')
test_df = pd.read_csv('./nlp-getting-started/test.csv')

In [3]:
train_df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


### columns:

* id - a unique identifier for each tweet
* keyword - a particular keyword from the tweet (may be blank)
* location - the location the tweet was sent from (may be blank)
* text - the text of the tweet
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


There are **7613** tweets. There are some Null values in 'keyword' and 'locations' columns. 

In [5]:
train_df.groupby('target').text.count()

target
0    4342
1    3271
Name: text, dtype: int64

**4342** of tweets are not about real disasters and **3271** are about real disasters.

In the following, I explore the uniqueness of values in location, keyword, text, and id columns.

In [6]:
len(pd.unique(train_df.location))

3342

In [28]:
len(pd.unique(train_df.keyword))

222

In [7]:
len(pd.unique(train_df.text))

7503

In [8]:
len(pd.unique(train_df.id))

7613

Next, I explor 10 disaster and 10 non-disaster tweets.

In [15]:
train_df[train_df.target==1].text.values[:10]

array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       '13,000 people receive #wildfires evacuation orders in California ',
       'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ',
       '#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires',
       '#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas',
       "I'm on top of the hill and I can see a fire in the woods...",
       "There's an emergency evacuation happening now in the building across the street",
       "I'm afraid that the tornado is coming to our area..."],
      dtype=object)

In [16]:
train_df[train_df.target==0].text.values[:10]

array(["What's up man?", 'I love fruits', 'Summer is lovely',
       'My car is so fast', 'What a goooooooaaaaaal!!!!!!',
       'this is ridiculous....', 'London is cool ;)', 'Love skiing',
       'What a wonderful day!', 'LOOOOOOL'], dtype=object)

In [46]:
pd.set_option('display.max_colwidth', -1)

  """Entry point for launching an IPython kernel.


In [47]:
train_df[(train_df.target==1) & train_df.keyword.notna() & train_df.location.notna()][:10]

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C,1
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. http://t.co/2nndBGwyEi,1
37,55,ablaze,World Wide!!,INEC Office in Abia Set Ablaze - http://t.co/3ImaomknnA,1
46,66,ablaze,"GREENSBORO,NORTH CAROLINA",How the West was burned: Thousands of wildfires ablaze in California alone http://t.co/vl5TBR3wbr,1
50,73,ablaze,"Sheffield Township, Ohio",Deputies: Man shot before Brighton home set ablaze http://t.co/gWNRhMSO8k,1
51,74,ablaze,India,Man wife get six years jail for setting ablaze niece\nhttp://t.co/eV1ahOUCZA,1
53,77,ablaze,Anaheim,Police: Arsonist Deliberately Set Black Church In North CarolinaåÊAblaze http://t.co/pcXarbH9An,1
55,79,ablaze,USA,#Kurds trampling on Turkmen flag later set it ablaze while others vandalized offices of Turkmen Front in #Diyala http://t.co/4IzFdYC3cg,1
56,80,ablaze,South Africa,TRUCK ABLAZE : R21. VOORTREKKER AVE. OUTSIDE OR TAMBO INTL. CARGO SECTION. http://t.co/8kscqKfKkF,1
59,83,ablaze,"Edmonton, Alberta - Treaty 6",How the West was burned: Thousands of wildfires ablaze in #California alone http://t.co/iCSjGZ9tE1 #climate #energy http://t.co/9FxmN0l0Bd,1


In [48]:
train_df[(train_df.target==0) & train_df.keyword.notna() & train_df.location.notna()][:10]

Unnamed: 0,id,keyword,location,text,target
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT http://t.co/YAo1e0xngw,0
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE http://t.co/qqsmshaJ3N,0
36,54,ablaze,Pretoria,@PhDSquares #mufc they've built so much hype around new acquisitions but I doubt they will set the EPL ablaze this season.,0
39,57,ablaze,Paranaque City,Ablaze for you Lord :D,0
40,59,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http://t.co/3Tj8ZjiN21 http://t.co/YDUiXEfIpE http://t.co/LxTjc87KLS #nsfw,0
42,62,ablaze,milky way,Had an awesome time visiting the CFC head office the ancop site and ablaze. Thanks to Tita Vida for taking care of us ??,0
48,68,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http://t.co/3Tj8ZjiN21 http://t.co/YDUiXEfIpE http://t.co/LxTjc87KLS #nsfw,0
49,71,ablaze,England.,First night with retainers in. It's quite weird. Better get used to it; I have to wear them every single night for the next year at least.,0
52,76,ablaze,Barbados,SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintendent Lanford Salmon has r ... - http://t.co/vplR5Hka2u http://t.co/SxHW2TNNLf,0


## Vectors

In [57]:
counter_vectorizer = feature_extraction.text.CountVectorizer()

In [62]:
sample_train_vectors = counter_vectorizer.fit_transform(train_df.text[:3])
print(sample_train_vectors[0].todense().shape)
print(sample_train_vectors[0].todense())
print(sample_train_vectors[1].todense().shape)
print(sample_train_vectors[1].todense())
print(sample_train_vectors[2].todense().shape)
print(sample_train_vectors[2].todense())
train_df.text[:3]

(1, 36)
[[1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 1]]
(1, 36)
[[0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0]]
(1, 36)
[[1 0 2 1 1 1 0 0 0 1 1 0 0 0 2 0 0 0 1 1 0 1 1 1 1 0 2 0 1 0 0 2 0 0 1 0]]


0    Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all                                                                
1    Forest fire near La Ronge Sask. Canada                                                                                               
2    All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
Name: text, dtype: object

13 words: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all                                                                
7 words: Forest fire near La Ronge Sask. Canada                                                                                               
22 words (16 new): residents asked to 'shelter in place' being notified by officers. No other evacuation or orders expected

In [106]:
words = train_df.text[0] + ' ' + train_df.text[1] + ' '+ train_df.text[2]
words = words.replace('#','').replace('.',' ').replace("\'"," ")
words = words.lower().split()
words = sorted(list(set(words)))
print(words)

['all', 'allah', 'are', 'asked', 'being', 'by', 'canada', 'deeds', 'earthquake', 'evacuation', 'expected', 'fire', 'forest', 'forgive', 'in', 'la', 'may', 'near', 'no', 'notified', 'of', 'officers', 'or', 'orders', 'other', 'our', 'place', 'reason', 'residents', 'ronge', 'sask', 'shelter', 'the', 'this', 'to', 'us']


In [107]:
len(words)

36

In [108]:
train_vectors = counter_vectorizer.fit_transform(train_df.text)
test_vectors = counter_vectorizer.transform(test_df.text)