# Kaggle Novice to Grandmaster
### Competition #2: Natural Language Processing with Disaster Tweets

- https://www.kaggle.com/c/nlp-getting-started


Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. 

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

path = '/content/drive/MyDrive/data/Kaggle'

Mounted at /content/drive


In [None]:
# Packages
import os

import pandas as pd
import numpy as np
import random
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings(action='ignore')

!pip install -q kaggle --upgrade
os.environ['KAGGLE_USERNAME'] = 'kangbeenko'
os.environ['KAGGLE_KEY'] = '7bf5875a0067e432fb8a9300acd60b8e'

## Dataset

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle
!kaggle competitions download -c nlp-getting-started
!mv nlp-getting-started.zip drive/MyDrive/data/Kaggle/

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[?25l[K     |█████▋                          | 10 kB 21.0 MB/s eta 0:00:01[K     |███████████▏                    | 20 kB 22.0 MB/s eta 0:00:01[K     |████████████████▊               | 30 kB 18.6 MB/s eta 0:00:01[K     |██████████████████████▎         | 40 kB 15.9 MB/s eta 0:00:01[K     |███████████████████████████▉    | 51 kB 8.7 MB/s eta 0:00:01[K     |████████████████████████████████| 58 kB 4.2 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=88906ecfcf1f486c0a9718cbc1435fe700da91a9228c725691a36bb094dce2d5
  Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Un

In [None]:
!unzip drive/MyDrive/data/Kaggle/nlp-getting-started.zip

Archive:  drive/MyDrive/data/Kaggle/nlp-getting-started.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
train_df=pd.read_csv('./train.csv')
test_df=pd.read_csv('./test.csv')

In [None]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


target이 1이면 disaster tweet이고, 0인 경우 disaster tweet이 아니다.

In [None]:
print(train_df[train_df["target"] == 0]["text"].values[1])
print(train_df[train_df["target"] == 1]["text"].values[1])

I love fruits
Forest fire near La Ronge Sask. Canada


In [None]:
print(train_df[train_df["keyword"].notnull()].iloc[0])

id                                                         48
keyword                                                ablaze
location                                           Birmingham
text        @bbcmtd Wholesale Markets ablaze http://t.co/l...
target                                                      1
Name: 31, dtype: object


In [None]:
train_df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

## Tutorial

### Building Vectors

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()

example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [None]:
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


1. There are 54 unique words (or "tokens") in the first five tweets.

2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

In [None]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])

In [None]:
print(train_vectors[0].todense().shape)

(1, 21637)


### Model

In [None]:
ridge = linear_model.RidgeClassifier()

scores = model_selection.cross_val_score(ridge, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59453669, 0.56498283, 0.64082434])

In [None]:
ridge.fit(train_vectors, train_df["target"])

RidgeClassifier()

In [None]:
sample_submission = pd.read_csv("./sample_submission.csv")
sample_submission["target"] = ridge.predict(test_vectors)

In [None]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [None]:
sample_submission.to_csv("submission_tutorial.csv", index=False)