<a href="https://colab.research.google.com/github/Oganesson-118/DMML2020-Orange/blob/Code/Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Orange Team 
##Data Mining Project

NLP with disaster tweets. Real or not?


## Exploratory Data Analysis

In [2]:
# installing the necessary packages
import  numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import spacy

%matplotlib inline

In [3]:
#read the training data
data = pd.read_csv('https://raw.githubusercontent.com/GeorgesBongibault/Project-Orange/main/data/training_data.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,3738,destroyed,USA,Black Eye 9: A space battle occurred at Star O...,0
1,853,bioterror,,#world FedEx no longer to transport bioterror ...,0
2,10540,windstorm,"Palm Beach County, FL",Reality Training: Train falls off elevated tra...,1
3,5988,hazardous,USA,#Taiwan Grace: expect that large rocks trees m...,1
4,6328,hostage,Australia,New ISIS Video: ISIS Threatens to Behead Croat...,1


In [4]:
# read the test data and save for later
t_df = pd.read_csv('https://raw.githubusercontent.com/GeorgesBongibault/Project-Orange/main/data/test_data.csv')
t_df.head()

Unnamed: 0,id,keyword,location,text
0,9972,tsunami,,Crptotech tsunami and banks.\n http://t.co/KHz...
1,9865,traumatised,"Portsmouth, UK",I'm that traumatised that I can't even spell p...
2,1937,burning%20buildings,,@foxnewsvideo @AIIAmericanGirI @ANHQDC So ... ...
3,3560,desolate,,Me watching Law &amp; Order (IB: @sauldale305)...
4,2731,crushed,bahstun/porta reeko,Papi absolutely crushed that ball


In [5]:
data.describe()

Unnamed: 0,id,target
count,6471.0,6471.0
mean,5446.2896,0.428064
std,3139.343612,0.494836
min,1.0,0.0
25%,2731.0,0.0
50%,5450.0,0.0
75%,8161.5,1.0
max,10873.0,1.0


We can see there are 6471 samples to train our model. The target variable is binary.

In [6]:
data.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [None]:
data.target.value_counts()/data.shape[0]

0    0.571936
1    0.428064
Name: target, dtype: float64

The base rate for the training dataset is 57%. This is the proportion of tweets that are not about disasters.

In [9]:
data.keyword.value_counts()

deluge                   39
earthquake               38
collision                37
harm                     37
ambulance                36
                         ..
forest%20fire            18
threat                   10
epicentre                10
radiation%20emergency     9
inundation                6
Name: keyword, Length: 221, dtype: int64

There are 221 different key words given.

In [11]:
data.location.value_counts()

USA                              91
New York                         59
United States                    46
London                           39
Canada                           25
                                 ..
Fukushima city Fukushima.pref     1
? Philly Baby ?                   1
Jakarta, Indonesia                1
GREENSBORO,NORTH CAROLINA         1
Rocky Mountains                   1
Name: location, Length: 2921, dtype: int64

There are 2921 different locations available.

Here are the 10 most common locations and keywords for a tweet about a real disaster : 

In [39]:
pivot = pd.pivot_table(data = data, index = ['location','keyword'], values = 'target', aggfunc= 'sum')
pivot.sort_values(by='target').tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,target
location,keyword,Unnamed: 2_level_1
Hong Kong,debris,3
United States,hazardous,4
United States,hail,4
"Bend, Oregon",evacuation,4
Pedophile hunting ground,displaced,4
"Washington, DC",derailed,5
Nigeria,suicide%20bomb,6
India,derailment,7
Mumbai,wreckage,10
USA,sandstorm,15


Here are the 10 most common locations and keywords for a tweet about a fake disaster : 

In [43]:
a = data.copy()
a['target'] = 1- a['target']
pivot = pd.pivot_table(data = a, index = ['location','keyword'], values = 'target', aggfunc= 'sum')
pivot.sort_values(by='target').tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,target
location,keyword,Unnamed: 2_level_1
Everywhere,crush,4
Road to the Billionaires Club,derail,4
Happily Married with 2 kids,ambulance,4
"Morioh, Japan",detonate,6
USA,destroyed,7
ss,arsonist,8
304,aftershock,9
New York,body%20bag,9
New York,flood,10
Kenya,loud%20bang,10


## First submission

In [None]:
# Download the english language model
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Preprocess the tweets:

- Remove the stopwords. Use the stop words from spacy package.

- Remove the punctuation marks. Use the punctuation marks from the string package.

- Lowercase all of the words.

- Lemmatize all of the words. Lemmatize the words using the spacy package


In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

In [None]:
# Create a list of stopwords
stop_words = spacy.lang.en.stop_words.STOP_WORDS

list(stop_words)[:10]

['wherein',
 'seem',
 'under',
 'becoming',
 'something',
 'whereafter',
 'can',
 '’d',
 'does',
 'four']

In [None]:
import string
# Create a list of punctuation marks
punctuations = string.punctuation

punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# Create tokenizer function
def spacy_tokenizer(sentence):
    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

In [None]:
# Select features
X = data.text
y = data.target
X.shape

(6471,)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)
X_train.shape

(5176,)

In [None]:
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer
classifier = LogisticRegressionCV(solver="lbfgs", cv=5, max_iter=2000, random_state=72)
pipe = Pipeline([('vectorizer', tfidf_vector), ('classifier', classifier)])

In [None]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                                 tokenizer=<function spacy_tokenizer at 0x7f2ba6af22f0>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegressionCV(Cs=10, class_weight=None, cv=5,
                                      dual=Fal

In [None]:
y_pred_train = pipe.predict(X_test)

In [None]:
print(f"Training accuracy:\n{accuracy_score(y_test, y_pred_train):.4f}")

Training accuracy:
0.8000


Predictions on the real test dataset (the separate dataset loaded at t_df)

In [None]:
y_pred_test = pipe.predict(t_df.text)

In [None]:
y_pred_test = pd.Series(y_pred_test).rename('target')
y_pred_test

0       0
1       0
2       1
3       0
4       0
       ..
1137    1
1138    1
1139    1
1140    0
1141    1
Name: target, Length: 1142, dtype: int64

In [None]:
from google.colab import drive
drive.mount('/drive')
y_pred_test.to_csv('/drive/My Drive/1st_submission.csv', index = False)

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).
