# Binary Classification Problem

NLP

## 01. Load Data

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
train_data = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/nlp/train.csv')
test_data = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/nlp/test.csv')


## 02. Analyze Data

* **id**: a unique identifier of every tweet
* **keyword**: a particular keyword from the tweet (this can be blank)
* **location**: the location the tweet was sent from (this can be blank)
* **text**: the text of the tweet
* **target**: present only in the train data, and denotes if the tweet is about a real disaster (1) or not (0)



In [4]:
train_data.describe(include = 'all')

Unnamed: 0,id,keyword,location,text,target
count,7613.0,7552,5080,7613,7613.0
unique,,221,3341,7503,
top,,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...,
freq,,45,104,10,
mean,5441.934848,,,,0.42966
std,3137.11609,,,,0.49506
min,1.0,,,,0.0
25%,2734.0,,,,0.0
50%,5408.0,,,,0.0
75%,8146.0,,,,1.0


In [5]:
train_data.sample(5)

Unnamed: 0,id,keyword,location,text,target
5260,7521,oil%20spill,,Refugio oil spill may have been costlier bigge...,1
4849,6908,mass%20murderer,Earth-616,[Creel:You must think I'm a real moron Flag ma...,0
3786,5378,fire%20truck,,Our garbage truck really caught on fire lmfao.,0
3489,4986,explosion,,New Explosion-proof Tempered Glass Screen Prot...,0
4256,6048,heat%20wave,"London, Riyadh",Something Frightening is Happening to the Weat...,1


In [6]:
test_data.sample(5)

Unnamed: 0,id,keyword,location,text
2895,9587,thunder,,Tuesday Bolts ÛÒ 8.4.15 http://t.co/mkMSV54eV...
3197,10624,wounded,SWMO,Officer Wounded Suspect Killed in Exchange of ...
1170,3865,detonation,"Palacio, Madrid",@jcenters No uh-oh it was a controlled detonat...
2915,9648,thunderstorm,"Oklahoma City, OK",Severe Thunderstorm Warnings have been cancell...
2526,8426,sandstorm,United States,Watch This Airport Get Swallowed Up By A Sands...


### 02.02. Missing Values

In [7]:
train_data.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [8]:
test_data.isnull().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

In [9]:
'''
We can still find an aproach to include the 'keyworkd in our mode, but in this
exaple we are going to be using the simple text from the twitter.
'''

"\nWe can still find an aproach to include the 'keyworkd in our mode, but in this\nexaple we are going to be using the simple text from the twitter.\n"

In [10]:
train_data.text.sample(10)

1277    burned 129 calories doing 24 minutes of Walkin...
3821    Juneau Empire - First responders turn out for ...
2839    Philippines Must Protect Internally Displaced ...
2007    Nine inmates charged with causing damage in Ca...
1197    Mmmmmm I'm burning.... I'm burning buildings I...
6067    150-Foot Sinkhole Opens In Lowndes County Resi...
1586    News Update Huge cliff landslide on road in Ch...
3189    Cruise's 'M:I 5' emergency plan: Awesome fail ...
3558    @FinancialTimes Ethiopian regimes continue rec...
6938    @astros stunningly poor defense it's not all o...
Name: text, dtype: object

## 03. Data Processing

### 03.01. Text Preprocessing

#### Lowercasing

In [11]:
train_data['text'] = train_data['text'].apply(lambda x : str.lower(x))
train_data.text.sample(10)


5718    video: 'we're picking up bodies from water': r...
2718    @un no more #gujaratriot &amp; #mumbairiot92-9...
6265    kesabaran membuahkan hasil indah pada saat tep...
6416    #deai #??? #??? #??? suicide bomber kills 15 i...
4827    http://t.co/c1h7jecfrv @royalcarribean do your...
2443    #news madhya pradesh train derailment: village...
2687    dorman 917-033 ignition knock (detonation) sen...
970     new summer long thin body bag hip a word skirt...
5129    nuclear-deal: indo-japan pact lies at the hear...
6201    smoke with me baby and lay with me baby and la...
Name: text, dtype: object

In [12]:
test_data['text'] = test_data['text'].apply(lambda x : str.lower(x))
test_data.text.sample(10)


1303    the drowning girl by caitlin r. kiernan centip...
601     @trubeque destruction magic's fine just don't ...
2855    cross-border terrorism: pakistan caught red-ha...
688     emergency services in hammondville. near jewel...
1271                              drown me in clementines
847     @benznibeadel_ hehe like u hahaha i'm kidding ...
1984    5 need to-dos seeing as how technical writing ...
1755    father we come 2u &amp; lift up this nation am...
176     wehtwtvwlocal: trial date set for man charged ...
1430    what if the fire up in the catalinas gets wors...
Name: text, dtype: object

#### Entities, URL links and Punctuation Removal

Remove mentions (i.e. @user), and hashtags (e.g. #CA) 

In [13]:
def remove_entities(text):
  entity_prefixes = ['@','#']
  words = []
  for word in text.split():
    w = word.strip()
    if len(w) > 0 and w[0] not in entity_prefixes:
      words.append(w)
  return  ' '.join(words)

In [14]:
train_data['text'] = train_data['text'].apply(lambda x: remove_entities(x))

In [15]:
train_data.text.sample(10)

7388    texas seeks comment on rules for changes to wi...
690     not really. sadly i have come to expect that f...
5333    pandemonium in aba as woman delivers baby with...
2034    training grains of wheat to bare gold in the a...
2194    malaysia confirms reunion island debris is fro...
707     morgan silver dollar 1921 p ch gem bu pl blazi...
5962                                            screaming
7399    police officer wounded suspect dead after exch...
1349    like for the music video i want some real acti...
203     twelve feared killed in pakistani air ambulanc...
Name: text, dtype: object

In [16]:
test_data['text'] = test_data['text'].apply(lambda x: remove_entities(x))

In [17]:
test_data.text.sample(10)

2245    short story about indifference oppression hatr...
2419    if a picture is worth a thousand words what wo...
3032    twister hits 4 villages in quezon province - h...
2157            washi is indeed 'a natural disaster' ????
2392    guess who's got a hilarious new piece on 51 th...
37      there's a construction guy working on the disn...
2411    florida firefighters rescue scared meowing kit...
2898    it doesn't really get much better then summer ...
2152             natural disaster on you half ass rappers
1572        the next chp is titled emmeryn i live in fear
Name: text, dtype: object

Replace URL links with blanks

In [18]:
import re

In [19]:
train_data['text'] = train_data['text'].apply(lambda x: re.sub(r"(?:\@|http?\://|https?\://|www)\S+",' ',x))

test_data['text'] = test_data['text'].apply(lambda x: re.sub(r"(?:\@|http?\://|https?\://|www)\S+",' ',x))

In [20]:
train_data.text.sample(10)

3197    do you have an emergency drinking water plan? ...
3083    she says that she'd love to come help but the ...
4290    describes piling up thinking it would last as ...
2938    sometimes logic gets drowned out in emotion bu...
1542    downtown emergency service center is hiring! c...
2370    totally agree.she is 23 and know what birth co...
5106       fukushima: the story of a nuclear disaster    
6577    4 those who care about sibling abuse survivors...
3257    why tf did i decide to workout today? my body ...
7282    set a new record.... 7 states in 4 days. i don...
Name: text, dtype: object

In [21]:
test_data.text.sample(10)

3140    {{ whirlwind romance dress | in-store &amp; on...
2799    'i mean if the relationship can't survive thez...
258     fedex no longer to transport bioterror germs i...
2571                                            *screams*
167     arson suspect linked to 30 fires caught in nor...
3136    join me on fb for friday   for my upcoming to ...
459     eish even drake killing niggas eish game is re...
2244    [chaos dancing in the streets | why did god or...
702     video: man rescued after 80ft cliff fall at sh...
1672                 flooding updates from pasco county  
Name: text, dtype: object


Removing Punctuation

In [22]:
train_data['text'] = train_data['text'].apply(lambda x: re.sub(r"[^\w\s]"," ",x))

test_data['text'] = test_data['text'].apply(lambda x: re.sub(r"[^\w\s]"," ",x))

In [23]:
train_data.text.sample(10)

1178    listening to blowers and tuffers on the aussie...
7215    iranian warship points weapon at american heli...
7314    deep crew to help with california wild fires  ...
4612                    familia  arm injury or head case 
4671    potential storm surge flooding map by national...
4689    speaking of memorable debates  60 second know ...
116               320  ir  icemoon  aftershock           
2511    just came back from camping and returned with ...
3302    the efak would be designed for building occupa...
7253     the reagan administration had arranged for is...
Name: text, dtype: object

In [24]:
test_data.text.sample(10)

3025    i liked a video   call of duty ghosts walktrou...
648     fox news is the biggest media catastrophe in a...
1403     emergency plumber emergency plumbing services   
1247    drought report lists se texas as  abnormally d...
2868    polls  should the unborn babies of boko haram ...
516     ashes 2015  australia ûªs collapse at trent br...
1440    chesterfield apartment complex evacuated becau...
1496    new explosion proof tempered glass screen prot...
1749    anger is an acid that can do more harm to the ...
2129    my husband was in the military and he does tha...
Name: text, dtype: object

Remove numbers

In [25]:
def remove_numbers(text):
  words = []
  for word in text.split():
    w = word.strip()
    if len(w) > 0 and w.isnumeric() == False:
      words.append(w)
  return  ' '.join(words)

In [26]:
train_data['text'] = train_data['text'].apply(lambda x: remove_numbers(x))

test_data['text'] = test_data['text'].apply(lambda x: remove_numbers(x))

In [27]:
train_data.sample(10)

Unnamed: 0,id,keyword,location,text,target
2997,4305,dust%20storm,CA via Brum,when the answer my friend isn t blowing in the...,1
5046,7194,mudslide,,even the one that looked like a mudslide,0
7305,10455,wild%20fires,"North Carolina, USA",i was thinking about you today when i was read...,1
1173,1690,bridge%20collapse,,two giant cranes holding a bridge collapse int...,1
7400,10588,wounded,North Cack/919,i also loved bury my heart at wounded knee too,0
677,978,blaze,My contac 27B80F7E 08170156520,say by agree s is a,0
2478,3554,desolate,,thanks a lot roadworks men cos a tube strike w...,1
3495,4995,explosion,,did you miss the explosion don t miss out toni...,0
7577,10829,wrecked,#NewcastleuponTyne #UK,he s gone you can relax i thought the wife who...,0
6324,9040,stretcher,,show me a picture of it,1


In [28]:
test_data.sample(10)

Unnamed: 0,id,keyword,location,text
227,742,attacked,,christian attacked by muslims at the temple mo...
2868,9512,terrorist,"Lagos, Nigeria.",polls should the unborn babies of boko haram t...
1965,6623,inundated,RELEASE THE REIS,my school is inundated and they wont let us go...
717,2331,collapse,Russia,if oikawa was in karasuno i guess i d just col...
1948,6574,injury,All Round The World,enter the world of extreme diving ûó stories u...
1752,5917,harm,,never let the experiences of your past bring h...
1818,6144,hijack,"FCT, Abuja",tension in bayelsa as patience jonathan plans ...
2450,8186,rescuers,"Manado,Sulawesi Utara",video we re picking up bodies from water rescu...
541,1777,buildings%20on%20fire,Wellington,jane kelsey on the fire economy 5th aug ûò7 30...
1348,4446,electrocute,32935,of electricity he had wire and a golf ball hoo...


#### Spelling Correction

In [29]:
!python -m pip install -U symspellpy



In [30]:
import pkg_resources
from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell()
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
sym_spell.load_dictionary(dictionary_path,0,1)
num_prints = 10

def spelling_correction(sent):
  global num_prints
  doc_w_correct_spelling = []
  for tok in sent.split(" "):
    tok = tok.strip()
    if len(tok) == 0:
       continue
    suggestion_list = sym_spell.lookup(tok,Verbosity.CLOSEST, max_edit_distance=2, include_unknown=True)
    first_suggestion = suggestion_list[0].__str__()
    y = first_suggestion.split(',')[0]
    z = first_suggestion.split(',')[1]
    if int(z) >= 1 and num_prints >= 0:
      print(f'Data Type {type(suggestion_list)}')
      print(f'Data Type {type(first_suggestion)}')
      print(f'Spelling Mistake: {tok}')
      print(f'Spelling Correction: {y}')
      print()
      num_prints -= 1
    doc_w_correct_spelling.append(y)
  return ' '.join(doc_w_correct_spelling)

In [31]:
train_data['text'] = train_data['text'].apply(lambda x: spelling_correction(x))

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: ronge
Spelling Correction: range

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: m
Spelling Correction: a

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: s
Spelling Correction: a

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: m
Spelling Correction: a

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: haha
Spelling Correction: hama

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: fvck
Spelling Correction: fuck

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: ve
Spelling Correction: be

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: bago
Spelling Correction: ago

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: bago
Spelling Correction: ago

Data Type <class 'list'>
Data Type <class 'str'>
Spelling Mistake: s
Spelling Correction: a

Data Type <class 'list'>
Data Type <cl

In [32]:
train_data.text.sample(10)

4285    hellfire is surrounded by desires so be carefu...
5579    the latest more homes razed by northern califo...
7351                     brush that a the lady from milan
1012    woke up to drake body bagging meek again meek ...
2824    thousands of people were displaced injured kil...
5326    would love to see a diabolo map themed after p...
7096    thunder pounds north goes black a deep bruise ...
155     asama bin ladies family dead in airline crash ...
6092      nigga car sinking but he snapping it up for fox
3146    the new quest type is level up quest its an al...
Name: text, dtype: object

In [33]:
test_data['text'] = test_data['text'].apply(lambda x: spelling_correction(x))

In [34]:
test_data.text.sample(10)

938     she stepping up enforcement after mining death...
3184    great time group camping at press file with fa...
593                 burning buildings keep the flames lit
2332    cod advanced warfare reckoning dec quarantine ...
94      well first we strike dreamworld and the minion...
1124    so gist houses farm produce destroyed by flood...
1947                         full game injury cut is here
682     the chemical brothers to play the armoury in o...
1821    chrysler jeep tirelessly hacked over internet ...
511     breaking news australia collapse to a hapless ...
Name: text, dtype: object

#### Lemmatization

The goal of lemmatization is to convert a word to its root form.

In [35]:
import spacy
import os 
os.system('python -m spacy download en')

nlp = spacy.load('en')

In [36]:
def lemmatize(sentence):
  doc = nlp(sentence)
  lemmas = [token.lemma_ for token in doc]
  return " ".join(lemmas)

In [37]:
train_data['text'] = train_data['text'].apply(lambda x: lemmatize(x))

test_data['text'] = test_data['text'].apply(lambda x: lemmatize(x))

In [38]:
train_data.text.sample(10)

3129    i a in the shower and i go to go change the so...
4139    i think a lot of celebrity have to treat pal a...
4220    be to dangle pierce crystal potentially hazard...
191     twelve fear kill in pakistani air ambulance he...
4546    how come scott rice do to get another shot hol...
4673    well unfortunately for -PRON- follower stage p...
6494          fuck sake john jesus -PRON- heart just sink
6636    the terrorist try to get out of the car i shoo...
5142         a little filming inside a nuclear reactor at
4530     and -PRON- wonder why -PRON- a injure every year
Name: text, dtype: object

In [39]:
test_data.text.sample(10)

2817    snacks snack snack snack be how -PRON- survive...
1590    news on crew neighbour put out hewlett truck fire
2134           for sale canadian military rifle drop once
2374    drive home after a fairly large rainstorm make...
1610    volunteer serve first responder today during a...
2534                 i just scream to -PRON- because yeah
47                                   in iceman aftershock
395     the whole of new zealand be shout bloody marve...
1996                             lava blast dan power red
1258    what if the drought be just a giant marketing ...
Name: text, dtype: object

#### Stop Words Removal

In [40]:
def remove_stopwords(sentence):
  doc = nlp(sentence)
  all_stopwords = nlp.Defaults.stop_words
  doc_tokens = [token.text for token in doc]
  tokens_without_sw = [word for word in doc_tokens if not word in all_stopwords]
  return " ".join(tokens_without_sw)

In [41]:
train_data['text'] = train_data['text'].apply(lambda x: remove_stopwords(x))

test_data['text'] = test_data['text'].apply(lambda x: remove_stopwords(x))

In [42]:
train_data.text.sample(10)

6443    christian terrorist sure don suicide bombing e...
353     build -PRON- kingdom lead -PRON- army victory ...
3099    pakpattan city news man electrocute -PRON- cor...
6690    flash thunder -PRON- quick amazon kindle soon ...
7565                                   wreck tired asleep
6865                        need plan trip cleveland soon
2289                         imagine root -PRON- demolish
5971                                          love scream
4765       lightning strike de snow patrol de million sun
5511                                  user sub quarantine
Name: text, dtype: object

In [43]:
test_data.text.sample(10)

2981    pts research aim answer question leave future ...
2198           -PRON- good cute kind cute want obliterate
2145                      look like mudslide unreal scene
2960    les win hospital -PRON- break -PRON- leg trap ...
1528    radio free europe radio liberty ukraine famine...
2759    google news suicide bomber kill saudi security...
2620    helen model 295ss siren amplifier police emerg...
3119    gene weapon -PRON- benefit doubt year huge bel...
2939          -PRON- love tragedy -PRON- -PRON- remedyyyy
2460                                wait date ill mag pax
Name: text, dtype: object

### 03.02. Build Vectors

In [44]:
from sklearn import feature_extraction

In [45]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [46]:
train_vectors = count_vectorizer.fit_transform(train_data.text)

In [47]:
print(train_vectors[88].shape)

(1, 9177)


In [48]:
test_vectors = count_vectorizer.transform(test_data["text"])

## 04. Model

In [49]:
from sklearn import linear_model,model_selection
clf = linear_model.LogisticRegression()

In [50]:
scores = model_selection.cross_val_score(clf,train_vectors,train_data.target,cv = 3,scoring = "f1")

In [51]:
scores

array([0.61563518, 0.57836198, 0.66534063])

In [52]:
clf.fit(train_vectors,train_data.target)

LogisticRegression()

## 05. Model Testing

In [53]:
from sklearn.metrics import accuracy_score, classification_report, precision_score,recall_score

In [54]:
y_pred = clf.predict(test_vectors)
