# NLP with TensorFlow

#### The data for this project is a Kaggle dataset that can be found in the link below. This project aims to make a Deep Learning model to predict whether a given twit is a disaster message.
https://www.kaggle.com/competitions/nlp-getting-started/data

In [None]:
#importing the libraries needed for this project.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

#### let's visualize the data.

In [5]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


#### it seems that the trainig data is not shuffled. Let's shuffle the data.

In [6]:
shuffled_train_data = train_data.sample(frac=1, random_state=42)
shuffled_train_data.head(10)

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0
5559,7934,rainstorm,,@Calum5SOS you look like you got caught in a r...,0
1765,2538,collision,,my favorite lady came to our volunteer meeting...,1
1817,2611,crashed,,@brianroemmele UX fail of EMV - people want to...,1
6810,9756,tragedy,"Los Angeles, CA",Can't find my ariana grande shirt this is a f...,0
4398,6254,hijacking,"Athens,Greece",The Murderous Story Of AmericaÛªs First Hijac...,1


#### let's visualize the test data.

In [7]:
test_data.head(10)
#it doesn't have labels.

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
5,12,,,We're shaking...It's an earthquake
6,21,,,They'd probably still show more life than Arse...
7,22,,,Hey! How are you?
8,27,,,What a nice hat?
9,29,,,Fuck off!


#### Good, now let's check whether the labels of training data is balanced or not.

In [9]:
shuffled_train_data['target'].value_counts()
# It's almost balanced , about 60% for target = 0 and 40% for target = 1.

0    4342
1    3271
Name: target, dtype: int64

In [13]:
# let's check the distribution of the total data set.
print(f"training data set : {len(train_data)}, test data set : {len(test_data)}, total data set : {len(train_data) + len(test_data)}")

training data set : 7613, test data set : 3263, total data set : 10876


In [19]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(shuffled_train_data)-3) # create random indexes not higher than the total number of samples
for row in shuffled_train_data[["text", "target"]][random_index:random_index+3].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(disaster)" if target > 0 else "(no disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (disaster)
Text:
Mourning notices for stabbing arson victims stir Û÷politics of griefÛª in Israel http://t.co/KkbXIBlAH7

---

Target: 1 (disaster)
Text:
Mass murderer Che Guevara greeting a woman in North Korea http://t.co/GlJBNSFGLl'

---

Target: 0 (no disaster)
Text:
Womens Flower Printed Shoulder Handbags Cross Body Metal Chain Satchel Bags Blue http://t.co/rjZw6C8asX http://t.co/WtdIav11ua

---



### Split train data into training and validation sets

Because the test data has no labels and we have to evalaute our trained models, we'll split off some of the training data and create a validation set.

I also convert the splitted data from pandas Series to lists of the text and lists the labels for ease of use later.

In [20]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_targets, val_targets = train_test_split(shuffled_train_data["text"].to_numpy(),
                                                                            shuffled_train_data["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42) 

In [25]:
# Check the lengths
len(train_texts), len(train_targets), len(val_texts), len(val_targets)

(6851, 6851, 762, 762)