# Natural Language Processing with Disaster Tweets

## Problem Statement and Justification

Twitter has become an essential communication platform, especially during emergencies. The widespread use of smartphones enables people to report emergencies in real-time, providing valuable information to disaster relief organizations, news agencies, and governmental bodies. However, distinguishing between genuine emergency tweets and unrelated messages poses a significant challenge.

Organizations that rely on Twitter for real-time disaster monitoring need an automated way to filter relevant tweets accurately. A machine learning model capable of distinguishing real disaster-related tweets from unrelated ones would enhance response times and resource allocation, ensuring efficient emergency management.

## Assumptions and Scope

- The dataset used for training the model consists of labeled tweets, identifying whether they describe a real disaster or not.

- Tweets may include textual indicators of disasters, such as figurative language that might lead to misclassification.

- The model assumes that tweets are written in English and that linguistic patterns can be used to determine their relevance.

- The model focuses on text-based features and does not incorporate external data sources (e.g., images, geolocation).

## Hypothesis (NLP Related):
- Tweets about real disasters have distinct linguistic patterns, including keywords, urgency markers, and direct mentions of locations or events.

- Sentiment analysis and keyword extraction can help differentiate real disaster tweets from metaphorical or unrelated statements.

- NLP techniques such as TF-IDF, word embeddings, and transformer models can effectively classify tweets into relevant and non-relevant categories.

## Data Description

The dataset consists of tweets labeled as either related to a disaster (1) or not (0). It contains the following columns:

   - id: A unique identifier for each tweet.

   - keyword: A keyword extracted from the tweet (may be blank).

   - location: The location from which the tweet was sent (may be blank).

   - text: The actual content of the tweet, which serves as the main feature for NLP processing.

   - target: Present only in the training data, with values:

     - 1: The tweet is about a real disaster.

     - 0: The tweet is not related to a disaster.

In [1]:
# libraries
import pandas as pd


In [None]:
# loading the data
df = pd.read_csv('train.csv')
df.tai()

Unnamed: 0,id,keyword,location,text,target
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1
7612,10873,,,The Latest: More Homes Razed by Northern Calif...,1


In [5]:
df['keyword'].unique()

array([nan, 'ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown%20up', 'body%20bag', 'body%20bagging',
       'body%20bags', 'bomb', 'bombed', 'bombing', 'bridge%20collapse',
       'buildings%20burning', 'buildings%20on%20fire', 'burned',
       'burning', 'burning%20buildings', 'bush%20fires', 'casualties',
       'casualty', 'catastrophe', 'catastrophic', 'chemical%20emergency',
       'cliff%20fall', 'collapse', 'collapsed', 'collide', 'collided',
       'collision', 'crash', 'crashed', 'crush', 'crushed', 'curfew',
       'cyclone', 'damage', 'danger', 'dead', 'death', 'deaths', 'debris',
       'deluge', 'deluged', 'demolish', 'demolished', 'demolition',
       'derail', 'der

In [6]:
df['location'].unique()

array([nan, 'Birmingham', 'Est. September 2012 - Bristol', ...,
       'Vancouver, Canada', 'London ', 'Lincoln'],
      shape=(3342,), dtype=object)