Natural Language Processing with Disaster Tweets

This is a notebook dedicated to the Natural Language Processing with Disaster Tweets Kaggle competition. In this competition, we will build a machine learning model that can predict whether a tweet is about a real disaster or not.

Dataset: https://www.kaggle.com/competitions/nlp-getting-started/overview

Step 1: Import necessary libraries

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score


Step 2: Load and Explore the Data
Use pandas for loading and analyzing your datasets.

2.1. Data Collection and Data Preparation

Load Data

In [None]:
train_df = pd.read_csv("train.csv")  # Training set
test_df = pd.read_csv("test.csv")  # Test set
sample_submission_df = pd.read_csv("sample_submission.csv")  # Sample submission format


In [None]:
# Quick overview of the data
print(train_df.head())
print(test_df.head())
print(sample_submission_df.head())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  
   id keyword location                                               text
0   0     NaN      NaN                 Just happened a terrible car crash
1   2     NaN      NaN  Heard about #earthquake is different cities, s...
2   3     NaN      NaN  there is a forest fire at spot pond, geese are...
3   9     NaN      NaN           Apocalypse lighting. #Spokane #wildfires
4  11     NaN      NaN      Typhoon Soudelor kills 28 in China and Taiwan
   id  target
0   0  

2.2. Display Data Structure

In [None]:
print("\nStructure of The training datase :")
print(train_df.info())

print("\nStructure ofThe test  dataset:")
print(test_df.info())

print("\nStructure of A sample submission file dataset:")
print(sample_submission_df.info())


Structure of The training datase :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
None

Structure ofThe test  dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB
None

Structure of A sample submission file dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 

3. Check for Missing Values


In [None]:
print("Missing values in training dataset:\n", train_df.isnull().sum())
print("\nMissing values in testing dataset:\n", test_df.isnull().sum())
print("\nMissing values in samplesubmission dataset:\n", sample_submission_df.isnull().sum())

train_df = train_df.dropna()
test_df = test_df.dropna()
sample_submission_df = sample_submission_df.dropna()


Missing values in training dataset:
 id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Missing values in testing dataset:
 id             0
keyword       26
location    1105
text           0
dtype: int64

Missing values in samplesubmission dataset:
 id        0
target    0
dtype: int64


In [None]:
print("Missing values in training dataset:\n", train_df.isnull().sum())
print("\nMissing values in testing dataset:\n", test_df.isnull().sum())
print("\nMissing values in samplesubmission dataset:\n", sample_submission_df.isnull().sum())

Missing values in training dataset:
 id          0
keyword     0
location    0
text        0
target      0
dtype: int64

Missing values in testing dataset:
 id          0
keyword     0
location    0
text        0
dtype: int64

Missing values in samplesubmission dataset:
 id        0
target    0
dtype: int64


4. Drop Duplicates (Keep Only the First Occurrence)

In [None]:

train_df = train_df.drop_duplicates(keep='first')
test_df = test_df.drop_duplicates(keep='first')
sample_submission_df = sample_submission_df.drop_duplicates(keep='first')

print("\nAfter dropping duplicates:")
print("Duplicates in trainning dataset:", train_df.duplicated().sum())
print("Duplicates in testing dataset:", test_df.duplicated().sum())
print("Duplicates in sample_submission dataset:", sample_submission_df.duplicated().sum())



After dropping duplicates:
Duplicates in trainning dataset: 0
Duplicates in testing dataset: 0
Duplicates in sample_submission dataset: 0


5. Reset Index

In [None]:
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)
sample_submission_df.reset_index(drop=True, inplace=True)


6. Display First and Last Records

In [None]:
print("\nFirst 5 records in training dataset:")
print(train_df.head())
print("\nLast 5 records in training dataset:")
print(train_df.tail())

print("\nFirst 5 records in testing dataset:")
print(test_df.head())
print("\nLast 5 records in testing dataset:")
print(test_df.tail())

print("\nFirst 5 records in sample_submission dataset:")
print(sample_submission_df.head())
print("\nLast 5 records in sample_submission dataset:")
print(sample_submission_df.tail())




First 5 records in training dataset:
   id keyword                       location  \
0  48  ablaze                     Birmingham   
1  49  ablaze  Est. September 2012 - Bristol   
2  50  ablaze                         AFRICA   
3  52  ablaze               Philadelphia, PA   
4  53  ablaze                     London, UK   

                                                text  target  
0  @bbcmtd Wholesale Markets ablaze http://t.co/l...       1  
1  We always try to bring the heavy. #metal #RT h...       0  
2  #AFRICANBAZE: Breaking news:Nigeria flag set a...       1  
3                 Crying out for more! Set me ablaze       0  
4  On plus side LOOK AT THE SKY LAST NIGHT IT WAS...       0  

Last 5 records in training dataset:
         id  keyword                location  \
5075  10826  wrecked                      TN   
5076  10829  wrecked  #NewcastleuponTyne #UK   
5077  10831  wrecked       Vancouver, Canada   
5078  10832  wrecked                 London    
5079  10833  wreck

7. Display Data  After

In [None]:
print("\nStructure of The training datase :")
print(train_df.info())

print("\nStructure ofThe test  dataset:")
print(test_df.info())

print("\nStructure of A sample submission file dataset:")
print(sample_submission_df.info())


Structure of The training datase :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5080 entries, 0 to 5079
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        5080 non-null   int64 
 1   keyword   5080 non-null   object
 2   location  5080 non-null   object
 3   text      5080 non-null   object
 4   target    5080 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 198.6+ KB
None

Structure ofThe test  dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2158 entries, 0 to 2157
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        2158 non-null   int64 
 1   keyword   2158 non-null   object
 2   location  2158 non-null   object
 3   text      2158 non-null   object
dtypes: int64(1), object(3)
memory usage: 67.6+ KB
None

Structure of A sample submission file dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0

8. Define a Function to Clean Text

Defines stopwords (common words like "the", "is", "and") to be removed.

In [None]:
custom_stopwords = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
    "her", "hers", "herself", "it", "its", "itself", "they", "them", "their",
    "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
    "these", "those", "am", "is", "are", "was", "were", "be", "been", "being",
    "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an",
    "the", "and", "but", "if", "or", "because", "as", "until", "while", "of",
    "at", "by", "for", "with", "about", "against", "between", "into", "through",
    "during", "before", "after", "above", "below", "to", "from", "up", "down",
    "in", "out", "on", "off", "over", "under", "again", "further", "then",
    "once", "here", "there", "when", "where", "why", "how", "all", "any", "both",
    "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will",
    "just", "don", "should", "now"
])


 Step 5: Apply the Cleaning Function

Cleans the text column (removes URLs, punctuation, stopwords).
Stores the cleaned version in clean_text.

In [None]:
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)  # Remove special characters
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in custom_stopwords])  # Remove stopwords manually
    return text

train_df['clean_text'] = train_df['text'].apply(clean_text)
test_df['clean_text'] = test_df['text'].apply(clean_text)


9. Convert Text to Numeric Features Using TF-IDF