# NLP with Disaster Tweets - Solution


In this notebook, we will build a machine learning model to predict whether a given tweet is about a real disaster or not.

### Steps:
1. Load and explore the data.
2. Data cleaning and preprocessing.
3. Feature engineering using text vectorization techniques.
4. Model training and evaluation.
5. Final predictions and preparation of the submission file.


## 1. Load and Explore the Data

In [1]:

import pandas as pd

# Load train and test data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display first few rows of the dataset
train_df.head()


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [2]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## 2. Data Exploration and Preprocessing

In [3]:

# Check for missing values in the dataset
train_df.isnull().sum()


Unnamed: 0,0
id,0
keyword,61
location,2533
text,0
target,0


In [4]:

import re

# Function to clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\n', '', text)
    text = re.sub(r'[^A-Za-z ]+', '', text)
    return text

# Apply the cleaning function to the text data
train_df['text'] = train_df['text'].apply(clean_text)
test_df['text'] = test_df['text'].apply(clean_text)


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
# Convert stop_words to a list instead of a set
stop_words = stopwords.words('english')

# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words=stop_words, max_features=1000)

# Fit and transform the training text data
X_train = vectorizer.fit_transform(train_df['text']).toarray()
X_test = vectorizer.transform(test_df['text']).toarray()
y_train = train_df['target']

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 3. Model Selection and Training

In [6]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# Split the training data for validation
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_tr, y_tr)

# Predict on validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation F1 Score: {f1:.4f}")


Validation Accuracy: 0.7833
Validation F1 Score: 0.7321


## 4. Cross-Validation

In [7]:

from sklearn.model_selection import cross_val_score
import numpy as np

# Evaluate model using cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-Validation F1 Score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print("Average cross-validation accuracy:", cv_scores.mean())

Cross-Validation F1 Score: 0.5760 ± 0.0680
Average cross-validation accuracy: 0.6731968769709828


## 5. Final Prediction and Submission

In [8]:

# Train final model on entire training data
model.fit(X_train, y_train)

# Predict on test set
predictions = model.predict(X_test)


In [9]:

# Prepare submission file
submission = pd.DataFrame({
    'id': test_df['id'],
    'target': predictions
})

# Save to CSV for submission
submission.to_csv('submission.csv', index=False)
submission.head()


Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1



### Summary
In this notebook, we performed data preprocessing, trained a Logistic Regression model, evaluated its performance, and made predictions on the test set for submission to Kaggle.

Further improvements could include experimenting with advanced NLP techniques, using other machine learning algorithms, or fine-tuning hyperparameters.



# NLP with Disaster Tweets - LightGBM Solution

In this notebook, we will use a LightGBM model to predict whether a given tweet is about a real disaster or not.

### Steps:
1. Load and explore the data.
2. Data cleaning and preprocessing.
3. Feature engineering using text vectorization techniques.
4. Model training and evaluation with LightGBM.
5. Final predictions and preparation of the submission file.


In [10]:
# Load train and test data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display first few rows of the dataset
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
# Check for missing values in the dataset
train_df.isnull().sum()

Unnamed: 0,0
id,0
keyword,61
location,2533
text,0
target,0


In [12]:
import re

# Function to clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\\S+|www\\S+|https\\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\\n', '', text)
    text = re.sub(r'[^A-Za-z ]+', '', text)
    return text

# Apply the cleaning function to the text data
train_df['text'] = train_df['text'].apply(clean_text)
test_df['text'] = test_df['text'].apply(clean_text)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit and transform the training text data
X_train = vectorizer.fit_transform(train_df['text']).toarray()
X_test = vectorizer.transform(test_df['text']).toarray()
y_train = train_df['target']

In [16]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# Split the training data for validation
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Initialize LightGBM classifier
lgb_model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05)

# Create a LightGBM dataset for validation
eval_set = [(X_val, y_val)]

# Train the model using eval_set and early_stopping_rounds within the fit method
lgb_model.fit(X_tr, y_tr,
              eval_set=eval_set,
              # Use callbacks parameter to specify early stopping
              callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=True)],
              # Removed verbose from here as it is not a direct parameter of fit anymore
             )

# Predict on validation set
y_pred = lgb_model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation F1 Score: {f1:.4f}")

[LightGBM] [Info] Number of positive: 2622, number of negative: 3468
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.055998 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6413
[LightGBM] [Info] Number of data points in the train set: 6090, number of used features: 517
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.430542 -> initscore=-0.279641
[LightGBM] [Info] Start training from score -0.279641
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[208]	valid_0's binary_logloss: 0.507171
Validation Accuracy: 0.7617
Validation F1 Score: 0.6998


In [17]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Evaluate model using cross-validation
cv_scores = cross_val_score(lgb_model, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-Validation F1 Score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")


[LightGBM] [Info] Number of positive: 2616, number of negative: 3474
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.035375 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6378
[LightGBM] [Info] Number of data points in the train set: 6090, number of used features: 499
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.429557 -> initscore=-0.283660
[LightGBM] [Info] Start training from score -0.283660
[LightGBM] [Info] Number of positive: 2617, number of negative: 3473
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.031522 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6457
[LightGBM] [Info] Number of data points in the train set: 6090, number of used features: 504
[LightGBM] [Info] [bin

In [18]:
# Train final model on entire training data
lgb_model.fit(X_train, y_train)

# Predict on test set
predictions = lgb_model.predict(X_test)


[LightGBM] [Info] Number of positive: 3271, number of negative: 4342
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032500 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8784
[LightGBM] [Info] Number of data points in the train set: 7613, number of used features: 649
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.429660 -> initscore=-0.283239
[LightGBM] [Info] Start training from score -0.283239


In [19]:
# Prepare submission file
submission = pd.DataFrame({
    'id': test_df['id'],
    'target': predictions
})

# Save to CSV for submission
submission.to_csv('lgbm_submission.csv', index=False)
submission.head()


Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1
