<a href="https://colab.research.google.com/github/BaberFaisal/NLP_Text-classification_HW/blob/main/Natural_Language_Processing_with_Disaster_Tweets_using__logistic_and_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Preprocessing Description**

Steps Applied:

Lowercasing: Ensured uniformity by converting text to lowercase (e.g., "Earthquake" → "earthquake").


Tokenization: Used NLTK’s word_tokenize for splitting text into words. This tokenizer handles contractions (e.g., "don't" → ["do", "n't"]) and punctuation effectively.

Stopword Removal: Filtered out common English stopwords (e.g., "the", "and") using NLTK’s list to retain meaningful keywords.

Lemmatization: Applied WordNetLemmatizer to reduce words to their base forms (e.g., "running" → "run"), preserving semantics better than stemming.

Why These Choices?

Tokenizer: word_tokenize is reliable for general-purpose tokenization and handles edge cases (e.g., apostrophes).

Lemmatization over Stemming: Lemmatization retains word meaning (critical for disaster context), whereas stemming might produce invalid roots (e.g., "disaster" → "disast").

Stopword Removal: Focuses the model on keywords like "fire" or "earthquake."

Punctuation Removal: Simplified the input but may discard contextual cues (e.g., "!" in emergencies).

In [2]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

In [13]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
# Load dataset
df = pd.read_csv('/content/train (1).csv')

In [5]:
df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [6]:
def preprocess_text(text):
    """
    Cleans and preprocesses text data:
    - Lowercasing
    - Removing URLs & punctuation
    - Tokenization
    - Removing stopwords
    - Lemmatization
    """
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(f'[{string.punctuation}]', '', text)
    words = word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)




In [14]:
df['cleaned_text'] = df['text'].apply(preprocess_text)

In [8]:
df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [16]:
vectorizer_tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=10000)
X_tfidf = vectorizer_tfidf.fit_transform(df['cleaned_text'])

# Select top 5000 features using Chi-Square test
selector = SelectKBest(chi2, k=5000)
X_tfidf = selector.fit_transform(X_tfidf, df['target'])



In [17]:
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)


In [18]:
log_reg = LogisticRegression(max_iter=500)
svm = SVC()

log_reg.fit(X_train, y_train)
svm.fit(X_train, y_train)

In [21]:
y_pred_log_reg = log_reg.predict(X_test)
y_pred_svm = svm.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print(classification_report(y_test, y_pred_log_reg))

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

Logistic Regression Accuracy: 0.8233749179251477
              precision    recall  f1-score   support

           0       0.80      0.92      0.86       874
           1       0.86      0.70      0.77       649

    accuracy                           0.82      1523
   macro avg       0.83      0.81      0.81      1523
weighted avg       0.83      0.82      0.82      1523

SVM Accuracy: 0.8305975049244911
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       874
           1       0.85      0.73      0.79       649

    accuracy                           0.83      1523
   macro avg       0.84      0.82      0.82      1523
weighted avg       0.83      0.83      0.83      1523



In [20]:
param_grid_lr = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l2'], 'solver': ['lbfgs', 'liblinear']}
grid_search_lr = GridSearchCV(LogisticRegression(max_iter=500), param_grid_lr, cv=5)
grid_search_lr.fit(X_train, y_train)

param_grid_svm = {'C': [0.01, 0.1, 1, 10], 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}
grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=5)
grid_search_svm.fit(X_train, y_train)

best_log_reg = grid_search_lr.best_estimator_
best_svm = grid_search_svm.best_estimator_


In [24]:
test_df = pd.read_csv('/content/test (1).csv')
test_df['cleaned_text'] = test_df['text'].apply(preprocess_text)


X_test_tfidf = vectorizer_tfidf.transform(test_df['cleaned_text'])
X_test_tfidf = selector.transform(X_test_tfidf)


log_reg_acc = accuracy_score(y_test, log_reg.predict(X_test))
svm_acc = accuracy_score(y_test, svm.predict(X_test))

best_model = log_reg if log_reg_acc > svm_acc else svm  # Select the better model
print(f"Using {'Logistic Regression' if best_model == log_reg else 'SVM'} for submission.")


test_predictions = best_model.predict(X_test_tfidf)


submission = pd.DataFrame({'id': test_df['id'], 'target': test_predictions})
submission.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' created successfully!")

Using SVM for submission.
Submission file 'submission.csv' created successfully!


**Conclusion**

Best Model in Terms of Quality/Resources:

Quality: SVM (83.1% accuracy) slightly outperformed Logistic Regression (82.3%).

Resources: Logistic Regression is faster to train and lighter for deployment.

Trade-off: SVM offers marginal quality gains at higher computational cost.

How to Improve Results:

Hyperparameter Tuning: Use GridSearchCV (as done in the notebook) to optimize C, kernel (SVM), or regularization (Logistic Regression).

Feature Engineering:

Retain punctuations like "!" or "?" for urgency signals.

Include hashtags (e.g., #wildfire) as features.

Class Balancing: Address imbalance (if present) via SMOTE or class weights.

Advanced Vectorization: Experiment with word embeddings (e.g., Word2Vec) alongside TF-IDF.

Difficulties Encountered:

Computational Cost: SVM training is slower for large datasets.

Context Loss: Aggressive preprocessing (e.g., removing punctuation) might discard critical signals (e.g., "Help!").

Overfitting Risk: High-dimensional TF-IDF features (10,000 → 5,000 after selection) require careful regularization.

Final Recommendation
For practical deployment, use Logistic Regression for its speed and interpretability. For maximal accuracy, opt for SVM with tuned hyperparameters. To bridge the gap, consider hybrid models (e.g., TF-IDF + embeddings) or ensemble techniques (e.g., stacking SVM and Logistic Regression).