<a href="https://www.kaggle.com/code/taheriodgewala/nlp-with-tf-idf-and-logistic-regression?scriptVersionId=210285336" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Competition About 
Twitter plays a vital role in emergencies, allowing real-time announcements of disasters. However, distinguishing between tweets about actual disasters and figurative language (e.g., "ABLAZE" used metaphorically) is challenging for machines.

This competition tasks you with building a machine learning model to classify tweets as disaster-related or not, using a hand-labeled dataset of 10,000 tweets. Note that the dataset may contain offensive language. 

# Evaluation
F1 Score Formula:
F
1
=
2
⋅
(
Precision
⋅
Recall
)
Precision
+
Recall
F1= 
Precision+Recall
2⋅(Precision⋅Recall)
​	
 
Precision: The proportion of correct positive predictions out of all positive predictions:
Precision
=
T
P
T
P
+
F
P
Precision= 
TP+FP
TP
​	
 
Recall: The proportion of actual positives correctly identified:
Recall
=
T
P
T
P
+
F
N
Recall= 
TP+FN
TP
​	
 
Where:

TP (True Positive): Predicted 1, and the actual label is also 1.
FP (False Positive): Predicted 1, but the actual label is 0.
FN (False Negative): Predicted 0, but the actual label is 1

# Notebook About 
In this notebook, we successfully built a machine learning pipeline to classify disaster-related tweets. Key steps included:

Data Preprocessing: Cleaned the text data by removing noise like special characters and extra spaces.
Feature Extraction: Used TfidfVectorizer to transform text into meaningful numerical features with n-grams for richer representation.
Model Selection: Implemented Logistic Regression for its efficiency and suitability for binary classification tasks.
Pipeline Creation: Combined preprocessing, feature extraction, and modeling into a seamless Pipeline for streamlined training and prediction.
Evaluation: Achieved a baseline F1 score on the validation set, ensuring a balance between precision and recall.
Submission: Generated a predictions file in the required format for submission.
This approach provides a strong starting point for further optimization, such as hyperparameter tuning or advanced NLP models.

Import necessary libraries 📚

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

 Load datasets📊

In [2]:
train_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_data = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sample_submission = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')


Data exploration 🗺️

In [3]:
print("Training Data Info:")
print(train_data.info())
print("\nSample Training Data:")
print(train_data.head())


Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
None

Sample Training Data:
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4    

Basic preprocessing function 🤖

In [4]:
def preprocess_text(df, text_column):
    # Lowercase transformation
    df[text_column] = df[text_column].str.lower()
    # Remove non-alphanumeric characters
    df[text_column] = df[text_column].str.replace(r'[^a-z0-9\s]', '', regex=True)
    # Remove extra whitespace
    df[text_column] = df[text_column].str.strip()
    return df

Apply preprocessing 💬

In [5]:
train_data = preprocess_text(train_data, 'text')
test_data = preprocess_text(test_data, 'text')

Splitting data into train and validation sets 🔂

In [6]:
X = train_data['text']
y = train_data['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


Create a pipeline for TF-IDF and Logistic Regression 🗂️

In [7]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('model', LogisticRegression(max_iter=1000, random_state=42))
])

In [8]:
pipeline.fit(X_train, y_train)

Validation predictions 🎰

In [9]:
y_val_pred = pipeline.predict(X_val)
val_f1 = f1_score(y_val, y_val_pred)
print(f"Validation F1 Score: {val_f1:.4f}")


Validation F1 Score: 0.7709


Predict on the test set 🧪

In [10]:
test_predictions = pipeline.predict(test_data['text'])


Prepare the submission file 📁

In [11]:
sample_submission['target'] = test_predictions
submission_file_name = 'predicted_target.csv'
sample_submission.to_csv(submission_file_name, index=False)


In [12]:
print(f"\nThe file '{submission_file_name}' has been created and contains the following sample:")
print(pd.read_csv(submission_file_name).head())


The file 'predicted_target.csv' has been created and contains the following sample:
   id  target
0   0       1
1   2       0
2   3       1
3   9       1
4  11       1


# Conclusion:

In this notebook, we used a streamlined pipeline with TfidfVectorizer for feature extraction and Logistic Regression for classification. Key steps included text preprocessing, feature engineering, model training, and validation using the F1 score. Finally, we generated a predictions file ready for submission. This setup forms a solid baseline for further improvement.


Thankyou for reading !