### 📝 **Notebook: Toxic Tweet Classification Using TF-IDF and LinearSVC**

#### 📦 **1. Import Required Libraries**
- Import all necessary libraries for data handling, NLP, and machine learning.

In [10]:
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

#### 🔃 **2. Download NLTK Resources**
- Download resources for tokenization, lemmatization, and stopword removal.
- Initialize lemmatizer and stopword list.

In [11]:
print("Downloading NLTK resources...")
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

Downloading NLTK resources...


#### 📂 **3. Load Dataset Files**
- Load training, test, and sample submission files.
- Exit if any file is missing.

In [12]:
try:
    train_df = pd.read_csv('train_2kmZucJ.csv')
    test_df = pd.read_csv('test_oJQbWVk.csv')
    sample_submission_df = pd.read_csv('sample_submission_LnhVWA4.csv')
except FileNotFoundError as e:
    print(f"Missing file: {e}")
    exit()

#### 🧹 **4. Define Preprocessing Functions**
- Convert text to lowercase.
- Remove URLs, mentions, hashtags, special characters.
- Replace slurs.
- Tokenize, remove stopwords, lemmatize tokens.

In [13]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    return {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }.get(tag, wordnet.NOUN)

def enhanced_preprocess(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)
    text = re.sub(r"@\w+|#", '', text)
    text = text.replace("$&@*#", "profane")
    text = re.sub(r"[^\w\s]", " ", text)
    tokens = [w for w in word_tokenize(text) if w not in stop_words]
    lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tokens]
    return " ".join(lemmatized).strip()

#### 🔄 **5. Apply Preprocessing to Dataset**
- Clean tweets using the function defined above.
- Apply it to both training and test data.

In [14]:
print("Preprocessing training data...")
train_df['cleaned_tweet'] = train_df['tweet'].apply(enhanced_preprocess)

print("Preprocessing test data...")
test_df['cleaned_tweet'] = test_df['tweet'].apply(enhanced_preprocess)

Preprocessing training data...
Preprocessing test data...


#### ✨ **6. Vectorize Text using TF-IDF**
- Create two TF-IDF vectorizers: one for words and one for character n-grams.
- Combine them using `FeatureUnion`.
- Fit and transform the training data, transform the test data.

In [15]:
print("Vectorizing text...")
word_vect = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), max_features=15000, sublinear_tf=True, strip_accents='unicode')
char_vect = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), max_features=5000, sublinear_tf=True)

vectorizer = FeatureUnion([("word", word_vect), ("char", char_vect)])

X_train = vectorizer.fit_transform(train_df['cleaned_tweet'])
X_test = vectorizer.transform(test_df['cleaned_tweet'])
y_train = train_df['label']

Vectorizing text...


#### 🤖 **7. Train SVM Model with Cross-Validation**
- Use Linear Support Vector Classifier with balanced class weights.
- Evaluate using 5-fold cross-validation on F1 (weighted) score.

In [16]:
model = LinearSVC(C=0.5, class_weight='balanced', random_state=42, max_iter=5000)

print("Performing cross-validation...")
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"Weighted F1-scores (CV): {cv_scores}")
print(f"Mean Weighted F1-score: {cv_scores.mean():.4f}")

Performing cross-validation...
Weighted F1-scores (CV): [0.88616924 0.89669183 0.89535764 0.88366326 0.88435273]
Mean Weighted F1-score: 0.8892


#### ✅ **8. Train Final Model and Predict**
- Train final model on the full training set.
- Make predictions on the test set.

In [17]:
print("Training final model...")
model.fit(X_train, y_train)

print("Predicting on test data...")
test_predictions = model.predict(X_test)

Training final model...
Predicting on test data...


#### 📄 **9. Save Predictions to Submission File**
- Create submission file in required format and save it as CSV.

In [18]:
submission_df = pd.DataFrame({'id': test_df['id'], 'label': test_predictions})
submission_df.to_csv('submission_tuned.csv', index=False)

print("Submission file 'submission_tuned.csv' created.")
print(submission_df.head())

Submission file 'submission_tuned.csv' created.
     id  label
0  7921      1
1  7922      1
2  7923      1
3  7924      1
4  7925      1


#### 📊 **10. Evaluate F1 Score on Training Data**
- Use model predictions on training set to calculate F1 score.

In [19]:
from sklearn.metrics import f1_score

train_preds = model.predict(X_train)
train_f1 = f1_score(y_train, train_preds, average='weighted')
print(f"Training F1 Score (Weighted): {train_f1:.4f}")

Training F1 Score (Weighted): 0.9694


#### 📁 **11. Read and Display Submission CSV**
- Read the saved `submission_tuned.csv` to verify contents.

In [20]:
submission_check = pd.read_csv('submission_tuned.csv')
submission_check.head()

Unnamed: 0,id,label
0,7921,1
1,7922,1
2,7923,1
3,7924,1
4,7925,1
