<a href="https://colab.research.google.com/github/Patcharaporn2093/Data-ANA-ML/blob/main/Negative_Language_Identification_(Semi_Supervised).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Challenge
In this task of "Negative Language Identification," your objective is to develop models that can accurately identify and classify instances of negative language within a given text. The goal is to achieve the closest possible match to the given labels, indicating the presence of negative language. The task is focused on semi-supervised learning, which utilizes unlabeled data to enhance model efficiency, and binary classification, with text samples labeled as either 0 (non-toxic) or 1 (toxic).

In [None]:
!gdown 1YR-Q7kmXcj2OYQi1LVfEwaCxnQ8NaFx5
!gdown 1qdEPsd92kYXXXy-HFNkD5YS6FfqwucYp

Downloading...
From: https://drive.google.com/uc?id=1YR-Q7kmXcj2OYQi1LVfEwaCxnQ8NaFx5
To: /content/train.csv
100% 3.61M/3.61M [00:00<00:00, 142MB/s]
Downloading...
From: https://drive.google.com/uc?id=1qdEPsd92kYXXXy-HFNkD5YS6FfqwucYp
To: /content/test.csv
100% 762k/762k [00:00<00:00, 71.5MB/s]


In [None]:
!pip install nltk
!pip install contractions



In [None]:
import pandas as pd
import numpy as np
import contractions
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from scipy.sparse import vstack

# Define the clean_text function for preprocessing text data
def clean_text(text):
    text = str(text) if pd.notnull(text) else ''
    text = text.lower()
    text = contractions.fix(text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Load and preprocess the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Apply the clean_text function
train['comment_text'] = train['comment_text'].apply(clean_text)
test['comment_text'] = test['comment_text'].apply(clean_text)

# Ensure 'toxic' column only contains numeric values. Non-numeric values are set to NaN and then removed.
train['toxic'] = pd.to_numeric(train['toxic'], errors='coerce')
train.dropna(subset=['toxic'], inplace=True)

# Convert 'toxic' to integer type
y_train = train['toxic'].astype(int)

# Continue with vectorization and model training as before
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train['comment_text'])

model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict on the test (unlabeled) data to generate pseudo-labels
X_test = vectorizer.transform(test['comment_text'])
test_predictions_proba = model.predict_proba(X_test)

# Apply a confidence threshold to select high-confidence predictions
confidence_threshold = 0.95
high_confidence_indices = np.where(test_predictions_proba.max(axis=1) > confidence_threshold)[0]
pseudo_labels = np.argmax(test_predictions_proba[high_confidence_indices], axis=1)

# Combine the pseudo-labeled data with the original training data, maintaining sparse format
X_combined = vstack([X_train, X_test[high_confidence_indices]])
y_combined = np.concatenate([y_train, pseudo_labels])  # Ensure all labels are numeric

# Re-train the model on the combined dataset
model.fit(X_combined, y_combined)

# Predict on the test dataset for submission
final_predictions = model.predict(X_test)

# Create submission file
submission_df = pd.DataFrame({'id': test['id'], 'toxic': final_predictions})
submission_df.to_csv('submission_semi_supervised.csv', index=False)
print("Submission file with semi-supervised learning saved as 'submission_semi_supervised.csv'.")


Submission file with semi-supervised learning saved as 'submission_semi_supervised.csv'.
