# Text Classification Project

This notebook uses text data to build a machine learning model that predicts binary labels (True/False) based on textual statements. We will preprocess the data, build a classification model, and evaluate its performance.

## 1. Importing Libraries

In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fanha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load and Explore the Data

In [25]:
file_path = r"C:\Users\fanha\Downloads\train.csv"
import pandas as pd
data = pd.read_csv(file_path)
print(data.head())


                                           Statement  Label
0  Says the Annies List political group supports ...  False
1  When did the decline of coal start? It started...   True
2  Hillary Clinton agrees with John McCain "by vo...   True
3  Health care reform legislation is likely to ma...  False
4  The economic turnaround started at the end of ...   True


## 3. Data Preprocessing
- Tokenization
- Stopword Removal
- TF-IDF Vectorization

In [28]:
# Splitting data into features and target
X = data['Statement']
y = data['Label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Text vectorization with TF-IDF
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## 4. Model Training: Logistic Regression

In [31]:
# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)

## 5. Model Evaluation

In [34]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('\nConfusion Matrix:\n', conf_matrix)
print('\nClassification Report:\n', class_report)

Accuracy: 0.60

Confusion Matrix:
 [[395 493]
 [316 844]]

Classification Report:
               precision    recall  f1-score   support

       False       0.56      0.44      0.49       888
        True       0.63      0.73      0.68      1160

    accuracy                           0.60      2048
   macro avg       0.59      0.59      0.59      2048
weighted avg       0.60      0.60      0.60      2048



## 6. Predicting New Samples

In [37]:
# Example new statements for prediction
new_statements = [
    'The economy is improving significantly this year.',
    'The healthcare policy will harm small businesses.'
]

# Transform the new statements
new_statements_tfidf = vectorizer.transform(new_statements)

# Predict
predictions = model.predict(new_statements_tfidf)

# Show predictions
for statement, label in zip(new_statements, predictions):
    print(f'Statement: {statement}\nPredicted Label: {label}\n')

Statement: The economy is improving significantly this year.
Predicted Label: True

Statement: The healthcare policy will harm small businesses.
Predicted Label: False

