# Text Classification using NLP (TF-IDF)

## Problem Statement
The objective of this project is to classify text documents into different categories using Natural Language Processing techniques.

## Dataset
The project uses a subset of the 20 Newsgroups dataset containing text related to sports and space topics.

## Approach
Text data is converted into numerical features using TF-IDF vectorization. A machine learning classifier is then trained to learn patterns in the text and perform classification.

## Outcome
The model predicts the category of unseen text documents based on learned patterns.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


In [3]:
# Create a simple sentiment dataset manually
data = {
    "text": [
        "I love this movie",
        "This product is amazing",
        "The service was excellent",
        "I am very happy with the experience",
        "Absolutely fantastic performance",

        "I hate this movie",
        "This product is terrible",
        "The service was very bad",
        "I am disappointed",
        "Worst experience ever"
    ],
    "label": [
        1, 1, 1, 1, 1,   # Positive
        0, 0, 0, 0, 0    # Negative
    ]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,I love this movie,1
1,This product is amazing,1
2,The service was excellent,1
3,I am very happy with the experience,1
4,Absolutely fantastic performance,1
5,I hate this movie,0
6,This product is terrible,0
7,The service was very bad,0
8,I am disappointed,0
9,Worst experience ever,0


In [4]:
X = df["text"]
y = df["label"]


In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [6]:
vectorizer = TfidfVectorizer(stop_words='english')

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)


In [8]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.5


In [9]:
sample_text = ["I really enjoyed this product"]

sample_tfidf = vectorizer.transform(sample_text)
prediction = model.predict(sample_tfidf)

if prediction[0] == 1:
    print("Predicted Sentiment: Positive")
else:
    print("Predicted Sentiment: Negative")


Predicted Sentiment: Negative


## Conclusion

This project demonstrates how Natural Language Processing can be used to classify text based on sentiment. The text data was converted into numerical form using TF-IDF, and a Logistic Regression model was trained for classification. Even with a small dataset, the project successfully shows the complete NLP workflow from text processing to sentiment prediction.
