# Text Classification
Text classification is the process of categorizing text into predefined labels.
### Examples:

Sentiment Analysis (Positive, Negative, Neutral)

Spam Detection (Spam, Not Spam)

Fake News Detection (Real, Fake)

# Approaches for Text Classification
### 1 Traditional Machine Learning

Uses algorithms like Naive Bayes, SVM, Logistic Regression

Requires TF-IDF or Bag-of-Words representation

###2 Deep Learning

Uses LSTMs, CNNs, and Transformers (BERT, GPT)

Learns patterns automatically from raw text

# Implementing using Traditional Approach (Nive bayes)

### Load the dataset


In [1]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [2]:
documents=[
    (list(movie_reviews.words(fileid)),category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)
]

In [5]:
# check the data size
print('total reviews :',len(documents))
print('example review :',''.join(documents[0][0][:50]))
print('category :',documents[0][1])

total reviews : 2000
example review : plot:twoteencouplesgotoachurchparty,drinkandthendrive.theygetintoanaccident.oneoftheguysdies,buthisgirlfriendcontinuestoseehiminherlife,andhasnightmares.what'sthedeal?watch
category : neg


### Preprocess the data

In [7]:
import random
import string
from nltk.corpus import stopwords

#shuffle the data to avoid order bias
random.shuffle(documents)

#download stopwords
nltk.download('stopwords')

def preprocess_text(words):
  stop_words=set(stopwords.words('english'))
  words=[w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
  return words

# apply preprocessing
documents=[(preprocess_text(words),category) for words, category in documents]

#check example
print('processed data example :', documents[0][0][:20])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


processed data example : ['kind', 'person', 'goes', 'see', 'movies', 'long', 'overpriced', 'theatre', 'popcorn', 'butter', 'optional', 'movie', 'indeed', 'got', 'either', 'one', 'unimaginative', 'rip', 'offs', 'recent']


### Convert Text to Numerical form

Since ML models work with numbers, we use TF-IDF (Term Frequency-Inverse Document Frequency).

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# convert words back to sentences
texts = [' '.join(words) for words, label in documents]
labels= [label for  words , label in documents]

#convert labels to binary 1= positive and 0 = negative
labels =[1 if label == 'pos' else 0 for label in labels]

#TF-IDF Vectorization
vectorizer=TfidfVectorizer(max_features=5000)

x=vectorizer.fit_transform(texts)

print('TF-IDF Matrix Shape:', x.shape)

TF-IDF Matrix Shape: (2000, 5000)


###Train a Machine Learning Model (Naive Bayes)

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Split data
x_train, x_test, y_train, y_test = train_test_split(x, labels, test_size=0.2, random_state=42)

# Train model
model = MultinomialNB()
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.80


###Test with Custom Sentences

In [11]:
def predict_sentiment(review):
  review=preprocess_text(review.split()) # Preprocess
  review=[' '.join(review)] # Convert to string
  review_vectorized= vectorizer.transform(review)
  prediction=model.predict(review_vectorized)
  return "positive" if prediction == 1 else 'negative'

In [13]:
predict_sentiment('movie is good and the acting was amazing')

'positive'

In [14]:
predict_sentiment('the movie was boring..!')

'negative'

In [15]:
predict_sentiment('this movie is pretty good but acting was horrible but boring one')

'negative'