# Title:Text Classification Algorithms: A Survey
### Focused on: Random Forest Classifier

#### Group Member Names :
#### 1)Rahul Kasturi - 200629568
#### 2)Rohit Sai Kiran Ravula - 200625534



### INTRODUCTION:
In text classification, one big challenge is that the data is usually unstructured, has too many features (like different words), and most of them don't appear often. This makes it hard for regular models to work well, especially when there are complicated links between words and categories.

To solve this, Random Forest (RF) models are often used. They are a type of machine learning method that builds many decision trees and makes predictions based on the majority vote of those trees. RF models are known to be accurate, can handle noise in data, and work well even when there are a lot of features—like in text data.

In simple terms, RF models help in sorting or tagging things like documents, emails, or messages by learning patterns from the words in the examples they are trained on.
*********************************************************************************************************************
### AIM :
The main goal of this research paper, focusing on the Random Forest (RF) classifier, is to:

->Explore how Random Forest is used for classifying text and documents.

->Show how RF fits into natural language processing (NLP) workflows, especially after turning text into useful features and reducing the number of features.

->Talk about the strengths of RF, like its ability to handle complex patterns, avoid overfitting, and show which features are important—and also point out its downsides, like being slower and harder to explain than simpler models.

->Share real-life examples where RF has been used successfully for organizing or tagging text data.

*********************************************************************************************************************
#### Github Repo:
https://github.com/kk7nc/Text_Classification.git

*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
This research paper gives a detailed overview of the different methods used to classify text. It explains each step in the process, including:

->Cleaning and preparing the text <br>
->Turning text into numerical features <br>
->Reducing the number of features to make models faster and better <br>
->Using different models to classify the text <br>
->Checking how well the models work <br>
The paper looks at many types of models, such as Naïve Bayes, Logistic Regression, SVM, k-NN, and tree-based models like Random Forest. It explains where each model works well and where it doesn't. It also talks about newer approaches like deep learning and word embedding techniques such as Word2Vec, GloVe, and FastText.


*********************************************************************************************************************
#### PROBLEM STATEMENT :
With the huge growth of text data from sources like emails, social media, and articles, it's become very important to sort and label this text correctly—for example, to detect spam, understand opinions, or organize medical records. But this is not easy because text data is messy, has many unique words, and often includes complex patterns.

The main challenges are:

->Finding good ways to turn raw text into useful features <br>
->Picking the right models, like Random Forest, to do the classification <br>
->Handling problems like too many features, uneven category sizes, and the time it takes to process the data

*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
Text classification is used in many areas, such as:

->Healthcare (like sorting medical records) <br>
->Cybersecurity (like spotting phishing messages) <br>
->Social media (like finding harmful or hateful content) <br>
->Search engines and recommendation systems <br>
Older models like Naïve Bayes and Logistic Regression often struggle when there are too many features or when the data has complex patterns. That’s why models like Random Forest, which use multiple decision trees, are becoming more popular—they can manage complicated data better and give more accurate results.


*********************************************************************************************************************
#### SOLUTION:
The paper shows how Random Forest (RF), a tree-based model, helps in sorting and classifying documents. Here's how it works:

Random Forest builds many decision trees using different parts of the data and random sets of features. This mix of trees helps improve accuracy and avoids overfitting, which is when a model works well on training data but not on new data.

Why it’s useful for text classification:

->RF can handle text data with many features. <br>
->It doesn’t get easily confused by noisy or unimportant words. <br>
->It supports tasks with many categories. <br>
->It works well even with simple text processing, so it's a good starting point. <br>

Some drawbacks:

->It’s harder to understand and explain than simpler models. <br>
->Training takes more time, especially with very large datasets. <br>
->For tasks needing deep understanding of meaning (like context or emotion), deep learning models might do better.




# Background
*********************************************************************************************************************
### Reference of Paper selected
Research Paper: https://www.mdpi.com/2078-2489/10/4/150?source=post_page--------------------------- <br>
Dataset: from sklearn.datasets import fetch_20newsgroups <br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html <br>
GitHub link: https://github.com/kk7nc/Text_Classification

### Explanation
I read a research paper called "Text Classification Algorithms: A Survey", which explains different ways to classify text. I found it very useful and wanted to see how these methods work in practice.

I looked it up on Papers with Code and found its GitHub repository, which includes models like CNN, DNN, RNN, and CRF. I chose to focus on the Random Forest Classifier because it's a traditional model that is easy to understand and works well with many types of data.

The GitHub author used the fetch_20newsgroups dataset from sklearn.datasets. This is a popular dataset with 20 different types of news articles, often used to test how well models can sort text into categories. It's a good fit for checking how Random Forest performs with text that has lots of different features.

In this setup, standard techniques like TF-IDF were used to turn text into numbers, then the Random Forest model was trained and tested to see how well it could classify the documents.

### Dataset
fetch_20newsgroups dataset from sklearn.datasets


*********************************************************************************************************************






# Implement paper code :

### To find Random_Forest.py file:
Text Classification --> code --> Random_Forest.py



*********************************************************************************************************************
### Contribution  Code :
#### 1. Importing Libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

#### 2. Load the Data
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

#### 3. Text Vectorization (TF-IDF)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

#### 4. Train Random Forest Classifier
clf = RandomForestClassifier()
clf.fit(X_train, newsgroups_train.target)

#### 5. Make Predictions
predictions = clf.predict(X_test)

#### 6. Evaluate Performance
print(classification_report(newsgroups_test.target, predictions))
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions))

### Results :
Overall Accuracy (~77–84%) <br>
which means Random Forest can effectively handle sparse text data transformed via TF-IDF.


#### Observations :
The Random_Forest.py script proves that with just TF-IDF + Random Forest, you can achieve strong baseline performance on a multi-class NLP task like 20 Newsgroups. It’s fast, interpretable, and surprisingly accurate.


# Implimentation done by us
We applied the Random Forest Classifier on a new dataset to evaluate the model’s accuracy and see how well it performs on unseen data.

### Dataset 
Email Spam Text Classification Dataset, from Kaggle <br>
Link: https://www.kaggle.com/datasets/tapakah68/email-spam-classification

### GitHub
Please the check the code from below repositorie <br>
file name: spam_classifier.py
GitHub link: https://github.com/RahulKasturi/email-spam-classifier.git

### Conclusion and Future Direction :
This project demonstrates the effectiveness of the Random Forest classifier in detecting spam emails using TF-IDF vectorized features. By combining subject and body text, and applying class balancing through SMOTE, we achieved reasonable accuracy and interpretability without deep learning. The model performs well in identifying legitimate emails, but struggles slightly with rare or ambiguous spam messages — a common challenge in real-world datasets.

#### Learnings :
->Learned how to preprocess real-world textual data (title + content). <br>
->Understood how TF-IDF helps convert raw text into meaningful numerical features. <br>
->Gained experience with Random Forest, a powerful and interpretable ensemble classifier. <br>
->Applied SMOTE to deal with class imbalance — an essential step in email spam filtering. <br>
->Practiced creating a fully functional and reproducible ML pipeline. <br>
->Learned to document, organize, and publish a public GitHub repository.

#### Results Discussion :
->Accuracy reached around 75% initially and improved after applying SMOTE and combining input features. <br>
->"Not spam" emails were detected more accurately, with higher recall and precision.<br>
->"Spam" emails were harder to classify, likely due to text diversity and smaller class size.<br>
->Overall, the model offers a strong baseline but can be enhanced for production use.

#### Limitations :
->Class imbalance: Despite using SMOTE, the dataset may not fully reflect the diversity of real-world spam. <br>
->TF-IDF ignores word order and context — it’s good for frequency, not meaning. <br>
->Random Forest lacks semantic understanding: It doesn’t capture nuanced meanings like newer models  <br>
->Interpretability is limited to feature importances — can't visualize decision-making as clearly as logistic regression.

#### Future Extension :
->Try alternative models: Compare with SVM, Logistic Regression, or XGBoost. <br>
->Use word embeddings (like GloVe or Word2Vec) for better text understanding. <br>
->Integrate deep learning (e.g., LSTM or BERT) for sequence and context modeling. <br>
->Add explainability tools like SHAP to interpret model decisions. <br>
->Expand dataset with more spam types or emails from different sources. <br>
->Deploy as a web app or API to allow live email classification.


# References:

Research Paper: https://www.mdpi.com/2078-2489/10/4/150?source=post_page--------------------------- <br>
Paper GitHub link: https://github.com/kk7nc/Text_Classification <br>
Paper Dataset: fetch_20newsgroups dataset from sklearn.datasets

New Dataset: https://www.kaggle.com/datasets/tapakah68/email-spam-classification <br>
Our GitHub link: https://github.com/RahulKasturi/email-spam-classifier.git