In [4]:
!pip install sklearn
!pip install nltk
import nltk

Collecting sklearn
  Using cached sklearn-0.0.post10.tar.gz (3.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


The warning..

× python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

egg_info is package metadata. It stores data about your package, like its name, version, and dependencies.


**1. Load the dataset**

First, we need to load the dataset that we will use to train our text classification model. In this tutorial, we will use the 20 newsgroups dataset available in the sklearn library.

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_data = fetch_20newsgroups(subset="train")
X, y = newsgroups_data.data, newsgroups_data.target

**2. Preprocess the text data**

Text preprocessing is an essential step in NLP. It helps to clean and standardize the text data, making it easier for the machine learning model to understand and process the input. In this tutorial, we will use the nltk library to preprocess the text data. The preprocessing steps include tokenization, stopword removal, and stemming.

In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words("english"))
    filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    return " ".join(stemmed_tokens)

X_preprocessed = [preprocess_text(text) for text in X]

**3. Transform text data into numerical features**

Machine learning models work with numerical data, so we need to convert our text data into numerical features. One common approach is to use the Term Frequency-Inverse Document Frequency (TF-IDF) method. The sklearn library provides a TfidfVectorizer class that can be used to transform the text data.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_transformed = vectorizer.fit_transform(X_preprocessed)

**4. Train the text classification model**

After the text data is preprocessed and transformed, we can train our text classification model. In this tutorial, we will use the Multinomial Naive Bayes classifier from the sklearn library.

**What is the Multinomial Naive Bayes algorithm?**

Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.

The presence or absence of a feature does not affect the presence or absence of the other feature.

https://www.upgrad.com/blog/multinomial-naive-bayes-explained/

In [9]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_transformed, y)

**5. Evaluate the text classification model**

After training our text classification model, we need to evaluate its performance using a test dataset. We can use the fetch_20newsgroups function to load the test dataset and preprocess it. Then, we can use the predict method to make predictions and calculate the accuracy of our model.

In [10]:
test_data = fetch_20newsgroups(subset="test")
X_test, y_test = test_data.data, test_data.target

X_test_preprocessed = [preprocess_text(text) for text in X_test]
X_test_transformed = vectorizer.transform(X_test_preprocessed)

y_pred = classifier.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8029739776951673


References :
[link text](https://reintech.io/blog/how-to-create-a-text-classification-model-with-python)