# Text Classification using scikit-learn in NLP
- from [GeeksForGeeks Text Classification using scikit-learn in NLP](https://www.geeksforgeeks.org/nlp/text-classification-using-scikit-learn-in-nlp/)
- for *reasons* I need to figure out how I *would* have categorized different types of vehicle services based on the description.

## Text Classification
- assign predefined categories or labels to text documents
- involves automated sorting and organizing of textual data
- extract valuable information and insights from large volumes of text

## Import Dataset
- fetch the data from the `20 newsgroups` dataset from `sklearn`
- it's a collection of documents from 20 different news groups
- specifically grab documents about baseball and space
- this ... appears to take a while. How big is this data set?

In [2]:
from sklearn.datasets import fetch_20newsgroups
# fetch the dataset of news documents
newsgroups = fetch_20newsgroups(
    subset="all",  # get both the test and train datasets
    categories=[
        "rec.sport.baseball",  # get baseball sport documents
        "sci.space",  # get science space documents
    ],
    shuffle=True,  # shuffle the data
    random_state=42,  # shuffle in a predictable order
)

In [68]:
newsgroups.target_names

['rec.sport.baseball', 'sci.space']

In [69]:
import pandas as pd
# split off the data (text) from the target (label)
data = newsgroups.data
target = newsgroups.target
# Create a DataFrame for easy manipulation
df = pd.DataFrame({"text": data, "label": target})
df

Unnamed: 0,text,label
0,From: mss@netcom.com (Mark Singer)\nSubject: R...,0
1,From: cuz@chaos.cs.brandeis.edu (Cousin It)\nS...,0
2,From: J019800@LMSC5.IS.LMSC.LOCKHEED.COM\nSubj...,0
3,From: tedward@cs.cornell.edu (Edward [Ted] Fis...,0
4,From: snichols@adobe.com (Sherri Nichols)\nSub...,0
...,...,...
1976,From: msb@sq.sq.com (Mark Brader)\nSubject: Re...,1
1977,From: clgs11@vaxa.strath.ac.uk\nSubject: Jack ...,0
1978,From: 18084TM@msu.edu (Tom)\nSubject: Level 5?...,1
1979,From: snichols@adobe.com (Sherri Nichols)\nSub...,0


### explore the data
- it's got 1981 rows
- the articles are shuffled together
- a label of `0` indicates that it's about `baseball`
- a label of `1` indicates that it's about `space`

In [60]:
print("articles labled 0 are about baseball")
print("label:", df.iloc[0]["label"])
print("text:", "\n".join(df.iloc[0]["text"].split("\n")[0:8]), "\n...")

articles labled 0 are about baseball
label: 0
text: From: mss@netcom.com (Mark Singer)
Subject: Re: Young Catchers
Article-I.D.: netcom.mssC52qMx.768
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Lines: 86

In article <7975@blue.cis.pitt.edu> genetic+@pitt.edu (David M. Tate) writes:
>mss@netcom.com (Mark Singer) said: 
...


In [61]:
print("articles labled 1 are about space")
print("label:", df.iloc[1980]["label"])
print("text:", "\n".join(df.iloc[1980]["text"].split("\n")[0:8]), "\n...")

articles labled 1 are about space
label: 1
text: From: mancus@sweetpea.jsc.nasa.gov (Keith Mancus)
Subject: Re: Lindbergh and the moon (was:Why not give $1G)
Organization: MDSSC
Lines: 32

In article <1r3nuvINNjep@lynx.unm.edu>, cook@varmit.mdc.com (Layne Cook) writes:
> All of this talk about a COMMERCIAL space race (i.e. $1G to the first 1-year 
> moon base) is intriguing. Similar prizes have influenced aerospace  
...


## Method
- we'll be using:
    - TF-IDF for text vectorization (see [GeeksForGeeks](https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/) and [my notes](#tf-idf) on TF-IDF)
    - SVM (Support Vector Machine) for clustering
- while TF-IDF is an NLP technique, and SVM is a supervised ML algorithm, They're much simpler than a dense Transformer-based LLM model like Devstral. 
    - LLM's include steps like attention to determine how words relate to each other and to drill down into the meaning of words, using context to differentiate between the same words with different meanings, like between `tennis courts` and `royal courts`
    - this method will just be looking at the tokens themselves, and their frequency

## Pre-Process with TF-IDF for text vectorization
- we'll use a `TfidfVectorizer` to `vectorize` the set of documents into a `sparse matrix`
- that matrix will be the input `X`, and the document labels will be the output `y`
- the sparse matrix basically lists a TF-IDF for each term in each document, leaving out terms where TF-IDF is zero because a term is absent from a document

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TF-IDF Vectorizer to transform the text
vectorizer = TfidfVectorizer(
    stop_words='english', # list of words to ignore like "a", "the", "and"
    max_df=0.7 # ignore terms with a DF (Document Frequency) above 0.7
)
# Use the vectorizer to transform the text column into feature vectors
X = vectorizer.fit_transform(df['text'])
# Extract the label column (0 for baseball, 1 for space)
y = df['label']

In [51]:
import numpy as np
# Get feature names (terms). They appear to be sorted alphanumerically?
feature_names = np.array(vectorizer.get_feature_names_out())
print("Top 5 TF-IDF terms in first few documents:")
# iterate through the first 4 documents
for i, doc in enumerate(df['text']):
    if i == 4:
        break
    print("-" * 44)
    # Get the row for the current document from the sparse matrix
    doc_row = X[i, :]
    # Get the indices of non-zero elements in the row and their corresponding TF-IDF values
    term_indices = doc_row.nonzero()[1] # find elements that appeared in the document (non-zero TF-IDF's)
    tfidf_values = doc_row.data # get the TF-IDF's of the nonzero elements in the row
    # Sort the terms by their TF-IDF values in descending order
    sorted_indices = np.argsort(tfidf_values)[::-1]
    # Get the top 5 terms and their TF-IDF scores
    top_term_indices = term_indices[sorted_indices[:5]]
    top_tfidf_scores = tfidf_values[sorted_indices[:5]]
    top_terms = feature_names[top_term_indices]
    # print the document ID and second line (subject), then the terms with the top TF-IDF's in that document
    print(f"Document {i}: '{doc.split("\n")[1]}'")
    for term, score in zip(top_terms, top_tfidf_scores):
        print(f"  - {term:<10} {score:.3f}")


Top 5 TF-IDF terms in first few documents:
--------------------------------------------
Document 0: 'Subject: Re: Young Catchers'
  - lopez      0.267
  - aa         0.224
  - netcom     0.193
  - aaa        0.184
  - season     0.154
--------------------------------------------
Document 1: 'Subject: Re: HBP? BB? BIG-CAT?'
  - plus       0.472
  - cuz        0.254
  - errors     0.209
  - walks      0.193
  - error      0.192
--------------------------------------------
Document 2: 'Subject: re: candlestick'
  - berkeley   0.302
  - pasteur    0.196
  - louven     0.186
  - field      0.178
  - stick      0.177
--------------------------------------------
Document 3: 'Subject: Re: Pleasant Yankee Surprises'
  - team       0.181
  - deserving  0.173
  - offense    0.171
  - curtis     0.168
  - salmon     0.168


## Fit the SVM Model
- use [Support Vector Machine (SVM)](https://www.geeksforgeeks.org/machine-learning/support-vector-machine-algorithm/) model (also see [More Notes on SVM](#ai-categories-ml-classification-regression))
- first split 30% of off into a `test` set.
- make SURE to randomize the data this time - originally all of the sports articles were first and all the space articles come afterwards
- we'll use a linear kernel - that means a flat hyperplane, right?

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the classifier
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

0,1,2
,C,1.0
,kernel,'linear'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


## Evaluate the Model
- we'll use [accuracy score](https://www.geeksforgeeks.org/machine-learning/difference-between-score-and-accuracy_score-methods-in-scikit-learn/) and [classification report](https://www.geeksforgeeks.org/machine-learning/compute-classification-report-and-confusion-matrix-in-python/)
- also see [notes on Classification Metrics](#classification-metrics)
- sweet, `99.66%` accuracy! So it correctly categorized `99.66%` of the test documents!
- the Classification Report includes
    - `Precision` : Measures the accuracy of positive predictions.
    - `Recall` : Indicates how many actual positives were correctly identified.
    - `F1-Score` : Balances precision and recall into a single score.
    - `Support` : Shows the number of samples for each class.
- so there were slightly more space articles than baseball articles in the training set, and it falsely classified one or more space articles as baseball

In [None]:
from sklearn.metrics import accuracy_score, classification_report
# Predict on the test set
y_pred = clf.predict(X_test)
# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=newsgroups.target_names)
# print it out
print(f'Accuracy: {accuracy:.4f}', '\nClassification Report:\n', report)

Accuracy: 0.9966 
Classification Report:
                     precision    recall  f1-score   support

rec.sport.baseball       0.99      1.00      1.00       286
         sci.space       1.00      0.99      1.00       309

          accuracy                           1.00       595
         macro avg       1.00      1.00      1.00       595
      weighted avg       1.00      1.00      1.00       595



## Create Predictor Function
- new text needs to be vectorized with the same vectorizer that was fit on the original documents
- then it can be predicted with the trained classifier

In [75]:
def predict_category(text):
    """Predict the category of a given text using the trained classifier."""
    # vectorize the text with the tfidf vectorizer. Just use 'transform' instead of fit_transform'
    text_vec = vectorizer.transform([text]) # note you can give it more than one text at once
    # use the trained classifier predict on the vectorized text. Returns a list of label(s)
    prediction = clf.predict(text_vec) # returns a list with as many entries as you passed to transform
    # convert the label number from the (only list entry) to the cateogry name
    return newsgroups.target_names[prediction[0]]

# Example usage
sample_text = "NASA announced the discovery of new exoplanets."
predicted_category = predict_category(sample_text)
print(f"Sample Text:        '{sample_text}'\nPredicted Category: '{predicted_category}'")

Sample Text:        'NASA announced the discovery of new exoplanets.'
Predicted Category: 'sci.space'
