In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
v = CountVectorizer()
v.fit(["Thor Hathodawala is looking for a job"])
v.vocabulary_

In [None]:
v = CountVectorizer(ngram_range=(1,3))
v.fit(["Thor Hathodawala is looking for a job"])
v.vocabulary_

While scikit-learn's CountVectorizer is the standard tool for converting text into a Bag of N-grams matrix, spaCy is often preferred for the preprocessing stage because it provides deeper linguistic intelligence that improves the quality of those n-grams.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens) 

In [None]:
preprocess("Loki is eating pizza")

In [None]:
preprocess("Thor ate pizza")

In [None]:
corpus = [
    "Thor ate pizza",
    "Loki is tall",
    "Loki is eating pizza"
]

In [None]:
corpus_processed = [preprocess(text) for text in corpus]
corpus_processed

By using spaCy for the heavy lifting and CountVectorizer for the final matrix, you get the best of both worlds: linguistic accuracy and mathematical efficiency.

The Workflow:

* Feed text to spaCy: Let it analyze the grammar and context.
* Filter & Transform:
* * Check token.is_stop to kill noise.
* * Check token.is_punct to remove punctuation.
* * Grab token.lemma_ to get the root word.
   Feed to CountVectorizer: Pass this "cleaned" list of lemmas into the vectorizer.

Why this specific order?

If you use CountVectorizer's built-in stop_words='english', it tries to remove words before they are lemmatized.
* The Problem: "Organizing" might be in a stop-word list, but its lemma "organize" might not be.
* The Result: You end up with "messy" data where some versions of a word are removed and others aren't.


In [None]:
v = CountVectorizer(ngram_range=(1,2)) 
v.fit(corpus_processed)
v.vocabulary_


Now generate bag of n gram vector for few sample documents

In [None]:
v.transform(['Thor eat pizza']).toarray()

Let's take a document that has out of vocabulary (OOV) term and see how bag of ngram generates vector out of it

In [None]:
v.transform(['Tanu eat snacks']).toarray()

* News Category Classification Problem : 
Here we want to do a news category classification. We will use bag of n-grams and traing a machine learning model that can categorize any news into one of the following categories,

* BUSINESS
* SPORTS
* CRIME
* SCIENCE

In [None]:
import pandas as pd
df = pd.read_json('news_dataset.json')
print(df.shape)
df.head()

In [None]:
df.category.value_counts()

In [None]:
min_samples = 1381

df_business = df[df.category=='BUSINESS'].sample(min_samples,random_state=2022)
df_sports = df[df.category=='SPORTS'].sample(min_samples,random_state=2022)
df_crime = df[df.category=='CRIME'].sample(min_samples,random_state=2022)
df_science = df[df.category=='SCIENCE'].sample(min_samples,random_state=2022)

In [None]:
df_balanced = pd.concat([df_business,df_sports,df_crime,df_science],axis=0)
df_balanced.category.value_counts()

In [None]:
df.head()

In [None]:
df.shape

In [None]:
target = {'BUSINESS':0,'SPORTS':1,'CRIME':2,'SCIENCE':3}

df_balanced['category_num'] = df_balanced['category'].map(target)

In [None]:
df_balanced.head()

In [None]:
df_balanced['preprocessed_txt'] = df_balanced['text'].apply(preprocess)

In [None]:
df_balanced.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_balanced.preprocessed_txt,df_balanced.category_num,test_size=0.25,random_state=2022,stratify=df_balanced.category_num)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = Pipeline([
    ('vectorizer',CountVectorizer(ngram_range=(1,2))),
    ('nb',MultinomialNB())
])

In [None]:
clf.fit(X_train,y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8,6))
sns.heatmap(cm,annot=True,fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')