## **The Bag of N-grams**
The Bag of N-grams model is an extension of the Bag of Words (BoW) model in natural language processing (NLP). While the traditional Bag of Words model represents a document as an unordered set of words and their frequencies, the Bag of N-grams model considers sequences of consecutive words, known as n-grams, in addition to individual words.

- Document 1: "The cat in the hat."
- Document 2: "The cat sat on the mat."

Vocabulary: {The cat, cat in, in the, the hat, hat sat, sat on, on the, the mat}

Vector representation:
- Document 1: [1, 1, 1, 1, 0, 0, 0, 0]
- Document 2: [1, 0, 0, 0, 1, 1, 1, 1]


In [47]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import pprint
import nltk
import pandas as pd

### Understanding how to generate n-grams using CountVectorizer

In [2]:
# BoW model
cv = CountVectorizer()
cv.fit(['Loki is an amazing movie.'])
print(cv.vocabulary_)

{'loki': 3, 'is': 2, 'an': 1, 'amazing': 0, 'movie': 4}


In [3]:
# n-gram model
cv = CountVectorizer(ngram_range=(1,3))
cv.fit(['Loki is an amazing movie.'])
pprint.pprint(cv.vocabulary_)

{'amazing': 0,
 'amazing movie': 1,
 'an': 2,
 'an amazing': 3,
 'an amazing movie': 4,
 'is': 5,
 'is an': 6,
 'is an amazing': 7,
 'loki': 8,
 'loki is': 9,
 'loki is an': 10,
 'movie': 11}


In [4]:
# Move to another example
corpus = [
    "Thor ate pizza",
    "Loki is tall",
    "Loki is eating pizza"
]

### We will preprocess the corpus first (sw and lemmatization)

In [5]:
sw = set(nltk.corpus.stopwords.words('english'))
lemma = nltk.stem.WordNetLemmatizer()
clear_corpus = []
for sentence in corpus:
    word = nltk.word_tokenize(sentence)
    word = [lemma.lemmatize(w) for w in word if w not in sw]
    clear_corpus.append(" ".join(word))

In [6]:
print(corpus)
print("\n",clear_corpus)

['Thor ate pizza', 'Loki is tall', 'Loki is eating pizza']

 ['Thor ate pizza', 'Loki tall', 'Loki eating pizza']


In [7]:
# n-gram model
cv = CountVectorizer(ngram_range=(1,2))
cv.fit(clear_corpus)
pprint.pprint(cv.vocabulary_)

{'ate': 0,
 'ate pizza': 1,
 'eating': 2,
 'eating pizza': 3,
 'loki': 4,
 'loki eating': 5,
 'loki tall': 6,
 'pizza': 7,
 'tall': 8,
 'thor': 9,
 'thor ate': 10}


In [8]:
cv.transform(['Thor ate pizza']).toarray()

array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1]], dtype=int64)

In [9]:
cv.transform(['Hulk ate pizza']).toarray()

array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

## **News Category Classification using Bag of n-gram**

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. 
Each record in the dataset consists of the following attributes:

- category: category in which the article was published.
- headline: the headline of the news article.
- authors: list of authors who contributed to the article.
- link: link to the original news article.
- short_description: Abstract of the news article.
- date: publication date of the article.

Dataset = https://www.kaggle.com/datasets/rmisra/news-category-dataset
- Category can be one of these 4: 'BUSINESS', 'SPORTS', 'CRIME', 'SCIENCE', to keep things simple the data trimmed so it different than original kaggel dataset.


In [25]:
df = pd.read_json('data\\news_dataset.json')
df.head()

Unnamed: 0,text,category
0,Watching Schrödinger's Cat Die University of C...,SCIENCE
1,WATCH: Freaky Vortex Opens Up In Flooded Lake,SCIENCE
2,Entrepreneurs Today Don't Need a Big Budget to...,BUSINESS
3,These Roads Could Recharge Your Electric Car A...,BUSINESS
4,Civilian 'Guard' Fires Gun While 'Protecting' ...,CRIME


In [26]:
df['category'].value_counts()

BUSINESS    4254
SPORTS      4167
CRIME       2893
SCIENCE     1381
Name: category, dtype: int64

### Handle class imbalance

In [27]:
min_samples = 1381

df_business = df[df['category'] == 'BUSINESS'].sample(min_samples,random_state=42)
df_sport = df[df['category'] == 'SPORTS'].sample(min_samples,random_state=42)
df_crime = df[df['category'] == 'CRIME'].sample(min_samples,random_state=42)
df_science = df[df['category'] == 'SCIENCE'].sample(min_samples,random_state=42)

In [33]:
df_balance = pd.concat([df_business,df_sport,df_crime,df_science], axis = 0).reset_index(drop=True)
df_balance['category'].value_counts()

BUSINESS    1381
SPORTS      1381
CRIME       1381
SCIENCE     1381
Name: category, dtype: int64

### convert category into number

In [41]:
df_balance['category_num'] = df_balance['category'].map({'BUSINESS': 0, 'SPORTS': 1, 'CRIME': 2, 'SCIENCE': 3})
df_balance.head()

Unnamed: 0,text,category,category_num
0,How to Develop the Next Generation of Innovato...,BUSINESS,0
1,"Madoff Victims' Payout Nears $7.2 Billion, Tru...",BUSINESS,0
2,Bay Area Floats 'Sanctuary In Transit Policy' ...,BUSINESS,0
3,Microsoft Agrees To Acquire LinkedIn For $26.2...,BUSINESS,0
4,"Inside A Legal, Multibillion Dollar Weed Market",BUSINESS,0


### build a model with original text (no pre processing)

In [46]:
X_train, X_test, y_train, y_test = train_test_split(df_balance['text'], 
                                                    df_balance['category_num'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df_balance['category_num']
                                                    )
X_train.shape, X_test.shape, y_train.value_counts()

((4419,),
 (1105,),
 0    1105
 3    1105
 1    1105
 2    1104
 Name: category_num, dtype: int64)

#### train and evaluate a model on basic bow

In [48]:
clf = Pipeline([
    ('vecotrizer_bow', CountVectorizer()),
    ('model', MultinomialNB())
])

clf.fit(X_train, y_train)

print(classification_report(y_test,clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.78      0.92      0.84       276
           1       0.92      0.85      0.88       276
           2       0.91      0.89      0.90       277
           3       0.89      0.82      0.85       276

    accuracy                           0.87      1105
   macro avg       0.88      0.87      0.87      1105
weighted avg       0.88      0.87      0.87      1105



#### train and evaluate on ngram

In [50]:
clf = Pipeline([
    ('vecotrizer_ngram', CountVectorizer(ngram_range=(1,3))),
    ('model', MultinomialNB())
])

clf.fit(X_train, y_train)

print(classification_report(y_test,clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.71      0.95      0.81       276
           1       0.92      0.80      0.85       276
           2       0.91      0.88      0.89       277
           3       0.92      0.77      0.84       276

    accuracy                           0.85      1105
   macro avg       0.86      0.85      0.85      1105
weighted avg       0.86      0.85      0.85      1105



### build a model with text pre-processing to remove stop words, punctuations and apply lemmatization

In [56]:
# defining preprocessing function
def preproces_senteces(sentence):
    sw = set(nltk.corpus.stopwords.words('english'))
    lemma = nltk.stem.WordNetLemmatizer()
    word = nltk.word_tokenize(sentence)
    word = [lemma.lemmatize(w, pos= 'v') for w in word if w not in sw]
    return " ".join(word)
preproces_senteces('Hulk is eating pizza')

'Hulk eat pizza'

In [59]:
df_balance['preprocess_text'] = df_balance['text'].apply(preproces_senteces)

In [61]:
df_balance.head()

Unnamed: 0,text,category,category_num,preprocess_text
0,How to Develop the Next Generation of Innovato...,BUSINESS,0,How Develop Next Generation Innovators : Stop ...
1,"Madoff Victims' Payout Nears $7.2 Billion, Tru...",BUSINESS,0,"Madoff Victims ' Payout Nears $ 7.2 Billion , ..."
2,Bay Area Floats 'Sanctuary In Transit Policy' ...,BUSINESS,0,Bay Area Floats 'Sanctuary In Transit Policy '...
3,Microsoft Agrees To Acquire LinkedIn For $26.2...,BUSINESS,0,Microsoft Agrees To Acquire LinkedIn For $ 26....
4,"Inside A Legal, Multibillion Dollar Weed Market",BUSINESS,0,"Inside A Legal , Multibillion Dollar Weed Market"


In [63]:
X_train, X_test, y_train, y_test = train_test_split(df_balance['preprocess_text'], 
                                                    df_balance['category_num'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df_balance['category_num']
                                                    )
X_train.shape, X_test.shape, y_train.value_counts()

((4419,),
 (1105,),
 0    1105
 3    1105
 1    1105
 2    1104
 Name: category_num, dtype: int64)

#### train and evaluate a model on basic bow

In [64]:
clf = Pipeline([
    ('vecotrizer_bow', CountVectorizer()),
    ('model', MultinomialNB())
])

clf.fit(X_train, y_train)

print(classification_report(y_test,clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.85      0.88      0.86       276
           1       0.88      0.87      0.87       276
           2       0.90      0.92      0.91       277
           3       0.89      0.84      0.86       276

    accuracy                           0.88      1105
   macro avg       0.88      0.88      0.88      1105
weighted avg       0.88      0.88      0.88      1105



#### train and evaluate on ngram

In [65]:
clf = Pipeline([
    ('vecotrizer_ngram', CountVectorizer(ngram_range=(1,3))),
    ('model', MultinomialNB())
])

clf.fit(X_train, y_train)

print(classification_report(y_test,clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86       276
           1       0.90      0.87      0.88       276
           2       0.91      0.90      0.90       277
           3       0.88      0.83      0.86       276

    accuracy                           0.88      1105
   macro avg       0.88      0.88      0.88      1105
weighted avg       0.88      0.88      0.88      1105

