<a href="https://colab.research.google.com/github/AbdulWahabRaza123/NLP/blob/main/nGramToMakeRelationshipBtwWords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note:**We can make relationship between words after removing stop words from the document 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
v=CountVectorizer()
v.fit(["Thor Hathororhe wala is looking for job"])
v.vocabulary_

{'thor': 5,
 'hathororhe': 1,
 'wala': 6,
 'is': 2,
 'looking': 4,
 'for': 0,
 'job': 3}

In [4]:
# let's change the window of n-gram so that
v=CountVectorizer(ngram_range=(2,2))
v.fit(["Thor Hathororhe wala is looking for job"])
v.vocabulary_

{'thor hathororhe': 4,
 'hathororhe wala': 1,
 'wala is': 5,
 'is looking': 2,
 'looking for': 3,
 'for job': 0}

In [6]:
# let's change the window of n-gram so that, let's test the 2 pairs (1-gram, bi-gram)
v=CountVectorizer(ngram_range=(1,2))
v.fit(["Thor Hathororhe wala is looking for job"])
v.vocabulary_

{'thor': 9,
 'hathororhe': 2,
 'wala': 11,
 'is': 4,
 'looking': 7,
 'for': 0,
 'job': 6,
 'thor hathororhe': 10,
 'hathororhe wala': 3,
 'wala is': 12,
 'is looking': 5,
 'looking for': 8,
 'for job': 1}

In [8]:
# let's change the window of n-gram so that, let's test the 3 pairs (1-gram,bi-gram,tri-gram)
v=CountVectorizer(ngram_range=(1,3))
v.fit(["Thor Hathororhe wala is looking for job"])
v.vocabulary_

{'thor': 12,
 'hathororhe': 2,
 'wala': 15,
 'is': 5,
 'looking': 9,
 'for': 0,
 'job': 8,
 'thor hathororhe': 13,
 'hathororhe wala': 3,
 'wala is': 16,
 'is looking': 6,
 'looking for': 10,
 'for job': 1,
 'thor hathororhe wala': 14,
 'hathororhe wala is': 4,
 'wala is looking': 17,
 'is looking for': 7,
 'looking for job': 11}

**let's play**

In [9]:
corpos=[
    "Thor ate pizza",
    "Loki is tall",
    "Loki is eating pizza"
]

In [10]:
import spacy
# load english language model and load nlp object from it
nlp=spacy.load("en_core_web_sm")



In [17]:
def preprocess(text):
  doc=nlp(text)
  filtered_tokens=[]
  for token in doc:
    if token.is_stop or token.is_punct:
      continue
    #getting base word by converting to 1st form of verb(ate to eat, eating to eat)
    filtered_tokens.append(token.lemma_)
  return " ".join(filtered_tokens)

In [18]:
#let's test this 
preprocess("Thor ate pizza")

'Thor eat pizza'

In [19]:
#let's test this 
preprocess("Loki is eating pizza")

'Loki eat pizza'

In [21]:
#let's test this in our data
corpos_preprocessed=[preprocess(text) for text in corpos]
corpos_preprocessed

['Thor eat pizza', 'Loki tall', 'Loki eat pizza']

In [24]:
#let's do vectorization using n-gram
v=CountVectorizer(ngram_range=(1,2))
v.fit(corpos_preprocessed)
v.vocabulary_

{'thor': 7,
 'eat': 0,
 'pizza': 5,
 'thor eat': 8,
 'eat pizza': 1,
 'loki': 2,
 'tall': 6,
 'loki tall': 4,
 'loki eat': 3}

In [25]:
#let's convert the text to numbers array using ngram
v.transform(["Thor eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 1, 1]])

In [27]:
#we can face out of vocabulary problems here like this here is no hulk in the vocabulary
v.transform(["hulk eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 0, 0]])

In [83]:
import pandas as pd
df=pd.read_json("https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/11_bag_of_n_grams/news_dataset.json")
df.shape

(12695, 2)

In [84]:
df.head()

Unnamed: 0,text,category
0,Watching Schrödinger's Cat Die University of C...,SCIENCE
1,WATCH: Freaky Vortex Opens Up In Flooded Lake,SCIENCE
2,Entrepreneurs Today Don't Need a Big Budget to...,BUSINESS
3,These Roads Could Recharge Your Electric Car A...,BUSINESS
4,Civilian 'Guard' Fires Gun While 'Protecting' ...,CRIME


In [85]:
#to count the category
df.category.value_counts()

BUSINESS    4254
SPORTS      4167
CRIME       2893
SCIENCE     1381
Name: category, dtype: int64

In [88]:
#so this data set is imbalence so we will balance this dataset using ne the of the techniques called undersampling
#starting undersampling
min_samples=1381
#now taking random samples till the count 1381
df_business=df[df.category=="BUSINESS"].sample(min_samples,random_state=2023)
df_sports=df[df.category=="SPORTS"].sample(min_samples,random_state=2023)
df_crime=df[df.category=="CRIME"].sample(min_samples,random_state=2023)
df_science=df[df.category=="SCIENCE"].sample(min_samples,random_state=2023)

In [90]:
#now we got the minimum samples with random data
df_business.shape

(1381, 2)

In [96]:
#now we will concate this whole data in row level with axis=0 (if you want to concate the data at column level than use axis=1)
df_balanced=pd.concat([df_business,df_sports,df_crime,df_science],axis=0)
df_balanced.shape

(5524, 2)

In [97]:
df_balanced.head()

Unnamed: 0,text,category
11110,Why Trendspotting Still Matters: The Power of ...,BUSINESS
6472,Software That Helps Travelers and Companies Se...,BUSINESS
7863,The Secret to Greater Success Is... Learning H...,BUSINESS
7920,Megyn Kelly Has The Perfect One-Word Response ...,BUSINESS
5459,How to Find Your Next Super Star Employee The ...,BUSINESS


In [98]:
df_balanced.category.value_counts()

BUSINESS    1381
SPORTS      1381
CRIME       1381
SCIENCE     1381
Name: category, dtype: int64

In [100]:
#Now convert the categories into numbers
df_balanced['category_num']=df_balanced.category.map({
        'BUSINESS':0,
     'SPORTS':1,
     'CRIME':2,
     'SCIENCE':3
     })

In [101]:
df_balanced.head()

Unnamed: 0,text,category,category_num
11110,Why Trendspotting Still Matters: The Power of ...,BUSINESS,0
6472,Software That Helps Travelers and Companies Se...,BUSINESS,0
7863,The Secret to Greater Success Is... Learning H...,BUSINESS,0
7920,Megyn Kelly Has The Perfect One-Word Response ...,BUSINESS,0
5459,How to Find Your Next Super Star Employee The ...,BUSINESS,0


**Model Training**

In [102]:
#use train test split model (stratify helps us in getting equal numbers of categories in train split data)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(
    df_balanced.text,
    df_balanced.category_num,
    test_size=0.2,
    random_state=2023,
    stratify=df_balanced.category_num
)

In [104]:
print(X_train.shape)
X_train.head(),y_train.head()

(4419,)


(11397    Iran Says OPEC Needs to Make Room for Its Oil ...
 12081                  Rethinking Values in the Workplace 
 1810     The 3 Ways In Which Strategic Influence Is Dif...
 7602     From Whale Songs to the Beatles: Computer Anal...
 7221     Dominique Wilkins On How The NBA Can Fix The D...
 Name: text, dtype: object, 11397    0
 12081    0
 1810     0
 7602     3
 7221     1
 Name: category_num, dtype: int64)

In [111]:
y_train.value_counts()

0    1105
1    1105
2    1105
3    1104
Name: category_num, dtype: int64

In [112]:
y_test.value_counts()

3    277
2    276
1    276
0    276
Name: category_num, dtype: int64

In [115]:
#naivebayes helps us in categorizing the text
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
clf=Pipeline([
    ('vectorizer_bow',CountVectorizer()),
    ('Multi NB',MultinomialNB())
])
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.78      0.87      0.82       276
           1       0.88      0.83      0.85       276
           2       0.85      0.87      0.86       276
           3       0.89      0.81      0.85       277

    accuracy                           0.85      1105
   macro avg       0.85      0.85      0.85      1105
weighted avg       0.85      0.85      0.85      1105



In [117]:
#here if we will train our model using beg of words of n gram then our performance will be low
# let's use the beg of words for preprocessing of text
#naivebayes helps us in categorizing the text
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
clf=Pipeline([
    ('vectorizer_bow',CountVectorizer(ngram_range=(1,2))),
    ('Multi NB',MultinomialNB())
])
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.69      0.89      0.78       276
           1       0.90      0.77      0.83       276
           2       0.86      0.85      0.86       276
           3       0.89      0.77      0.83       277

    accuracy                           0.82      1105
   macro avg       0.84      0.82      0.82      1105
weighted avg       0.84      0.82      0.82      1105



In [118]:
y_test[:5]

10873    2
3735     2
4907     3
4586     1
5670     1
Name: category_num, dtype: int64

In [119]:
y_pred[:5]

array([0, 2, 1, 1, 1])

**Now let's see Magic with preprocessing**

In [120]:
df_balanced['preprocessed_txt']=df_balanced.text.apply(preprocess)

In [121]:
df_balanced.head()

Unnamed: 0,text,category,category_num,preprocessed_txt
11110,Why Trendspotting Still Matters: The Power of ...,BUSINESS,0,trendspotte matter Power look Forward archaic ...
6472,Software That Helps Travelers and Companies Se...,BUSINESS,0,software help traveler company sell travel Pac...
7863,The Secret to Greater Success Is... Learning H...,BUSINESS,0,secret Greater Success learn sell important su...
7920,Megyn Kelly Has The Perfect One-Word Response ...,BUSINESS,0,Megyn Kelly Perfect word response Donald Trump...
5459,How to Find Your Next Super Star Employee The ...,BUSINESS,0,find Super Star Employee pace talent scouting ...


In [122]:
#use train test split model (stratify helps us in getting equal numbers of categories in train split data)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(
    df_balanced.preprocessed_txt,
    df_balanced.category_num,
    test_size=0.2,
    random_state=2023,
    stratify=df_balanced.category_num
)

In [123]:
#here if we will train our model using beg of words of n gram then our performance will be imporove after preprocessing
# let's use the beg of words for preprocessing of text
#naivebayes helps us in categorizing the text
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
clf=Pipeline([
    ('vectorizer_bow',CountVectorizer(ngram_range=(1,2))),
    ('Multi NB',MultinomialNB())
])
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       276
           1       0.90      0.83      0.86       276
           2       0.83      0.91      0.87       276
           3       0.89      0.82      0.85       277

    accuracy                           0.86      1105
   macro avg       0.86      0.86      0.86      1105
weighted avg       0.86      0.86      0.86      1105

