## What is Natural Language Processing ?


Natural Language Processing (NLP) is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.


Let's get started.

## Step 1: Importing necessary Libraries



In [153]:
# dataset link (https://drive.google.com/drive/folders/1lTdrM9Cbv225M2n3lOSrTbEWCak5YpPw?usp=drive_link)

# NLP

Industry estimates indicate that just 21% of all available data is structured. As data is continuously generated through activities like tweeting, sending WhatsApp messages, and more, the majority of it is in textual form, which is highly unstructured.


Some examples of unstructured data include tweets and posts on social media, user-to-user chat conversations, news, blogs, articles, product or service reviews, and patient records in the healthcare sector. More recent examples include chatbots and other voice-driven bots.

Despite the high-dimensional nature of this data, the information it contains is not directly accessible without manual processing or automated analysis. To extract significant and actionable insights from text data, it is crucial to understand the techniques and principles of Natural Language Processing (NLP).



## Step 2: Text Processing

Text is the most unstructured form of available data, often containing various types of noise, making it difficult to analyze without preprocessing. The process of cleaning and standardizing text to remove noise and prepare it for analysis is known as text preprocessing.

This process primarily involves following steps:

- Punctuation Removal
- Lowering the text
- Tokenization
- Stop Word Removal
- Stemming
- Lemmatization

## 2.1 Punctuation Removal

This step involves removing all punctuation from the text. The Python string library contains a predefined list of punctuation characters, such as:

‘!”#$%&'()*+,-./:;?@[\]^_{|}~’`

By removing these characters, the text is cleaned and simplified, making it more suitable for further analysis.

In [154]:
import string
print(string.punctuation)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [156]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [157]:
sentence = "Hello, world! Let's take this example text with punctuations:"

for i in sentence:
  print(i)
#   if i in string.punctuation:
#     sentence = sentence.replace(i,'')
# print(sentence)

H
e
l
l
o
,
 
w
o
r
l
d
!
 
L
e
t
'
s
 
t
a
k
e
 
t
h
i
s
 
e
x
a
m
p
l
e
 
t
e
x
t
 
w
i
t
h
 
p
u
n
c
t
u
a
t
i
o
n
s
:


In [158]:
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree


sentence = "Hello, world! Let's take this example text with punctuations:"
sentence = remove_punctuation(sentence)
sentence

'Hello world Lets take this example text with punctuations'

## 2.2 Lowering the text
Converting text to the same case, preferably lowercase, is one of the most common text preprocessing steps in Python. However, it is not always necessary to perform this step for every NLP problem, as lowercasing can lead to a loss of information in certain contexts.

For instance, when analyzing a person's emotions, words written in uppercase can indicate frustration or excitement. Therefore, retaining the original casing can be important for understanding the text's full meaning in such cases.

In [159]:
def lowercase(text):
  return text.lower()

In [160]:
lowercase(sentence)

'hello world lets take this example text with punctuations'

### 2.3 Tokenization

Text data needs to be broken down into smaller units, such as words or phrases, for analysis. Tokenization is the process that divides text into these meaningful units, enabling subsequent processing steps like feature extraction.

In [161]:
import re

sentence = 'That movie was not bad.'


def tokenization(text):
  tokens = re.split('\W+',text)
  return tokens

tokenization(sentence)

['That', 'movie', 'was', 'not', 'bad', '']

### 2.4 Removal of stopwords

We remove commonly used stopwords from the text because they do not add value to the analysis and carry little or no meaning.

The NLTK library includes a list of stopwords in the English language, such as:

`[i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren't, could, couldn't, didn't, didn’t]`

However, it is unnecessary to use the default list of stopwords, as they should be chosen wisely based on the project.

In [162]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [163]:
stop_words_list = nltk.corpus.stopwords.words('english')
stop_words_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [164]:
len(stop_words_list)

179

In [165]:
def remove_stopwords(text):
    output= [i for i in text if i not in stop_words_list]
    return output

In [166]:
sentence = 'The runners are running quickly.'

out1 = remove_punctuation(sentence)
print(out1)

The runners are running quickly


In [167]:
lower_text = lowercase(out1)
print(lower_text)

the runners are running quickly


In [168]:
tokenized_tokens = tokenization(lower_text)
print(tokenized_tokens)

['the', 'runners', 'are', 'running', 'quickly']


In [169]:
cleaned_sent = remove_stopwords(tokenized_tokens)
print(cleaned_sent)

['runners', 'running', 'quickly']


### 2.5 Stemming

This step, known as text standardization, involves stemming or reducing words to their root or base form. For example, words like 'programmer,' 'programming,' and 'program' are all reduced to 'program.'

However, stemming can sometimes result in the root form losing its meaning or not reducing to a proper English word. This will be illustrated in the following steps.

In [170]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text

In [171]:
stemming(cleaned_sent)

['runner', 'run', 'quickli']

### 2.6 Lemmatization

Lemmatization, unlike stemming, ensures that the root form retains meaning by using a predefined dictionary that considers the context of words. It checks the word against the dictionary during the normalization process.

Let's now compare the results of stemming and lemmatization to illustrate the differences.

In [172]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatizer(text):
  lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
  return lemm_text

lemmatizer(cleaned_sent)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['runner', 'running', 'quickly']

## Step 3: Apply all the text preprocessing steps on Dataset

### 3.1 Load the dataset

In [173]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/LLM_Course_Datasets/IMDB Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### 3.2 Remove punctuation

In [174]:
df['review'].head()

Unnamed: 0,review
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends."
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work."


In [175]:
df['review'] = df['review'].apply(remove_punctuation)

In [176]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production br br The filmin...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive


### 3.3 Lower case

In [177]:

df['review'] = df['review'].apply(lowercase)

In [178]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


### 3.4 Tokenization

In [179]:
df['review'] = df['review'].apply(tokenization)

In [180]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, of, the, other, reviewers, has, mentione...",positive
1,"[a, wonderful, little, production, br, br, the...",positive
2,"[i, thought, this, was, a, wonderful, way, to,...",positive
3,"[basically, theres, a, family, where, a, littl...",negative
4,"[petter, matteis, love, in, the, time, of, mon...",positive


### 3.5 Stopwords removal

In [181]:
df['review'] = df['review'].apply(remove_stopwords)

In [182]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, watching, 1, oz, e...",positive
1,"[wonderful, little, production, br, br, filmin...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, theres, family, little, boy, jake,...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


### 3.6 Lemmatization

In [183]:
df['review'] = df['review'].apply(lemmatizer)

In [184]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewer, mentioned, watching, 1, oz, ep...",positive
1,"[wonderful, little, production, br, br, filmin...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, there, family, little, boy, jake, ...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


## More we can search about
- **Dependency Parsing**:- Sentences are made up of words that are linked together. Dependency grammar is a method used to analyze how these words are connected to each other. It looks at the relationships between pairs of words, showing how one word depends on another in a sentence.




- **Part of Speech(PoS) Tagging**:-Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence.




- **Name Entity Recognition**:- The process of identifying named entities such as people, places, or organizations in text is called Named Entity Recognition (NER). For instance:

`Sentence: "Elon Musk, the CEO of SpaceX, recently visited Tokyo for a conference."`

**Named Entities**:

("person": "Elon Musk")

("organization": "SpaceX")

("location": "Tokyo")



A typical NER model involves three main steps:

**Noun Phrase Identification**: This step involves finding all the noun phrases in the text using methods like analyzing word relationships and part-of-speech tagging.

**Phrase Classification**: In this step, the identified noun phrases are categorized into types such as names, organizations, or locations. Tools like Google Maps help with location names, and databases like Wikipedia assist in identifying people and companies. Custom lookup tables and dictionaries can also be created from various sources.

**Entity Disambiguation**: Sometimes, entities may be misclassified. To address this, a validation layer is used to ensure accuracy. Knowledge graphs, such as Google Knowledge Graph, IBM Watson, and Wikipedia, can help resolve ambiguities and improve classification.


## Step 4 Vectorization


Processing natural language text and extract useful information from the given word, a sentence using machine learning and deep learning techniques requires the string/text needs to be converted into a set of real numbers (a vector).


### 4.1 vectorizers

In [185]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")



In [186]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(ngram_range = (1,3))
vectorizer.fit_transform(train_set)

<2x16 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [187]:
vectorizer.vocabulary_

{'the': 11,
 'sky': 5,
 'is': 2,
 'blue': 0,
 'the sky': 12,
 'sky is': 6,
 'is blue': 3,
 'the sky is': 13,
 'sky is blue': 7,
 'sun': 8,
 'bright': 1,
 'the sun': 14,
 'sun is': 9,
 'is bright': 4,
 'the sun is': 15,
 'sun is bright': 10}

In [188]:
dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1]))

{'blue': 0,
 'bright': 1,
 'is': 2,
 'is blue': 3,
 'is bright': 4,
 'sky': 5,
 'sky is': 6,
 'sky is blue': 7,
 'sun': 8,
 'sun is': 9,
 'sun is bright': 10,
 'the': 11,
 'the sky': 12,
 'the sky is': 13,
 'the sun': 14,
 'the sun is': 15}

In [189]:
test_vector = vectorizer.transform(test_set)
test_vector.toarray()

array([[0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 2, 1, 1, 1, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0]])

In [190]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range = (1,3),max_features=10)
tfidf.fit_transform(train_set)

<2x10 sparse matrix of type '<class 'numpy.float64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [191]:
tfidf.vocabulary_

{'the': 9,
 'sky': 5,
 'is': 2,
 'blue': 0,
 'sky is': 6,
 'is blue': 3,
 'sky is blue': 7,
 'sun': 8,
 'bright': 1,
 'is bright': 4}

In [192]:
tf_idf_vector = tfidf.transform(test_set)
print(tf_idf_vector.toarray())

[[0.         0.36439074 0.25926702 0.         0.36439074 0.36439074
  0.36439074 0.         0.36439074 0.51853403]
 [0.         0.37729199 0.         0.         0.         0.
  0.         0.         0.75458397 0.53689271]]


### 4.2 Word Embedding

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.


Word embeddings help in the following use cases.

- Compute similar words
- Text classifications
- Document clustering/grouping
- Feature extraction for text classifications

In [193]:
# story = """There lived a great king called Ashoka in India.
#             After Kalinga battle, he converted to Buddhism.
#             This mighty king ordered his ministers to put together
#             a peaceful treaty with their neighboring kingdom.
#             The emperor ordered his ministers to also build a stupa,
#             a monument with Buddha's teachings."""


with open('/content/drive/MyDrive/LLM_Course_Datasets/story1.txt','r') as f:
  story1 = f.read()


print(story1)

with open('/content/drive/MyDrive/LLM_Course_Datasets/story2.txt','r') as f:
  story2 = f.read()


print(story2)

King Ashoka, the third ruler of the Maurya Dynasty, began his reign with fierce conquests, expanding his empire across India. Witnessing the brutal Kalinga War, he experienced profound remorse and embraced Buddhism, seeking peace and non-violence. Ashoka dedicated his life to spreading Dharma, building stupas, and sending missionaries across Asia. His edicts, carved on pillars and rocks, advocated ethical living, compassion, and respect for all life. Under his rule, the Mauryan Empire flourished in prosperity and cultural harmony. King Ashoka’s transformation from a ruthless conqueror to a benevolent ruler left an enduring legacy of peace and spiritual growth.
Ashoka’s reign marked a golden age in Indian history. He established hospitals, universities, and irrigation systems, improving the welfare of his people. Trade flourished along the Silk Road, fostering cultural exchanges with distant lands. The king’s efforts in environmental conservation were pioneering; he protected wildlife a

In [194]:
single_doc = []
for i in ['/content/drive/MyDrive/LLM_Course_Datasets/story1.txt',
          '/content/drive/MyDrive/LLM_Course_Datasets/story2.txt']:
  with open(i,'r') as f:
    story = f.read()

  punct_story = remove_punctuation(story)
  lower_story = lowercase(punct_story)
  tokenized_story = tokenization(lower_story)
  stopwords_story = remove_stopwords(tokenized_story)
  lemmatized_story = lemmatizer(stopwords_story)
  single_doc.append(lemmatized_story)
print(single_doc)

[['king', 'ashoka', 'third', 'ruler', 'maurya', 'dynasty', 'began', 'reign', 'fierce', 'conquest', 'expanding', 'empire', 'across', 'india', 'witnessing', 'brutal', 'kalinga', 'war', 'experienced', 'profound', 'remorse', 'embraced', 'buddhism', 'seeking', 'peace', 'nonviolence', 'ashoka', 'dedicated', 'life', 'spreading', 'dharma', 'building', 'stupa', 'sending', 'missionary', 'across', 'asia', 'edict', 'carved', 'pillar', 'rock', 'advocated', 'ethical', 'living', 'compassion', 'respect', 'life', 'rule', 'mauryan', 'empire', 'flourished', 'prosperity', 'cultural', 'harmony', 'king', 'ashoka', 'transformation', 'ruthless', 'conqueror', 'benevolent', 'ruler', 'left', 'enduring', 'legacy', 'peace', 'spiritual', 'growth'], ['ashoka', 'reign', 'marked', 'golden', 'age', 'indian', 'history', 'established', 'hospital', 'university', 'irrigation', 'system', 'improving', 'welfare', 'people', 'trade', 'flourished', 'along', 'silk', 'road', 'fostering', 'cultural', 'exchange', 'distant', 'land', 

In [195]:
from gensim.models import Word2Vec

model = Word2Vec(single_doc, min_count = 3,)

print(model.wv.most_similar('king'))

[('stupa', -0.023671666160225868), ('ashoka', -0.05234673619270325)]


In [196]:
model

<gensim.models.word2vec.Word2Vec at 0x7a7f8d1d5b70>

In [197]:
import numpy as np

# embeddings = []
# for doc in single_doc:
#   for word in doc:
#     if word in model.wv:
#       print(word,model.wv.get_vector(word))
#       embeddings.append(model.wv[word])

# embeddings = np.array(embeddings)
# # print(model.wv['king'])
# print(embeddings.mean(axis = 0))
# doc_vectors = np.array([np.mean(embeddings, axis=0)])
# print(doc_vectors)

In [198]:

# Train Word2Vec model
model = Word2Vec(sentences=single_doc, vector_size=100, window=5, min_count=1, workers=4)

# Function to get document vector by averaging word vectors
def document_vector(doc):
    word_vectors = [model.wv[word] for word in doc if word in model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)

# Create document vectors
doc_vectors = np.array([document_vector(doc) for doc in single_doc])

# Print document vectors
for i, vector in enumerate(doc_vectors):
    print(f"Document {i} vector:\n{vector}\n")

Document 0 vector:
[-3.9440225e-04 -1.4245392e-05 -5.4409337e-04  1.1728095e-03
 -3.3508774e-04 -1.8417151e-03 -1.7545834e-04  2.1307217e-03
 -1.1939220e-03 -1.5683990e-03 -2.8349162e-04 -1.2617565e-03
 -1.1815055e-03  1.3393172e-03  3.3780266e-04 -4.5160172e-04
  1.5587940e-04 -1.1721924e-03 -9.6207333e-04 -2.4384228e-03
  5.1434850e-04  7.4625330e-04  7.3331717e-04 -5.1559665e-04
  7.8534300e-04  1.0200113e-04 -2.0599894e-03  2.3892290e-04
 -1.0795451e-03  4.2168621e-04  4.4050423e-04  8.4455816e-05
  1.7807268e-03 -1.3818041e-03 -1.0551626e-03  8.8348890e-05
 -2.1526411e-04 -2.1140206e-04 -2.0567486e-04 -1.6796066e-03
 -6.8977502e-06 -1.1673705e-03 -9.6955604e-04  9.2257491e-05
  1.2551512e-03  3.4501741e-04 -1.0040660e-03 -4.2206229e-04
  1.0075581e-03  1.4338953e-03 -3.0883489e-04 -8.1413431e-04
  1.2481400e-04 -3.4985782e-04  9.7626692e-04 -7.3208706e-05
 -3.4958878e-04 -3.8485453e-04 -9.5288287e-04  1.1553048e-03
  3.6683073e-04  3.7600318e-04  5.0664291e-04  7.3824194e-04
 -5.9

In [199]:
print(model.wv.most_similar('king'))

[('principle', 0.18916434049606323), ('dedicated', 0.18856076896190643), ('guide', 0.1836378127336502), ('marked', 0.18359914422035217), ('conqueror', 0.17821064591407776), ('empire', 0.16058917343616486), ('legacy', 0.15974301099777222), ('land', 0.15965142846107483), ('monastery', 0.1586817353963852), ('golden', 0.15200771391391754)]


In [203]:
model.wv.similarity('king','maurya')

0.017129293