Training data with Word2Vec using gensim

In [1]:
import spacy
import re
import gensim
import pandas as pd
nlp = spacy.load("en_core_web_sm")
stopwords = nlp.Defaults.stop_words

In [2]:
text=("""Mahatma Gandhi is mostly known as “Father of the Nation” and “Bapu” for his incredible contribution. He was a great man who believed in non-violence and social unity.Mahatma Gandhi is the father of the nation because he was the chief architect of the independence struggle.

He had raised the voice for the social development of rural areas in India and inspired the Indians to use native goods, also raised his voice against the British on social issues.

His aim was to destroy the tradition of untouchability and discrimination from Indian culture. Later he joined the Indian independence campaign and started fighting.

In Indian history, he was a great man who transformed the dream of freedom of Indians into reality. Even today, people remember him for his great and incredible work.

He faced a lot of problems in his life but he never gave up, he always kept moving forward. Also, read Mahatma Gandhi Essay 500 words.He started a number of campaigns such as the Non-Cooperation Movement in 1920, the Urban Disobedience Movement in 1930, and finally, the Quit India Movement in 1942 and all these movements proved to be effective in liberating India.

Eventually, India got independence from the British Rule due to his struggles. Mahatma Gandhi’s life was very simple, he did not believe in ageism and caste discrimination.

He also made a lot of efforts to destroy the tradition of untouchables from Indian society and due to this he also gave the name of “Harijan” to the untouchables which mean “People of God”.

Mahatma Gandhi was a great social reformer, freedom fighter & the aim of his life was to liberate India.

And on serving the country, this Mahatma died on 30 January 1948 and was cremated in the presence of millions of supporters at Rajghat, Delhi & this day in his memory celebrated as Martyr’s Day in India

“I never want to think about what will happen in the future, I just worry about the present, God has not given me any control over the moments to come.”Mahatma Gandhi, considered the main architect of the freedom struggle, was born in an ordinary family on 2 October 1869 in Porbandar, Gujarat. His childhood name was Mohandas Karamchand Gandhi.

His father, Karam Chand Gandhi, was the ‘Diwan’ of Rajkot during the British rule, mother’s Putlibai, an obedient woman with religious views, which had a profound impact on Gandhiji.

At the same time, when he was 13 years old, he was married to Kasturba under the practice of child marriage. Gandhiji was a very disciplined and obedient child since childhood.

He completed his early education in Gujarat and then went to England to study law, from where he returned to India to start work in India, however, he did not last long in advocacy.It was during the course of his advocacy that Gandhiji had to suffer separatism in South Africa.

According to an incident with Gandhiji, once he sat in the first-class compartment of the train, he was pushed out of the train compartment.

Along with this, he was also barred from visiting many big hotels in South Africa, after which Gandhiji fought fiercely against separatism.

He entered politics with the aim of destroying discrimination against Indians and then gave a new dimension to the politics of the country with his judiciousness and proper political skills and also played an important role as a freedom fighter.Mahatma Gandhi was a very ideological and idealistic leader. He was a simple, high minded person, due to his nature, people used to call him “Mahatma”.

His great ideas and idealism have also been followed by many greats like Albert Einstein, Rajendra Prasad, Sarojini Naidu, Nelson Mandela, and Martin Luther King.

These people were faithful supporters of Gandhiji. The great personality of Gandhiji had an influence not only in the country but also abroad.

Truth and non-violence were his two powerful weapons and with it, he forced the British to leave India.

He was a great freedom fighter and politician as well as a social worker who also made commendable efforts to remove casteism, untouchability, gender discrimination etc. in India.""").lower()

To split the given text into an array of sentences

In [3]:
def split_into_sentences(text):
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov)"
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

Text Pre-processing

In [4]:
def text_clean(x):
  doc=nlp(text);
  sentence=[];
  for word in doc:
    if not word.is_punct and not word in stopwords:
      if word.lemma!='-PRON':
          temp=word.lemma_.lower().strip()
      else:
          temp=word.lower()
      if temp not in sentence:
        sentence.append(temp)
  return sentence

In [5]:
sentences = split_into_sentences(text)
df=pd.DataFrame(sentences,columns=['Sentence'])
Sentence = df.Sentence.apply(gensim.utils.simple_preprocess)

In [6]:
df['Sentence']=df['Sentence'].apply(lambda x: text_clean(x))

**Initializing the gensim model**

size refers to the vector size of the given model                              
window refers to the window size                                               
min_count refers to the minimum number of nearby context words                 
workers refers to the number of CPU threads

In [7]:
model = gensim.models.Word2Vec(size=80,window=10,min_count=2,workers=4)

Initially the model will have 5 epochs. If you want you can change the number of epochs as per your requirement

In [8]:
model.build_vocab(Sentence,progress_per=1000)
model.train(Sentence,total_examples=model.corpus_count,epochs=model.epochs)

(789, 3335)

Printing the vectors for each word given in the training sentence

In [10]:
words=list(model.wv.vocab)
for word in words:
  print(word,f'50',model[word])

mahatma 50 [-1.8756409e-03 -1.3392378e-03 -6.1311829e-04  1.7719526e-03
  4.3567866e-03 -2.9848432e-03  1.9469496e-03  6.2937853e-03
  1.7062065e-03  1.2780682e-03  2.9633639e-03 -4.2518321e-03
  6.5731059e-04 -1.3734412e-03 -3.8902219e-03 -1.6642833e-03
 -1.9637708e-04  2.9201272e-03 -8.7369973e-04 -6.2477174e-03
  6.1798381e-04  3.3903548e-03  5.3836587e-03  4.2199423e-03
 -4.9836091e-03  5.9155412e-03 -3.9941045e-03 -3.8057598e-03
  5.9936526e-03  1.1946043e-03 -2.6256377e-03  2.4875933e-03
 -2.0119420e-03 -8.3389925e-04  1.9338474e-03 -5.0526559e-03
  3.8510766e-03  1.0043922e-03  4.0325541e-03  1.7585481e-03
  5.6328307e-05 -3.4888152e-03  8.1332790e-04  3.3004771e-03
 -2.1702696e-03  4.0515559e-03 -3.4892741e-03  3.7051477e-03
 -3.0009914e-03  4.6489316e-05  5.2845757e-03  4.0486543e-03
  4.5167739e-03  7.7193446e-04  4.1893446e-03  3.9185821e-03
  5.4170024e-03 -3.2768340e-03 -1.5766697e-03  9.2993176e-04
  1.7939890e-03 -2.3530696e-04 -1.3958234e-03 -5.1743896e-03
 -6.4054304e-

  This is separate from the ipykernel package so we can avoid doing imports until


To find the similar words

In [11]:
model.wv.most_similar("struggle")

[('had', 0.22027194499969482),
 ('efforts', 0.1995162069797516),
 ('voice', 0.17764560878276825),
 ('against', 0.1749974936246872),
 ('is', 0.15615060925483704),
 ('africa', 0.1359127312898636),
 ('not', 0.13466714322566986),
 ('then', 0.13281846046447754),
 ('for', 0.12880919873714447),
 ('violence', 0.12239345908164978)]

In [12]:
model.wv.most_similar("british")

[('politics', 0.37645411491394043),
 ('the', 0.36082693934440613),
 ('he', 0.2660360336303711),
 ('people', 0.23588106036186218),
 ('gujarat', 0.21613532304763794),
 ('indian', 0.20008288323879242),
 ('child', 0.17266935110092163),
 ('was', 0.17216113209724426),
 ('rule', 0.16380399465560913),
 ('of', 0.16201938688755035)]