### Limitations of BOW and TF-IDF model
<li>Semantic information is not stored. Even in TF-IDF model we only give importance to uncommon words.</li>
<li>There is a chance to overfitting your model. Overfitting is when your model is too closely tied with training data and does well with training data but performs poorly with new data.</li>

### Word2Vec model
<li>In this model each word is represented as a vector of 32 or more dimensions instead of a single number</li>
<li>Relation between separate words are preserved</li>
<br>
<b>Steps:</b>
<ol>
    <li>Scrape through huge dataset like the whole wikipedia </li>
    <li>Create a whole matrix with all the unique words in the dataset. The matrix represents the occurance relationship between models</li>
    <li>Split the matrix in two thin matrices</li>
    <li>We have the model</li>
</ol>

##  ***Building Word2Vec model***

In [33]:
import nltk
import urllib
import bs4 as bs
import re
from gensim.models import Word2Vec
from nltk.corpus import stopwords

### Fetching Data from Wikipedia

In [24]:
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Climate_change').read()

In [25]:
soup = bs.BeautifulSoup(source, 'lxml')

In [26]:
text = ''

for paragraph in soup.find_all('p'):
    text+=paragraph.text

### Preprocessing the data

In [29]:
text = re.sub(r'\[[0-9]*\]',' ', text)
text = re.sub(r'\s+',' ', text)
text = text.lower()
text = re.sub(r'[@#\$\%\*\(\)\{\}\[\]\+]',' ', text)
text = re.sub(r'\d',' ', text)
text = re.sub(r'\s+',' ', text)

In [31]:
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

In [32]:
len(sentences)

517

In [36]:
english_stopwords = stopwords.words('english')

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in english_stopwords]

### Train the Model

In [37]:
model = Word2Vec(sentences, min_count=1) # min_count=we will ignore all the words with frequency less than this value

In [38]:
# word to index dict
words = model.wv.key_to_index

In [39]:
# vector of word global
model.wv['global']

array([-2.32197624e-02,  1.57655235e-02,  2.24995753e-03, -7.62303127e-03,
       -9.01957601e-03, -2.79170088e-02,  1.03491498e-02,  3.80639434e-02,
       -2.19229292e-02, -1.49719482e-02, -2.00697407e-02, -2.95138005e-02,
       -1.50051983e-02, -3.62161594e-03,  1.03122853e-02, -1.58294979e-02,
        1.03270565e-03, -1.94836296e-02, -2.54769321e-03, -3.16581875e-02,
       -4.89174668e-03,  2.87487195e-03,  1.91747099e-02, -5.39449835e-03,
       -1.40529461e-02, -2.23496929e-03, -9.67528298e-03, -1.17570357e-02,
       -9.01293103e-03,  1.34216975e-02,  3.16833667e-02, -1.17369015e-02,
       -6.28552458e-04, -1.83756743e-02, -6.11896487e-03,  2.50285044e-02,
        1.23198098e-02,  8.58989137e-04,  8.69853393e-05, -2.68974639e-02,
        1.19003719e-02, -2.02892404e-02, -1.32058430e-02, -9.65767889e-04,
        1.06649883e-02, -8.94327741e-03, -6.87225815e-03,  6.30669203e-03,
        6.97216718e-03, -1.10534881e-03,  8.60343967e-03, -2.78188139e-02,
        3.14615015e-03,  

In [41]:
# similar to word 'warming' i.e. the words that have appeared to the word 'warming' in this context
model.wv.most_similar('warming')

[('climate', 0.8907241821289062),
 (',', 0.8898768424987793),
 ('.', 0.8662092089653015),
 ('may', 0.8575513362884521),
 ('heat', 0.8561801910400391),
 ('global', 0.8485167026519775),
 ('change', 0.838651180267334),
 ('emissions', 0.8366113305091858),
 ('also', 0.8334764838218689),
 ('greenhouse', 0.8318421840667725)]