We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

In this notebook we are going to understand some basic text cleaning steps and techniques for encoding text data. We are going to learn about
1. **Understanding the data** - See what's data is all about. what should be considered for cleaning for data (Punctuations , stopwords etc..).
2. **Basic Cleaning** -We will see what parameters need to be considered for cleaning of data (like Punctuations , stopwords etc..)  and its code.
3. **Techniques for Encoding** - All the popular techniques that are used for encoding that I personally came across.
    *           **Bag of Words**
    *           **Binary Bag of Words**
    *           **Bigram, Ngram**
    *           **TF-IDF**( **T**erm  **F**requency - **I**nverse **D**ocument **F**requency)


# 1.Importing Libraries

Libraries used in this notebook along with their version:

google	2.0.3

nltk	3.2.5

numpy	1.18.3

pandas	1.0.3

In [1]:
import nltk
from six import string_types
from nltk.corpus import reuters
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
import numpy as np
import pandas as pd

# 2.Reading the data

We will employ a text categorization dataset based on Reviews. Each article is assigned a specific captegory. 
###Implement the code to load the dataset.(Hint: Use the pandas library to load the csv file.)

In [2]:
df = pd.read_csv('Reviews.csv')
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [3]:
df.shape

(568454, 10)

In [4]:
df = df.iloc[:2000,:]
df.shape   # since program was not able to run and consuming lots of  memory ,so  reduced the dataset

(2000, 10)

In [5]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
df["Score"].unique()

array([5, 1, 4, 2, 3], dtype=int64)

1. **Understanding the data**

Our main objective from the dataset is to predict whether a review is **Positive** or **Negative** based on the Text.
 
If we see the Score column, it has values 1,2,3,4,5 .  Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews.
 For Score = 3 we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative
 
HelpfulnessNumerator says about number of people found that review usefull and HelpfulnessDenominator is about usefull review count + not so usefull count.
So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator.

In [7]:
data =df[df.Score !=3]
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [8]:
data = data[data.HelpfulnessNumerator<= data.HelpfulnessDenominator]
data.shape

(1838, 10)

Converting Score values into class label either Positive or Negative.

In [9]:
data['label'] = data['Score'].apply(lambda x: 0 if (x==1 or x==2) else 1)
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,label
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,1


2. **Basic Cleaning**
 
**Deduplication** means removing duplicate rows, It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on UserId, ProfileName, Time, Text. If all these values are equal then we will remove those records. (No user can type a review on same exact time for different products.)


We have seen that HelpfulnessNumerator should always be less than or equal to HelpfulnessDenominator so checking this condition and removing those records also.


In [10]:
data.shape

(1838, 11)

In [11]:
data = data.drop_duplicates()
data.shape

(1838, 11)

###Create a function called "complaint_to_words" to convert each consumer complaint narrative to individual tokens.(Hint: Use regular expression based tokenizer.)

# 3.Basic Cleaning

We will use the above function here to create a list of list that will store each complaint tokenized into separate words.

## 3.1.Tokenize

In [12]:
report = data['Text']
report.shape

(1838,)

In [13]:
report

0       I have bought several of the Vitality canned d...
1       Product arrived labeled as Jumbo Salted Peanut...
2       This is a confection that has been around a fe...
3       If you are looking for the secret ingredient i...
4       Great taffy at a great price.  There was a wid...
                              ...                        
1995    I have to laugh at the reviews that said it wa...
1996    I had read some favorable reviews of this panc...
1997    I was expecting great things based on the revi...
1998    I love this pancake mix.  I bought my first ca...
1999    What can i say??  They are wonderful, and the ...
Name: Text, Length: 1838, dtype: object

In [14]:
word = []
for file_id in report:
    words = word_tokenize(file_id)
    word.extend(words)
len(words)

31

## 3.2.Lower Case

In [15]:

text_tokens = [w.lower() for w in word]
len(text_tokens)

161075

## 3.3.Removing Stopwords

In [32]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rishabh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 3.3.1.Removing Punctuation

In [16]:
# Remove Punctuation
  
#stop_words = set(stopwords.words('english')) 
puncList = [";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]

#word_tokens = word_tokenize(text_tokens) 
  
filtered_sentence = [w for w in text_tokens if not w in puncList] 
  
Punc_filtered_sentence = [] 
  
for w in text_tokens: 
    if w not in puncList: 
        Punc_filtered_sentence.append(w) 
  
print(len(text_tokens) )
print(len(Punc_filtered_sentence) )

161075
150693


### 3.3.2.Removing the Stop Words

In [17]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bapan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
    
stop_words = set(stopwords.words('english')) 
    
filtered_sentence = [w for w in Punc_filtered_sentence if not w in stop_words] 
  
len(filtered_sentence) 

84583

## 3.4.Stemming & Lemitization

### 3.4.1.Stemming

In [19]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

filtered_sentence = list(map(lambda x : porter.stem(x),filtered_sentence))
filtered_sentence

['bought',
 'sever',
 'vital',
 'can',
 'dog',
 'food',
 'product',
 'found',
 'good',
 'qualiti',
 '.',
 'product',
 'look',
 'like',
 'stew',
 'process',
 'meat',
 'smell',
 'better',
 '.',
 'labrador',
 'finicki',
 'appreci',
 'product',
 'better',
 '.',
 'product',
 'arriv',
 'label',
 'jumbo',
 'salt',
 'peanut',
 '...',
 'peanut',
 'actual',
 'small',
 'size',
 'unsalt',
 '.',
 'sure',
 'error',
 'vendor',
 'intend',
 'repres',
 'product',
 '``',
 'jumbo',
 "''",
 '.',
 'confect',
 'around',
 'centuri',
 '.',
 'light',
 'pillowi',
 'citru',
 'gelatin',
 'nut',
 '-',
 'case',
 'filbert',
 '.',
 'cut',
 'tini',
 'squar',
 'liber',
 'coat',
 'powder',
 'sugar',
 '.',
 'tini',
 'mouth',
 'heaven',
 '.',
 'chewi',
 'flavor',
 '.',
 'highli',
 'recommend',
 'yummi',
 'treat',
 '.',
 'familiar',
 'stori',
 'c.',
 '.',
 'lewi',
 "'",
 '``',
 'lion',
 'witch',
 'wardrob',
 "''",
 '-',
 'treat',
 'seduc',
 'edmund',
 'sell',
 'brother',
 'sister',
 'witch',
 '.',
 'look',
 'secret',
 'ingr

### 3.4.2.Lemitization

In [20]:
from nltk.stem import LancasterStemmer
lancaster=LancasterStemmer()



filtered_sentence = list(map(lambda x : lancaster.stem(x),filtered_sentence))
filtered_sentence

['bought',
 'sev',
 'vit',
 'can',
 'dog',
 'food',
 'produc',
 'found',
 'good',
 'qualit',
 '.',
 'produc',
 'look',
 'lik',
 'stew',
 'process',
 'meat',
 'smel',
 'bet',
 '.',
 'labrad',
 'finick',
 'apprec',
 'produc',
 'bet',
 '.',
 'produc',
 'ar',
 'label',
 'jumbo',
 'salt',
 'peanut',
 '...',
 'peanut',
 'act',
 'smal',
 'siz',
 'unsalt',
 '.',
 'sur',
 'er',
 'vend',
 'intend',
 'repr',
 'produc',
 '``',
 'jumbo',
 "''",
 '.',
 'confect',
 'around',
 'centur',
 '.',
 'light',
 'pillow',
 'citru',
 'gelatin',
 'nut',
 '-',
 'cas',
 'filbert',
 '.',
 'cut',
 'tin',
 'squ',
 'lib',
 'coat',
 'powd',
 'sug',
 '.',
 'tin',
 'mou',
 'heav',
 '.',
 'chew',
 'flav',
 '.',
 'highl',
 'recommend',
 'yumm',
 'tre',
 '.',
 'famili',
 'stor',
 'c.',
 '.',
 'lew',
 "'",
 '``',
 'lion',
 'witch',
 'wardrob',
 "''",
 '-',
 'tre',
 'seduc',
 'edmund',
 'sel',
 'broth',
 'sist',
 'witch',
 '.',
 'look',
 'secret',
 'ingred',
 'robitussin',
 'believ',
 'found',
 '.',
 'got',
 'addit',
 'root',

## 3.5.PoS

In [21]:
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(filtered_sentence)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bapan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('bought', 'VBN'),
 ('sev', 'NNS'),
 ('vit', 'NN'),
 ('can', 'MD'),
 ('dog', 'VB'),
 ('food', 'NN'),
 ('produc', 'NN'),
 ('found', 'VBD'),
 ('good', 'JJ'),
 ('qualit', 'NN'),
 ('.', '.'),
 ('produc', 'JJ'),
 ('look', 'NN'),
 ('lik', 'JJ'),
 ('stew', 'NN'),
 ('process', 'NN'),
 ('meat', 'NN'),
 ('smel', 'NN'),
 ('bet', 'NN'),
 ('.', '.'),
 ('labrad', 'CC'),
 ('finick', 'JJ'),
 ('apprec', 'NN'),
 ('produc', 'NN'),
 ('bet', 'NN'),
 ('.', '.'),
 ('produc', 'NN'),
 ('ar', 'NN'),
 ('label', 'NN'),
 ('jumbo', 'NN'),
 ('salt', 'NN'),
 ('peanut', 'NN'),
 ('...', ':'),
 ('peanut', 'NN'),
 ('act', 'NN'),
 ('smal', 'JJ'),
 ('siz', 'NN'),
 ('unsalt', 'NN'),
 ('.', '.'),
 ('sur', 'JJ'),
 ('er', 'JJ'),
 ('vend', 'NN'),
 ('intend', 'VBP'),
 ('repr', 'NN'),
 ('produc', 'NN'),
 ('``', '``'),
 ('jumbo', 'JJ'),
 ("''", "''"),
 ('.', '.'),
 ('confect', 'VB'),
 ('around', 'IN'),
 ('centur', 'NN'),
 ('.', '.'),
 ('light', 'JJ'),
 ('pillow', 'JJ'),
 ('citru', 'NN'),
 ('gelatin', 'NN'),
 ('nut', 'SYM'),
 ('-'

# Save the data

In [22]:
filter1 =nltk.pos_tag(filtered_sentence)
df_filter =pd.DataFrame(filter1,columns=['pos_tag', 'tag_type'])
#df_filter.to_csv('filter.csv') 

In [23]:
df_filter.head()

Unnamed: 0,pos_tag,tag_type
0,bought,VBN
1,sev,NNS
2,vit,NN
3,can,MD
4,dog,VB


# 4.**Techniques for Encoding**

4. **Techniques for Encoding**

      **BAG OF WORDS**
      
      In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are **d** unique words in our dictionary then for every sentence or review the vector will be of length **d** and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.
      
      Ex. pasta is tasty and pasta is good
      
     **[0]....[1]............[1]...........[2]..........[2]............[1]..........**             <== Its vector representation ( remaining all dots will be represented as zeroes)
     
     **[a]..[and].....[good].......[is].......[pasta]....[tasty].......**            <==This is dictionary
      .
      
    Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of, one of them is max_features =5000 it tells about to consider only top 5000 most frequently repeated words to place in a dictionary. so our dictionary length or vector length will be only 5000
    


   **BINARY BAG OF WORDS**
    
   In binary BoW, we dont count the frequency of word, we just place **1** if the word appears in the review or else **0**. In CountVectorizer there is a parameter **binary = true** this makes our BoW to binary BoW.
   
  

In [24]:
filtered_sentence

['bought',
 'sev',
 'vit',
 'can',
 'dog',
 'food',
 'produc',
 'found',
 'good',
 'qualit',
 '.',
 'produc',
 'look',
 'lik',
 'stew',
 'process',
 'meat',
 'smel',
 'bet',
 '.',
 'labrad',
 'finick',
 'apprec',
 'produc',
 'bet',
 '.',
 'produc',
 'ar',
 'label',
 'jumbo',
 'salt',
 'peanut',
 '...',
 'peanut',
 'act',
 'smal',
 'siz',
 'unsalt',
 '.',
 'sur',
 'er',
 'vend',
 'intend',
 'repr',
 'produc',
 '``',
 'jumbo',
 "''",
 '.',
 'confect',
 'around',
 'centur',
 '.',
 'light',
 'pillow',
 'citru',
 'gelatin',
 'nut',
 '-',
 'cas',
 'filbert',
 '.',
 'cut',
 'tin',
 'squ',
 'lib',
 'coat',
 'powd',
 'sug',
 '.',
 'tin',
 'mou',
 'heav',
 '.',
 'chew',
 'flav',
 '.',
 'highl',
 'recommend',
 'yumm',
 'tre',
 '.',
 'famili',
 'stor',
 'c.',
 '.',
 'lew',
 "'",
 '``',
 'lion',
 'witch',
 'wardrob',
 "''",
 '-',
 'tre',
 'seduc',
 'edmund',
 'sel',
 'broth',
 'sist',
 'witch',
 '.',
 'look',
 'secret',
 'ingred',
 'robitussin',
 'believ',
 'found',
 '.',
 'got',
 'addit',
 'root',

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
vectorizer  = CountVectorizer()
vectorizer.fit(filtered_sentence)

CountVectorizer()

In [35]:
len(vectorizer.vocabulary_)

5504

In [36]:
vector = vectorizer.transform(filtered_sentence)
vector.shape

(84583, 5504)

In [37]:
vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [38]:
binary_vectorizer  =CountVectorizer(binary=True)
binary_vectorizer.fit(filtered_sentence)
len(binary_vectorizer.vocabulary_)

5504

In [39]:
binary_vector = binary_vectorizer.transform(filtered_sentence)
binary_vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

 **Drawbacks of BoW/ Binary BoW**
 
 Our main objective in doing these text to vector encodings is that similar meaning text vectors should be close to each other, but in some cases this may not possible for Bow
 
For example, if we consider two reviews **This pasta is very tasty** and **This pasta is not tasty** after stopwords removal both sentences will be converted to **pasta tasty** so both giving exact same meaning.

The main problem is here we are not considering the front and back words related to every word, here comes Bigram and Ngram techniques.

## 4.1.N-gram 

### 4.1.1.**BI-GRAM BOW**

Considering pair of words for creating dictionary is Bi-Gram , Tri-Gram means three consecutive words so as NGram.

CountVectorizer has a parameter **ngram_range** if assigned to (1,2) it considers Bi-Gram BoW

But this massively increases our dictionary size 

In [40]:
# bigram

CountVec = CountVectorizer(ngram_range=(0,1)) # forward bigram
bi = CountVec.fit(filtered_sentence)
len(bi.vocabulary_)

5504

In [41]:
bigram_vector = CountVec.transform(filtered_sentence)
print(bigram_vector.shape)

(84583, 5504)


## 4.2.**TF-IDF**

**Term Frequency -  Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a **particular word(W)** occurs in a review divided by totall number of words **(Wr)** in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.


**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**


 Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(0,1),stop_words='english')
tf_idf_vec_smooth.fit(filtered_sentence)
len(tf_idf_vec_smooth.vocabulary_)

5337

In [44]:
tf_idf_vector = tf_idf_vec_smooth.transform(filtered_sentence)
tf_idf_vector.shape  # not printing vector as the size is large and will display out of memory error

(84583, 5337)

## 4.3.Word2Vec

Gensim is a free to use python library. It provides APIs to solve various problems relating to natural language processing. It is fast, scalable and robust.

In this practice exercise we will train our own Word2Vec model using gensim Word2Vec API. Objectives of this practice exercise are, 


1.   Train your word2vec word embedding model.
2.   Visualize trained word embedding model using principal component analysis.


First step will be to load the corpus, clean it and tokenize it.

Libraries used in this notebook along with their version:

google	2.0.3

matplotlib	3.2.1

numpy	1.18.3

pandas	1.0.3

In [53]:
#pip install gensim

In [45]:
from gensim.test.utils import common_texts

In [46]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Next step is to import the Word2Vec model from gensim.

In [47]:
from gensim.models import Word2Vec

##### Create your own model using the data_list defined above and gensim Word2Vec API. (Hint: https://radimrehurek.com/gensim/models/word2vec.html)

In [48]:
type(report)

pandas.core.series.Series

In [49]:
train = report.head(5)  # taking 5 values to make computation faster

In [50]:
i=0
for sentences in train:
    train[i] = sentences.split(' ')
    i=i+1
    

In [52]:
document = []
for val in train:
    document.append(val)
len(document)

5

In [53]:
model = Word2Vec(sentences=document, vector_size=4, window=1, min_count=1, workers=4)
model.save("word2vec_2.model") 

In [54]:
model.wv['Salted']  # vector for Salted

array([-0.00525026,  0.08687656, -0.02292105,  0.2099418 ], dtype=float32)

In [55]:
 model.wv.most_similar('Salted', topn=10) # most similar

[('very', 0.9754360318183899),
 ('This', 0.8565337657928467),
 ('out', 0.8558760285377502),
 ('got', 0.8383424878120422),
 ('dog', 0.7904241681098938),
 ('in', 0.7720502614974976),
 ('with', 0.7592442035675049),
 ('you', 0.7581216096878052),
 ('', 0.7547386288642883),
 ('Filberts.', 0.7532793879508972)]

In [58]:
model.wv['food']

array([-0.18129264,  0.23563291,  0.1912711 ,  0.13729256], dtype=float32)

In [59]:
 model.wv.most_similar('food', topn=10)

[('Witch.', 0.9921318292617798),
 ('canned', 0.9699839353561401),
 ('you', 0.9456015825271606),
 ('tiny', 0.9127481579780579),
 ('-', 0.8640699982643127),
 ('got', 0.8637152910232544),
 ('more', 0.8624321222305298),
 ('labeled', 0.8432460427284241),
 ('with', 0.8026028871536255),
 ('liberally', 0.7958273887634277)]

##### Use PCA algorithm from sklearn to convert high dimesnional word embeddings to two diemnsions and save them in the variable "results".

##### Visualizing the word embeddings.

# 5.Emotion and Sentiment Analysis