<a href="https://colab.research.google.com/github/AkahndPratapSingh136/NLP/blob/main/TextRepresentation_FeatureExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# One Hot Encoding:
One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model.

The advantages of using one hot encoding include:
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the categorical variable.
3. It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

The disadvantages of using one hot encoding include:
1. It can lead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
 
2. It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns.
 
3. It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small.
 
4. One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.

In [19]:
import numpy as np
import pandas as pd

In [20]:
df=pd.read_excel('employee.xlsx')

In [21]:
df

Unnamed: 0,Gender,Employee_ID,Remarks
0,Male,45,Nice
1,Female,78,Good
2,Female,56,Great
3,Male,12,Great
4,Female,7,Nice


In [23]:
print(df['Gender'].unique())


['Male' 'Female']


In [24]:
print(df['Remarks'].unique())

['Nice' 'Good' 'Great']


In [26]:
df['Gender'].value_counts()

Female    3
Male      2
Name: Gender, dtype: int64

In [27]:
df['Remarks'].value_counts()

Nice     2
Great    2
Good     1
Name: Remarks, dtype: int64

We can use pd.get_dummies() function from pandas to one-hot encode the categorical columns.

In [30]:
one_hot_encoded_data=pd.get_dummies(df,columns=['Remarks','Gender'])
print(one_hot_encoded_data)

   Employee_ID  Remarks_Good  Remarks_Great  Remarks_Nice  Gender_Female  \
0           45             0              0             1              0   
1           78             1              0             0              1   
2           56             0              1             0              1   
3           12             0              1             0              0   
4            7             0              0             1              1   

   Gender_Male  
0            1  
1            0  
2            0  
3            1  
4            0  


**One Hot Encoding using Sci-kit Learn Library:**

Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for encoding categorical and numerical variables into binary vectors, also before implementing this algorithm. Make sure the categorical values must are labeled and encoded as one-hot encoding takes only numerical categorical values. 

In [31]:
from sklearn.preprocessing import OneHotEncoder

In [45]:
# Converting type of columns to category
df['Gender']=df['Gender'].astype('category')
df['Remarks']=df['Remarks'].astype('category')

In [46]:
# Assigning numerical values and storing it in another columns
df['Gen_new']=df['Gender'].cat.codes
df['Rem_new']=df['Remarks'].cat.codes
df

Unnamed: 0,Gender,Employee_ID,Remarks,Gen_new,Rem_new
0,Male,45,Nice,1,2
1,Female,78,Good,0,0
2,Female,56,Great,0,1
3,Male,12,Great,1,1
4,Female,7,Nice,0,2


In [48]:
# Create an instance of One-hot-encoder
enc=OneHotEncoder()

In [49]:
# Passing encoded columns
enc_data=pd.DataFrame(enc.fit_transform(df[['Gen_new','Rem_new']]).toarray())
enc_data

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,1.0


In [50]:
# Merge with main
New_df=df.join(enc_data)
New_df

Unnamed: 0,Gender,Employee_ID,Remarks,Gen_new,Rem_new,0,1,2,3,4
0,Male,45,Nice,1,2,0.0,1.0,0.0,0.0,1.0
1,Female,78,Good,0,0,1.0,0.0,1.0,0.0,0.0
2,Female,56,Great,0,1,1.0,0.0,0.0,1.0,0.0
3,Male,12,Great,1,1,0.0,1.0,0.0,1.0,0.0
4,Female,7,Nice,0,2,1.0,0.0,0.0,0.0,1.0


we have converted the enc.fit_transform() method to an array because the fit_transform method of OneHotEncoder returns SpiPy sparse matrix so converting to an array first enables us to save space when we have a huge number of categorical variables. 

# Bag of Word(BOG):
Bag of words is a Natural Language Processing technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. This approach is a simple and flexible way of extracting features from documents.

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [48]:
text="Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."

In [49]:
# Python3 code for preprocessing text
import nltk
import re
import numpy as np
import pandas as pd
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [50]:
df=nltk.sent_tokenize(text)
df

['Beans.',
 'I was trying to explain to somebody as we were flying in, that’s corn.',
 'That’s beans.',
 'And they were very impressed at my agricultural knowledge.',
 'Please give it up for Amaury once again for that outstanding introduction.',
 'I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here.',
 'I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have.',
 'And it’s great to see you, Governor.',
 'I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today.',
 'And I am deeply honored at the Paul Douglas Award that is being given to me.',
 'He is somebody who set the path for so much outstanding public service here in Illinois.',
 'Now, I want to start by addressing the elephant in the room.',
 'I know people are

In [51]:
for i in range(len(df)):
  df[i] = df[i].lower()
  df[i] = re.sub(r'\W', ' ', df[i])
  df[i] = re.sub(r'\s+', ' ', df[i])
df

['beans ',
 'i was trying to explain to somebody as we were flying in that s corn ',
 'that s beans ',
 'and they were very impressed at my agricultural knowledge ',
 'please give it up for amaury once again for that outstanding introduction ',
 'i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here ',
 'i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have ',
 'and it s great to see you governor ',
 'i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today ',
 'and i am deeply honored at the paul douglas award that is being given to me ',
 'he is somebody who set the path for so much outstanding public service here in illinois ',
 'now i want to start by addressing the elephant in the room ',
 'i know people are still wonde

In [52]:
# Creating the Bag of Words model
wordcount={}
for data in df:
  words=nltk.word_tokenize(data)
  for word in words:
    if word not in wordcount:
      wordcount[word]=1
    else:
      wordcount[word]+=1
wordcount

{'beans': 2,
 'i': 12,
 'was': 1,
 'trying': 1,
 'to': 8,
 'explain': 1,
 'somebody': 3,
 'as': 1,
 'we': 2,
 'were': 2,
 'flying': 1,
 'in': 5,
 'that': 4,
 's': 3,
 'corn': 1,
 'and': 7,
 'they': 1,
 'very': 1,
 'impressed': 1,
 'at': 4,
 'my': 1,
 'agricultural': 1,
 'knowledge': 1,
 'please': 1,
 'give': 1,
 'it': 3,
 'up': 1,
 'for': 5,
 'amaury': 1,
 'once': 1,
 'again': 1,
 'outstanding': 2,
 'introduction': 1,
 'have': 3,
 'a': 2,
 'bunch': 1,
 'of': 3,
 'good': 1,
 'friends': 1,
 'here': 5,
 'today': 2,
 'including': 1,
 'who': 4,
 'served': 1,
 'with': 1,
 'is': 4,
 'one': 1,
 'the': 9,
 'finest': 1,
 'senators': 1,
 'country': 1,
 're': 1,
 'lucky': 1,
 'him': 1,
 'your': 1,
 'senator': 1,
 'dick': 1,
 'durbin': 1,
 'also': 1,
 'noticed': 1,
 'by': 2,
 'way': 1,
 'former': 1,
 'governor': 2,
 'edgar': 1,
 'haven': 1,
 't': 2,
 'seen': 1,
 'long': 1,
 'time': 1,
 'somehow': 1,
 'he': 2,
 'has': 1,
 'not': 1,
 'aged': 1,
 'great': 1,
 'see': 1,
 'you': 1,
 'want': 2,
 'thank':

In [53]:
import heapq
fre_words=heapq.nlargest(100,wordcount, key=wordcount.get)
fre_words

['i',
 'the',
 'to',
 'and',
 'in',
 'for',
 'here',
 'that',
 'at',
 'who',
 'is',
 'somebody',
 's',
 'it',
 'have',
 'of',
 'beans',
 'we',
 'were',
 'outstanding',
 'a',
 'today',
 'by',
 'governor',
 't',
 'he',
 'want',
 'me',
 'was',
 'trying',
 'explain',
 'as',
 'flying',
 'corn',
 'they',
 'very',
 'impressed',
 'my',
 'agricultural',
 'knowledge',
 'please',
 'give',
 'up',
 'amaury',
 'once',
 'again',
 'introduction',
 'bunch',
 'good',
 'friends',
 'including',
 'served',
 'with',
 'one',
 'finest',
 'senators',
 'country',
 're',
 'lucky',
 'him',
 'your',
 'senator',
 'dick',
 'durbin',
 'also',
 'noticed',
 'way',
 'former',
 'edgar',
 'haven',
 'seen',
 'long',
 'time',
 'somehow',
 'has',
 'not',
 'aged',
 'great',
 'see',
 'you',
 'thank',
 'president',
 'killeen',
 'everybody',
 'u',
 'system',
 'making',
 'possible',
 'be',
 'am',
 'deeply',
 'honored',
 'paul',
 'douglas',
 'award',
 'being',
 'given',
 'set',
 'path',
 'so']

In [54]:
#Step #3 : Building the Bag of Words model

x=[]
for data in df:
  vector=[]
  for word in fre_words:
    if word in nltk.word_tokenize(data):
      vector.append(1)
    else:
      vector.append(0)
  x.append(vector)
x=np.asarray(x)

In [55]:
x

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

# **class sklearn.feature_extraction.text.CountVectorizer**(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'> )

In [None]:
import numpy as np
import pandas as pd

In [None]:
df=pd.DataFrame({'text':['people watch campusx', 'campusx watch campusx', 'people wirte comment', 'campusx write comment'],'output':[1,1,0,0]})

In [None]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people wirte comment,0
3,campusx write comment,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(stop_words=None)

In [None]:
bow=cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'wirte': 4, 'comment': 1, 'write': 5}


In [None]:
print(bow[0].toarray())

[[1 0 1 1 0 0]]


In [None]:
print(bow[1].toarray())

[[2 0 0 1 0 0]]


In [None]:
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 0, 1]])

# N-Grams:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(2,2))

In [None]:
bow=cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people wirte': 3, 'wirte comment': 5, 'campusx write': 1, 'write comment': 6}


In [None]:
print(bow[0].toarray())

[[0 0 1 0 1 0 0]]


In [None]:
print(bow[1].toarray())

[[1 0 0 0 1 0 0]]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,2))

In [None]:
bow=cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'wirte': 9, 'comment': 3, 'people wirte': 6, 'wirte comment': 10, 'write': 11, 'campusx write': 2, 'write comment': 12}


In [None]:
print(bow[0].toarray())

[[1 0 0 0 1 1 0 1 1 0 0 0 0]]


In [None]:
print(bow[1].toarray())

[[2 1 0 0 0 0 0 1 1 0 0 0 0]]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,3))

In [None]:
bow=cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people': 6, 'watch': 11, 'campusx': 0, 'people watch': 7, 'watch campusx': 12, 'people watch campusx': 8, 'campusx watch': 1, 'campusx watch campusx': 2, 'wirte': 13, 'comment': 5, 'people wirte': 9, 'wirte comment': 14, 'people wirte comment': 10, 'write': 15, 'campusx write': 3, 'write comment': 16, 'campusx write comment': 4}


In [None]:
print(bow[0].toarray())

[[1 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 0]]


**NOTE:**

ngram_range=(1,1) --> Unigrams(BOW)

ngram_range=(2,3) --> Bigrams

ngram_range=(3,3) --> Trigram

ngram_range=(1,2) --> Unigram+Bigram

ngram_range=(1,3) --> Unigram+Bigram+Trigram

and so on...

# Tf-Idf:
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

Terminologies:

1. Term Frequency: In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.

  **tf(t,d) = count of t in d / number of words in d**

2. Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.

  **df(t) = occurrence of t in documents**

3. Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term t by counting the number of documents containing the term:

  df(t) = N(t)

  where

  df(t) = Document frequency of a term t
  
  N(t) = Number of documents containing the term t 

4. Term frequency is the number of instances of a term in a single document only; although the frequency of the document is the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the frequency of the text.

  **idf(t) = N/ df(t) = N/N(t)**

  The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:

  **idf(t) = log(N/ df(t))**

5. Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf). The words with higher scores of weight are deemed to be more significant.
Usually, the tf-idf weight consists of two terms-

  Normalized Term Frequency (tf)
  
  Inverse Document Frequency (idf)
  
  **tf-idf(t, d) = tf(t, d) * idf(t)**
  
  In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module.

In [None]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people wirte comment,0
3,campusx write comment,0


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ,
        0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ,
        0.        ],
       [0.        , 0.52640543, 0.52640543, 0.        , 0.66767854,
        0.        ],
       [0.44809973, 0.55349232, 0.        , 0.        , 0.        ,
        0.70203482]])

In [None]:
print(tfidf.idf_)

[1.22314355 1.51082562 1.51082562 1.51082562 1.91629073 1.91629073]


In [None]:
print(tfidf.get_feature_names_out())

['campusx' 'comment' 'people' 'watch' 'wirte' 'write']
