Word Embeddings is the process to represent the words for text analysis as a real valued vector so that similar words or vectors will be closer to each other

They are of 2 types , 
    (1)count or frequency -> One hot encoding, bag of words, TF-IDF
    (2) Deep learning models -> Word2Vec
        (A)CBow (continious bag of words ANN ) 
        (B)Skipgram

Word2Vec:
this uses neural netwrok to learn the word associations from a large corpus of text, once trained the model detects synonymus words for the partial sentence.
As the name suggest word2vec represents each word with a list of number called a vector.

a) CBOW(Continous Bag of Words):
    -> We take a window size containg the number of words, and then a center word of the window
    -> Each time we take the window and keep it moving one word at a time and repeat the process and then train the model
    -> CBOW is a fully connected Neural network , from there we train the weights and then see how they move and work on the loss function and backward propogation  
    -> We use this for a small set of corpus

b) Skipgram:
    -> The input output has been changed, this is done to reduce the size of the sparsity of the matrix
    -> We use this for a large corpus of words 

Advantages of Word2Vec:
    -> This makes a dense Matrix 
    -> Semantic Meaining of the words are captured and the similarity is also captured
    -> Vocabulary size is fixes [dimentions are around 300]
    -> Out of vocabulary is also solved as we have a huge corpus and almost every word is captured


Average Word2Vec:
    For the entire sentence each individual word will have a representation and we will take the corresponding avg of each row to create the avg of all the words to represent the sentence




In [None]:
#using google pre-trained models:
#https://huggingface.co/fse/word2vec-google-news-300

# !pip install gensim


In [None]:
from gensim.models import Word2Vec, KeyedVectors


In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king=wv['king'] #this will take the word and convert ttto a vector

In [None]:
vec_king.shape

In [None]:
wv.most_similar('cricket') #This is used to find the words most similar to the word cricket

In [None]:
wv.similarity("hockey", "sports") #this is to tell how similar the two words are

In [None]:
vec=wv['king']-wv['man']+wv['woman'] #here this prooves that we are able to use the model to convert as numerical values
wv.most_similar([vec])

Spam And Ham => using Bag Of Words and tfidf to convert the text to numerical values and then use machine learning to perform the classication of Spam or not (Ham)

In [None]:
import pandas as pd
messages=pd.read_csv('SMSSpamCollection.csv',sep='\t',names=["label","message"])

In [None]:
messages

In [None]:
#Data Clearning And Preprocessing
import re
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

corpus=[]
for i in range(0,len(messages)):
    review = re.sub('[^a-zA-Z]',' ',messages['message'][i])
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)
corpus

In [None]:
#The output feature is the labels

y=pd.get_dummies(messages['label']).astype(int)
y #now instead of having two various columns we can use only one of them



In [None]:
y=y.iloc[:,1].values
y

Note as a best practice we need to follow the below steps:
1) Preprocessing and Cleaning 
2) Train and Test
3) BOW and TFIDF -> This is done to prevent any data leakage
4) Trained the model

In [None]:
# Train and Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(corpus,y,test_size=0.20)

In [None]:
#Creating the Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=100,binary=True, ngram_range=(1,2)) # place binary=True , if we should use Binary BOW
X_train=cv.fit_transform(X_train).toarray() #X is the independent features
X_test=cv.transform(X_test).toarray() #X is the independent features
cv.vocabulary_

In [None]:
from sklearn.naive_bayes import MultinomialNB #this performs well on sparse matrices
spam_detect_model = MultinomialNB().fit(X_train,y_train)
y_pred=spam_detect_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_test,y_pred)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

#Spam and Ham Project using TFIDF


In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(corpus,y,test_size=0.20)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(max_features=2055,ngram_range=(1,2))
X_train=tfidf.fit_transform(X_train).toarray()
X_test=tfidf.transform(X_test).toarray()

In [None]:
tfidf.vocabulary_

In [None]:
from sklearn.naive_bayes import MultinomialNB #this performs well on sparse matrices
spam_tfidf_detect_model = MultinomialNB().fit(X_train,y_train)
y_tfidf_pred=spam_tfidf_detect_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_test,y_tfidf_pred)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_tfidf_pred))