<a href="https://colab.research.google.com/github/Metallicode/Math/blob/main/Bayes_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Simple Naive Bayes classifier

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [79]:
df = pd.read_csv("O_movies_reviews.csv")

In [80]:
df.head()

Unnamed: 0,Review,Label
0,The plot was dull and predictable.,0
1,The cinematography was absolutely stunning!,1
2,Terrible acting and poor direction ruined the ...,0
3,The lead actor's performance was breathtaking ...,1
4,I wasted two hours of my life on this film.,0


In [81]:
X = df['Review']
y = df["Label"]

##Create Probability for Each Class

In [82]:
total_documents = y.count()
total_documents

33

In [83]:
y.value_counts()

0    18
1    15
Name: Label, dtype: int64

In [84]:
n_bad_reviews = y.value_counts()[0]
n_good_reviews = y.value_counts()[1]
n_bad_reviews,n_good_reviews

(18, 15)

In [85]:
prob_good_review = n_good_reviews/total_documents
prob_bad_review = n_bad_reviews/total_documents
prob_good_review,prob_bad_review

(0.45454545454545453, 0.5454545454545454)

##Create Vocabulary

In [None]:
vocabulary = set()

for sentence in X:
    for word in sentence.split():
        vocabulary.add(word.lower().strip('.,!?'))  # Added .lower() to ensure case-insensitivity

vocabulary

##Count Vectorization

In [88]:
##add alpha smoothing parameter
alpha = 1 #larger alpha values create more "smoothing"
good_reviews_word_freq = { word: alpha for word in vocabulary }
bad_reviews_word_freq = { word: alpha for word in vocabulary }

In [89]:
for review in df[y==1].values:
  for word in review[0].split():
    good_reviews_word_freq[word.lower().strip('.,!?')] += 1

#good_reviews_word_freq

In [90]:
for review in df[y==0].values:
  for word in review[0].split():
    bad_reviews_word_freq[word.lower().strip('.,!?')] += 1

#bad_reviews_word_freq

##Prediction Function

In [91]:
def Predict(new_review):
  good_probs = [prob_good_review]
  bad_probs = [prob_bad_review]

  for token in new_review:
    t = token.lower().strip('.,!?')
    if t in vocabulary:
      good_probs.append(good_reviews_word_freq[t]/n_good_reviews)
      bad_probs.append(bad_reviews_word_freq[t]/n_bad_reviews)

  P_GOOD_REVIEW = np.prod(good_probs)
  P_BAD_REVIEW = np.prod(bad_probs)
  print("prob this was a good review:\t", P_GOOD_REVIEW)
  print("prob this was a bad review:\t", P_BAD_REVIEW)

In [92]:
Predict("this movie was bad")

prob this was a good review:	 0.0007272727272727274
prob this was a bad review:	 0.006734006734006733


##TF-IDF

Term Frequency-Inverse Document Frequency. It's a statistical measure that indicates the importance of a word in a document relative to a collection of documents, called a corpus. The idea is to weigh terms based on how frequently they appear in a specific document, but also penalize those terms that appear frequently in the entire corpus, as they might not carry much meaningful information (e.g., terms like "and", "is", "the").



##Term Frequency (TF)

This represents the frequency of a term in a document. It can be calculated as

```
TF(t,d) = Number of times term t appears in document d   /   Total number of terms in document d

```



##Inverse Document Frequency (IDF)

This represents the importance of the term in the entire corpus. It can be calculated as

```
IDF(t,D) = log(Total number of documents in corpus D / Number of documents where term t appears)
```



##TF-IDF score

```
TF_IDF(t,d,D)=TF(t,d)*IDF(t,D)
```

##Using sklearn TfidfVectorizer

In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [94]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

In [97]:
#print(X) ##sparse matrix type
print(X.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.52468318 0.         ... 0.         0.         0.        ]
 [0.         0.         0.39648826 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
