# Naive Bayes Classification

Naive Bayes is a classification technique based on **Bayes' Theorem** with the assumption of independence among predictors. It is widely used for text classification, spam detection, and sentiment analysis due to its simplicity and efficiency.

---

## 1. Conditional Probability

Conditional probability is the likelihood of an event occurring given that another event has already occurred. It is calculated as:

\[
P(A|B) = \frac{P(A \cap B)}{P(B)}
\]

where:
- \( P(A|B) \): Probability of event A given B has occurred.
- \( P(A \cap B) \): Probability of both A and B occurring.
- \( P(B) \): Probability of event B occurring.

---

## 2. Bayes' Theorem

Bayes' Theorem allows us to calculate the probability of a hypothesis given evidence. It is expressed as:

\[
P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
\]

where:
- \( P(H|E) \): Posterior probability of hypothesis \( H \) given evidence \( E \).
- \( P(E|H) \): Likelihood of evidence given the hypothesis.
- \( P(H) \): Prior probability of the hypothesis.
- \( P(E) \): Probability of evidence.

---

## 3. Naive Bayes Classification

Naive Bayes simplifies the computation by assuming independence among features. For a given class \( C \) and feature vector \( X = (x_1, x_2, \dots, x_n) \), the Naive Bayes classifier predicts the class label by calculating:

\[
P(C|X) \propto P(C) \prod_{i=1}^n P(x_i|C)
\]

Steps:
1. **Calculate Prior**: Compute \( P(C) \) for each class.
2. **Calculate Likelihood**: Compute \( P(x_i|C) \) for each feature.
3. **Compute Posterior**: Use Bayes' theorem to calculate \( P(C|X) \) for each class and choose the class with the highest posterior probability.

---

### Advantages of Naive Bayes
- Simple and fast
- Performs well on small datasets and text data
- Works well for both binary and multiclass classification tasks

---

**Note**: Naive Bayes assumes feature independence, which may not hold in all cases but often performs surprisingly well in practice.

---


In [8]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?dataset_version_number=1...


100%|██████████| 25.7M/25.7M [00:00<00:00, 157MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/versions/1


In [9]:
data = pd.read_csv(os.path.join(path, "IMDB Dataset.csv"))

In [10]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [11]:
data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Text cleaning :
--> Sample 10000 rows

---> remove html tags

---> remove special characters

--->converting every thung to lower case

---> remove stop words

In [12]:
data = data.sample(10000)

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 25962 to 39475
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     10000 non-null  object
 1   sentiment  10000 non-null  object
dtypes: object(2)
memory usage: 234.4+ KB


In [14]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,5072
positive,4928


In [15]:
data['sentiment'].replace({'positive':1,'negative':0},inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['sentiment'].replace({'positive':1,'negative':0},inplace=True)
  data['sentiment'].replace({'positive':1,'negative':0},inplace=True)


In [16]:
data

Unnamed: 0,review,sentiment
25962,This movie brought together some of the old Sp...,1
7394,First let me be honest. I did not watch all th...,0
8394,"Saw the film at it's Lawrence, Kansas premiere...",0
38408,Talk Radio sees a man somewhat accidentally st...,1
28086,"Wesley Snipes is James Dial, an assassin for h...",0
...,...,...
15647,"Just saw the movie, and the scary thing was, t...",1
33557,Opening the film with a Bach Toccata is an aur...,1
23555,"Jack Frost 2. THE worst ""horror film"" I have e...",0
38129,Large corporations Vs. Conscientious Do good-e...,1


In [21]:
data.columns

Index(['review', 'sentiment'], dtype='object')

In [24]:
def clean_html(text):
  clean = re.compile('<.*?>')
  return re.sub(clean, '', text)

In [26]:
data['review'] =data['review'].apply(clean_html)

In [32]:
data['review'].iloc[0]

"this movie brought together some of the old spinal crew for another mockumentary film, this time revolving around the world of the dog show, how their owners prepare and train for the show before moving on to the show itself.we meet several teams as they hope to win the top prize- the fleck's, cookie who seems to have slept with every man ever, and gerry who tries to cope with his wife's old escapades and the fact that he literally has two left feet. harlan, whose dog talks to him, and enjoys ventriloquism. the swan's who have taken far too much coffee and scream at each other. donalan and vanderhoof the gay couple, and cabot and cummings who have won the last two years. fred willard commentates on the show, and is very funny as always. funny scenes include the 'look at me!' scene, and any with levy. unfortunately some of the best scenes were deleted or filmed later- willard interviewing leslie cabot, and the alternative epilogue with gerry is one of the funniest things i have ever se

In [30]:
def convert_lower(text):
  return text.lower()

In [31]:
data['review'] =data['review'].apply(convert_lower)

In [33]:
def remove_special(text):
  x = ''
  for i in text:
    if i.isalnum():
      x = x + i
    else:
      x = x + ' '
  return x

In [34]:
data['review'] =data['review'].apply(remove_special)

In [38]:

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [39]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [40]:
def remove_stopwords(text):
  x = []
  for i in text.split():
    if i not in stopwords.words('english'):
      x.append(i)
  y = x[:]
  x.clear()
  return y

In [41]:
data['review'] =data['review'].apply(remove_stopwords)

In [45]:
from nltk.stem import PorterStemmer
ps= PorterStemmer()

In [46]:
Y=[]
def stem_words(text):
  for i in text:
    y = ps.stem(i)
    Y.append(y)
  z = Y[:]
  Y.clear()
  return z

In [47]:
data['review'] =data['review'].apply(stem_words)

In [49]:
def join_back(list_input):
  return " ".join(list_input)

In [50]:
data['review'] =data['review'].apply(join_back)

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x = cv.fit_transform(data['review']).toarray()

In [54]:
x.shape

(10000, 36352)

In [55]:
x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [58]:
y= data.iloc[:, -1].values

In [59]:
y.shape

(10000,)

In [60]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

In [61]:
from ast import Mult
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
clf1 = GaussianNB()
clf2 = MultinomialNB()
clf3 = BernoulliNB()

In [62]:
clf1.fit(x_train, y_train)
clf2.fit(x_train, y_train)
clf3.fit(x_train, y_train)

In [63]:
y_pred1 = clf1.predict(x_test)
y_pred2 = clf2.predict(x_test)
y_pred3 = clf3.predict(x_test)

In [65]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred1))
print(accuracy_score(y_test, y_pred2))
print(accuracy_score(y_test, y_pred3))

0.627
0.8375
0.822


In [67]:
clf2.predict([x_test[0]])

array([1])

In [68]:
clf2.predict(x_test)

array([1, 0, 0, ..., 1, 1, 1])