In [42]:
# Dataset from kaggle: Dataset - https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [4]:
import numpy as np
import pandas as pd

In [5]:
temp_df = pd.read_csv('IMDB Dataset.csv')

In [6]:
df = temp_df [: 1000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.shape

(1000, 2)

In [8]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [9]:
df['sentiment'].value_counts()

sentiment
positive    501
negative    499
Name: count, dtype: int64

In [10]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [11]:
df.duplicated().value_counts()

False    1000
Name: count, dtype: int64

# Basic Preprocessing

### 1.Remove tages

In [12]:
import re
def remove_tage(raw_text):
    cleaned_text = re.sub(re.compile('<*_.?'), '' , raw_text )
    return cleaned_text

In [13]:
df['review'] = df['review'] .apply(remove_tage)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'] .apply(remove_tage)


In [14]:
df['review']

0      One of the other reviewers has mentioned that ...
1      A wonderful little production. <br /><br />The...
2      I thought this was a wonderful way to spend ti...
3      Basically there's a family where a little boy ...
4      Petter Mattei's "Love in the Time of Money" is...
                             ...                        
995    Nothing is sacred. Just ask Ernie Fosselius. T...
996    I hated it. I hate self-aware pretentious inan...
997    I usually try to be professional and construct...
998    If you like me is going to see this in a film ...
999    This is like a zoology textbook, given that it...
Name: review, Length: 1000, dtype: object

### 2.Apply Lower Case

In [15]:
df['review'] = df['review'].apply(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x : x.lower())


In [16]:
df['review']

0      one of the other reviewers has mentioned that ...
1      a wonderful little production. <br /><br />the...
2      i thought this was a wonderful way to spend ti...
3      basically there's a family where a little boy ...
4      petter mattei's "love in the time of money" is...
                             ...                        
995    nothing is sacred. just ask ernie fosselius. t...
996    i hated it. i hate self-aware pretentious inan...
997    i usually try to be professional and construct...
998    if you like me is going to see this in a film ...
999    this is like a zoology textbook, given that it...
Name: review, Length: 1000, dtype: object

### 3.Remove stop words

In [17]:
from nltk.corpus import stopwords
sw_list = stopwords.words('english')

In [18]:
sw_list

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [19]:
df['review'] = df['review'].apply(lambda x :[item for item in x.split() if item not in sw_list]).apply(lambda x : " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x :[item for item in x.split() if item not in sw_list]).apply(lambda x : " ".join(x))


In [20]:
df['review']

0      one reviewers mentioned watching 1 oz episode ...
1      wonderful little production. <br /><br />the f...
2      thought wonderful way spend time hot summer we...
3      basically there's family little boy (jake) thi...
4      petter mattei's "love time money" visually stu...
                             ...                        
995    nothing sacred. ask ernie fosselius. days, eve...
996    hated it. hate self-aware pretentious inanity ...
997    usually try professional constructive criticiz...
998    like going see film history class something li...
999    like zoology textbook, given depiction animals...
Name: review, Length: 1000, dtype: object

In [21]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [22]:
X

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. <br /><br />the f...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
995,"nothing sacred. ask ernie fosselius. days, eve..."
996,hated it. hate self-aware pretentious inanity ...
997,usually try professional constructive criticiz...
998,like going see film history class something li...


In [23]:
y

0      positive
1      positive
2      positive
3      negative
4      positive
         ...   
995    positive
996    negative
997    negative
998    negative
999    negative
Name: sentiment, Length: 1000, dtype: object

### 4.Encoding

In [24]:
pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  Getting requirements to build wheel did not run successfully.
  exit code: 1
  
  [15 lines of output]
  The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  rather than 'sklearn' for pip commands.
  
  Here is how to fix this error in the main use cases:
  - use 'pip install scikit-learn' rather than 'pip install sklearn'
  - replace 'sklearn' by 'scikit-learn' in your pip requirements files
    (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  - if the 'sklearn' package is used by one of your dependencies,
    it would be great if you take some time to track which package uses
    'sklearn' instead of 'scikit-learn' and report it to their issue tracker
  - as a last resort, set the environment variable
    SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
  
  More information is available at
  https://github.com/scikit-learn/sklearn-pypi-package
  [end of output]
  
  note: This error originates f

In [25]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

In [26]:
y

array([1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,

## Traning test split

In [27]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2 ,random_state = 42)

In [28]:
X_train.shape

(800, 1)

In [29]:
X_test.shape

(200, 1)

## Applying Bag of Words

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
cv = CountVectorizer()


In [32]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()


In [33]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(800, 16048))

In [34]:
X_test_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(200, 16048))

### Aplly Guassian Naive Baise

In [35]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow,y_train)


0,1,2
,priors,
,var_smoothing,1e-09


In [37]:
y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.565

In [39]:
confusion_matrix(y_test,y_pred)

array([[55, 49],
       [38, 58]])

In [41]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)

y_pred = rf.predict(X_test_bow)

accuracy_score(y_test,y_pred)

0.765

In [42]:
confusion_matrix(y_test,y_pred)

array([[81, 23],
       [24, 72]])

In [48]:
cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.77

In [50]:
cv = CountVectorizer(ngram_range = (1,2) , max_features = 5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.785

## Using TfIdf 

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [55]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.fit_transform(X_test['review']).toarray()


In [58]:
X_train_tfidf

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.07039649, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], shape=(800, 16048))

In [60]:
rf = RandomForestClassifier()

rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)

ValueError: X has 6951 features, but RandomForestClassifier is expecting 16048 features as input.

# 0) আপনার ডেটা এখানে কী ধরে নিচ্ছে

* `X_train['review']` = training review টেক্সট (স্ট্রিং)
* `X_test['review']` = test review টেক্সট
* `y_train`, `y_test` = sentiment label (0/1)

আপনার লক্ষ্য:
**review text → numeric features → model train → predict → accuracy/confusion matrix**

---

# 1) BoW (Bag of Words) কী?

Bag of Words মানে:

* আপনি পুরো training ডেটার সব ইউনিক শব্দ নিয়ে একটা বড় vocabulary বানাবেন
* তারপর প্রতিটা review কে একটা বড় vector বানাবেন
* vector-এর প্রতিটা position একটা শব্দকে represent করবে

  * ওই শব্দটা review-তে কয়বার এসেছে = সেই position এর সংখ্যা

উদাহরণ:
Vocabulary = `["good", "bad", "phone"]`

Review: `"good phone good"`
Vector: `[2, 0, 1]`

এটাই BoW।

---

# 2) CountVectorizer দিয়ে BoW তৈরি

```python
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()
```

এখানে ২টা গুরুত্বপূর্ণ ব্যাপার:

## (ক) `fit_transform` (শুধু train এ)

* `fit` = train set থেকে vocabulary শিখবে (কোন কোন শব্দ আছে)
* `transform` = train data কে vector বানাবে

তাই train এ `fit_transform` করা হয়।

## (খ) `transform` (শুধু test এ)

Test এ কখনো `fit` করা যাবে না।
কারণ তাহলে test data থেকে vocabulary শিখে ফেলবেন → এটা **data leakage** (ভুল পদ্ধতি)।

---

## আপনার আউটপুট:

```python
X_train_bow.shape
(7986, 48282)
```

মানে:

* train samples = **7986**
* vocabulary size = **48282** (এতগুলো ইউনিক শব্দ পাওয়া গেছে)

অর্থাৎ প্রতিটা review এখন 48282-length vector।

---

# 3) Gaussian Naive Bayes দিয়ে ক্লাসিফিকেশন

```python
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_bow,y_train)
y_pred = gnb.predict(X_test_bow)
```

Naive Bayes কীভাবে কাজ করে (intuition):

* ধরে নেয় feature গুলো (শব্দগুলো) একে অপরের থেকে independent
* probability দিয়ে সিদ্ধান্ত নেয়

কিন্তু একটা গুরুত্বপূর্ণ বিষয়:

## GaussianNB BoW-এর জন্য “ideal” না

BoW হলো integer count feature (0,1,2,…)
GaussianNB ধরে নেয় feature continuous এবং Gaussian distribution follow করে।

BoW-এর জন্য সাধারণত ভালো:

* **MultinomialNB**
* **BernoulliNB**

তাই GaussianNB দিয়ে accuracy কম আসাটা স্বাভাবিক।

### আপনার accuracy:

```python
0.6324
```

---

# 4) Confusion Matrix বুঝে ফেলুন (অতি গুরুত্বপূর্ণ)

```python
confusion_matrix(y_test,y_pred)
array([[717, 235],
       [499, 546]])
```

সাধারণভাবে sklearn confusion matrix:

[
\begin{bmatrix}
TN & FP \
FN & TP
\end{bmatrix}
]

অর্থাৎ:

* **TN = 717**: আসলে 0 ছিল, predict 0
* **FP = 235**: আসলে 0 ছিল, predict 1 (ভুল positive)
* **FN = 499**: আসলে 1 ছিল, predict 0 (miss করেছে)
* **TP = 546**: আসলে 1 ছিল, predict 1

আপনার FN (499) তুলনামূলক বেশি → model অনেক positive কে negative বলছে (বা class-1 miss করছে)।

---

# 5) RandomForest on BoW (full vocab)

```python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)
0.8528
```

এখানে accuracy অনেক বেশি হয়েছে। কারণ:

* RandomForest অনেক decision tree বানিয়ে pattern ধরতে পারে
* BoW high-dimensional হলেও কাজ করতে পারে

**তবে** একটা বড় টেকনিক্যাল ইস্যু আছে:

## BoW matrix খুব sparse, কিন্তু আপনি `.toarray()` করেছেন

`fit_transform()` সাধারণত sparse matrix দেয় (মেমোরি বাঁচে)।
`.toarray()` করলে সেটা dense হয়ে যায় → RAM প্রচুর লাগে।

Dataset বড় হলে crash করতে পারে।
RandomForest dense data চায় না—এটা sparse দিয়েও অনেক সময় ধীর/ভারী হয়।

---

# 6) `max_features=3000` দিলে কী হয়?

```python
cv = CountVectorizer(max_features=3000)
...
accuracy = 0.8373
```

মানে:

* vocabulary এখন **শুধু top 3000 frequent শব্দ**
* dimension কম → training দ্রুত, memory কম
* কিন্তু কম informative শব্দ বাদ পড়লে accuracy একটু কমতে পারে

আপনার ক্ষেত্রে full vocab (48282) এ 0.8528 ছিল, 3000 এ 0.8373 হয়েছে—স্বাভাবিক।

---

# 7) N-gram (1,2) মানে কী?

```python
cv = CountVectorizer(ngram_range=(1,2), max_features=5000)
...
accuracy = 0.8408
```

`ngram_range=(1,2)` মানে:

* unigram: `"good"`
* bigram: `"not good"`, `"very bad"`—এগুলাও feature হবে

Sentiment এ bigram খুব শক্তিশালী, কারণ:

* `"not good"` আলাদা meaning দেয় (good এর উল্টো)

তাই অনেক সময় accuracy বাড়ে। আপনার ক্ষেত্রে সামান্য বাড়েছে (3000 unigram এর চেয়ে)।

---

# 8) TF-IDF কী এবং কেন ব্যবহার হয়?

BoW এ সমস্যা:

* common শব্দ (যেমন “the”, “is”, “product”) অনেকবার আসে → feature dominate করে
* কিন্তু এগুলো sentiment বুঝতে কম কাজে লাগে

TF-IDF ধারণা:

* যে শব্দটা একটা document-এ important কিন্তু সব document-এ common না, তাকে বেশি weight দাও

```python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])
```

TF-IDF এ feature value:

* raw count না
* weight (floating point)

আপনার accuracy:

```python
0.8483
```

BoW full-vocab RF (0.8528) থেকে সামান্য কম।

এটা dataset-এর উপর নির্ভর করে—সবসময় TF-IDF বেশি হবে এমন না।

---

# 9) আপনার কোডে ৩টা জিনিস খেয়াল করলে আরও “সঠিক” হবে

## (১) GaussianNB বাদ দিয়ে MultinomialNB ব্যবহার করুন (BoW/TF-IDF এর জন্য)

GaussianNB এর জায়গায়:

```python
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_bow, y_train)
```

এতে সাধারণত accuracy ভালো হয়।

## (২) `.toarray()` না করে sparse রাখুন (বিশেষ করে বড় ডেটায়)

BoW/TF-IDF সাধারণত sparse রাখা উচিত:

```python
X_train_bow = cv.fit_transform(X_train['review'])
X_test_bow = cv.transform(X_test['review'])
```

## (৩) TF-IDF test এ `.toarray()` করেননি—এটা ঠিক আছে, কিন্তু consistency রাখলে ভালো

আপনি train এ `.toarray()` করলেন, test এ করলেন না—দুটোই একই টাইপ হলে ভালো (দুটোই sparse রাখাই best)।

---

# 10) আপনার ফলাফলগুলোকে এক লাইনে ব্যাখ্যা

* **GaussianNB + BoW**: low accuracy (কারণ wrong NB variant)
* **RF + BoW full vocab**: best (0.8528)
* **RF + BoW limited features**: একটু কম (info কম)
* **RF + bigram**: সামান্য benefit
* **RF + TF-IDF**: competitive (0.8483)
