 1. Load the dataset and preprocess the reviews.
 a. Convert all text to lowercase.
 b. Remove non-alphabetic characters (punctuation).
 c. Tokenize the reviews and remove common stopwords.
 d. Apply stemming to reduce words to their root form.
 2. Split the dataset into training and testing sets (80% training, 20% testing).
 3. Use a Naive Bayes classifier to classify the reviews into positive and negative categories.
 a. Implement a Bag-of-Words model using CountVectorizer.
 b. Train the Naive Bayes classifier using the training set

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import  classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve , auc
import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv('/content/drive/MyDrive/Copy of Copy of IMDB Dataset.csv')

In [4]:
length_before_sampling = len(df)
length_before_sampling

50000

In [5]:
df = df.sample(n = 1000)
length_after_sampling = len(df)
length_after_sampling

1000

In [6]:
df.head()

Unnamed: 0,review,sentiment
13290,"This is probably one of the worst ""wanna be a ...",negative
20613,This movie stinks. IMDb needs negative numbers...,negative
6886,"In Iran, the Islamic Revolution has shaped all...",positive
39334,This is one of those movies that should have b...,negative
39831,I was forced to watch this whole series of fil...,negative


In [7]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,518
positive,482


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 13290 to 11932
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB


In [9]:
corpus = []
for i in range(0 , length_after_sampling):
  review = re.sub("[^a-zA-Z]", " ", df.iloc[i]["review"])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in set(stopwords.words("english"))]
  review = " ".join(review)
  corpus.append(review)

In [10]:
df_copy = df.copy()
df_copy["CleanedText"] = corpus
df_copy.head()

Unnamed: 0,review,sentiment,CleanedText
13290,"This is probably one of the worst ""wanna be a ...",negative,probabl one worst wanna low budget hit film cr...
20613,This movie stinks. IMDb needs negative numbers...,negative,movi stink imdb need neg number rate system pr...
6886,"In Iran, the Islamic Revolution has shaped all...",positive,iran islam revolut shape part life includ ever...
39334,This is one of those movies that should have b...,negative,one movi way better turn dread think blockbust...
39831,I was forced to watch this whole series of fil...,negative,forc watch whole seri film young child told re...


In [11]:
df_copy["sentiment"] .nunique()

2

In [13]:
X=df_copy["CleanedText"]
y=df_copy["sentiment"]

In [14]:
  tfidf = TfidfVectorizer(max_features=5000)
  X = tfidf.fit_transform(X).toarray()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
model = MultinomialNB()
model.fit(X_train, y_train)

 Part 2:
 1. Evaluate the performance of the model using the following metrics:
 a. Accuracy
 b. Precision, Recall, and F1-score
 c. Confusion Matrix
 d. ROC-AUC Scor

In [18]:
y_pred = model.predict(X_test)
y_pred =[1 if label == "positive" else 0 for label in y_pred]
y_test = [1 if label == "positive" else 0 for label in y_test]
print("Accuracy: " , accuracy_score(y_test, y_pred))
print("Classification Report: ", classification_report(y_test, y_pred))

Accuracy:  0.815
Classification Report:                precision    recall  f1-score   support

           0       0.84      0.83      0.83       112
           1       0.79      0.80      0.79        88

    accuracy                           0.81       200
   macro avg       0.81      0.81      0.81       200
weighted avg       0.82      0.81      0.82       200



In [19]:
roc_auc_value=roc_auc_score(y_test, y_pred)
print("ROC AUC SCORE: ", roc_auc_value)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", conf_matrix)

ROC AUC SCORE:  0.8129058441558441
Confusion Matrix: 
 [[93 19]
 [18 70]]


Part 1: Data Loading and Preprocessing
 1. Load the Breast Cancer Prognostic Dataset.
 2. Dataset is available in Drive.
 3. Perform basic exploratory data analysis (EDA) to understand the dataset:
 • Summarize key statistics for each feature.
 • Check for missing values and handle them appropriately.
 4. Split the dataset into training (80%) and testing (20%) sets.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

In [21]:
pip install ucimlrepo


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [22]:

from ucimlrepo import fetch_ucirepo
breast_cancer_prognostic = fetch_ucirepo(id=94)
X = breast_cancer_prognostic.data.features
y = breast_cancer_prognostic.data.targets


In [23]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 57 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 

In [24]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Class   4601 non-null   int64
dtypes: int64(1)
memory usage: 36.1 KB


In [25]:
X.isnull().sum()

Unnamed: 0,0
word_freq_make,0
word_freq_address,0
word_freq_all,0
word_freq_3d,0
word_freq_our,0
word_freq_over,0
word_freq_remove,0
word_freq_internet,0
word_freq_order,0
word_freq_mail,0


In [26]:
print(X.shape)
print(y.shape)

(4601, 57)
(4601, 1)


In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Part 2: Apply a Wrapper Method
 1. Use Recursive Feature Elimination (RFE) with a Logistic Regression model to perform feature selection:
 • Select the top 5 features that contribute the most to predicting the target variable.
 • Visualize the ranking of features.
 2. Train the Logistic Regression model using only the selected features

In [28]:
model= LogisticRegression(max_iter=200)

In [29]:
n_features_to_select = 2
rfe= RFE(estimator=model, n_features_to_select=n_features_to_select)
rfe.fit(X_train, y_train)

In [30]:
RFE(estimator=LogisticRegression(max_iter=200),n_features_to_select=2)

In [31]:
selected_features=rfe.support_
ranking= rfe.ranking_
X_train_rfe= rfe.transform(X_train)
X_test_rfe= rfe.transform(X_test)

In [32]:
model.fit(X_train_rfe, y_train)

Part 3: Model Evaluation
 1. Evaluate the model’s performance using the testing set:
 • Metrics to calculate: Accuracy, Precision, Recall, F1-Score, and ROC-AUC.
 2. Compare the performance of the model trained on all features versus the model trained on the selected
 features.

 Part 4: Experiment
 1. Experiment with different numbers of selected features (e.g., top 3, top 7).
 2. Discuss how feature selection affects model performance.

In [33]:
y_pred = model.predict(X_test_rfe)
accuracy = accuracy_score(y_test, y_pred)
print(f"Selected features mask:{selected_features}")
print(f"Features ranking: {ranking}")
print(f"Model accuracy with selected features: {accuracy}")

Selected features mask:[False False False False False False  True False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True False False False False]
Features ranking: [36 45 40 20 29 23  1 32 17 47 43 48 49 46  7 15 14 44 50  5 37 39  3 28
  6 18  2 34 13 30 54 51 21 42 19 25 53 26 16 38 12 11 22  4 31  9 24  8
 27 41 33 35  1 10 52 55 56]
Model accuracy with selected features: 0.8121606948968513
