## Task 2
### IMDB Dataset Text - Sentiment Analysis


In [3]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Shahdil\Desktop\IMDB Dataset.csv') 
print(df.head())


                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [4]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
df['tokens'] = df['review'].apply(lambda x: tokenizer.tokenize(x.lower()))


In [8]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['filtered'] = df['tokens'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
df


Unnamed: 0,review,sentiment,tokens,filtered
0,One of the other reviewers has mentioned that ...,positive,"[one, of, the, other, reviewers, has, mentione...","[one, reviewers, mentioned, watching, 1, oz, e..."
1,A wonderful little production. <br /><br />The...,positive,"[a, wonderful, little, production., <, br, /, ...","[wonderful, little, production., <, br, /, >, ..."
2,I thought this was a wonderful way to spend ti...,positive,"[i, thought, this, was, a, wonderful, way, to,...","[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,"[basically, there, 's, a, family, where, a, li...","[basically, 's, family, little, boy, (, jake, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, 's, ``, love, in, the, time, ...","[petter, mattei, 's, ``, love, time, money, ''..."
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[i, thought, this, movie, did, a, down, right,...","[thought, movie, right, good, job., n't, creat..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[bad, plot, ,, bad, dialogue, ,, bad, acting, ...","[bad, plot, ,, bad, dialogue, ,, bad, acting, ..."
49997,I am a Catholic taught in parochial elementary...,negative,"[i, am, a, catholic, taught, in, parochial, el...","[catholic, taught, parochial, elementary, scho..."
49998,I'm going to have to disagree with the previou...,negative,"[i, 'm, going, to, have, to, disagree, with, t...","['m, going, disagree, previous, comment, side,..."


In [9]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['filtered'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


In [10]:
df['processed_review'] = df['lemmatized'].apply(lambda x: ' '.join(x))


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['processed_review']).toarray()


In [12]:
from sklearn.preprocessing import LabelEncoder

y = LabelEncoder().fit_transform(df['sentiment']) 

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [14]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


In [15]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.90      0.87      0.89      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



# Insights



In this Task, we built a sentiment analysis model using the IMDB movie reviews dataset. Each review in the dataset is labeled as either "positive" or "negative" based on the sentiment.

**Step 1: Text Preprocessing**  
- We tokenized the reviews, breaking each review into individual words.  
- Removed stopwords (like "the", "is", "and") since they do not add much meaning to the sentiment.  
- Applied lemmatization to convert words to their root form (e.g., "running" becomes "run") to treat similar words as the same.

**Step 2: Feature Engineering**  
- Joined the cleaned words back into full sentences.  
- Used TF-IDF vectorization to convert the text into numerical features, focusing on the most important words.  
- Encoded the sentiment labels as 0 (negative) and 1 (positive).

**Step 3: Model Training**  
- Split the data into training and testing sets.  
- Trained a Logistic Regression model on the training data.

**Step 4: Model Evaluation**  
- Used the model to predict sentiment on the test set.  
- Evaluated the model using metrics like precision, recall, and F1-score to measure its performance.

**Conclusion:**  
This model can automatically predict whether a movie review is positive or negative. The text preprocessing steps helped clean the data and made it more suitable for modeling. Logistic Regression is a simple yet effective model for this kind of classification task.
