#**Movie Review Classification**

#**1. Importing Libraries**

###1. nltk: A natural language processing library for tokenizing, stemming, and lemmatizing text.
###2. pandas: A powerful and flexible data analysis and manipulation library.
###3. TfidfVectorizer: A library for converting text data into a numerical format using TF-IDF vectorization.
###4. MultinomialNB: A Naive Bayes classifier for text classification.
###5. FunctionTransformer: A library for transforming functions in a pipeline.
###6. Pipeline: A library for creating and managing pipelines of data processing steps.
###7. requests: A library for making HTTP requests and interacting with REST APIs.
###8. Beautiful Soup: A library for parsing and scraping HTML and XML data.
###9. WordNetLemmatizer: A library for lemmatizing words based on their part-of-speech tags.
###10. stopwords: A library for removing common stop words from text data.

In [None]:
!pip install nltk



In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
nltk.download('punkt')
print(nltk.data.path)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


In [None]:
nltk.download('wordnet')
print(nltk.data.path)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


In [None]:
nltk.download('stopwords')
print(nltk.data.path)

['/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#**2. Data Loading and Preprocessing**

###1. Loading Data: This section loads the data from various CSV files.

In [None]:
# Load the data
df1 = pd.read_csv('/content/sampleSubmission.csv')
df2 = pd.read_csv('/content/labeledTrainData.tsv', sep='\t')
df3 = pd.DataFrame([line.strip().split('\t') for line in open('/content/unlabeledTrainData.tsv', 'r')])
df4 = pd.read_csv('/content/testData.tsv', sep='\t')

###2. Data Shape: Displays the shape of the loaded data.

In [None]:
df1.shape

(25000, 2)

In [None]:
df2.shape

(4609, 3)

In [None]:
df3.shape

(3919, 2)

In [None]:
df4.shape

(4036, 2)

###3. Data Information: Displays information about the data, including data types and missing values.

In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4609 entries, 0 to 4608
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         4609 non-null   object
 1   sentiment  4609 non-null   int64 
 2   review     4609 non-null   object
dtypes: int64(1), object(2)
memory usage: 108.1+ KB


In [None]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3919 entries, 0 to 3918
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3919 non-null   object
 1   1       3919 non-null   object
dtypes: object(2)
memory usage: 61.4+ KB


In [None]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4036 entries, 0 to 4035
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      4036 non-null   object
 1   review  4036 non-null   object
dtypes: object(2)
memory usage: 63.2+ KB


###4. Data Head and Tail: Displays the first and last few rows of the data.

In [None]:
df1.head()

Unnamed: 0,id,sentiment
0,12311_10,0
1,8348_2,0
2,5828_4,0
3,7186_2,0
4,12128_7,0


In [None]:
df1.tail()

Unnamed: 0,id,sentiment
24995,2155_10,0
24996,59_10,0
24997,2531_1,0
24998,7772_8,0
24999,11465_10,0


In [None]:
df2.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [None]:
df2.tail()

Unnamed: 0,id,sentiment,review
4604,11709_3,0,You've heard it said to live every moment as i...
4605,4199_7,1,The only thing about this film that bums me ou...
4606,5006_1,0,Nothing like a movie about a group of friends ...
4607,6095_1,0,"i've seen a movie thats sort of like this, wer..."
4608,11511_9,1,Weak scripts at times? Yep! Cheesy special eff...


In [None]:
df3.head()

Unnamed: 0,0,1
0,id,review
1,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
2,"""45057_0""","""I saw this film about 20 years ago and rememb..."
3,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
4,"""7161_0""","""I went to see this film with a great deal of ..."


In [None]:
df3.tail()

Unnamed: 0,0,1
3914,"""29962_0""","""Fate and History are not presented as flesh a..."
3915,"""24716_0""","""Well I think it's actually a pretty good show..."
3916,"""8303_0""","""The Attic Expeditions is painful on viewers w..."
3917,"""16534_0""","""Even for pre-Hayes code cinema, this one is p..."
3918,"""13652_0""","""As always, this film occurs in Paris, there h..."


In [None]:
df4.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [None]:
df4.tail()

Unnamed: 0,id,review
4031,4278_9,I have to admit I have a particular penchant f...
4032,3538_8,I saw MESSIAH 2 a few months ago and didn`t ge...
4033,6499_1,"Set in a post apocalyptic future, the new high..."
4034,7715_1,I used to work in a video store. I saw this ti...
4035,5847_1,"Demi and Woody are married, but they're poor. ..."


###5. Data Null Values: Checks for null values in the data.
###6. Data Null Values Sum: Calculates the total number of null values in the data.

In [None]:
df1.isnull()

Unnamed: 0,id,sentiment
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
24995,False,False
24996,False,False
24997,False,False
24998,False,False


In [None]:
df1.isnull().sum()

id           0
sentiment    0
dtype: int64

In [None]:
df1.isnull().sum().sum()

0

In [None]:
df2.isnull()

Unnamed: 0,id,sentiment,review
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
4604,False,False,False
4605,False,False,False
4606,False,False,False
4607,False,False,False


In [None]:
df2.isnull().sum()

id           0
sentiment    0
review       0
dtype: int64

In [None]:
df2.isnull().sum().sum()

0

In [None]:
df3.isnull()

Unnamed: 0,0,1
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
3914,False,False
3915,False,False
3916,False,False
3917,False,False


In [None]:
df3.isnull().sum()

0    0
1    0
dtype: int64

In [None]:
df3.isnull().sum().sum()

0

In [None]:
df4.isnull()

Unnamed: 0,id,review
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
4031,False,False
4032,False,False
4033,False,False
4034,False,False


In [None]:
df4.isnull().sum()

id        0
review    0
dtype: int64

In [None]:
df4.isnull().sum().sum()

0

###7. Data Description: Displays summary statistics for the data.

In [None]:
df1.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


In [None]:
df1.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sentiment,25000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df2.describe()

Unnamed: 0,sentiment
count,4609.0
mean,0.503146
std,0.500044
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [None]:
df2.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sentiment,4609.0,0.503146,0.500044,0.0,0.0,1.0,1.0,1.0


In [None]:
df3.describe()

Unnamed: 0,0,1
count,3919,3919
unique,3919,3917
top,id,"""This movie is so bad that it would stand with..."
freq,1,2


In [None]:
df3.describe().transpose()

Unnamed: 0,count,unique,top,freq
0,3919,3919,id,1
1,3919,3917,"""This movie is so bad that it would stand with...",2


In [None]:
df4.describe()

Unnamed: 0,id,review
count,4036,4036
unique,4036,4028
top,12311_10,Horrible movie. This movie beat out revenge of...
freq,1,2


In [None]:
df4.describe().transpose()

Unnamed: 0,count,unique,top,freq
id,4036,4036,12311_10,1
review,4036,4028,Horrible movie. This movie beat out revenge of...,2


#**3. Preprocessing**

In [None]:
# Combine the labeled and unlabeled data
df = pd.concat([df2, df3])

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(df4.isnull().sum())

id        0
review    0
dtype: int64


In [None]:
df['review'] = df['review'].fillna('')

###1. Preprocessing Function: Defines a function to preprocess the text data by tokenizing, removing stop words, and lemmatizing.

In [None]:
# Define the preprocessing function
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]

    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Join the tokens back into a string
    return ' '.join(tokens)

###2. Text Vectorization: Converts the text data into a numerical format using TF-IDF vectorization.

In [None]:
# Convert the text data into a numerical format
vectorizer = TfidfVectorizer(preprocessor=preprocess_text)

#**4. Model Training and Evaluation**

###1. Splitting Data: Splits the data into training and validation sets.

In [None]:
# Split the data into training and validation sets
X = df['review']
y = df['sentiment']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

###2. Training Model: Trains a Naive Bayes classifier on the training data.

In [None]:
y_train = y_train.fillna(y_train.mean())

In [None]:
print(y_train.dtype)

float64


In [None]:
y_train = y_train.astype(int)

In [None]:
# Create the pipeline
pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', MultinomialNB())
])

In [None]:
# Train the model
pipeline.fit(X_train, y_train)

###3. Model Evaluation: Evaluates the model using accuracy, classification report, and confusion matrix.

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(y_val.isnull().sum())

776


#**5. Predicting**

###1. Making Predictions: Makes predictions on the test data.

In [None]:
# Make predictions on the test data
y_pred = pipeline.predict(df4['review'])

###2. Creating Submission File: Creates a submission file with predicted sentiment values.

In [None]:
# Create the submission file
submission = pd.DataFrame({'id': df4['id'], 'sentiment': y_pred})
submission.to_csv('submission.csv', index=False)

#**6. Summary**

###This code classifies movie reviews as positive or negative using natural language processing (NLP) techniques. It loads movie review data, preprocesses the text data, and converts it into a numerical format using TF-IDF vectorization. Then, it trains a Naive Bayes classifier on the preprocessed data. Finally, it makes predictions on the test data and generates a submission file with predicted sentiment values.