**IMDB MOVIE REVIEW**

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Here we load the data and check the five rows

In [3]:
dataSet = pd.read_excel('IMDB_dataset6k.xlsx')
dataSet.head()

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative


Our data has total 25000 rows and this becomes so difficult for local device to run so we take 6000 records

In [4]:
dataSet.shape

(6017, 2)

**Question 1:-Preprocess Text Data Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem**

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Here we have take Lammatization because lemmatization is generally considered to be better than stemming because it produces a more accurate base form of a word

In [6]:
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Create a set of stopwords for English language
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Preprocesses the given text by removing punctuation, stop words and lemmatizing the words.
    """
    # Remove punctuation from text
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize the text into individual words
    tokens = text.split()
    
    # Remove stopwords from the list of tokens
    tokens = [token for token in tokens if token.lower() not in stop_words]
    
    # Lemmatize the words in the list of tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join the lemmatized tokens to form a string again
    preprocessed_text = ' '.join(lemmatized_tokens)
    
    return preprocessed_text

# Apply preprocess_text function to the 'review' column of the dataSet
dataSet['review'] = dataSet['review'].apply(preprocess_text)

**Question 2:- Perform TFIDF Vectorization**

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataSet['review'], dataSet['sentiment'], test_size=0.5, stratify=dataSet['sentiment'])

In [8]:
# Import the TfidfVectorizer from scikit-learn's feature extraction module
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit the TfidfVectorizer to the training data and transform it
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data using the fitted TfidfVectorizer
X_test_tfidf = tfidf_vectorizer.transform(X_test)

**Question 3:- Exploring parameter settings using GridSearchCV on Random Forest** **& Gradient Boosting Classifier. Use Xgboost instead of Gradient** **Boosting if it's taking a very long time in GridSearchCV**

**Question 4:- Perform Final evaluation of models on the best parameter settings using the evaluation metrics**

**RandomForestClassifier**

In [9]:
# Import necessary modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Classifier object
randomForestClassifier = RandomForestClassifier()

# Set the parameters for Grid Search
params = {
    'n_estimators': [100, 200, 300], # Number of trees in the forest
    'max_depth': [None, 10, 20], # Maximum depth of each tree in the forest
    'min_samples_split': [2, 5, 10], # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4], # Minimum number of samples required at each leaf node
    'max_features': ['auto', 'sqrt', 'log2'] # Maximum number of features to consider for splitting a node
}

# Create a Grid Search object with 5-fold cross validation
grid_randomForestClassifier = GridSearchCV(randomForestClassifier, params, cv=5)

# Fit the Grid Search object on the training data
grid_randomForestClassifier.fit(X_train_tfidf, y_train)

# Print the best parameters found by Grid Search
print("Best parameters for Random Forest: ", grid_randomForestClassifier.best_params_)

# Print the training accuracy of the best model found by Grid Search
print("Training accuracy for Random Forest: ", grid_randomForestClassifier.best_score_)

# Print the testing accuracy of the best model found by Grid Search
print("Testing accuracy for Random Forest: ", grid_randomForestClassifier.score(X_test_tfidf, y_test))



Best parameters for Random Forest:  {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 300}
Training accuracy for Random Forest:  0.8434132481301928
Testing accuracy for Random Forest:  0.8334995014955134


**XGBClassifier**

In [10]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# # Create a GradientBoostingClassifier object
# gbClassifier = GradientBoostingClassifier()

# # Define the hyperparameter grid
# params = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [None, 3, 6],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4],
#     'max_features': ['auto', 'sqrt', 'log2']
# }

# # Create a GridSearchCV object for GradientBoostingClassifier
# grid_gbClassifier = GridSearchCV(gbClassifier, params, cv=5)

# # Fit the GridSearchCV object to the training data
# grid_gbClassifier.fit(X_train_tfidf, y_train)

# # Print the best parameters and accuracy for GradientBoostingClassifier
# print("Best parameters for Gradient Boosting: ", grid_gbClassifier.best_params_)
# print("Training accuracy for Gradient Boosting: ", grid_gbClassifier.best_score_)
# print("Testing accuracy for Gradient Boosting: ", grid_gbClassifier.score(X_test_tfidf, y_test))

# Create an XGBClassifier object
xgbClassifier = XGBClassifier()

# Define the hyperparameter grid for XGBClassifier
params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.1, 0.01, 0.001],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1],
    'gamma': [0, 1, 5]
}

# Create a GridSearchCV object for XGBClassifier
grid_xgbClassifier = GridSearchCV(xgbClassifier, params, cv=5)

# Fit the GridSearchCV object to the training data
grid_xgbClassifier.fit(X_train_tfidf, y_train)

# Print the best parameters and accuracy for XGBClassifier
print("Best parameters for XGBoost: ", grid_xgbClassifier.best_params_)
print("Training accuracy for XGBoost: ", grid_xgbClassifier.best_score_)
print("Testing accuracy for XGBoost: ", grid_xgbClassifier.score(X_test_tfidf, y_test))




Best parameters for XGBoost: {'colsample_bytree': 0.5, 'gamma': 5, 'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 300, 'subsample': 1}
Training accuracy for XGBoost: 0.8657264151941433
Testing accuracy for XGBoost: 0.8482793577371045


**Question 5:- Report the best performing model**

The best performing model appears to be XGBoost. It has a higher testing accuracy than the Random Forest model and a higher training accuracy than the Logistic Regression model (assuming that there is a Logistic Regression model as well). Additionally, it has the best hyperparameters among the two models, which were determined through a grid search. Therefore, XGBoost is the recommended model to use for this classification task.