<a href="https://colab.research.google.com/github/SOUMEE2000/Sentiment-Analysis-guidelines-IMDB-Datset-/blob/main/3.Stacking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to Stacking and Ensemble Models !!!**

**In the previous notebook we had seen how out of all the classifiers available SVM gaves us the best results. However, each of those calls took about 4 hours and that is definitely not practical. Turns out we can do better. We can stack classifiers one on top of another and they take way less time amd in some cases give better accuracy!!** 

In [None]:
source="https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/"

In [None]:
!unzip IMDB.zip

Archive:  IMDB.zip
  inflating: IMDB Dataset.csv        


In [None]:
import pandas as pd
df= pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
feature=[]
for i in df["sentiment"]:
  if i=="positive":
    feature.append(1)
  elif i=="negative":
    feature.append(0)
df["feature"]=feature

In [None]:
df['review_processed'] = df['review'].str.replace("[^a-zA-Z#]", " ") 
df['review_processed']=[review.lower() for review in df['review_processed']]

# Removing Stopwords Begin
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import word_tokenize
stop_words = stopwords.words('english')

# Making custom list of words to be removed 
add_words = ['movie','film','one','make','even','the']
stop_words.extend(add_words)

# Function to remove stop words 
def remove_stopwords(rev):
    review_tokenized = word_tokenize(rev)
    rev_new = " ".join([i for i in review_tokenized  if i not in stop_words])
    return rev_new

# Removing stopwords
df['review_processed'] = [remove_stopwords(r) for r in df['review_processed']]

# Replacing short words
df['review_processed'] = df['review_processed'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Importing module
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating sparse matrix of top 2500 tokens
cv = TfidfVectorizer(max_features = 2500)
X = cv.fit_transform(df.review_processed).toarray()
y = df.feature.values

# Splitting the dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(0,1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
models= []
models.append(("GNB", GaussianNB())) 
models.append(("CNB", ComplementNB()))
models.append(("MNB", MultinomialNB()))
models.append(("RF", RandomForestClassifier(max_depth=500, n_estimators=1000, max_features=5, min_samples_split=5)))
models.append(("LR", LogisticRegression()))
models.append(("SVC", SVC(kernel="linear")))
model= StackingClassifier(estimators=models)
model.fit(X_train, y_train)   
y_pred= model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("The model accuracy is", accuracy )

[[4430  605]
 [ 567 4398]]
The model accuracy is 0.8828


# **Some Results**

* StackingClassifier(estimators=(GNB,CNB,MNB)), 84.94%
* StackingClassifier(estimators=(GNB,CNB,MNB),final_estimator=RandomForestClassifier(max_depth=500)), 81.23%
* StackingClassifier(estimators=(GNB,CNB,MNB,RF(max_depth=500)), 86.64%
* StackingClassifier(estimators=(GNB,CNB,MNB,RF(max_depth=500)),final_estimator=LogisticRegression()), 86.72%
* StackingClassifier(estimators=(GNB,CNB,MNB,RF(max_depth=500), LR)), 88.25%
* StackingClassifier(estimators=(GNB,CNB,MNB,RF(max_depth=500), LSVC), 88.34%