The aim of the task is building a model that predicts type of movie review (negative or positive) using `sklearn`.
We have two categories: "negative" and "positive", therefore 1s and 0s have been added to the target array.
The folder contains two subfolders of .txt files divided into "negative" and "positive" reviews.

In [1]:
import numpy as np
import re
# import sys
# !{sys.executable} -m pip install nltk
import nltk
from sklearn.datasets import load_files
# nltk.download('stopwords')
# nltk.download('wordnet')
import pickle
from nltk.corpus import stopwords

In [2]:
movie_data = load_files(r"C:\Users\Gulsh\Desktop\Education\test\txt_sentoken")
X, y = movie_data.data, movie_data.target

In [19]:
movie_data.data[0]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [20]:
movie_data.target

array([0, 1, 1, ..., 1, 0, 0])

In [3]:
documents = []

from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)

In [4]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [12]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression


from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [17]:
%%time 

model_dict = {
   'LR': LogisticRegression(random_state=42),
   'LDA': LDA(),
   'SVC': SVC(random_state=42),
   'KNN': KNeighborsClassifier(),
   'DT': DecisionTreeClassifier(random_state=42),
   'RF': RandomForestClassifier(random_state=42),
   'GBC': GradientBoostingClassifier(random_state=42)
}

score_list = []

for model in model_dict:
    model_dict[model].fit(X_train, y_train)
    y_pred = model_dict[model].predict(X_test)
    print(f'\n Model name: {model}')
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))



 Model name: LR
[[161  29]
 [ 37 173]]
              precision    recall  f1-score   support

           0       0.81      0.85      0.83       190
           1       0.86      0.82      0.84       210

    accuracy                           0.83       400
   macro avg       0.83      0.84      0.83       400
weighted avg       0.84      0.83      0.84       400


 Model name: LDA
[[119  71]
 [ 76 134]]
              precision    recall  f1-score   support

           0       0.61      0.63      0.62       190
           1       0.65      0.64      0.65       210

    accuracy                           0.63       400
   macro avg       0.63      0.63      0.63       400
weighted avg       0.63      0.63      0.63       400


 Model name: SVC
[[160  30]
 [ 30 180]]
              precision    recall  f1-score   support

           0       0.84      0.84      0.84       190
           1       0.86      0.86      0.86       210

    accuracy                           0.85       400
   mac

In [18]:
%%time

C = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 , 1.1]
degree = [int(x) for x in np.linspace(start = 1, stop = 5, num = 1)]
tol = [0.001, 0.002, 0.003, 0.004, 0.005]
cache_size = [int(x) for x in np.linspace(start = 100, stop = 500, num = 50)]
shrinking = [True, False]
probability = [True, False] 
gamma = ['scale', 'auto']


param_dist = {'C': C,
              'degree': degree,
              'tol': tol,
              'cache_size': cache_size,
              'shrinking': shrinking,
              'probability': probability,
              'gamma': gamma}

rs = RandomizedSearchCV(SVC(), 
                        param_dist, 
                        n_iter = 100, 
                        cv = 3, 
                        verbose = 1, 
                        n_jobs=-1, 
                        random_state=42)
rs.fit(X_train, y_train)
y_pred_hp = rs.predict(X_test)

print(classification_report(y_test,y_pred_hp))
print(rs.best_score_)
print(rs.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 15.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 23.3min finished


              precision    recall  f1-score   support

           0       0.84      0.84      0.84       190
           1       0.85      0.85      0.85       210

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400

0.8187549334438892
{'tol': 0.001, 'shrinking': False, 'probability': False, 'gamma': 'scale', 'degree': 1, 'cache_size': 418, 'C': 1.1}
Wall time: 23min 29s


There is no need for configuring hyperparameters.