In [1]:
import pandas as pd
import numpy as np

In [2]:
!pip install nltk



TfidfVectorizer weights the word counts by a measure of how often they appear in the documents, also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

In [11]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score,accuracy_score
import pickle

Stopwords are words that are commonly used in a language but are generally considered to be of little value in text analysis because they don't convey much meaning on their own. These words are typically removed from text before performing natural language processing (NLP) tasks such as text classification, text summarization, and text generation. Examples of stopwords in English language include "the", "a", "an", "and", "is", "are", "of", "on", etc.

The idea behind removing stopwords is that they are very common and they don't carry much meaning on their own, so they don't contribute much to the overall meaning of a text. By removing stopwords, we can focus on the important words in a text and get a better understanding of its meaning.

Stopwords can be filtered out using the Python’s NLTK package which contains a list of stopwords for several languages. The list of stopwords in NLTK can be accessed via the nltk.corpus.stopwords module.

In [4]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neeha\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
data_reviews = pd.read_csv('reviews.txt',sep = '\t', names =['Reviews','Comments'])

In [6]:
data_reviews

Unnamed: 0,Reviews,Comments
0,1,The Da Vinci Code book is just awesome.
1,1,this was the first clive cussler i've ever rea...
2,1,i liked the Da Vinci Code a lot.
3,1,i liked the Da Vinci Code a lot.
4,1,I liked the Da Vinci Code but it ultimatly did...
...,...,...
6913,0,Brokeback Mountain was boring.
6914,0,So Brokeback Mountain was really depressing.
6915,0,"As I sit here, watching the MTV Movie Awards, ..."
6916,0,Ok brokeback mountain is such a horrible movie.


In [7]:
stopset = set(stopwords.words('english'))

In [8]:
vectorizer = TfidfVectorizer(use_idf = True,lowercase = True, strip_accents='ascii',stop_words=stopset)

In [9]:
X = vectorizer.fit_transform(data_reviews.Comments)
y = data_reviews.Reviews
pickle.dump(vectorizer, open('tranform.pkl', 'wb'))

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [12]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=7)
clf.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=7)

In [13]:
accuracy_score(y_test,clf.predict(X_test))

0.9747109826589595

In [15]:
clf.fit(X,y)

KNeighborsClassifier(n_neighbors=7)

In [16]:
accuracy_score(y_test,clf.predict(X_test))

0.9804913294797688

In [17]:
filename = 'movie_recommender_model.pkl'
pickle.dump(clf, open(filename, 'wb'))