source:
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

# Initial Data Prep

**Read in data, create new "first_genre" column, and split into train / test**

In [61]:
import pandas as pd
from ast import literal_eval
import numpy as np

In [66]:
df = pd.read_csv('data/movie_df.csv', encoding='utf8', converters={'genre_ids':literal_eval})
df = df[df['genre_ids'].str.len() != 0]

In [69]:
df['first_genre'] = df['genre_ids'].apply(lambda x: x[0])
df.head()

Unnamed: 0,genre_ids,id,overview,popularity,release_date,title,vote_average,vote_count,imdb_id,first_genre
0,"[18, 80]",278,Framed in the 1940s for the double murder of h...,28.527767,1994-09-23,The Shawshank Redemption,8.5,9773,tt0111161,18
1,"[18, 80]",238,"Spanning the years 1945 to 1955, a chronicle o...",36.965452,1972-03-14,The Godfather,8.5,7394,tt0068646,18
2,"[18, 36, 10752]",424,The true story of how businessman Oskar Schind...,19.945455,1993-11-29,Schindler's List,8.4,5518,tt0108052,18
3,"[18, 80]",240,In the continuing saga of the Corleone crime f...,30.191804,1974-12-20,The Godfather: Part II,8.4,4249,tt0071562,18
4,"[18, 9648]",452522,Standalone version of the series pilot with an...,5.969249,1989-12-31,Twin Peaks,8.4,123,tt0278784,18


In [97]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.5, random_state=9001)

# Natural Language Processing

We will be using a few sklearn functions to assist in this portion

**CountVectorizer**

This creates a Document-Term matrix with the dimensions [n_samples, n_features]

**TfidfTransformer**

TF stands for *Term Frequency*. It is the count of each word divided by the toal words in the document. TFIDF stands for *Term Frequency Times Inverse Document Frequency*. This function reduces the weights of common words, such as (this, is, an, etc.)

**Pipeline**
This allows us to make multiple manipulations to our data in a single line of code. It makes our code more concise.



In [101]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

**Naive Bayes Classifier**

In [105]:
from sklearn.naive_bayes import MultinomialNB


text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB()),])

text_clf = text_clf.fit(train['overview'], train['first_genre'])
nb_train_predict = text_clf.predict(train['overview'])
nb_test_predict = text_clf.predict(test['overview'])

nb_train_accuracy = np.mean(nb_train_predict == train['first_genre'])
nb_test_accuracy = np.mean(nb_test_predict == test['first_genre'])

print("Train Accuracy: {0} \nTest Accuracy: {1}".format(nb_train_accuracy,nb_test_accuracy))

Train Accuracy: 0.334 
Test Accuracy: 0.372


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


**Support Vector Machines (SVM)**

In [87]:
from sklearn.linear_model import SGDClassifier

In [109]:
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, n_iter=5, random_state=42)),
])
_ = text_clf_svm.fit(train['overview'], train['first_genre'])

svm_train_predict = text_clf_svm.predict(train['overview'])
svm_test_predict = text_clf_svm.predict(test['overview'])

svm_train_accuracy = np.mean(svm_train_predict == train['first_genre'])
svm_test_accuracy = np.mean(svm_test_predict == test['first_genre'])

print("Train Accuracy: {0} \nTest Accuracy: {1}".format(svm_train_accuracy,svm_test_accuracy))

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Train Accuracy: 1.0 
Test Accuracy: 0.364


It looks like extreme over-fitting is occuring                                   

**Grid Search Cross Validation**

In [94]:
from sklearn.model_selection import GridSearchCV

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1e-2, 1e-3),
 }

In [110]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(train['overview'], train['first_genre'])

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [113]:
gs_clf.best_score_


0.338

In [116]:
gs_clf.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

**Removing Stop Words - Naive Bayes**

In [119]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB()),])

text_clf = text_clf.fit(train['overview'], train['first_genre'])
nb_train_predict = text_clf.predict(train['overview'])
nb_test_predict = text_clf.predict(test['overview'])

nb_train_accuracy = np.mean(nb_train_predict == train['first_genre'])
nb_test_accuracy = np.mean(nb_test_predict == test['first_genre'])

print("Train Accuracy: {0} \nTest Accuracy: {1}".format(nb_train_accuracy,nb_test_accuracy))

Train Accuracy: 0.376 
Test Accuracy: 0.372


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


**Removing Stop Words - Support Vector Machines**

In [114]:
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, n_iter=5, random_state=42)),
])
_ = text_clf_svm.fit(train['overview'], train['first_genre'])

svm_train_predict = text_clf_svm.predict(train['overview'])
svm_test_predict = text_clf_svm.predict(test['overview'])

svm_train_accuracy = np.mean(svm_train_predict == train['first_genre'])
svm_test_accuracy = np.mean(svm_test_predict == test['first_genre'])

print("Train Accuracy: {0} \nTest Accuracy: {1}".format(svm_train_accuracy,svm_test_accuracy))

Train Accuracy: 1.0 
Test Accuracy: 0.352


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


**Stemming**

In [None]:
import nltk
nltk.download()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                      ('tfidf', TfidfTransformer()),
                      ('mnb', MultinomialNB()),
 ])

text_mnb_stemmed = text_mnb_stemmed.fit(train['overview'], train['first_genre'])