<a href="https://colab.research.google.com/github/Dhruvit/Python_ML/blob/master/Movie_genre_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Multi lebal movie classification
we will use python code to do this

https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/?utm_source=blog&utm_medium=7-innovative-machine-learning-github-projects-in-python

In [0]:
import tarfile
movie_tar_data = tarfile.open('/content/MovieSummaries.tar.gz')
movie_tar_data.extractall()

In [0]:
import pandas as pd
import numpy as np
import json
import nltk
import re
import csv
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

%matplotlib inline
pd.set_option('display.max_colwidth', 300)

meta = pd.read_csv('/content/MovieSummaries/movie.metadata.tsv', sep='\t', header=None)
meta.head()

# There is no header, we need to add header

# rename columns
meta.columns = ["movie_id",1,"movie_name",3,4,5,6,7,"genre"]

#Now, we will load the movie plot dataset into memory. This data comes in a text file with each row consisting of a movie id and a plot of the movie. We will read it line-by-line:

plots = []

with open("/content/MovieSummaries/plot_summaries.txt", 'r') as f:
  reader = csv.reader(f, dialect = 'excel-tab')
  for row in tqdm(reader): #TODO: need to know more about tqdm
    plots.append(row)
    
#Next, split the movie ids and the plots into two separate lists. We will use these lists to form a dataframe:
movie_id = []
plot = []

# extract movie Ids and plot summaries
for i in tqdm(plots):
  movie_id.append(i[0])
  plot.append(i[1])

# create dataframe
movies = pd.DataFrame({'movie_id': movie_id, 'plot': plot})

movies.head()

#Perfect! We have both the movie id and the corresponding movie plot

# Data Exploration and Pre-processing

# change datatype of 'movie-id'
meta['movie_id'] = meta['movie_id'].astype(str)

# merge meta with movies
movies = pd.merge(movies, meta[['movie_id', 'movie_name', 'genre']], on = 'movie_id')

movies.head()

movies['genre'][0]

# convert this string into json

json.loads(movies['genre'][0]).values()

# and empty list
genres = []

# extract genres
for i in movies['genre']:
  genres.append(list(json.loads(i).values()))
  
# add to 'movies' dataframe
movies['genre_new'] = genres

movies.head()

# Some of the samples might not contain any genre tags. We should remove those samples as they won’t play a part in our model building process:

# remove sample with 0 genre tags
movies_new = movies[~(movies['genre_new'].str.len() == 0)]

movies_new.shape , movies.shape

# get all genre tags in a list 
all_genres = sum(genres,[])
len(set(all_genres))
  
#There are over 363 unique genre tags in our dataset. That is quite a big number. I can hardy recall 5-6 genres! Let’s find out what are these tags. We will use FreqDist( ) from the nltk library to create a dictionary of genres and their occurrence count across the dataset:
all_genres = nltk.FreqDist(all_genres)

# create dataframe
all_genres_df = pd.DataFrame({'Genre' : list(all_genres.keys()),
                             'Count': list(all_genres.values())})

#I personally feel visualizing the data is a much better method than simply putting out numbers. So, let’s plot the distribution of the movie genres:
'''
g = all_genres_df.nlargest(columns="Count", n = 50)
plt.figure(figsize=(12,15))
ax = sns.barplot(data=g,x="Count",y="Genre")
ax.set(ylabel = 'Count')
plt.show()
'''

#Next, we will clean our data a bit. I will use some very basic text cleaning steps (as that is not the focus area of this article):
# function for text cleaning
def clean_text(text):
  # remove backslash-apostrophe
  text = re.sub("\'","",text)
  # remove everything except alphabets
  text = re.sub("[^a-zA-Z]"," ",text)
  # remove whitespace
  text = ' '.join(text.split())
  #convert text to lowercase
  text = text.lower()
  
  return text

movies_new['clean_plot'] = movies_new['plot'].apply(lambda x: clean_text(x))

movies_new.head()

#The function below will visualize the words and their frequency in a set of documents. Let’s use it to find out the most frequent words in the movie plots column:
def freq_words(x, terms = 30):
  all_words = ' '.join([text for text in x])
  all_words = all_words.split()
  fdist = nltk.FreqDist(all_words)
  words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})
  
  # selecting top n most frequent words
  d = words_df.nlargest(columns="count", n = terms)
  
  # visualize words and frequencies
  plt.figure(figsize=(12,15))
  ax = sns.barplot(data = d, x="count", y="word")
  ax.set(ylabel='Word')
  plt.show()
  
# print 100 most frequent words
#freq_words(movies_new['clean_plot'], 100)

nltk.download('stopwords')
  
#Let’s remove the stopwords:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# function to remove stop words
def remove_stopwords(text):
  no_stopword_text = [w for w in text.split() if not w in stop_words]
  return ' '.join(no_stopword_text)

movies_new['clean_plot'] = movies_new['clean_plot'].apply(lambda x: remove_stopwords(x))

#freq_words(movies_new['clean_plot'], 100)

#Converting Text to Feaures

#I mentioned earlier that we will treat this multi-label classification problem as a Binary Relevance problem. Hence, we will now one hot encode the target variable, i.e., genre_new by using sklearn’s MultiLabelBinarizer( ). Since there are 363 unique genre tags, there are going to be 363 new target variables.

from sklearn.preprocessing import MultiLabelBinarizer

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(movies_new['genre_new'])

# transform target variable
y = multilabel_binarizer.transform(movies_new['genre_new'])

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)

# I have used the 10,000 most frequent words in the data as my features. You can try any other number as well for the max_features parameter.

# split dataset into training and validation set
xtrain, xval, ytrain, yval = train_test_split(movies_new['clean_plot'], y, test_size = 0.2, random_state=9)

# Now we can create feature for the train and the validation set

#create TF-IDF features
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

#Build Your Movie Genre Prediction Model
#We are all set for the model building part! This is what we’ve been waiting for.

#Remember, we will have to build a model for every one-hot encoded target variable. Since we have 363 target variables, we will have to fit 363 different models with the same set of predictors (TF-IDF features).

#As you can imagine, training 363 models can take a considerable amount of time on a modest system. Hence, I will build a Logistic Regression model as it is quick to train on limited computational power:

from sklearn.linear_model import LogisticRegression

# Binary Relevance
from sklearn.multiclass import OneVsRestClassifier

# Performance matric
from sklearn.metrics import f1_score

# We will use sk-learn’s OneVsRestClassifier class to solve this problem as a Binary Relevance or one-vs-all problem:

lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

# Finally, fit the model on the train set:
# fit model on train data

clf.fit(xtrain_tfidf, ytrain)

# make predictions for validation set
y_pred = clf.predict(xval_tfidf)

#Let’s check out a sample from these predictions:

y_pred[3]

#It is a binary one-dimensional array of length 363. Basically, it is the one-hot encoded form of the unique genre tags. We will have to find a way to convert it into movie genre tags.

#Luckily, sk-learn comes to our rescue once again. We will use the inverse_transform( ) function along with the MultiLabelBinarizer( ) object to convert the predicted arrays into movie genre tags:

multilabel_binarizer.inverse_transform(y_pred)[3]

def infer_tags(q):
    q = clean_text(q)
    q = remove_stopwords(q)
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)
  
for i in range(5): 
  k = xval.sample(1).index[0] 
  print("Movie: ", movies_new['movie_name'][k], "\nPredicted genre: ", infer_tags(xval[k])), print("Actual genre: ",movies_new['genre_new'][k], "\n")
  

42303it [00:00, 52566.50it/s]
100%|██████████| 42303/42303 [00:00<00:00, 812709.92it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


Movie:  Shrimad Virat Veerabrahmendra Swami Charitra 
Predicted genre:  [('Drama',)]
Actual genre:  ['History', 'Biographical film', 'Drama', 'Musical'] 

Movie:  The Corn is Green 
Predicted genre:  [('Drama',)]
Actual genre:  ['Film adaptation', 'Drama', 'Television movie'] 

Movie:  Someone to Watch Over Me 
Predicted genre:  [()]
Actual genre:  ['Crime Fiction', 'Thriller', 'Mystery', 'Drama', 'Suspense', 'Crime Thriller', 'Romantic drama'] 

Movie:  The Medallion 
Predicted genre:  [('Action', 'Action/Adventure')]
Actual genre:  ['Thriller', 'Fantasy Adventure', 'Buddy film', 'Adventure', 'Action Comedy', 'Action/Adventure', 'Martial Arts Film', 'Fantasy', 'Comedy', 'Action', 'Chinese Movies'] 

Movie:  A Pure Country Gift 
Predicted genre:  [('Drama', 'Musical')]
Actual genre:  ['Romantic drama', 'Romance Film', 'Drama'] 

