<h1>AI-Exam</h1>
<p>For this exam project we decided to work with natural language processing, i.e text processing. We wanted to determaine if a strong correlation could be made between the overview/description of a movie and its genre. The overall idea was to make the genre of the movies our <i>dependable variable</i>, hence transforming all genres into numeric values, so that each number represents a different genre. Then, we would break the down the description of each movie, initially removing all stop words, i.e words that are concidered to have no siginificant or descriptive meaning of a text. Then check how frequently the remaining words appeared in each movie description, and assign them a "weight" accordingly. This way we would be able to predict the genre of a movie, based off of the movie description itself.</p>

<h2>The dataset</h2>
We've downloaded a dataset from kaggle, which can be found [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata). The dataset contains movies from [IMDb](https://www.imdb.com/), and meta data about them. Most importantly it contains movies with their corresponding genres and a brief description about them. Initially we used another dataset where each movie was associated with one genre, but we got very poor results, and opted to change dataset, because we believed that the poor result primarily was a side effect of the size of the dataset. This was somewhat confirmed by the change of dataset, seeing as our models improved across the board after changing to a larger dataset. However the change did not come without issues. As briefly mentioned, in the initial dataset all movies were only related to <b>one</b> genre, whereas in the new dataset a movie could be related to <b>several</b> genres. This is an issue, as we only can have one dependable variable. To get around this issue, we decided that the first genre in the genre array, would be the movies genre, well knowning that the result potentially would be scewerd abit, as there's no guarentee the first genre in the array is the genre that fits the movie best. Furthermore it also means that our algorithm potentially guesses correct with the second or third, or even fourth genre, but it would still be classified as a false negative in our confusion matrix.

In [6]:
#Imports

import pandas as pd
import re #Regular expresion
import json
import numpy as np
import nltk #Natural language processing tool kit
from nltk.corpus import stopwords

# Modelling
from sklearn import model_selection, preprocessing, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split
from sklearn import decomposition, ensemble

# Data preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Validation
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


<h2>Helper methods</h2>
<p>Three different helper methods has been developed during this project, to get the best possible results</p>

<h3>train_model</h3>
<p>The 'train_model' method was developed to get an easier overview, seeing as we're training a bunch of differnt models, which are all trained in the same way. Hence the method takes the following parameters: classfier, X_train, y_test and X_test. This allows to train several models in bulk, whilst maintaining an overview.</p>

<h3>convert_column_array_to_normal_array</h3>
<p>As the method name suggests, the method takes an array, which is shaped like so (x, 1), and converts to an array with the following shape (1, y). This is needed for some of the training models.</p>

<h3>transform_genre</h3>
<p>TBD</p>

In [7]:
# Helper methods
def train_model(classifier, X_train, y_test, X_test):
    
    # fit the training dataset on the classifier
    classifier.fit(X_train, y_test)
    
    # predict the labels on validation dataset
    pred = classifier.predict(X_test)

    return metrics.accuracy_score(y_test, pred)

def convert_column_array_to_normal_array(arr):
    arr = []
    for i in array:
        arr.append(i)
    return arr

def transform_genre(array):
    arr = [[],[]]
    failedIndexes = []
    counter = 0
    while counter < len(array):
        try:
            genreObject = array[counter]
            genreStr = genreObject[0]            
            genreStr = genreStr.replace('[', '')
            genreStr = genreStr.replace(']','')
            splitStr = genreStr.split('}, {')
            arr.append(counter)
            splitStr[0] = splitStr[0].replace('{', '')
            length = len(splitStr)
            splitStr[length-1] = splitStr[length-1].replace('}', '')
            for i in splitStr:
                i = "{"+i+"}"
                jsonobj = json.loads(i)
                genre = jsonobj["name"]
                arr[counter].append(genre)                      
            counter += 1 
        except:
            failedIndexes.sort()
            failedIndexes = failedIndexes[::-1] # reversing the list
            counter += 1
    return arr, failedIndexes

<h2>Data pre-processing</h2>
<p>Brief explanations of choises and challenges</p>

In [8]:
moviesDF = pd.read_csv('tmdb_movies.csv') 

# RETURN TO THIS LATER
# moviesDF.groupby('genres').size() # prints how many of each genre there exists


y = moviesDF[['genres']] # our dependent variable

X = moviesDF[['overview']] # our independant variable

X_list = X.values.tolist()
y_list = y.values.tolist()

#result, failed = TransformGenre(y_list)

y_ = [] # temp, array for genres that arent empty
failedIndexes = [] # array to keep track of the index we had an empty genre, to be used later in deletion.

counter = 0
# we iterate over y_list to find all genres that arent empty, if one is empty its gonna trigger an exception
# which triggers our except clause. The except clause saved the index the error empty genre was located and progresses the counter
while counter < len(y_list):
    try:
        genreObject = y_list[counter]
        genreStr = genreObject[0]
        strstr = genreStr.split()
        indexedstr = strstr[3]
        test = re.findall(r'\w+', indexedstr)    
        y_.append(test[0])
        counter += 1 
    except:
        failedIndexes.append(counter)
        counter += 1
        
failedIndexes.sort()
failedIndexes = failedIndexes[::-1] # reversing the list
# as it turns out when you delete an index from a python list, it collapses the list, so we had to delete the highest index first to circumvent this.
for index in failedIndexes:
    del X_list[index]
    
y_list = y_ 

X_ =  [] # temp list to contain all strings
failedIndexes = []
counter = 0
# Currently X_list contains a collection of collections, this kinda og list-ception is incompatible with the AI algorithm
# So we create this small loop to extract the string.
while counter < len(X_list):
    listLine = X_list[counter]
    listString = listLine[0]
    if not listString:
        failedIndexes.append(counter)
        counter += 1
        continue
    X_.append(listString)
    counter += 1
    
failedIndexes = failedIndexes[::-1]
for i in failedIndexes:
    del X_[i]
    
X_list = X_

# We had nan values in our X_list, we decided to convert the X_list back to a dataframe
# in order to run isnull(), which returns a list of false/true wether an entry is nan or not
# with this we simply saved the index of which the nan occured and deleted
# it from both our dependant and independant variable.

tempDf = pd.DataFrame(X_list)
boolList = tempDf.isnull().values

badIndexes = []
counter = 0
while counter < len(boolList):
    if boolList[counter][0]:
        badIndexes.append(counter)               
    counter += 1

badIndexes = badIndexes[::-1]

for i in badIndexes:
    del X_list[i]
    del y_list[i]
    
        

# we are splitting our dataset up here for training and later validation.
X_train, X_test, y_train, y_test = train_test_split(X_list,y_list) 

# Encoder to encode our dependant variable
encoder = preprocessing.LabelEncoder()

y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

<h2>Feature Engineering</h2>
<p>Count Vector is a matrix notation of the dataset in 
which every row represents a document from the corpus,
every column represents a term from the corpus, 
and every cell represents the frequency count of a particular term in a particular document.

Here we create the count vectorized object
analyzer=word means that we chose to create an n-gram over words compared to chars or char_wb which is a special n-gram
when using analyzer=word we can choose a token pattern which is decided in a regular expression 
in this case the '\w{1,}' means that it will match a word with at least 1 character length.</p>

In [9]:
stop_words = set(stopwords.words('english'))

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', stop_words=stop_words, max_features=5000)

count_vect.fit(X_train)



# Now we are gonna transform the training and test data with our vectorized object

X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

# Now we are gonna use Term Frequency - Inverted Document Frequence (TF-IDF) vectors as features
# The score generate by the TF-IDF represents the relatuve importance of a term in a document and the entire corpus.
# We generate this score in two steps:
# The first computes the normalizeds term frequency (Tf) --- TF(x) = Number of times x appears in the document / total bynber if terms in the document. 
# the second computes the inverse document frequency  (IDF) --- IDF(x) = log_e(total number of documents / number of documents with term x in it)
# as mentioned earlier we could have chosen to use an n-gram composed of words, which we have implemented in line 61.
# now we are creating the TF-IDF score based on that n-gram.

#tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern='\w{1,}', max_features=5000)
tfidf_vect = TfidfVectorizer(encoding='utf-8',lowercase=True, stop_words=stop_words, sublinear_tf=True, use_idf=True,max_features=5000)
tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3),stop_words=stop_words,max_features=5000,lowercase=True)
tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)


<h2>Training and testing</h2>
<p></p>

In [10]:
label = convert_column_array_to_normal_array(y_train)
accuracy = train_model(classifier = naive_bayes.MultinomialNB(),
                       feature_vector_train = X_train_count,
                       label = label,
                       feature_vector_valid = X_test_count)
print(f'NB, Count vectors: {accuracy}')

accuracy = train_model(classifier = naive_bayes.MultinomialNB(),
                       feature_vector_train = X_train_tfidf,
                       label = label,
                       feature_vector_valid = X_test_tfidf)
print(f'NB, tf-idf vectors: {accuracy}')


accuracy = train_model(classifier = svm.SVC(),
                       feature_vector_train = X_train_count,
                       label = label,
                       feature_vector_valid = X_test_count)
print(f'SVM, Count vectors: {accuracy}')

accuracy = train_model(classifier = svm.SVC(),
                       feature_vector_train = X_train_tfidf,
                       label = label,
                       feature_vector_valid = X_test_tfidf)
print(f'SVM, tf-idf vectors: {accuracy}')

accuracy = train_model(classifier = ensemble.RandomForestClassifier(),
                       feature_vector_train = X_train_count,
                       label = label,
                       feature_vector_valid = X_test_count)
print(f'RF, Count vectors: {accuracy}')

accuracy = train_model(classifier = ensemble.RandomForestClassifier(),
                       feature_vector_train = X_train_tfidf,
                       label = label,
                       feature_vector_valid = X_test_tfidf)
print(f'RF, tf-idf vectors: {accuracy}')

NB, Count vectors: 0.44761106454316846
NB, tf-idf vectors: 0.4082145850796312
SVM, Count vectors: 0.38390611902766136
SVM, tf-idf vectors: 0.4082145850796312
RF, Count vectors: 0.3788767812238055
RF, tf-idf vectors: 0.3897736797988265


<h2>Group</h2>
<b>- Mikkel Wexøe Ertbjerg // cph-me209@cphbusiness.dk</b>

<b>- Nikolai Sjhøholm Christiansen // cph-nc103@cphbusiness.dk</b>

<b>- Nikolaj Dyring Jensen // cph-nj183@cphbusiness.dk</b>