## Overview

#### Steps

- Extract entities
- Topic modelling
- Knowledge search

### Algorthmic building blocks

### Textual data

Read textual data and add structured metadata using Dependency parsing -> POS-tagging.

Subj -> Verb <- Object 

This (nsubj) is (verb) a (det) sentence (attr)

### Semantic triple

(Subject, Predicate, Object) - RDF- w3C


## Preprocessing data

Criteria for good dataset

- Long well written texts covering divers topics
- Extensive and well maintained
- Available

### Eploratory Data Analysis

In [None]:
from pathlib import Path
import pandas as pd

In [None]:
root_dir = Path.cwd()
root_dir

In [None]:
%matplotlib inline

In [None]:
# Read csv
data_file = root_dir / 'data' / 'wiki_movie_plots_deduped.csv'
movie_plots_data = pd.read_csv(data_file)

In [None]:
movie_plots_data.shape

In [None]:
movie_plots_data.head(5)

### country of origin

In [None]:
movie_plots_data.groupby(['Origin/Ethnicity']).size().sort_values(ascending=True).plot.barh(figsize=(4,8))

In [None]:
# most movies are American/ British - western - Bias

### Visualize release year trend

In [None]:
movie_plots_data.groupby(["Release Year"]).size().plot(kind='bar', figsize=(22,4), grid=True)

### Visualize genre breakdown

In [None]:
movie_plots_data[movie_plots_data['Genre'] != 'unknown'].groupby(['Genre']).size().sort_values(ascending=True).tail(25).plot.barh(figsize=(4,8), grid=True)

In [None]:
# TODO ignore both ('unknown'/ 'Unknown') types
# Suggests exponential distribution on num of movies per genre

### Visualize director movie count

In [None]:
movie_plots_data[movie_plots_data['Director'] != 'Unknown'].groupby(['Director']).size().sort_values(ascending=True).tail(25).plot.barh(figsize=(4,8), grid=True)

### Get subset of data released after 2005

In [None]:
data_subset = movie_plots_data[movie_plots_data['Release Year'] >= 2005]

In [None]:
data_subset.head(5)

In [None]:

data_subset.shape

### NLP data preprocessing

In [None]:
# topic modelling used for knowledge mining
# looks for ways to group text into clusters

In [None]:
import numpy as np
np.random.seed(1234)


import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')


### Select films newer than 2015 and of comedy genre

In [None]:
movie_plots = movie_plots_data.loc[
    (movie_plots_data['Release Year'] >= 2015) &
     (movie_plots_data['Genre'].str.contains("comedy"))
].Plot

In [None]:
movie_plots.head(5)

In [None]:
stemmer = SnowballStemmer('english')
def lemmatize(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        # remove stopword tokens and token of len < 3
        if token not in STOPWORDS and len(token) > 2:
            result.append(lemmatize(token))
    return result

In [None]:
preprocessed_docs = movie_plots.map(preprocess)
preprocessed_docs.head(5)

In [None]:
# compare sentences before and after preprocessing
with pd.option_context('display.max_colwidth', None):
    display(movie_plots.head(1))
    display(preprocessed_docs.head(1))

### Create bag of words


In [None]:
# BoW
dictionary = gensim.corpora.Dictionary(preprocessed_docs)

In [None]:
# filter out extreme values
# Words that appear in less than 10 doc
# appear in more than 50% of docs - Can see link with TF-IDF ideas here
# keeping first 100,000 tokens sorted by appearance frequency
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100_000)

In [None]:
# create bow corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

print(len(bow_corpus))
preprocessed_docs.shape

In [None]:
# count word occurence
word_dict_count = {}
for doc in bow_corpus:
    for i, word_info in enumerate(doc):
        word = dictionary[word_info[0]]
        print(word, word_info)