# Star Trek Captains - an NLP inquiry

The goal of this project was to train different classification models on lines our favorite captains have said and then see how the models fare on yet unseen bits of wisdom.  

The scripts of the following series were included:
- Star Trek: The Original Series (TOS)
- Star Trek: The Next Generation (TNG)
- Star Trek: Deep Space Nine (DS9)
- Star Trek: Voyager (VOY)
- Star Trek: Enterprise (ENT)

All lines of the following captains were included, disregarding whether they were said in the original series the captain belongs to or in a crossover episode:
- TOS: James T. Kirk (KIRK)
- TNG: Jean-Luc Picard (PICARD)
- DS9: Benjamin Sisko (SISKO)
- VOY: Kathryn Janeway (JANEWAY)
- ENT: Jonathan Archer (ARCHER)

Notes: Lines from mirror/alternate universe characters were counted with their prime-universe characters. Lines said over comm were counted. Personal and captain's logs were not counted. 

# Overview

- [Introduction](#intro)
- [Importing Packages](#import)
- [Setting Visualization Parameters](#viz_paras)
_________

- [Data Acquisition](#dataacq)
    - [Custom Scraping Funtions](#csf)
    - [Data Extraction Functions](#fdataextr)
    - [Combining Scraping and Extraction Functions](#combo_scrape_extr)
    - [Data Scraping](#actual_scraping)
    - [Converting to Dataframe and Saving to CSV](#convert_save)
    
___________
- [Data Cleaning](#dataclean)
    - [Reading in Data](#read_in_data)
    - [Looking at Raw Data](#look_at_raw_data)
    - [Dropping missing pages](#drop_missing_pages)
    - [Seperating Lines per Character](#sep_lines_char)
    - [Clean up Lines](#clean_up_lines)
____________
- [EDA](#eda)
    - [Preparation of dataframes](#prep_df)
    - [Total Word Counts and Average Number of Words per Line](#totalwc_avg_num_words)
    - [Count all words including common english words](#wordcount_no_stopwords)
    - [Analysing Frequency of Very Common Words](#analysing_freq_v_common)
    - [Count all words excluding common english words](#wordcount_with_stopwords)
    - [Wordclouds](#wordclouds)

___________    
    
**Basic Models**

- [Preparing Data for Basic Modelling](#prep_base_data)
    - [Baseline Accuracy](#baseline)
    - [Train/Test Splits](#ttsplits)
- [Create Pipelines](#create_pipelines)
    - [Create Vectorizers](#create_vectorizers)
    - [Create Models](#create_models)
    - [Create Pipelines with Param-Grids](#create_pipelines_2)
- [Fitting Basic Models](#basic_models)
    - [Modelling Functions](#modelling_functions)
    - [Fit Basic Models (Vectorized Lines) for Kirk vs Picard](#fit_basic_models_kvp)
    - [Fit Basic Models (Vectorized Lines) for all captains](#fit_basic_models_capt)

- [Evaluating Basic Models](#eval_basic_models)
    - [Model evaluation functions](#eval_functions)
    - [Kirk vs Picard](#kirkvpicard_eval)
        - [Logistic Regression -  Kirk vs Picard](#kirkvpicard_logreg)
        - [Decision Tree -  Kirk vs Picard](#kirkvpicard_dectree)
        - [Bernoulli Native Bayes - Kirk vs Picard](#kirkvpicard_bayes)
    - [All captains](#allc_eval)
        - [Logistic Regression - all captains](#allc_logreg_eval)
        - [Decision Tree - all captains](#allc_dectree_eval)
        
        
_________
        
**Additional features**

- [Importing GloVe Embeddings](#import_glove)
- [GloVe feature engineering functions](#create_glove_functions)
    - [Find the average word](#find_avg_word)
- [Create Features from GloVe Embeddings](#create_glove_features)
- [Combine Additional Features](#combine_adv_features)
    - [Train-Test Split](#tts_advanced)
    - [Feature Scaling](#scaling_adv)
- [Create a Model only using the Additional Features](#create_model_adv_fea)
- [Fitting a Logistic Regression on only Additional Features](#fit_model_only_adv_fea)
- [Evaluating Logistic Regression with only Additional Features](#eval_models_only_adv_fea)

_________

**Combined Models**

- [Combination of Wordvectors and Advanced Features](#combo_wordvec_adv_fea)
    - [Kirk vs Picard](#combo_kvp)
    - [All Captains](#combo_allc)
    
- [XGBoosted Random Forest](#xgb_main)
    
__________    
- [Interaction Networks](#interaction_networks)

- [Export Data for the Browser Game](#exp_data_game)

<a name="import"></a>
#### Importing packages

In [None]:
# General
import pandas as pd
import numpy as np
import time
import copy

# Scraping
import requests
from bs4 import BeautifulSoup
import re

# Feature Engineering
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE
from scipy import spatial
from nltk.tag import pos_tag
from nltk.tokenize import WordPunctTokenizer
from sklearn.preprocessing import StandardScaler 

# Modelling General
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.decomposition import PCA

# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score

# Boosting
import xgboost as xgb

# Viz
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import graphviz
import scikitplot as skplt
from wordcloud import WordCloud, ImageColorGenerator
import imageio as iio

from sklearn.metrics import plot_confusion_matrix, plot_roc_curve, plot_precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
import scikitplot as skplt

# Networks
import networkx as nx

<a name="viz_paras"></a>
#### Setting visualization parameters

In [None]:
plt.rcParams["font.family"] = "DIN Condensed"
plt.rcParams["font.size"]  = 20       
plt.rcParams["axes.edgecolor"] = "black"
plt.rcParams["axes.facecolor"] ="white"
plt.rcParams["figure.facecolor"] = "white"

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a name="dataacq"></a>
## Data Acquisition

Data was scraped from <a href="http://www.chakoteya.net/StarTrek/index.html">chakoteya.net</a> using a set of custom made functions and Packages `BeautifulSoup` and `requests`. 

**Section Overview**:
- [Custom Scraping Functions](#csf)
- [Data Extraction Functions](#fdataextr)
- [Combining Scraping and Extraction Functions](#combo_scrape_extr)
- [Data Scraping](#actual_scraping)
- [Converting to Dataframe and Saving to CSV](#convert_save)

<a name="csf"></a>
#### Custom Scraping Functions

In [None]:
def scraping_chakoteyanet_one_script(base_url, page_number):
    """ Scrapes one script page from chakoteyanet.com
    Arguments:
        base_url = baseline url to scrape with placeholder for page number like this: {}
        page_number = number in url access a specific page """

    URL = base_url.format(page_number)
    req = requests.get(URL)
    soup = BeautifulSoup(req.text, "html.parser")    

    return soup

In [None]:
def scraping_chakoteyanet_full_series(base_url, max_episode_number, starting_episode_number):
    """ Scrapes all scripts of one series from chakoteyanet.com
    Uses scraping_chakoteyanet_one_script()
    Arguments:
        base_url = baseline url to scrape with {}-placeholder for page number to iterate over
        max_episode_number = page number of the last episode
        starting_episode_number = page number of the first episode """
    
    soups_all_episodes = []
    
    # iterating over all pages associated with a series
    for i in range(starting_episode_number, max_episode_number+1):   
        page_number = i
        soup = scraping_chakoteyanet_one_script(base_url, page_number)
        soups_all_episodes.append(soup)
        
    return soups_all_episodes

<a name="fdataextr"></a>
#### Functions for Data Extraction

In [None]:
def replace_formatting_characters(text):
    """ Replaces both \r and \n in a string with a whitespace
    Arguments:
        text = text to clean"""
    text = text.replace("\r", " ")
    text = text.replace("\n", " ")
    return text

In [None]:
def extract_episode_information(soup):
    """Extracts the information on top of the page (episode title, date etc).
    Arguments:
        soup = soup of the whole page scraped"""
    return soup.find_all("p")[0].text


def extract_episode_script(soup):
    """Extracts the full text of the episode script.  
    Arguments:
        soup = soup of the whole page scraped"""
    return soup.find_all("center")[0].text


def extract_title(episode_information_cleaned):
    """Extracts the title of the episode from the cleaned information section.
    Arguments:
        episode_information_cleaned = clean episode information (no formatting characters), 
        use replace_formatting_characters() for cleaning."""
    try:
        if "Stardate" in episode_information_cleaned: # used in TOS, TNG, VOY, DS9
            episode_title = re.findall(".+?(?=Stardate)", episode_information_cleaned)[0].strip("")
        if "Mission date" in episode_information_cleaned: # used in ENT
            episode_title = re.findall(".+?(?=Mission date)", episode_information_cleaned)[0].strip("")
        else: # some episodes do not have a star date assigned
            episode_title = episode_information_cleaned.strip("")
    except:
        episode_title = np.nan    
    
    return episode_title


def extract_stardate(episode_information_cleaned):
    """Extracts the stardate of the episode (if known) from the cleaned information section.
    Arguments:
        episode_information_cleaned = clean episode information (no formatting characters), 
        use replace_formatting_characters() for cleaning."""
    try:
        episode_stardate = re.findall("Stardate:\s+(\S+)", episode_information_cleaned)[0].strip("")
    except:
        episode_stardate = np.nan     
    
    return episode_stardate


def extract_mission_date(episode_information_cleaned):
    """Extracts the mission date of the episode (if known) from the cleaned information section.
    (This only applies to ENT.)
    Arguments:
        episode_information_cleaned = clean episode information (no formatting characters), 
        use replace_formatting_characters() for cleaning."""
    try:
        episode_missiondate = re.findall("Mission [dD]ate:\s+(.+)Ori", episode_information_cleaned)[0].strip("")
    except:
        episode_missiondate = np.nan     
    
    return episode_missiondate


def extract_airdate(episode_information_cleaned):
    """Extracts the original air date of the episode (if known) from the cleaned information section.
    Arguments:
        episode_information_cleaned = clean episode information (no formatting characters), 
        use replace_formatting_characters() for cleaning."""
    try:
        episode_original_airdate = re.findall("Original Airdate:\s(.+)", 
                                              episode_information_cleaned)[0].strip("")
    except:
        episode_original_airdate = np.nan   
        
    return episode_original_airdate

<a name="combo_scrape_extr"></a>
#### Combining Scraping and Extraction Functions

In [None]:
def soup_to_series_dictionary(series_name, base_url, series_dictionary, max_episode_number, 
                              starting_episode_number):
    
    """ Calls scraping function for a episodes of a series, iterates over raw soups and calls extraction functions
    for title, stardate, original airdate and script.
    Arguments:
        series_name = name of the series scraped for print statement
        base_url = baseline url to scrape with {}-placeholder for page number to iterate over
        max_episode_number = page number of the last episode
        starting_episode_number = page number of the first episode
        series_dictionary = empty dictionary to fill with series information."""
    
    # gather full page soups using the functions defined above
    soups_all_episodes = scraping_chakoteyanet_full_series(base_url, max_episode_number, starting_episode_number)
    
    # iterate over the scraped soups to extract information
    for i, soup in enumerate(soups_all_episodes):
        
        try:
            episode_information = extract_episode_information(soup)  # information such as title and star date
            episode_information_cleaned = replace_formatting_characters(episode_information)

            episode_script = extract_episode_script(soup) # actual episode script
            episode_script_cleaned = replace_formatting_characters(episode_script)
            
        except:
            for key in series_dictionary.keys():
                series_dictionary[key].append(np.nan)  
        
        episode_title = extract_title(episode_information_cleaned)
        episode_stardate = extract_stardate(episode_information_cleaned)
        episode_mission_date = extract_mission_date(episode_information_cleaned) # only used in ENT
        episode_original_airdate = extract_airdate(episode_information_cleaned)
        
        episode_production_number = i+starting_episode_number # prod number is not always order of episodes aired

        # collect the information in a dictionary
        series_dictionary["title"].append(episode_title)
        series_dictionary["stardate"].append(episode_stardate)
        series_dictionary["mission_date"].append(episode_mission_date)
        series_dictionary["original_airdate"].append(episode_original_airdate)
        series_dictionary["production_number"].append(episode_production_number)
        series_dictionary["script"].append(episode_script_cleaned)
        
    return f"Successfully scraped {series_name} into {series_dictionary}"

<a name="actual_scraping"></a>
#### Scraping 

Data was scraped from <a href="http://www.chakoteya.net/StarTrek/index.html">chakoteya.net</a> for the series TOS, TNG, DS9, VOY, ENT. 

In [None]:
# template dictionary for datastructure
dict_series = {"title": [],
            "stardate" : [],
            "mission_date" : [],
            "original_airdate": [],
            "production_number": [],
            "script": []}

In [None]:
# scraping the original series
dict_TOS = copy.deepcopy(dict_series)
url_TOS = "http://www.chakoteya.net/StarTrek/{}.htm"

soup_to_series_dictionary(series_name = "TOS", base_url = url_TOS, series_dictionary = dict_TOS, 
                          max_episode_number = 79, starting_episode_number = 1)

In [None]:
# scraping the next generation
dict_TNG = copy.deepcopy(dict_series)
url_TNG = "http://www.chakoteya.net/NextGen/{}.htm"

soup_to_series_dictionary(series_name = "TNG", base_url = url_TNG, series_dictionary = dict_TNG, 
                          max_episode_number = 277, starting_episode_number = 101)

In [None]:
# scraping Deep Space 9
dict_DS9 = copy.deepcopy(dict_series)
url_DS9 = "http://www.chakoteya.net/DS9/{}.htm"

soup_to_series_dictionary(series_name = "DS9", base_url = url_DS9, series_dictionary = dict_DS9, 
                          max_episode_number = 575, starting_episode_number = 401)

In [None]:
# scraping Voyager episodes first season
dict_VOY = copy.deepcopy(dict_series)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 119, starting_episode_number = 101)


# scraping Voyager second season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 225, starting_episode_number = 201)


# scraping Voyager third season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 321, starting_episode_number = 301)


# scraping Voyager fourth season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 423, starting_episode_number = 401)


# scraping Voyager fifth season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 525, starting_episode_number = 501)


# scraping Voyager sixth season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 625, starting_episode_number = 601)

# scraping Voyager seventh season (seperatly because of gaps in URL numbering)
url_VOY = "http://www.chakoteya.net/Voyager/{}.htm"

soup_to_series_dictionary(series_name = "VOY", base_url = url_VOY, series_dictionary = dict_VOY, 
                          max_episode_number = 722, starting_episode_number = 701)

In [None]:
# scraping Enterprise 01 to 09 (seperatly because of additional 0 in URL)
dict_ENT = dict_series.copy()
url_ENT = "http://www.chakoteya.net/Enterprise/0{}.htm"

soup_to_series_dictionary(series_name = "ENT", base_url = url_ENT, series_dictionary = dict_ENT, 
                          max_episode_number = 9, starting_episode_number = 1)


# scraping Enterprise 10 to 98
url_ENT = "http://www.chakoteya.net/Enterprise/{}.htm"

soup_to_series_dictionary(series_name = "ENT", base_url = url_ENT, series_dictionary = dict_ENT, 
                          max_episode_number = 98, starting_episode_number = 10)

<a name="convert_save"></a>
#### Converting to Dataframe and Saving to CSV

In [None]:
# converting dictionaries to pandas dataframes
df_TOS_scraped = pd.DataFrame(dict_TOS)
df_TNG_scraped = pd.DataFrame(dict_TNG)
df_DS9_scraped = pd.DataFrame(dict_DS9)
df_VOY_scraped = pd.DataFrame(dict_VOY)
df_ENT_scraped = pd.DataFrame(dict_ENT)

In [None]:
saving_destiny = "../scraped_csvs/"

df_TOS_scraped.to_csv(saving_destiny + "_scripts_TOS.csv")
df_TNG_scraped.to_csv(saving_destiny + "_scripts_TNG.csv")
df_DS9_scraped.to_csv(saving_destiny + "_scripts_DS9.csv")
df_VOY_scraped.to_csv(saving_destiny + "_scripts_VOY.csv")
df_ENT_scraped.to_csv(saving_destiny + "_scripts_ENT.csv")

<a name="dataclean"></a>
## Data Cleaning

In this section the scripts of the episodes gathered was processed in order to collect all lines of the characters of interest and remove non-spoken text like for example stage directions.

**Section overview**:
- [Reading in Data](#read_in_data)  
- [Looking at Raw Data](#look_at_raw_data)  
- [Dropping Missing Pages](#drop_missing_pages)  
- [Seperating Lines per Character](#sep_lines_char)  
- [Clean up Lines](#clean_up_lines)   

<a name="read_in_data"></a>
#### Reading in Data

In [None]:
# reading in scraped data from csv
df_TOS_raw = pd.read_csv("./scraped_csvs/_scripts_TOS.csv", index_col=0)
df_TNG_raw = pd.read_csv("./scraped_csvs/_scripts_TNG.csv", index_col=0)
df_DS9_raw = pd.read_csv("./scraped_csvs/_scripts_DS9.csv", index_col=0)
df_VOY_raw = pd.read_csv("./scraped_csvs/_scripts_VOY.csv", index_col=0)
df_ENT_raw = pd.read_csv("./scraped_csvs/_scripts_ENT.csv", index_col=0)

In [None]:
# dropping empty columns (ENT did not use stardate yet)
df_ENT_raw.drop(columns=["stardate"], inplace=True)

<a name="look_at_raw_data"></a>
#### Looking at Raw Data

In [None]:
df_TOS_raw.head()

In [None]:
df_TNG_raw.head()

In [None]:
df_DS9_raw.head()

In [None]:
df_VOY_raw.head()

In [None]:
df_ENT_raw.head()

<a name="drop_missing_pages"></a>
#### Dropping Mssing Pages

In [None]:
def drop_bad_htm_pages(dataframe_raw):
    for i,title in enumerate(dataframe_raw.title):
        if "WordPress" in title:
            dataframe_raw.drop(index=i, inplace=True)

In [None]:
# dropping htm pages that did not contain information 
# these exist because of skipped page-numbers in case of double-lenght episodes
drop_bad_htm_pages(df_TOS_raw)
drop_bad_htm_pages(df_TNG_raw)
drop_bad_htm_pages(df_DS9_raw)
drop_bad_htm_pages(df_VOY_raw)
drop_bad_htm_pages(df_ENT_raw)

<a name="sep_lines_char"></a>
#### Seperating Lines per Character

In [None]:
# Designing regex
seperate_lines_regex = "(Personal\s+log|Captain's\s+log|[A-Z]{2,} \[OC\]|Q:|[A-Z]{2,})(.+?(?=[A-Z]{2,}|Q:|[A-Z\s0-9]{2,}\s+\[.+\]|$))"

**Explanation of the regex**

Regex: ```(Personal\s+log|Captain's\s+log|[A-Z]{2,} \[OC\]|Q:|[A-Z]{2,})(.+?(?=[A-Z]{2,}|Q:|[A-Z\s0-9]{2,}\s+\[.+\]|$))```


The Basic structure is the one of two capture groups:  

Group 1: `(Personal\s+log|Captain's\s+log|[A-Z]{2,} \[OC\]|Q:|[A-Z]{2,})`

This group consists of 5 different patterns that it will match ( `|` = or)
- `Personal\s+log` to match lines that start are a personal log entry, since those are not always preceeded by the talking character's name.
- `Captain's\s+log` to match lines that start are a captain's log entry, since those are not always preceeded by the talking character's name.
- `[A-Z]{2,} \[OC\]` will match any character name (> 1 captial letters) that is proceeded by `[OC]`, which stands for "over comm". 
- `Q:` will match a the character Q speaking
- `[A-Z]{2,}` will match any character name (> 1 captial letters)

Group 2: `(.+?(?=[A-Z]{2,}|Q:|[A-Z\s0-9]{2,}\s+\[.+\]|$))`
- The goal of group 2 is to capture the line a character said. It catches everything up to one of the specified lookahead patterns. 
- `.+?` will match everything until the patterns specified 
- `[A-Z]{2,}`, `Q:`, `[A-Z\s0-9]{2,}\s+\[.+\]` will match possible patterns for the name of the character speaking next.
- `$` will match the end of a script

In [None]:
def find_names_and_lines(script_one_episode, regex=seperate_lines_regex):
    """Returns a tuple with (character_name, line_said) using the
    episode script and regex given.
    Arguments:
        script_one_episode = scraped script of one episode
        regex = the regex string to seperate lines (default=seperate_lines_regex)
        """
    return re.findall(regex, script_one_episode)

def get_characters_one_series(list_of_scripts_of_series,regex=seperate_lines_regex):
    """Returns a list with all characters speaking in a series.
    Arguments:
        lists_of_scripts_of_series = list containing scraped scripts of all episodes in a series
        regex = the regex string to seperate lines (default=seperate_lines_regex)"""
    
    series_characters = set()
    for script in list_of_scripts_of_series:
        for line in find_names_and_lines(script, regex=regex):
            character_name = line[0]  #accessing only the character name, captured in the first capture group
            series_characters.add(character_name)
            
    return series_characters

def collect_lines_per_character(list_of_scripts_of_series, regex=seperate_lines_regex):
    """Collects all lines of a character in a dictionary from
    a list of scripts of a series. Output: dictionary of characters and their lines said in a list.
    Arguments:
        lists_of_scripts_of_series = list containing scraped scripts of all episodes in a series
        regex = the regex string to seperate lines (default=seperate_lines_regex)"""
    series_characters = get_characters_one_series(list_of_scripts_of_series, regex)
    
    # creates a dictionary with all characters of the series as keys
    lines_per_character = {character:[] for character in series_characters}
    
    # iterates over the scripts to add individuals lines to one list per character
    for script in list_of_scripts_of_series:
        for line in find_names_and_lines(script, regex):
            character_name = line[0]
            line_content = line[1]
            lines_per_character[character_name].append(line_content)
    
    return lines_per_character

In [None]:
# creating dictionaries of lines per characters in each series
lines_per_character_TOS = collect_lines_per_character(df_TOS_raw.script)
lines_per_character_TNG = collect_lines_per_character(df_TNG_raw.script)
lines_per_character_DS9 = collect_lines_per_character(df_DS9_raw.script)
lines_per_character_VOY = collect_lines_per_character(df_VOY_raw.script)
lines_per_character_ENT = collect_lines_per_character(df_ENT_raw.script)

<a name="clean_up_lines"></a>
#### Clean up Lines

The lines collected still contain some information that was not spoken by the characters themselves, such as stage directions, for example: `(The main door slams shut behind them.)`, which are always in between smooth parenthesis. Additionally characters in mirror universes/clones etc were designated with `OTHER`, which the line seperation regex cut off into the previous line. Locations are specified within squared brackets, e.g. `[BRIDGE]`. Lastly names with prefixes, for example `T'Pol` had their prefix cut off into the previous line, which in some cases could constitute data leakage (for example, since T'Pol only exists in ENT, the `T'`at the end of a line would be a clear predictor for Archer having said that line). 

The following functions clean the lines from these artefacts:

In [None]:
# Designing regex
find_stage_directions_regex = "\(.+\)"
find_other_regex = "OTHER"
find_locations_regex = "\[.+\]"
find_orphan_name_prefixes = "[A-Z]'"

In [None]:
def remove_stage_directions(line):
    """Removes text in parenthesis (stage directions) from a single line.
    Arguments:
        line = one line of dialog to clean."""
    for stage_direction in re.findall(find_stage_directions_regex, line):
        line = line.replace(stage_direction, "")
    return line

def remove_semicolons(line):
    """Removes semicolons from a single line
    Arguments:
        line = one line of dialog to clean."""
    return line.replace(":", "")

def remove_other(line):
    """Removes OTHER, signifying cloned/alternate universe characters from a single line.
    Arguments:
        line = one line of dialog to clean."""
    for other in re.findall(find_other_regex, line):
        line = line.replace(other, "")
    return line
    
def remove_locations(line):
    """Removes text in squared brackets [location information] from a single line.
    Arguments:
        line = one line of dialog to clean."""
    for location in re.findall(find_locations_regex, line):
        line = line.replace(location, "")
    return line   

def remove_orphan_name_prefixes(line):
    """Removes orphaned name prefixes (e.g. "O'" if the next line was said by O'Brian).
    Arguments:
        line = one line of dialog to clean."""
    for orphan_name in re.findall(find_orphan_name_prefixes, line):
        line = line.replace(orphan_name, "")
    return line

def clean_lines_per_character(dictionary_lines_per_character):
    """Combines the following cleaning function: remove_stage_directions,
    remove_semicolons, remove_other, remove_locations, remove_orphan_name_prefixes and applies them
    to all lines in a series dictionary.
    Arguments:
        dictionary_of_lines_per_character = dictionary with all lines of a character to iterate through."""
    cleaned_lines_per_character = {}
    
    # iterating over all lines in a series dictionary to apply cleaning functions
    for character, lines_one_character in dictionary_lines_per_character.items():
        cleaned_lines_one_character = []
        for line in lines_one_character:
            line = remove_stage_directions(line)
            line = remove_semicolons(line)
            line = remove_other(line)
            line = remove_locations(line)
            line = remove_orphan_name_prefixes(line)
            line = line.strip(" ")  # removes whitespace
            line = line.strip(";")

            cleaned_lines_one_character.append(line)

        cleaned_lines_per_character[character] = cleaned_lines_one_character
    
    return cleaned_lines_per_character

In [None]:
cleaned_lines_per_character_TOS = clean_lines_per_character(lines_per_character_TOS)
cleaned_lines_per_character_TNG = clean_lines_per_character(lines_per_character_TNG)
cleaned_lines_per_character_DS9 = clean_lines_per_character(lines_per_character_DS9)
cleaned_lines_per_character_VOY = clean_lines_per_character(lines_per_character_VOY)
cleaned_lines_per_character_ENT = clean_lines_per_character(lines_per_character_ENT)

<a name="eda"></a>
## EDA

**Section Overview:**
- [Preparation of dataframes](#prep_df)
- [Line Counts per Captain](#line_counts_per_captain)
- [Count all words including common english words](#wordcount_no_stopwords)
- [Count all words excluding common english words](#wordcount_with_stopwords)
- [Wordclouds](#wordclouds)


<a name="prep_df"></a>
### Preparation of dataframes

In this section various dataframes are prepared for later use in modelling and other analysis. This includes a collection of all lines per each series, collections of all lines per captain and collections with modifications such as only lines with more than 5 words. 
This step of feature engineering was taken because neither model nor human can be expected to accurately predict whether Kirk or Picard said the line: "Yes.". 

The number of words in a line was extracted as a seperate feature.

In [None]:
def get_lines_one_series(cleaned_lines_per_character_series):
    """Function that gathers all lines from a series in one list.
    Arguments:
        cleaned_lines_per_character_series = dictionary of cleaned lines sorted by character."""
    lines_series = []
    for key in cleaned_lines_per_character_series.keys():
        lines_series.extend(cleaned_lines_per_character_series[key])
    return lines_series

In [None]:
# collect lines per series
lines_TOS = get_lines_one_series(cleaned_lines_per_character_TOS)
lines_TNG = get_lines_one_series(cleaned_lines_per_character_TNG)
lines_DS9 = get_lines_one_series(cleaned_lines_per_character_DS9)
lines_VOY = get_lines_one_series(cleaned_lines_per_character_VOY)
lines_ENT = get_lines_one_series(cleaned_lines_per_character_ENT)

In [None]:
# collect lines per character: Picard
# collecting lines within the TNG series
lines_picard_TNG = cleaned_lines_per_character_TNG["PICARD"]
lines_picard_TNG.extend(cleaned_lines_per_character_TNG["PICARD [OC]"])  #include lines were said over comm

# adding lines from Picard in DS9 and ENT
lines_picard = lines_picard_TNG
lines_picard.extend(cleaned_lines_per_character_DS9["PICARD"]) #include DS9 and ENT crossover episodes
lines_picard.extend(cleaned_lines_per_character_ENT["PICARD"])

# collect lines per character: Kirk
# collecting lines withing the TOS series
lines_kirk_TOS = cleaned_lines_per_character_TOS["KIRK"]
lines_kirk_TOS.extend(cleaned_lines_per_character_TOS["KIRK [OC]"]) 

# adding lines from Kirk in DS9 and ENT
lines_kirk = lines_kirk_TOS
lines_kirk.extend(cleaned_lines_per_character_DS9["KIRK"])
lines_kirk.extend(cleaned_lines_per_character_DS9["KIRK [OC]"])
lines_kirk.extend(cleaned_lines_per_character_ENT["KIRK"])

# collect lines per character: Sisko (speaks only in DS9)
lines_sisko = cleaned_lines_per_character_DS9["SISKO"]
lines_sisko.extend(cleaned_lines_per_character_DS9["SISKO [OC]"])

# collect lines per character: Janeway (speaks only in VOY)
lines_janeway = cleaned_lines_per_character_VOY["JANEWAY"]
lines_janeway.extend(cleaned_lines_per_character_VOY["JANEWAY [OC]"])

# collect lines per character: Archer (speaks only in ENT)
lines_archer = cleaned_lines_per_character_ENT["ARCHER"]
lines_archer.extend(cleaned_lines_per_character_ENT["ARCHER [OC]"])

In [None]:
# Filtering out lines consisting only of numbers
lines_picard = [line for line in lines_picard if type(line) != int]
lines_kirk = [line for line in lines_kirk if type(line) != int]
lines_sisko = [line for line in lines_sisko if type(line) != int]
lines_janeway = [line for line in lines_janeway if type(line) != int]
lines_archer = [line for line in lines_archer if type(line) != int]

In [None]:
# Creating a dataframe for each captain
df_lines_picard = pd.DataFrame(lines_picard, columns=["line"])
df_lines_picard["character"] = "picard"

df_lines_kirk = pd.DataFrame(lines_kirk, columns=["line"])
df_lines_kirk["character"] = "kirk"

df_lines_sisko = pd.DataFrame(lines_sisko, columns=["line"])
df_lines_sisko["character"] = "sisko"

df_lines_janeway = pd.DataFrame(lines_janeway, columns=["line"])
df_lines_janeway["character"] = "janeway"

df_lines_archer = pd.DataFrame(lines_archer, columns=["line"])
df_lines_archer["character"] = "archer"

In [None]:
# Creating combined dataframes

# lines of Kirk and Picard
df_lines_kirk_picard = pd.concat([df_lines_picard, df_lines_kirk], axis=0)

df_lines_kirk_picard.reset_index(inplace=True, drop=True)


# lines of all captains
df_lines_all_captains = pd.concat([df_lines_kirk_picard, df_lines_sisko, 
                                   df_lines_janeway, df_lines_archer], axis=0)

df_lines_all_captains.reset_index(inplace=True, drop=True)

In [None]:
# create a filter function to select lines by number of words
def get_lines_greater_x_words(dataframe_lines_character, x=5):
    """Gets lines greater than x words
    Arguments:
        dataframe_lines_character = pandas dataframe containing the lines said by one character
        x = minimum number of words wanted in a line (default=5)."""
    
    dataframe_lines_greater_x_words = dataframe_lines_character.copy()
    
    for index, line in enumerate(dataframe_lines_character["line"]):
        number_of_words_in_line = len(line.split(" "))
        
        # drop lines with less than the amount of words specified
        if ((number_of_words_in_line < x) | (len(line) <= 20)):
            dataframe_lines_greater_x_words = dataframe_lines_greater_x_words.drop(index=index)
    
    return dataframe_lines_greater_x_words


# function to add the counts of words into a new feature
def add_word_count_to_df(dataframe_lines_character):
    """Adds a new column including the word count of a line.
    Arguments:
        dataframe_lines_character = pandas dataframe containing the lines said by one character."""
    
    dataframe_lines_character_word_count = dataframe_lines_character.copy()
    dataframe_lines_character_word_count["num_words"] = np.nan
    
    # for each line, count the words, add the count into a new column
    for index, line in zip(dataframe_lines_character.index, dataframe_lines_character["line"]):
        number_of_words_in_line = len(line.split(" "))
        
        dataframe_lines_character_word_count.loc[index,"num_words"] = number_of_words_in_line
    
    return dataframe_lines_character_word_count   

In [None]:
# create dataframes with all lines with at least 5 words

# combined dataframes
df_lines_kirk_picard_5_words = get_lines_greater_x_words(df_lines_kirk_picard)
df_lines_all_captains_5_words = get_lines_greater_x_words(df_lines_all_captains)

# dataframes of individual captains
df_lines_kirk_5_words = get_lines_greater_x_words(df_lines_kirk)
df_lines_picard_5_words = get_lines_greater_x_words(df_lines_picard)
df_lines_sisko_5_words = get_lines_greater_x_words(df_lines_sisko)
df_lines_janeway_5_words = get_lines_greater_x_words(df_lines_janeway)
df_lines_archer_5_words = get_lines_greater_x_words(df_lines_archer)

In [None]:
# adding word counts as a seperate feature for lines >= 5 words

# combined dataframes
df_lines_kirk_picard_word_count = add_word_count_to_df(df_lines_kirk_picard_5_words)
df_lines_all_captains_word_count = add_word_count_to_df(df_lines_all_captains_5_words)

# dataframes of individual captains
df_lines_kirk_word_count= add_word_count_to_df(df_lines_kirk_5_words)
df_lines_picard_word_count = add_word_count_to_df(df_lines_picard_5_words)
df_lines_sisko_word_count = add_word_count_to_df(df_lines_sisko_5_words)
df_lines_janeway_word_count = add_word_count_to_df(df_lines_janeway_5_words)
df_lines_archer_word_count = add_word_count_to_df(df_lines_archer_5_words)

<a name="totalwc_avg_num_words"></a>
#### Total Word Counts and Average Number of Words per Line

In [None]:
# counting the words in each line for all lines and all captains
df_lines_all_captains_all_lines_word_count = add_word_count_to_df(df_lines_all_captains)

# calculating the average word count per line
total_number_of_words_per_captain = df_lines_all_captains_all_lines_word_count.groupby("character").sum()
total_number_of_lines_per_captain = df_lines_all_captains_all_lines_word_count.groupby("character").count()

average_lenght_of_line = (total_number_of_words_per_captain / total_number_of_lines_per_captain)[["num_words"]]

In [None]:
total_number_of_words_per_captain

In [None]:
average_lenght_of_line

In [None]:
np.mean(average_lenght_of_line)

In [None]:
# Creating a plot of average word count per line per captain
fig, ax = plt.subplots()

# setting title parameters and title
title_params = {"font":"DIN Condensed","verticalalignment":"baseline",
        "size":25,"horizontalalignment": "center"}

plt.title("Average word count per line", title_params)

# create the bar chart object
bar_chart = ax.barh(average_lenght_of_line.sort_values("num_words",ascending=True).index, 
                    average_lenght_of_line.sort_values("num_words",ascending=True).num_words, 
                    color=["#9a99ff","#9a99ff","#cc6698","#cc6698","#cc6698"], edgecolor="black")

# adding text to the bar chart
for bar, num_words in zip(ax.patches, average_lenght_of_line.sort_values("num_words",ascending=True).num_words):
    if num_words < np.mean(average_lenght_of_line).values: # text within the bar for bars below the mean
        ax.text(bar.get_x()+bar.get_width()-1.8, bar.get_y()+bar.get_height()/2, round(num_words,2), 
            color = 'black', ha = 'left', va = 'center',  fontproperties={"size":20})
    else: # text to the right of the bar for bars above the mean
        ax.text(bar.get_x()+bar.get_width()+0.4, bar.get_y()+bar.get_height()/2, round(num_words,2), 
            color = 'black', ha = 'left', va = 'center',  fontproperties={"size":20})

# change labels on the y-axis to capitalized captain names
plt.yticks(font="DIN Condensed",ticks=[0,1,2,3,4],labels=["Archer","Kirk","Sisko","Picard","Janeway"],size=20)

# change size of labels on the x-axis
plt.xticks(size=15)

# add vertical line
plt.axvline(np.mean(average_lenght_of_line).values, **{"c":"black"})

# add text for the mean
ax.text(11.3, -1.5, u"\u03bc = 11.9", fontproperties={"size":15, "style":"italic"})

# set lenght of x axis
plt.xlim(0,15);

<a name="part_of_speech"></a>
#### Part of Speech Tagging

William Shatner's performance as James T. Kirk using dramatic pauses is iconic. Whether these pauses are reflected in the script using more punctiation remains to be seen. In any case part of speech tagging might pick up on subtle differences in style of speaking. 
The following section uses the `nltk` module `pos_tag` to engineer grammatical features.

In [None]:
def extract_part_of_speech_features(df_lines):
    """Function to apply a simple tokenizer (including punctuation) and part of speech tagging to 
    each line of a character. Returns the counts of parts of speech per line as added features 
    to the given dataframe.
    Arguments:
        df_lines = dataframe with lines to process"""
    
    tok = WordPunctTokenizer() # initialise a simple tokenizer that includes punctuation
    
    df_lines_pos = df_lines.copy()
    
    # iterate over the lines in the dataframe
    for index, line in zip(df_lines_pos.index, df_lines_pos["line"]):
        
        # create a list of part of speech items of the line
        pos_features_per_line = [b for a,b in pos_tag(tok.tokenize(line))]

        # iterate over part of speech items in one line
        for pos_feature in pos_features_per_line:
            
            # check if the feature of that part of speech type already exists
            if pos_feature in df_lines_pos.columns:
                if df_lines_pos.loc[index,pos_feature] > 0:  #in case the type already occured in this line, add 1
                    df_lines_pos.loc[index,pos_feature] += 1
                else: # in case the part has not occured in this line yet, set the cell to 1
                    df_lines_pos.loc[index,pos_feature] = 1
            else: # create a new feature column and set it to one 
                df_lines_pos.loc[index,pos_feature] = 1
                
    return df_lines_pos

In [None]:
# add part of speech tacking to the dataframes
df_lines_kirk_picard_pos = extract_part_of_speech_features(df_lines_kirk_picard_word_count)
df_lines_all_captains_pos = extract_part_of_speech_features(df_lines_all_captains_word_count)

Getting part of speech tagging for the example line (for use in the presentation)

In [None]:
pos_example = pd.DataFrame([["What do you think, Counsellor?"]], columns=["line"])

In [None]:
extract_part_of_speech_features(pos_example)

<a name="perc_per_captain"></a>
#### Percentage of lines said by a captain in their series

Which captain talks the most in their series, relative to other characters? (yes, it is Kirk). This section is a short interlude to calculate percentage of lines in a series said by its respective main captain. 

In [None]:
def count_total_lines_per_series(cleaned_lines_per_charcter_dict):
    """Counts total lines per series (that are not empty)
    Arguments: 
        cleaned_lines_per_charcter_dict = Dictionary of collected lines per character of a series."""
    line_count_per_char = [len(lines) 
                           for lines in cleaned_lines_per_charcter_dict.values() 
                           if len(lines) > 0]
    return sum(line_count_per_char)

In [None]:
# Couting total lines of all characters per series
total_line_count_TOS = count_total_lines_per_series(cleaned_lines_per_character_TOS)
total_line_count_TNG = count_total_lines_per_series(cleaned_lines_per_character_TNG)
total_line_count_DS9 = count_total_lines_per_series(cleaned_lines_per_character_DS9)
total_line_count_VOY = count_total_lines_per_series(cleaned_lines_per_character_VOY)
total_line_count_ENT = count_total_lines_per_series(cleaned_lines_per_character_ENT)

In [None]:
# Calculating percentage of captain's lines per series
perc_lines_kirk = round((len(lines_kirk_TOS)/ total_line_count_TOS)*100,1)
perc_lines_picard = round((len(lines_picard_TNG)/ total_line_count_TNG)*100,1)
perc_lines_sisko = round((len(lines_sisko)/ total_line_count_DS9)*100,1)
perc_lines_janeway = round((len(lines_janeway)/ total_line_count_VOY)*100,1)
perc_lines_archer = round((len(lines_archer)/ total_line_count_ENT)*100,1)


# Creating overview table
perc_lines_said_by_captains = pd.DataFrame([[len(lines_kirk_TOS),len(df_lines_kirk_5_words), perc_lines_kirk],
              [len(lines_picard_TNG),len(df_lines_picard_5_words), perc_lines_picard],
              [len(lines_sisko),len(df_lines_sisko_5_words), perc_lines_sisko],
              [len(lines_janeway),len(df_lines_janeway_5_words), perc_lines_janeway],
              [len(lines_archer),len(df_lines_archer_5_words), perc_lines_archer]],
    index=["Kirk", "Picard", "Sisko", "Janeway", "Archer"],
    columns=["Lines total", "Lines ≥ 5 words", "% of lines in series"])

In [None]:
perc_lines_said_by_captains

In [None]:
np.mean(perc_lines_said_by_captains["% of lines in series"])

In [None]:
# plotting percentage of lines said by its main captain
fig, ax = plt.subplots(figsize=(8,8))

# creating the bar chart object
bar_chart = ax.bar(perc_lines_said_by_captains.sort_values("% of lines in series",ascending=False).index, 
                   perc_lines_said_by_captains.sort_values("% of lines in series",ascending=False)["% of lines in series"], 
                   color=["#cc6698","#cc6698","#9a99ff","#9a99ff","#9a99ff"], edgecolor="black")

# adding numbers
for bar, perc in zip(ax.patches, perc_lines_said_by_captains.sort_values("% of lines in series",
                                                            ascending=False)["% of lines in series"]):
    if perc > np.mean(perc_lines_said_by_captains["% of lines in series"]): # number on top of the bar for > mean
        ax.text(bar.get_x()+bar.get_width()/2, bar.get_y()+bar.get_height()+0.8,round(perc,2), 
            color = 'black', ha = 'center', va = 'center', size=20)
    else: # number in the bar for < mean
        ax.text(bar.get_x()+bar.get_width()/2, bar.get_y()+bar.get_height()-1.2,round(perc,2), 
            color = 'black', ha = 'center', va = 'center', size=20)

# add title
ax.set_title("Percentage of lines in a series said by its main captain", title_params)

# add horizontal line at mean
ax.axhline(np.mean(perc_lines_said_by_captains["% of lines in series"]), color="black")

# configure axis
plt.yticks(size=20)
plt.xticks(size=20)
plt.ylim(0,35);

<a name="wordcount_no_stopwords"></a>
#### Count all words including common english words

This section determines the top 10 words said by each captain, including common English words.

In [None]:
def get_dataframe_cvec(lines, cvec):
    """Fit a vectorizer to a set of lines and return a dataframe of the tokenized
    lines.
    Arguments:
        lines = list of lines said by one character
        cvec = CountVectorizer instances used (Default=cvec_with_stopwords)."""
    cvec_all = cvec
    lines_sparse_matrix = cvec_all.fit_transform(lines)
    df = pd.DataFrame(lines_sparse_matrix.toarray(), columns=cvec_all.get_feature_names())
    
    return df

In [None]:
# create basic count vectorizer, no stopwords
cvec_no_stopwords = CountVectorizer(input = "content",    
                       encoding = "utf-8",   
                       decode_error = "strict",
                       strip_accents = None,    
                       lowercase = True,
                       token_pattern = "\s(\w+)\s", #at least 1 word character in between two space characters
                       stop_words = None,  
                       ngram_range=(1,1))  

In [None]:
# applying the count vectorizer to all indiv captains dataframes
df_simple_wordcount_incl_common_picard = get_dataframe_cvec(lines_picard,cvec_no_stopwords)
df_simple_wordcount_incl_common_kirk = get_dataframe_cvec(lines_kirk,cvec_no_stopwords)
df_simple_wordcount_incl_common_sisko = get_dataframe_cvec(lines_sisko,cvec_no_stopwords)
df_simple_wordcount_incl_common_janeway = get_dataframe_cvec(lines_janeway,cvec_no_stopwords)
df_simple_wordcount_incl_common_archer = get_dataframe_cvec(lines_archer,cvec_no_stopwords)

In [None]:
# creating tables of top 10 words per captain
word_count_picard_st = pd.DataFrame(df_simple_wordcount_incl_common_picard.sum(axis=0).sort_values(
    ascending=False).head(10), 
             columns=["Wordcount Picard"])

word_count_kirk_st = pd.DataFrame(df_simple_wordcount_incl_common_kirk.sum(axis=0).sort_values(
    ascending=False).head(10),
             columns=["Wordcount Kirk"])

word_count_sisko_st = pd.DataFrame(df_simple_wordcount_incl_common_sisko.sum(axis=0).sort_values(
    ascending=False).head(10),
             columns=["Wordcount Sisko"])

word_count_janeway_st = pd.DataFrame(df_simple_wordcount_incl_common_janeway.sum(axis=0).sort_values(
    ascending=False).head(10),
             columns=["Wordcount Janeway"])

word_count_archer_st = pd.DataFrame(df_simple_wordcount_incl_common_archer.sum(axis=0).sort_values(
    ascending=False).head(10),
             columns=["Wordcount Archer"])

<a name="analysing_freq_v_common"></a>
#### Analysing frequency of very common words

In [None]:
# look for words in top 10 all captains have in common
set(word_count_archer_st.index
   ).intersection(set(word_count_janeway_st.index)
                 ).intersection(set(word_count_kirk_st.index)
                                                ).intersection(set(word_count_picard_st.index)
                                                              ).intersection(set(word_count_sisko_st.index))

In [None]:
# create a dataframe including counts of top words by all captains
top_words_all_char = ['a', 'i', 'of', 'the', 'to', 'you']

common_words_allc = pd.DataFrame([[word_count_kirk_st.loc["a", "Wordcount Kirk"], word_count_kirk_st.loc["i", "Wordcount Kirk"],
            word_count_kirk_st.loc["of", "Wordcount Kirk"], word_count_kirk_st.loc["the", "Wordcount Kirk"],
            word_count_kirk_st.loc["to", "Wordcount Kirk"], word_count_kirk_st.loc["you", "Wordcount Kirk"]],
              
              [word_count_picard_st.loc["a", "Wordcount Picard"], word_count_picard_st.loc["i", "Wordcount Picard"],
            word_count_picard_st.loc["of", "Wordcount Picard"], word_count_picard_st.loc["the", "Wordcount Picard"],
            word_count_picard_st.loc["to", "Wordcount Picard"], word_count_picard_st.loc["you", "Wordcount Picard"]],
              
              [word_count_sisko_st.loc["a", "Wordcount Sisko"], word_count_sisko_st.loc["i", "Wordcount Sisko"],
            word_count_sisko_st.loc["of", "Wordcount Sisko"], word_count_sisko_st.loc["the", "Wordcount Sisko"],
            word_count_sisko_st.loc["to", "Wordcount Sisko"], word_count_sisko_st.loc["you", "Wordcount Sisko"]],
              
              [word_count_janeway_st.loc["a", "Wordcount Janeway"], word_count_janeway_st.loc["i", "Wordcount Janeway"],
            word_count_janeway_st.loc["of", "Wordcount Janeway"], word_count_janeway_st.loc["the", "Wordcount Janeway"],
            word_count_janeway_st.loc["to", "Wordcount Janeway"], word_count_janeway_st.loc["you", "Wordcount Janeway"]],
              
              [word_count_archer_st.loc["a", "Wordcount Archer"], word_count_archer_st.loc["i", "Wordcount Archer"],
            word_count_archer_st.loc["of", "Wordcount Archer"], word_count_archer_st.loc["the", "Wordcount Archer"],
            word_count_archer_st.loc["to", "Wordcount Archer"], word_count_archer_st.loc["you", "Wordcount Archer"]]
             ], index=["Kirk", "Picard", "Sisko", "Janeway", "Archer"], columns = top_words_all_char)

In [None]:
# show total counts of very common words
common_words_allc

In [None]:
# create dataframe of percentage of very common words
common_words_perc = common_words_allc.apply(lambda x: x/perc_lines_said_by_captains["Lines total"][x.name], 
                                            axis=1)

In [None]:
common_words_perc

In [None]:
# plot percentages of very common words as stacked barchart per captain
fig, ax = plt.subplots(edgecolor="black")

bottom_added = 0
ax.bar(common_words_perc.index, common_words_perc.a, label="a", color="#9a99ff", 
       edgecolor="black")
bottom_added = common_words_perc.a.copy()
ax.bar(common_words_perc.index, common_words_perc.i, bottom=bottom_added, label="i", 
       color="#ffcc9a", edgecolor="black")
bottom_added += common_words_perc.i
ax.bar(common_words_perc.index, common_words_perc.of, bottom=bottom_added, label="of", 
       color="#cc6698", edgecolor="black")
bottom_added += common_words_perc.of
ax.bar(common_words_perc.index, common_words_perc.the, bottom=bottom_added, label="the", 
       color="#ff9900", edgecolor="black")
bottom_added += common_words_perc.the
ax.bar(common_words_perc.index, common_words_perc.to, bottom=bottom_added, label="to", 
       color="#89DCB1", edgecolor="black")
bottom_added += common_words_perc.to
ax.bar(common_words_perc.index, common_words_perc.you, bottom=bottom_added, label="you", 
       color="#869CD6", edgecolor="black")
bottom_added += common_words_perc.you

# chance sizes of axis labels
plt.xticks(size=20)
plt.yticks(size=20)

# add title
plt.title("Normalized Frequency of very common words", size=20)

# add legend
plt.legend(loc=[1.05,0], fontsize=15);

<a name="wordcount_with_stopwords"></a>
#### Count all words excluding common english words

This section determines the top 10 words said by each captain, excluding common English words.

In [None]:
# Create a basic CountVectorizer instance with english stopwords
cvec_with_stopwords = CountVectorizer(input = "content",    
                       encoding = "utf-8",   
                       decode_error = "strict",
                       strip_accents = None,    
                       lowercase = True,        
                       token_pattern = "\s(\w{2,})\s", 
                       stop_words = "english",
                       ngram_range=(1,1))

In [None]:
# collecting words excluding stop words
df_simple_wordcount_excl_common_picard = get_dataframe_cvec(lines_picard, cvec_with_stopwords)
df_simple_wordcount_excl_common_kirk = get_dataframe_cvec(lines_kirk, cvec_with_stopwords)
df_simple_wordcount_excl_common_sisko = get_dataframe_cvec(lines_sisko, cvec_with_stopwords)
df_simple_wordcount_excl_common_janeway = get_dataframe_cvec(lines_janeway, cvec_with_stopwords)
df_simple_wordcount_excl_common_archer = get_dataframe_cvec(lines_archer, cvec_with_stopwords)

In [None]:
# getting dataframes of top 10 words
top10_words_picard = pd.DataFrame(df_simple_wordcount_excl_common_picard.sum(axis=0).sort_values(
    ascending=False).head(10), columns=["Wordcount Picard"])
top10_words_kirk = pd.DataFrame(df_simple_wordcount_excl_common_kirk.sum(axis=0).sort_values(
    ascending=False).head(10),columns=["Wordcount Kirk"])
top10_words_sisko = pd.DataFrame(df_simple_wordcount_excl_common_sisko.sum(axis=0).sort_values(
    ascending=False).head(10), columns=["Wordcount Sisko"])
top10_words_janeway = pd.DataFrame(df_simple_wordcount_excl_common_janeway.sum(axis=0).sort_values(
    ascending=False).head(10), columns=["Wordcount Janeway"])
top10_words_archer = pd.DataFrame(df_simple_wordcount_excl_common_archer.sum(axis=0).sort_values(
    ascending=False).head(10), columns=["Wordcount Archer"])

In [None]:
top10_words_picard

In [None]:
top10_words_kirk

In [None]:
top10_words_sisko

In [None]:
top10_words_janeway

In [None]:
top10_words_archer

In [None]:
# create wordcount dataframes for all series data
df_simple_wordcount_excl_common_TOS = get_dataframe_cvec(lines_TOS, cvec_with_stopwords)
df_simple_wordcount_excl_common_TNG = get_dataframe_cvec(lines_TNG, cvec_with_stopwords)
df_simple_wordcount_excl_common_DS9 = get_dataframe_cvec(lines_DS9, cvec_with_stopwords)
df_simple_wordcount_excl_common_VOY = get_dataframe_cvec(lines_VOY, cvec_with_stopwords)
df_simple_wordcount_excl_common_ENT = get_dataframe_cvec(lines_ENT, cvec_with_stopwords)

<a name="wordclouds"></a>
#### Creating Wordclouds

Using the word counts (excl. common English words) word clouds for each captain can be created

In [None]:
def make_wordcloud(data, color = "white", image=None):
    """Creates a wordcloud from a cvec dataframe. 
    
    Arguments:
        data = dataframe with tokenized text
        color = background color (default: white)
        image = image color and crop the wordcloud"""
    counts_per_word = data.sum(axis=0)
    
    wordcloud_text = ""
    
    # iterating over words and their frequency
    for word, count in zip(counts_per_word.index, counts_per_word):
        multiplied_word = (word + " ") * count  # multiplying the word by its frequency
        wordcloud_text += multiplied_word # creating a string to be parsed by the WordCloud method

    wordcloud = WordCloud(
                      background_color=color,
                      width=2500,
                      height=2000,
                      max_words=1000,
                      stopwords= set("la"),
                      collocations=False,
                      mask=image,  # using an image with a white background as a mask
                     ).generate(wordcloud_text)
    
    # determining wordcolor from an image
    image_colors = ImageColorGenerator(image)
    
    plt.figure(1,figsize=(13, 13))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.imshow(wordcloud.recolor(color_func=image_colors),interpolation="bilinear")
    plt.axis('off')
    plt.show()

In [None]:
picard_coloring = iio.imread("/Users/tjanif/Desktop/KirkvPicard_Material/pictures/picard.png")
make_wordcloud(df_simple_wordcount_excl_common_picard, image=picard_coloring)

In [None]:
kirk_coloring = iio.imread("/Users/tjanif/Desktop/KirkvPicard_Material/pictures/kirk2.jpeg")
make_wordcloud((df_simple_wordcount_excl_common_kirk), image=kirk_coloring)

In [None]:
sisko_coloring = iio.imread("/Users/tjanif/Desktop/KirkvPicard_Material/pictures/sisko.jpeg")
make_wordcloud((df_simple_wordcount_excl_common_sisko), image=sisko_coloring)

In [None]:
janeway_coloring = iio.imread("/Users/tjanif/Desktop/KirkvPicard_Material/pictures/janeway3.jpeg")
make_wordcloud((df_simple_wordcount_excl_common_janeway), image=janeway_coloring)

In [None]:
archer_coloring = iio.imread("/Users/tjanif/Desktop/KirkvPicard_Material/pictures/archer_bad_cropped.png")
make_wordcloud((df_simple_wordcount_excl_common_archer), image=archer_coloring)

## Basic Models

For this classification problem a series of basic models were tested only against the count of words in a given line to get an overview of how different models perform on the data given, especially since the occurence of certain words, for example "Spock" will most likely be highly predictive of Kirk having said that line. 

Modelling was only done on line with at least 5 words.
The Basic models were run both on a binary classification (Kirk vs Picard) and on a multiclass problem (all captains), subsequent advanced models were only tested on the multiclass task.

**Section Overview:**
- [Preparing Data for Basic Modelling](#prep_base_data)
    - [Baseline Accuracy](#baseline)
    - [Train/Test Splits](#ttsplits)
- [Create Pipelines](#create_pipelines)
    - [Create Vectorizers](#create_vectorizers)
    - [Create Models](#create_models)
    - [Create Pipelines with Param-Grids](#create_pipelines_2)
- [Fitting Basic Models](#basic_models)
    - [Modelling Functions](#modelling_functions)
    - [Fit Basic Models (Vectorized Lines) for Kirk vs Picard](#fit_basic_models_kvp)
    - [Fit Basic Models (Vectorized Lines) for all captains](#fit_basic_models_capt)
    
    
- [Evaluating Basic Models](#eval_basic_models)
    - [Model evaluation functions](#eval_functions)
    - [Kirk vs Picard](#kirkvpicard_eval)
        - [Overview](#kirkvpicard_overview_eval)
        - [Logistic Regression -  Kirk vs Picard](#kirkvpicard_logreg)
        - [Decision Tree -  Kirk vs Picard](#kirkvpicard_dectree)
        - [Bernoulli Native Bayes - Kirk vs Picard](#kirkvpicard_bayes)
        - [Random Forest - Kirk vs Picard](#krikvpicard_random_forest)
        - [Logistic Regression + tfidf - Kirk vs Picard](#krikvpicard_logreg_tfidf)
        - [Decision Tree + tfidf - Kirk vs Picard](#krikvpicard_dectree_tfidf)
        - [Bernoulli Naive Bayes + tfidf - Kirk vs Picard](#krikvpicard_bayes_tfidf)
    - [All Captains](#allc_eval)
        - [Overview](#allc_overview_eval)
        - [Logistic Regression -  All Captains](#allc_logreg)
        - [Decision Tree -   All Captains](#allc_dectree)
        - [Bernoulli Native Bayes -  All Captains](#allc_bayes)
        - [Random Forest -  All Captains](#allc_random_forest)
        - [Logistic Regression + tfidf -  All Captains](#allc_logreg_tfidf)
        - [Decision Tree + tfidf -  All Captains](#allc_dectree_tfidf)
        - [Bernoulli Naive Bayes + tfidf -  All Captains](#allc_bayes_tfidf)

<a name="prep_base_data"></a>
### Preparing Data for Basic Modelling

<a name="baseline"></a>
#### Baseline Accuracy

The models will compete with the baseline accuracy, which is the proportion of the majority class.

In [None]:
# Picard vs Kirk lines with at least 5 words
print("Picard vs Kirk, lines with at least 5 words")
print(df_lines_kirk_picard_5_words["character"].value_counts(normalize=True))
print("..................")
# All captains lines with at least 5 words
norm_line_count_allc = df_lines_all_captains_5_words["character"].value_counts(normalize=True)
line_count_allc = df_lines_all_captains_5_words["character"].value_counts(normalize=False)
print("All captains, lines with at least 5 words")
print(df_lines_all_captains_5_words["character"].value_counts(normalize=True))

In [None]:
# total number of lines with at least 5 words
sum(line_count_allc)

In [None]:
# Baseline plot
fig, ax = plt.subplots(figsize=(2.5,5))

# add title
ax.set_title("Class Balance", size=25)

# create bar chart
ax.bar("a", norm_line_count_allc.picard, color="#9a99ff", edgecolor="black",label="Picard")
offset_bottom = norm_line_count_allc.picard
ax.bar("a", norm_line_count_allc.janeway, bottom=offset_bottom, color="#ffcc9a",edgecolor="black",label="Janeway")
offset_bottom += norm_line_count_allc.janeway
ax.bar("a", norm_line_count_allc.kirk, bottom=offset_bottom, color="#cc6698", edgecolor="black",label="Kirk")
offset_bottom += norm_line_count_allc.kirk
ax.bar("a", norm_line_count_allc.sisko, bottom=offset_bottom, color="#156FA2", edgecolor="black",label="Sisko")
offset_bottom += norm_line_count_allc.sisko
ax.bar("a", norm_line_count_allc.archer, bottom=offset_bottom, color="#087F8C", edgecolor="black",label="Archer")
offset_bottom += norm_line_count_allc.archer

# adding percentages and total counts of lines to bar charts
for bar,value in zip(ax.patches, zip(norm_line_count_allc.values, line_count_allc)):
    
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_y()+bar.get_height()/2, 
            f"{int(round(value[0],2)*100)}% ({round(value[1],4)})",
            color = 'black', ha = 'center', va = 'center', size=20)

# add captain names next to bar chart    
for bar,index in zip(ax.patches, ["Picard", "Janeway", "Kirk", "Sisko", "Archer"]):
    
    ax.text(bar.get_x()+bar.get_width()+0.1, bar.get_y()+bar.get_height()/2, index,
            color = 'black', ha = 'left', va = 'center', size=20)

# remove box and xticks
ax.set_xticks(" ")
ax.axis("off");

<a name="ttsplits"></a>
#### Train/Test Splits for Basic Models

In [None]:
# function to create a train-test split of 80 to 20 %
def creating_train_test_splits(df_X_y, test_size=0.2):
    """Function to create a stratified train-test-split from a dataframe with the first column 
    continaing predictors, the second column containing the target.
    Arguments:
        df_X_y = dataframe with the data as specified above.
        test_size = relative size of the test set (default=0.2)"""
    X = df_X_y.iloc[:,0]
    y = df_X_y.iloc[:,1]

    X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=23, stratify=y)
    
    list_of_datasets = [X_train, y_train, X_test, y_test]
    
    # prints the shapes as quality control
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    
    return list_of_datasets

In [None]:
tt_list_split_kirk_picard = creating_train_test_splits(df_lines_kirk_picard)

In [None]:
tt_list_split_kirk_picard_5_words = creating_train_test_splits(df_lines_kirk_picard_5_words)

In [None]:
tt_list_split_kirk_picard_word_count = creating_train_test_splits(df_lines_kirk_picard_word_count)

In [None]:
tt_list_split_all_capt = creating_train_test_splits(df_lines_all_captains)

In [None]:
tt_list_split_all_capt_5_words = creating_train_test_splits(df_lines_all_captains_5_words)

In [None]:
tt_list_split_all_capt_word_count = creating_train_test_splits(df_lines_all_captains_word_count)

<a name="create_pipelines"></a>
### Create Pipelines

<a name="create_vectorizers"></a>
#### Create Vectorizers

Vectorizers to be used to do basic modelling.

In [None]:
cvec = CountVectorizer(stop_words = "english",     
                       token_pattern="\s(\w{2,})\s", 
                       ngram_range =(1,1))
                       #max_features = 50000)

In [None]:
tvec = TfidfVectorizer(stop_words = "english",  
                       token_pattern="\s(\w{2,})\s",  
                       ngram_range=(1,1))
                       #max_features = 50000) 

<a name="create_models"></a>
#### Create Models

Models to be used in basic modelling.

In [None]:
logreg = LogisticRegression(max_iter=2000, solver="liblinear")

In [None]:
bernoulliNB =  naive_bayes.BernoulliNB()

In [None]:
multinomialNB =  naive_bayes.MultinomialNB()

In [None]:
decision_tree_classifier = DecisionTreeClassifier()

In [None]:
random_forest_classifier = RandomForestClassifier(n_estimators = 2000, 
                                                  criterion="gini", 
                                                  n_jobs=-2)

<a name="create_pipelines_2"></a>
#### Create Pipelines with Param-Grids

A small grid search was used to tune hyperparameters. Since these models were only preliminary, computationally expensive searches were not deemed necessary at this point.

In [None]:
pipe_logreg = Pipeline([
    ('vect', cvec),
    ('logreg', logreg)
])

param_grid_logreg = {
                "vect__max_df" : [.8, .9, 1.0],
                "logreg__penalty" : ["l1", "l2"],
                "logreg__C": np.logspace(-4, 4, 10)}

gs_logreg = GridSearchCV(pipe_logreg, param_grid_logreg, n_jobs=-2)

In [None]:
pipe_dectree = Pipeline([
    ('vect', cvec),
    ('DecTree', decision_tree_classifier)
])

param_grid_dectree = {"vect__max_df" : [.8, .9, 1.0],
              "DecTree__max_depth" : np.linspace(20,80,4, dtype=int),
              "DecTree__min_samples_split": np.linspace(2, 10, 5, dtype=int),
              "DecTree__min_samples_leaf": np.linspace(2, 10, 5, dtype=int)}

gs_dectree = GridSearchCV(pipe_dectree, param_grid_dectree, n_jobs=-2)

In [None]:
pipe_bayes = Pipeline([
    ('vect', cvec),
    ('BernoulliNB', bernoulliNB)
])

param_grid_bayes = {"vect__max_df" : [1.0],
            "BernoulliNB__alpha" : [1.0]}

gs_bayes = GridSearchCV(pipe_bayes, param_grid_bayes, n_jobs=-2)

In [None]:
pipe_random_forest = Pipeline([
    ('vect', cvec),
    ('Forest', random_forest_classifier)
])

param_grid_forest = {
              "Forest__max_depth" : np.linspace(0,50,10, dtype=int)}

gs_forest = GridSearchCV(pipe_random_forest, param_grid_forest, n_jobs=-2)

In [None]:
pipe_bayes = Pipeline([
    ('vect', cvec),
    ('MultinomialNB', multinomialNB)
])

param_grid_bayes = {"vect__max_df" : [.8, .9, 1.0],
            "MultinomialNB__alpha" : np.linspace(0.0,1.0,4)}

gs_bayes_multi = GridSearchCV(pipe_bayes, param_grid_bayes, n_jobs=-2)

In [None]:
pipe_logreg_tvec = Pipeline([
    ('vect', tvec),
    ('logreg', logreg)
])

param_grid_logreg = {
                "vect__max_df" : [.8, .9, 1.0],
                "logreg__penalty" : ["l1", "l2"],
                "logreg__C": np.logspace(-4, 4, 10)}

gs_logreg_tvec = GridSearchCV(pipe_logreg, param_grid_logreg, n_jobs=-2)

In [None]:
pipe_dectree = Pipeline([
    ('vect', tvec),
    ('DecTree', decision_tree_classifier)
])

param_grid_dectree = {"vect__max_df" : [.8, .9, 1.0],
              "DecTree__max_depth" : np.linspace(20,80,4, dtype=int),
              "DecTree__min_samples_split": np.linspace(2, 10, 5, dtype=int),
              "DecTree__min_samples_leaf": np.linspace(2, 10, 5, dtype=int)}

gs_dectree_tvec = GridSearchCV(pipe_dectree, param_grid_dectree, n_jobs=-2)

In [None]:
pipe_bayes = Pipeline([
    ('vect', tvec),
    ('BernoulliNB', bernoulliNB)
])

param_grid_bayes = {"vect__max_df" : [.8, .9, 1.0],
              "BernoulliNB__alpha" : np.linspace(0.0,1.0,4)}

gs_bayes_tvec = GridSearchCV(pipe_bayes, param_grid_bayes, n_jobs=-2)

In [None]:
pipe_bayes = Pipeline([
    ('vect', tvec),
    ('MultinomialNB', multinomialNB)
])

param_grid_bayes = {"vect__max_df" : [.8, .9, 1.0],
              "MultinomialNB__alpha" : np.linspace(0.0,1.0,4)}

gs_bayes_tvec_multi = GridSearchCV(pipe_bayes, param_grid_bayes, n_jobs=-2)

<a name="basic_models"></a>
### Fitting Basic Models

<a name="modelling_functions"></a>
#### Modelling Functions

In [None]:
def fit_pipeline(pipe, X_train, y_train):
    """Fits a pipeline.
    Arguments:
        pipe = pipeline to fit
        X_train = predictors training set
        y_train = target training set."""
    pipe.fit(X_train,y_train)

def get_scores(fitted_model, X_train, y_train, X_test, y_test, cv=5):
    """Gets Training/Mean Cross Val/Testing score for a fitted model and a provided set of train/test data.
    Arguments:
        fitted_model = model that has been fitted
        X_train, y_train, X_test, y_test = training predictors/target as well as testing predictors/target
        cv = number of cross-validation folds (default=5)"""
    
    cv_scores = cross_val_score(fitted_model, X_train, y_train, cv=cv)

    print("Training Score:", fitted_model.score(X_train, y_train))
    print("Mean Cross Val Score:", cv_scores.mean())
    print("Testing Score:", fitted_model.score(X_test, y_test))
    
    return (fitted_model.score(X_train, y_train), cv_scores.mean(), fitted_model.score(X_test, y_test))

<a name="fit_basic_models_kvp"></a>
#### Fit Basic Models (Vectorized Lines) for Kirk vs Picard
- using the above Vectorizers (CountVectorizer and Tf-idf)
- using the above models (LogisticRegression, DecisionTree, Bernoulli Native Bayes, RandomForest)

In [None]:
fit_pipeline(gs_logreg, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
logreg_kp = gs_logreg.best_estimator_
logreg_kp_scores = get_scores(logreg_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_dectree, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
dectree_kp = gs_dectree.best_estimator_
dectree_kp_scores = get_scores(dectree_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_bayes, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
bayes_kp = gs_bayes.best_estimator_
bayes_kp_scores = get_scores(bayes_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_forest, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
forest_kp = gs_forest.best_estimator_
forest_kp_scores = get_scores(forest_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_logreg_tvec, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
logreg_tvec_kp = gs_logreg_tvec.best_estimator_
logreg_tvec_kp_scores = get_scores(logreg_tvec_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_dectree_tvec, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
dectree_tvec_kp = gs_dectree_tvec.best_estimator_
dectree_tvec_kp_scores = get_scores(dectree_tvec_kp, *tt_list_split_kirk_picard_5_words)

In [None]:
fit_pipeline(gs_bayes_tvec, tt_list_split_kirk_picard_5_words[0], tt_list_split_kirk_picard_5_words[1])
bayes_tvec_kp = gs_bayes_tvec.best_estimator_
bayes_tvec_kp_scores = get_scores(bayes_tvec_kp, *tt_list_split_kirk_picard_5_words)

<a name="fit_basic_models_capt"></a>
#### Fit Basic Models (Vectorized Lines) for all captains
- using the above Vectorizers (CountVectorizer and Tf-idf)
- using the above models (LogisticRegression, DecisionTree, Multinomial Native Bayes)

In [None]:
fit_pipeline(gs_logreg, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
logreg_allc = gs_logreg.best_estimator_
logreg_allc_scores = get_scores(logreg_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_dectree, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
dectree_allc = gs_dectree.best_estimator_
dectree_allc_scores = get_scores(dectree_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_bayes_multi, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
bayes_allc = gs_bayes_multi.best_estimator_
bayes_allc_scores = get_scores(bayes_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_forest, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
forest_allc = gs_forest.best_estimator_
forest_allc_scores = get_scores(forest_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_logreg_tvec, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
logreg_tvec_allc = gs_logreg_tvec.best_estimator_
logreg_tvec_allc_scores = get_scores(logreg_tvec_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_dectree_tvec, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
dectree_tvec_allc = gs_dectree_tvec.best_estimator_
dectree_tvec_allc_scores = get_scores(dectree_tvec_allc, *tt_list_split_all_capt_5_words)

In [None]:
fit_pipeline(gs_bayes_tvec_multi, tt_list_split_all_capt_5_words[0], tt_list_split_all_capt_5_words[1])
bayes_tvec_allc = gs_bayes_tvec_multi.best_estimator_
bayes_tvec_allc_scores = get_scores(bayes_tvec_allc, *tt_list_split_all_capt_5_words)

<a name="eval_basic_models"></a>
## Evaluating Basic Models

In this section basic models will be evaluated with regards to their accuracy, ROC- and Precision-Recall Curves. 

**Section Overview**
- [Evaluating Basic Models](#eval_basic_models)
    - [Model evaluation functions](#eval_functions)
    - [Kirk vs Picard](#kirkvpicard_eval)
        - [Overview](#kirkvpicard_overview_eval)
        - [Logistic Regression -  Kirk vs Picard](#kirkvpicard_logreg)
        - [Decision Tree -  Kirk vs Picard](#kirkvpicard_dectree)
        - [Bernoulli Native Bayes - Kirk vs Picard](#kirkvpicard_bayes)
        - [Random Forest - Kirk vs Picard](#krikvpicard_random_forest)
        - [Logistic Regression + tfidf - Kirk vs Picard](#krikvpicard_logreg_tfidf)
        - [Decision Tree + tfidf - Kirk vs Picard](#krikvpicard_dectree_tfidf)
        - [Bernoulli Naive Bayes + tfidf - Kirk vs Picard](#krikvpicard_bayes_tfidf)
    - [All Captains](#allc_eval)
        - [Overview](#allc_overview_eval)
        - [Logistic Regression -  All Captains](#allc_logreg)
        - [Decision Tree -   All Captains](#allc_dectree)
        - [Bernoulli Native Bayes -  All Captains](#allc_bayes)
        - [Random Forest -  All Captains](#allc_random_forest)
        - [Logistic Regression + tfidf -  All Captains](#allc_logreg_tfidf)
        - [Decision Tree + tfidf -  All Captains](#allc_dectree_tfidf)
        - [Bernoulli Naive Bayes + tfidf -  All Captains](#allc_bayes_tfidf)

<a name="eval_functions"></a>
#### Model evaluation functions

In [None]:
def get_logreg_coef_from_pipeline(pipeline, vectorizer_name, model_name, multiclass=False, 
                                  class_of_interest=None):
    """Gets coefficients from a model (that has a .coef_ attribute) within a fitted pipeline. 
    Returns a dataframe.
    Arguments:
        pipeline = fitted pipeline
        vectorizer_name = name assigned to the vectorizer in the pipeline
        model_name = name assigned to the model in the pipeline
        multiclass = whether or not the classification is multiclass (default=False)
        class_of_interest = in case of multiclass, for which class coefficients are wanted (default=None)
                            should be specified as the class name (string)."""
    array_classes = pipeline.named_steps[model_name].classes_
    class_of_interest_index = np.where(array_classes == class_of_interest) #get the index of the class of interest
    
    # in case of multiclass get coefficients from the class of interest
    if multiclass == True:
        coefs = pipeline.named_steps[model_name].coef_[class_of_interest_index]
    if multiclass == False:
        coefs = pipeline.named_steps[model_name].coef_
    
    # get the name of the coefficients
    vocab_unsorted = pipeline.named_steps[vectorizer_name].vocabulary_
    vocab_sorted_by_index = sorted(vocab_unsorted.keys())
    
    # create the dataframe
    df =  pd.DataFrame(coefs.reshape(-1,1),columns=['coef'], index= vocab_sorted_by_index)
    
    return df


def plot_coef_dataframe(df, lenght=20, positive=True, ax=None):
    """Plots either the coefficients with the highest, or the lowest value, given a dataframe of coefficients.
    Arguments:
        df = dataframe of coefficients
        positive = boolean value indicating whether to plot top or bottom coeffiecients (default=True)
        ax = subplot position (default=None)."""
    if positive == True:
        return sns.barplot(y=df.sort_values(by="coef", ascending=False).index[:lenght], 
                x=df.sort_values(by="coef", ascending=False).coef[:lenght],
                orient="h",
                ax=ax, edgecolor="black",
                color="#9A99FF")
    if positive == False:
        return sns.barplot(y=df.sort_values(by="coef", ascending=True).index[:lenght], 
                x=df.sort_values(by="coef", ascending=True).coef.apply(lambda x: -x)[:lenght],
                orient="h",
                ax=ax)     
       
    
def plot_graphs_logreg_binary(df_coef, model, X_test, y_test, lenght=20, pos_label="1", neg_label="0", 
                              coef_plot=True):
    """Plots the following graphs for a binary classification: Coefficient for positive label, coefficients
    for negative labels, confusion matrix, ROC-curve, Precision-Recall-Curve, barplot of baseline counts.
    Arguments:
        df_coef = dataframe of coefficients
        model = fitted model
        X_test = predictors test set
        y_test = target test set
        lenght = number of coefficients to plot (default=20)
        pos_label = label of the positive class (default="1")
        neg_label = label of the negative class (default="0")
        coef_plot = turn the plotting of coefficients on/off (default=True)"""
    
    if coef_plot == True:
    
        fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(20,20))

        ax[2][0].set_title(f"Coefficients predicting for {pos_label}")
        plot_coef_dataframe(df_coef, lenght, ax=ax[2][0])

        ax[2][1].set_title(f"Coefficients predicting for {neg_label}")
        plot_coef_dataframe(df_coef, lenght, positive=False, ax=ax[2][1])
        
    if coef_plot == False:
        fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,10))

    plot_confusion_matrix(model, X_test, y_test, ax=ax[0][0], colorbar=False, cmap="Blues")
    plot_roc_curve(model, X_test, y_test, ax=ax[0][1])
    plot_precision_recall_curve(model, X_test, y_test, ax=ax[1][0])

    sns.barplot(x=y_test.value_counts().index, y=y_test.value_counts().values, ax=ax[1][1])
    
    fig.subplots_adjust(hspace=0.5)
    
    plt.show()
    
    
def plot_graphs_logreg_five_classes(pipeline, vectorizer_name, model_name, X_test, y_test, lenght=20, label=[]):
    """Plots the following graphs for a 5-class multiclass classification: Coefficients for each label, 
    confusion matrix, ROC-curve.
    Arguments:
        pipeline = fitted pipeline
        vectorizer_name = name of the vectorizer within the pipeline
        model_name = name of the model within the pipeline
        X_test = predictors test set
        y_test = target test set
        lenght = number of coefficients to plot (default=20)
        label = list of all labels (default=[])"""
    
    # getting coefficient dataframes
    df1 = get_logreg_coef_from_pipeline(pipeline, vectorizer_name, 
                                       model_name, multiclass=True, class_of_interest=label[0].lower())
    df2 = get_logreg_coef_from_pipeline(pipeline, vectorizer_name, 
                                       model_name, multiclass=True, class_of_interest=label[1].lower())
    df3 = get_logreg_coef_from_pipeline(pipeline, vectorizer_name, 
                                       model_name, multiclass=True, class_of_interest=label[2].lower())
    df4 = get_logreg_coef_from_pipeline(pipeline, vectorizer_name, 
                                       model_name, multiclass=True, class_of_interest=label[3].lower())
    df5 = get_logreg_coef_from_pipeline(pipeline, vectorizer_name, 
                                       model_name, multiclass=True, class_of_interest=label[4].lower())
    
    
    fig, ax = plt.subplots(nrows=4, ncols=2, figsize=(20,50))

    ax[0][0].set_title(f"Coefficients predicting for {label[0]}")
    plot_coef_dataframe(df1, lenght, ax=ax[0][0])

    ax[0][1].set_title(f"Coefficients predicting for {label[1]}")
    plot_coef_dataframe(df2, lenght, ax=ax[0][1])

    ax[1][0].set_title(f"Coefficients predicting for {label[2]}")
    plot_coef_dataframe(df3, lenght, ax=ax[1][0])
    
    ax[1][1].set_title(f"Coefficients predicting for {label[3]}")
    plot_coef_dataframe(df4, lenght, ax=ax[1][1])
    
    ax[2][0].set_title(f"Coefficients predicting for {label[4]}")
    plot_coef_dataframe(df5, lenght, ax=ax[2][0])
    
    ax[0][0].set_xlabel("Coefficient")
    ax[0][1].set_xlabel("Coefficient")
    ax[1][0].set_xlabel("Coefficient")
    ax[0][1].set_xlabel("Coefficient")
    ax[2][0].set_xlabel("Coefficient")
    
    strek_colors = ListedColormap(["#C6F6FA", "#7CEAF4", "#2FDEEE", "#0FAEBD", "#0B7A84"])

    plt.rcParams.update({'font.size': 25})
    
    plot_confusion_matrix(pipeline, X_test, y_test, ax=ax[2][1], colorbar=True, cmap=strek_colors,
                         display_labels=["Archer", "Janeway", "Kirk", "Picard", "Sisko"],
                         values_format="d")
    
    ax[2][1].tick_params(labelsize=25)
    ax[2][1].grid(False)
    
    skplt.metrics.plot_roc(y_test, pipeline.predict_proba(X_test), 
                       plot_micro=True, plot_macro=True, 
                       title_fontsize=20, text_fontsize=16, figsize=(8,6), ax=ax[3][0])
    
    ax[-1, -1].axis('off')
    
    ax[3][0].legend(loc=(1,0))
    fig.subplots_adjust(hspace=0.4)
    fig.subplots_adjust(wspace=0.5)
    
    sns.despine()
    
    plt.show()

In [None]:
def get_fea_importances_from_pipeline(pipeline, vectorizer_name, model_name, multiclass=False, 
                                      class_of_interest=None):
    """Gets feature importances from a fitted pipeline that includes a model with a .feature_importances_ method
    and returns a dataframe.
    Arguments:
        pipeline = fitted pipeline
        vectorizer_name = name assigned to the vectorizer in the pipeline
        model_name = name assigned to the model in the pipeline
        multiclass = whether or not the classification is multiclass (default=False)
        class_of_interest = in case of multiclass, for which class coefficients are wanted (default=None)
                            should be specified as the class name (string)."""
    
    array_classes = pipeline.named_steps[model_name].classes_
    class_of_interest_index = np.where(array_classes == class_of_interest)
    
    if multiclass == True:
        fea_importances = pipeline.named_steps[model_name].feature_importances_[class_of_interest_index]
    if multiclass == False:
        fea_importances = pipeline.named_steps[model_name].feature_importances_
    
    vocab_unsorted = pipeline.named_steps[vectorizer_name].vocabulary_
    vocab_sorted_by_index = sorted(vocab_unsorted.keys())
    
    df =  pd.DataFrame(fea_importances.reshape(-1,1),columns=['feature_importance'], index= vocab_sorted_by_index)
    
    return df


def plot_fea_importance_dataframe(df, lenght=20, positive=True, ax=None):
    """Plots either the coefficients with the highest, or the lowest value, given a dataframe of coefficients.
    Arguments:
        df = dataframe of coefficients
        lenght = number of coefficients to plot (default = 20)
        positive = boolean value indicating whether to plot top or bottom coeffiecients (default=True)
        ax = subplot position (default=None)."""
    if positive == True:
        return sns.barplot(y=df.sort_values(by="feature_importance", ascending=False).index[:lenght], 
                x=df.sort_values(by="feature_importance", ascending=False).feature_importance[:lenght],
                orient="h",
                ax=ax, color="#CC6698")
    if positive == False:
        return sns.barplot(y=df.sort_values(by="feature_importance", ascending=True).index[:lenght], 
                x=df.sort_values(by="feature_importance", ascending=True).feature_importance.apply(lambda x: -x)[:lenght],
                orient="h",
                ax=ax) 

def plot_graphs_dectree_binary(df_fea, model, X_test, y_test, lenght=20, pos_label="1", neg_label="0"):
    """Plots the following graphs for a binary classification: Feature importances,
    confusion matrix, ROC-curve, Precision-Recall-Curve, barplot of baseline counts.
    Arguments:
        df_fea = dataframe of feature importances
        model = fitted model
        X_test = predictors test set
        y_test = target test set
        lenght = number of feature importances to plot (default=20)
        pos_label = label of the positive class (default="1")
        neg_label = label of the negative class (default="0")"""
    
    fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,15))

    ax[0][0].set_title(f"Feature importances")
    plot_fea_importance_dataframe(df_fea, lenght, ax=ax[0][0])

    plot_confusion_matrix(model, X_test, y_test, ax=ax[0][1], colorbar=False, cmap="Blues")
    plot_roc_curve(model, X_test, y_test, ax=ax[1][0])
    plot_precision_recall_curve(model, X_test, y_test, ax=ax[1][1])
    
    plt.show()
    
    
def plot_graphs_DecTree_five_classes(pipeline, vectorizer_name, model_name, X_test, y_test, lenght=20, label=[]):
    """Plots the following graphs for a 5-class multiclass classification: Feature importances, 
    confusion matrix, ROC-curve.
    Arguments:
        pipeline = fitted pipeline
        vectorizer_name = name of the vectorizer within the pipeline
        model_name = name of the model within the pipeline
        X_test = predictors test set
        y_test = target test set
        lenght = number of coefficients to plot (default=20)
        label = list of all labels (default=[])"""
    
    df1 = get_fea_importances_from_pipeline(pipeline, vectorizer_name, model_name)

    fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,10))

    ax[0][0].set_title(f"Feature importances")
    
    plot_fea_importance_dataframe(df1, lenght, ax=ax[0][0])
    
    strek_colors = ListedColormap(["#C6F6FA", "#7CEAF4", "#2FDEEE", "#0FAEBD", "#0B7A84"])

    plot_confusion_matrix(pipeline, X_test, y_test, ax=ax[0][1], colorbar=False, cmap=strek_colors, 
                          display_labels=["Archer", "Janeway", "Kirk", "Picard", "Sisko"])
    
    skplt.metrics.plot_roc(y_test, pipeline.predict_proba(X_test), 
                       plot_micro=True, plot_macro=True, 
                       title_fontsize=20, text_fontsize=16, figsize=(8,6), ax=ax[1][0])
    
    ax[-1, -1].axis('off')
    
    ax[1][0].legend(loc=(1,0))
    
    plt.show()
    
    
def plot_decision_tree_graph(pipeline, vectorizer_name, model_name):
    """Plots the graph of a decision tree.
    Arguments:
        pipeline = fitted pipeline
        vectorizer_name = name of the vectorizer within the pipeline
        model_name = name of the model within the pipeline."""
    dot_data = export_graphviz(pipeline.named_steps[model_name],
                filled=True,
                rounded=True,
                special_characters=False,
                feature_names= sorted(pipeline.named_steps[vectorizer_name].vocabulary_.keys())
                )

    graph = graphviz.Source(dot_data) 
    return graph

<a name=kirkvpicard_eval></a>
### Kirk vs Picard

<a name="kirkvpicard_overview_eval"></a>
#### Overview

In [None]:
df_kp_scores = pd.DataFrame([[*logreg_kp_scores], [*dectree_kp_scores], [*bayes_kp_scores], [*forest_kp_scores], 
               [*logreg_tvec_kp_scores], [*dectree_tvec_kp_scores], [*bayes_tvec_kp_scores]], 
               index=["LogReg + Cvec", "Decision Tree + Cvec", "Bernoulli naive bayes + Cvec", 
                      "Random Forest + Cvec", "LogReg + tfidf", "Decision Tree + tfidf",
                      "Bernoulli naive bayes + tfidf"],
                columns=["Training Score", "Mean Cross-Val Score", "Test Score"]).sort_values(by="Mean Cross-Val Score")

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

bar_chart = ax.barh(df_kp_scores.index, df_kp_scores["Mean Cross-Val Score"], color="teal")

for bar, score in zip(ax.patches, df_kp_scores["Mean Cross-Val Score"]):
    ax.text(bar.get_x()+bar.get_width()+0.01, bar.get_y()+bar.get_height()/2,round(score,3), 
            color = 'black', ha = 'left', va = 'center', size=14)

ax.set_title("Mean CV Scores by model (Kirk vs. Picard)")
plt.xlim(0,0.75);

<a name=kirkvpicard_logreg></a>
#### Logistic Regression -  Kirk vs Picard

In [None]:
df_coef = get_logreg_coef_from_pipeline(logreg_kp_5w, "vect", "logreg")
plot_graphs_logreg_binary(df_coef, logreg_kp, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk")

<a name=kirkvpicard_dectree></a>
#### Decision Tree -  Kirk vs Picard

In [None]:
df_fea = get_fea_importances_from_pipeline(dectree_kp_5w, "vect", "DecTree")

plot_graphs_dectree_binary(df_fea, dectree_kp_5w, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk")

In [None]:
plot_decision_tree_graph(dectree_kp_5w, "vect", "DecTree")

<a name=kirkvpicard_bayes></a>
#### Bernoulli Naive Bayes - Kirk vs Picard

In [None]:
coefs = bayes_kp_5w.named_steps["BernoulliNB"].coef_

vocab_unsorted = bayes_kp_5w.named_steps["vect"].vocabulary_
vocab_sorted_by_index = sorted(vocab_unsorted.keys())

df =  pd.DataFrame(coefs.reshape(-1,1),columns=['coef'], index= vocab_sorted_by_index)

plt.figure(figsize=(5,5))

plt.barh(df.sort_values("coef", ascending=False).head(20).index, df.sort_values("coef", ascending=False).head(20).coef);

In [None]:
df_coef = get_logreg_coef_from_pipeline(bayes_kp_5w, "vect", "BernoulliNB")
plot_graphs_logreg_binary(df_coef, bayes_kp_5w, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk", coef_plot=False)

<a name="krikvpicard_random_forest"></a>
#### Random Forest - Kirk vs Picard

In [None]:
df_fea = get_fea_importances_from_pipeline(forest_kp, "vect", "Forest")

plot_graphs_dectree_binary(df_fea,forest_kp, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk")

<a name="krikvpicard_logreg_tfidf"></a>
#### Logistic Regression + tfidf - Kirk vs Picard

In [None]:
df_coef = get_logreg_coef_from_pipeline(logreg_tvec_kp, "vect", "logreg")
plot_graphs_logreg_binary(df_coef, logreg_tvec_kp, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk")

<a name="krikvpicard_dectree_tfidf"></a>
#### Decision Tree + tfidf - Kirk vs Picard

In [None]:
df_fea = get_fea_importances_from_pipeline(dectree_tvec_kp, "vect", "DecTree")

plot_graphs_dectree_binary(df_fea, dectree_tvec_kp, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk")

<a name="krikvpicard_bayes_tfidf"></a>
#### Bernoulli Naive Bayes + tfidf - Kirk vs Picard

In [None]:
coefs = bayes_tvec_kp.named_steps["BernoulliNB"].coef_

vocab_unsorted = bayes_tvec_kp.named_steps["vect"].vocabulary_
vocab_sorted_by_index = sorted(vocab_unsorted.keys())

df =  pd.DataFrame(coefs.reshape(-1,1),columns=['coef'], index= vocab_sorted_by_index)

plt.figure(figsize=(5,5))

plt.barh(df.sort_values("coef", ascending=False).head(20).index, df.sort_values("coef", ascending=False).head(20).coef);

In [None]:
df_coef = get_logreg_coef_from_pipeline(bayes_tvec_kp, "vect", "BernoulliNB")
plot_graphs_logreg_binary(df_coef, bayes_tvec_kp, tt_list_split_kirk_picard_5_words[2], 
                   tt_list_split_kirk_picard_5_words[3], pos_label="picard", neg_label="kirk", coef_plot=False)

<a name=allc_eval></a>
### All Captains

<a name="allc_overview_eval"></a>
#### Overview

In [None]:
# create overview dataframe all basic models
df_allc_scores = pd.DataFrame([[*logreg_allc_scores], [*dectree_allc_scores], [*bayes_allc_scores], 
                             [*forest_allc_scores], [*logreg_tvec_allc_scores], [*dectree_tvec_allc_scores], 
                             [*bayes_tvec_allc_scores]], 
                       index=["Logistic Regression + Cvec", "Decision Tree + Cvec", "Bernoulli Naive Bayes + Cvec", 
                      "Random Forest + Cvec", "Logistic Regression + Tfidf", "Decision Tree + Tfidf",
                      "Bernoulli naive bayes + Tfidf"],
                       columns=["Training Score", "Mean Cross-Val Score", "Test Score"]).sort_values("Mean Cross-Val Score")

In [None]:
# get mean test score all basic models
np.mean(df_allc_scores["Test Score"])

In [None]:
# Plot overview test accuracy all models
fig, ax = plt.subplots(figsize=(8,8))

# create bar chart
bar_chart = ax.barh(df_allc_scores.index, df_allc_scores["Test Score"], 
                    color=["#FFCC9A","#FFCC9A","#FFCC9A","#1671A2","#1671A2","#1671A2","#1671A2"], 
                    edgecolor="black")

# add test score to the bar charts
for bar, score in zip(ax.patches, df_allc_scores["Test Score"]):
    if score > np.mean(df_allc_scores["Test Score"]): # add score to the right if > mean
        ax.text(bar.get_x()+bar.get_width()+0.01, bar.get_y()+bar.get_height()/2,round(score,3), 
            color = 'black', ha = 'left', va = 'center', size=20)
    else: # add score within bar if < mean
        ax.text(bar.get_x()+bar.get_width()-0.05, bar.get_y()+bar.get_height()/2,round(score,3), 
            color = 'black', ha = 'left', va = 'center', size=20)        

# add title
ax.set_title("Test Accuracy Scores", size=25)

# change ticksize
plt.xticks(size=20)
plt.yticks(size=20)

# add axline at baseline accuracy
plt.axvline(round(0.250000,2), **{"c":"black"})
 
# add baseline text
ax.text(0.23,-1.5, "Baseline", fontproperties={"size":20, "style":"italic"})

# customize length x-axis
plt.xlim(0,0.45);

<a name=allc_logreg></a>
#### Logistic Regression -  All Captains

In [None]:
plot_graphs_logreg_five_classes(logreg_tvec_allc, "vect", "logreg", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["Archer", "Kirk", "Picard", "Sisko", "Janeway"])

<a name=allc_dectree></a>
#### Decision Tree -  All Captains

In [None]:
plot_graphs_DecTree_five_classes(dectree_allc, "vect", "DecTree", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

In [None]:
plot_decision_tree_graph(dectree_allc, "vect", "DecTree")

<a name=allc_bayes></a>
#### Bernoulli Naive Bayes - All Captains

In [None]:
plot_graphs_logreg_five_classes(bayes_allc, "vect", "MultinomialNB", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

<a name="allc_random_forest"></a>
#### Random Forest - All Captains

In [None]:
plot_graphs_DecTree_five_classes(forest_allc, "vect", "Forest", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

<a name="allc_logreg_tfidf"></a>
#### Logistic Regression + tfidf - All Captains

In [None]:
plot_graphs_logreg_five_classes(logreg_tvec_allc, "vect", "logreg", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

<a name=allc_dectree_tfidf></a>
#### Decision Tree + tfidf -  All Captains

In [None]:
plot_graphs_DecTree_five_classes(dectree_tvec_allc, "vect", "DecTree", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

In [None]:
plot_decision_tree_graph(dectree_tvec_allc, "vect", "DecTree")

<a name=allc_bayes_tfidf></a>
#### Bernoulli Naive Bayes + tfidf- All Captains

In [None]:
plot_graphs_logreg_five_classes(bayes_tvec_allc, "vect", "MultinomialNB", tt_list_split_all_capt_5_words[2], 
                   tt_list_split_all_capt_5_words[3], label=["archer", "kirk", "picard", "sisko", "janeway"])

# Models with Additional Features

<a name="import_glove"></a>
#### Importing GloVe Embeddings

Using pre-trained embeddings from <a href="https://github.com/stanfordnlp/GloVe">GloVe</a> (Wikipedia crawl). I will use the 300 dimensional set for modelling and the 50 dimensional set for finding the average word of a character. The reason to use less dimensions when looking for closest embedding is that the higher dimensional set overfits and highly specific words (or combinations of letters and numbers) are returned. 

In [None]:
embeddings_dict = {}

with open("/Users/tjanif/Desktop/KirkvPicard_Material/embeddings/glove.6B.300d.txt") as file:
    for line in file:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector   

In [None]:
few_embeddings_dict = {}

with open("/Users/tjanif/Desktop/KirkvPicard_Material/embeddings/glove.6B.50d.txt") as file:
    for line in file:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        few_embeddings_dict[word] = vector   

In [None]:
# 300 vectors per word
len(embeddings_dict["the"])

In [None]:
# 50 vectors per word 
len(few_embeddings_dict["the"])

<a name="create_glove_functions"></a>
#### GloVe feature engineering functions

In [None]:
# Count vectorizer for word-embeddings
cvec_em = CountVectorizer(stop_words = "english",     
                       ngram_range =(1,1))

In [None]:
def convert_lines_to_average_embedding_vect(arr_lines):
    """Takes in lines of dialog and calculates the average word embedding vector for the line, stopdwords
    are excluded.
    Arguments:
        arr_lines = array of lines."""
    
    # use CountVectorizer to create sparse matrix
    sparse_matrix = cvec_em.fit_transform(arr_lines)

    # dataframe of word counts, excluding common eclish words
    df_em = pd.DataFrame(sparse_matrix.toarray(),columns=cvec_em.get_feature_names())
    
    line_vectors = []
    
    # iterating over lines
    for line in df_em.index:
        base_vectors = np.zeros(300)  #creating an empty (all 0) 300-dimensinal base vector
        number_of_vectors = 0
        
        # iterating over all words in the set of lines
        for word, entry in zip(df_em.columns, df_em.loc[line,:]):
            
            # in case the entry is not 0 (= the word occurs in the line)
            if entry != 0:
                try:
                    word_vectors = embeddings_dict[word]  # get the word vector
                    number_of_vectors += 1 
                    base_vectors += word_vectors  #add the new vector to the vector for the whole line
                except:
                    pass
                
        if number_of_vectors > 0:
            base_vectors = base_vectors / number_of_vectors  # divide by number_of_vectors to get the average
        line_vectors.append(base_vectors)
    
    return line_vectors

<a name="find_avg_word"></a>
#### Find the average word

This is purely for fun. Using word embeddings one can calculate the average for all words in a text. With the `find_closest_embeddings` function the word closest to that average vector can be found. 

Interestingly the closest word to the average of each captain is the same: "supposed".

In [None]:
def find_the_average_word(arr_lines):
    """Function that averages all embedding vectors of a corpus of text
    Arguments:
        arr_lines = array of lines."""
    
    # use CountVectorizer to create sparse matrix
    sparse_matrix = cvec_em.fit_transform(arr_lines)

    df_em = pd.DataFrame(sparse_matrix.toarray(),columns=cvec_em.get_feature_names())
    

    base_vectors = np.zeros(50)
    number_of_vectors = 0
    
    for line in df_em.index:
        for word, entry in zip(df_em.columns, df_em.loc[line,:]):
            
            if entry != 0:
                try:
                    word_vectors = few_embeddings_dict[word]
                    number_of_vectors += 1
                    base_vectors += word_vectors
                except:
                    pass
                

    character_vector = base_vectors / number_of_vectors
    print(number_of_vectors)
    
    return character_vector


def find_closest_embeddings(embedding):
    """Function to find the word that matches a given embedding vector most closely.
    Make sure the embedding dictionary used and the embedding given have the same dimensions!
    Argument:
        embedding = embedding vector"""
    return sorted(few_embeddings_dict.keys(), 
                  key=lambda word: spatial.distance.euclidean(few_embeddings_dict[word], embedding))

In [None]:
find_closest_embeddings(find_the_average_word(lines_kirk))[:20]

In [None]:
find_closest_embeddings(find_the_average_word(lines_picard))[:20]

In [None]:
find_closest_embeddings(find_the_average_word(lines_sisko))[:20]

In [None]:
find_closest_embeddings(find_the_average_word(lines_janeway))[:20]

In [None]:
find_closest_embeddings(find_the_average_word(lines_archer))[:20]

<a name="create_glove_features"></a>
### Create Features from GloVe Embeddings

Using the average embedding vector per line said a new set of 300 features can be created.

In [None]:
df_em_kp = convert_lines_to_average_embedding_vect(df_lines_kirk_picard_pos.line)

In [None]:
df_em_allc = convert_lines_to_average_embedding_vect(df_lines_all_captains_pos.line)

<a name="combine_adv_features"></a>
### Combine Additional Features

Putting GloVe Embeddings, grammatical features and number of words in a line into the same dataframe.

In [None]:
df_kp_adv_fea = pd.concat([df_lines_kirk_picard_pos.loc[:,"character":], 
                           pd.DataFrame(df_em_kp, index=df_lines_kirk_picard_pos.index)], axis=1)

In [None]:
df_allc_adv_fea = pd.concat([df_lines_all_captains_pos.loc[:,"character":], 
                           pd.DataFrame(df_em_allc, index=df_lines_all_captains_pos.index)], axis=1)

In [None]:
df_allc_adv_fea.shape

In [None]:
df_kp_adv_fea.fillna(0, inplace=True)
df_allc_adv_fea.fillna(0, inplace=True)

<a name="tts_advanced"></a>
#### Train-Test Split for Dataframes with Additional Features

In [None]:
def creating_tts_adv_fea(X,y,test_size=0.2, random_state=23):
    """Function to make a train test split in order X_train, y_train, X_test, y_test"""
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=random_state)
    return X_train, y_train, X_test, y_test

In [None]:
tt_list_split_adv_fea_kp = creating_tts_adv_fea(df_kp_adv_fea.loc[:,"num_words":], df_kp_adv_fea["character"])

In [None]:
tt_list_split_adv_fea_kp[0].shape, tt_list_split_adv_fea_kp[1].shape, tt_list_split_adv_fea_kp[2].shape, tt_list_split_adv_fea_kp[3].shape

In [None]:
tt_list_split_adv_fea_allc = creating_tts_adv_fea(df_allc_adv_fea.loc[:,"num_words":], 
                                                  df_allc_adv_fea["character"])

In [None]:
tt_list_split_adv_fea_allc[0].shape, tt_list_split_adv_fea_allc[1].shape, tt_list_split_adv_fea_allc[2].shape, tt_list_split_adv_fea_allc[3].shape, 

<a name="scaling_adv"></a>
#### Feature Scaling

In [None]:
# instantiate two scalers
scaler_kp = StandardScaler()
scaler_allc = StandardScaler()

# kirk vs picard
X_train_scaled_kp = pd.DataFrame(scaler_kp.fit_transform(tt_list_split_adv_fea_kp[0]),  # fit on train
                                 columns=tt_list_split_adv_fea_kp[0].columns)
X_test_scaled_kp = pd.DataFrame(scaler_kp.transform(tt_list_split_adv_fea_kp[2]), # transform test
                                columns=tt_list_split_adv_fea_kp[2].columns)

# all captains
X_train_scaled_allc = pd.DataFrame(scaler_allc.fit_transform(tt_list_split_adv_fea_allc[0]), 
                                 columns=tt_list_split_adv_fea_allc[0].columns)
X_test_scaled_allc = pd.DataFrame(scaler_allc.transform(tt_list_split_adv_fea_allc[2]), 
                                 columns=tt_list_split_adv_fea_allc[2].columns)

In [None]:
# create predictors - target lists in correct order
tt_list_split_adv_fea_kp = [X_train_scaled_kp, tt_list_split_adv_fea_kp[1],
                           X_test_scaled_kp, tt_list_split_adv_fea_kp[3]]

In [None]:
tt_list_split_adv_fea_allc = [X_train_scaled_allc, tt_list_split_adv_fea_allc[1],
                           X_test_scaled_allc, tt_list_split_adv_fea_allc[3]]

<a name="create_model_adv_fea"></a>
### Create a Model only using the Additional Features

In [None]:
# Logistic Regression
param_grid_logreg = {
                "penalty" : ["l1", "l2"],
                "C": np.logspace(-1, 4, 20)}

gs_logreg_em = GridSearchCV(LogisticRegression(max_iter=50000), param_grid_logreg, n_jobs=-2)

In [None]:
# Logistic Regression, faster version (less iterations)
param_grid_logreg = {"solver":["saga"],
                "penalty" : ["elasticnet"],
                "l1_ratio" : np.linspace(0,1,10),
                "C": np.logspace(-2, 3, 20)}

gs_logreg_em_fast = GridSearchCV(LogisticRegression(max_iter=1000), param_grid_logreg, n_jobs=-2)

<a name="fit_model_only_adv_fea"></a>
### Fitting a Logistic Regression on only Additional Features

#### Models Kirk vs Picard

In [None]:
fit_pipeline(gs_logreg_em, tt_list_split_adv_fea_kp[0], tt_list_split_adv_fea_kp[1])
logreg_adv_fea_kp = gs_logreg_em.best_estimator_
logreg_adv_fea_kp_scores = get_scores(logreg_adv_fea_kp, *tt_list_split_adv_fea_kp)

#### Models All Captains

In [None]:
fit_pipeline(gs_logreg_em, tt_list_split_adv_fea_allc[0], tt_list_split_adv_fea_allc[1])
logreg_adv_fea_allc = gs_logreg_em.best_estimator_
logreg_adv_fea_allc_scores = get_scores(logreg_adv_fea_allc, *tt_list_split_adv_fea_allc)

<a name="eval_models_only_adv_fea"></a>
### Evaluating Logistic Regression with only Additional Features

In [None]:
df_coef = pd.DataFrame(logreg_adv_fea_kp.coef_.reshape(-1,1), index = tt_list_split_adv_fea_kp[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)
plot_graphs_logreg_binary(df_coef, logreg_adv_fea_kp, tt_list_split_adv_fea_kp[2], 
                   tt_list_split_adv_fea_kp[3], pos_label="picard", neg_label="kirk")

In [None]:
# create plots for all captains

# get class labels
label = logreg_adv_fea_allc.classes_

# create coefficient dataframes
df_coef_1 = pd.DataFrame(logreg_adv_fea_allc.coef_[0].reshape(-1,1), index = tt_list_split_adv_fea_allc[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)
df_coef_2 = pd.DataFrame(logreg_adv_fea_allc.coef_[1].reshape(-1,1), index = tt_list_split_adv_fea_allc[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)
df_coef_3 = pd.DataFrame(logreg_adv_fea_allc.coef_[2].reshape(-1,1), index = tt_list_split_adv_fea_allc[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)
df_coef_4 = pd.DataFrame(logreg_adv_fea_allc.coef_[3].reshape(-1,1), index = tt_list_split_adv_fea_allc[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)
df_coef_5 = pd.DataFrame(logreg_adv_fea_allc.coef_[4].reshape(-1,1), index = tt_list_split_adv_fea_allc[0].columns,
                       columns=["coef"]).sort_values(by="coef",ascending=False)


fig, ax = plt.subplots(nrows=4, ncols=2, figsize=(20,30))

# plot coefficients for each captain
ax[0][0].set_title(f"Coefficients predicting for {label[0]}", size=20)
plot_coef_dataframe(df_coef_1, 20, ax=ax[0][0])

ax[0][1].set_title(f"Coefficients predicting for {label[1]}", size=20)
plot_coef_dataframe(df_coef_2, 20, ax=ax[0][1])

ax[1][0].set_title(f"Coefficients predicting for {label[2]}", size=20)
plot_coef_dataframe(df_coef_3, 20, ax=ax[1][0])

ax[1][1].set_title(f"Coefficients predicting for {label[3]}", size=20)
plot_coef_dataframe(df_coef_4, 20, ax=ax[1][1])

ax[2][0].set_title(f"Coefficients predicting for {label[4]}", size=20)
plot_coef_dataframe(df_coef_5, 20, ax=ax[2][0])

# confusion matix
plot_confusion_matrix(logreg_adv_fea_allc, tt_list_split_adv_fea_allc[2], tt_list_split_adv_fea_allc[3], 
                      ax=ax[2][1], colorbar=False, cmap="Blues")

# ROC
skplt.metrics.plot_roc(tt_list_split_adv_fea_allc[3], logreg_adv_fea_allc.predict_proba(tt_list_split_adv_fea_allc[2]), 
                    plot_micro=True, plot_macro=True, 
                    title_fontsize=20, text_fontsize=16, figsize=(8,6), ax=ax[3][0])

# turn off last plot
ax[-1, -1].axis('off')

# change size xticks
plt.xticks(size=20)

# add legend
ax[3][0].legend(loc=(1,0));

<a name="combo_wordvec_adv_fea"></a>
## Combination of Wordvectors and Additional Features

In this section both, the features gained from CountVectorization and the additional features from above (word embeddings, lenght of line, part of speech tagging) are used as predictors.

<a name="combo_kvp"></a>
#### Kirk vs Picard

Creating the combined dataframe

In [None]:
# combining embedding features with pos-tagging and 
df_combo_kp = pd.concat([df_lines_kirk_picard_pos, pd.DataFrame(df_em_kp, index=df_lines_kirk_picard_pos.index)],
                        axis=1)

df_combo_kp.fillna(0, inplace=True)

In [None]:
# defining the target
target_kp = df_combo_kp.pop("character")

In [None]:
# creating the train test split
ttl_kp_combo = creating_tts_adv_fea(df_combo_kp, target_kp)

In [None]:
ttl_kp_combo[0].shape, ttl_kp_combo[1].shape, ttl_kp_combo[2].shape, ttl_kp_combo[3].shape, 

Feature engineering

In [None]:
# Using the "line" column to create word count features after train-test-split
cvec_combo = CountVectorizer(stop_words = "english",  
                             token_pattern="\s(\w{2,})\s",
                             ngram_range =(1,1))


lines_sparse_matrix_train = cvec_combo.fit_transform(ttl_kp_combo[0].line)
lines_sparse_matrix_test = cvec_combo.transform(ttl_kp_combo[2].line)

df_lines_kp_train = pd.DataFrame(lines_sparse_matrix_train.toarray(), columns=cvec_combo.get_feature_names(),
                                 index= ttl_kp_combo[0].index)
df_lines_kp_test = pd.DataFrame(lines_sparse_matrix_test.toarray(), columns=cvec_combo.get_feature_names(), 
                                index= ttl_kp_combo[2].index)

In [None]:
# combine word counts with the other features, removing the line
df_combo_kp_train = pd.concat([df_lines_kp_train, pd.DataFrame(ttl_kp_combo[0].loc[:,"num_words":])], axis=1)
df_combo_kp_test = pd.concat([df_lines_kp_test, pd.DataFrame(ttl_kp_combo[2].loc[:,"num_words":])], axis=1)

Scaling

In [None]:
# scale the data
scaler_combo_kp = StandardScaler()

X_train_scaled_kp_combo = pd.DataFrame(scaler_combo_kp.fit_transform(df_combo_kp_train), 
                                 columns=df_combo_kp_train.columns)
X_test_scaled_kp_combo = pd.DataFrame(scaler_combo_kp.transform(df_combo_kp_test), 
                                columns=df_combo_kp_test.columns)

Fitting the model

In [None]:
# fitting a logistic regression grid search
fit_pipeline(gs_logreg_em, X_train_scaled_kp_combo, ttl_kp_combo[1])
logreg_combo_kp = gs_logreg_em.best_estimator_
logreg_combo_kp_scores = get_scores(logreg_combo_kp, X_train_scaled_kp_combo, 
                                    ttl_kp_combo[1], X_test_scaled_kp_combo, ttl_kp_combo[3])

<a name="combo_allc"></a>
#### All captains

Creating the combined dataframe

In [None]:
# combining features
df_combo_allc = pd.concat([df_lines_all_captains_pos, 
                           pd.DataFrame(df_em_allc, index=df_lines_all_captains_pos.index)], axis=1)

# fill missing values with 0
df_combo_allc.fillna(0, inplace=True)

In [None]:
# define target
target_allc = df_combo_allc.pop("character")

In [None]:
# create train-test split
ttl_allc_combo = creating_tts_adv_fea(df_combo_allc, target_allc)

Feature engineering

In [None]:
# Create word-count features
cvec_combo = CountVectorizer(stop_words = "english",  
                             token_pattern="\s(\w{2,})\s",
                             ngram_range =(1,1))


lines_sparse_matrix_train = cvec_combo.fit_transform(ttl_allc_combo[0].line)
lines_sparse_matrix_test = cvec_combo.transform(ttl_allc_combo[2].line)

df_lines_allc_train = pd.DataFrame(lines_sparse_matrix_train.toarray(), columns=cvec_combo.get_feature_names(),
                                 index= ttl_allc_combo[0].index)
df_lines_allc_test = pd.DataFrame(lines_sparse_matrix_test.toarray(), columns=cvec_combo.get_feature_names(), 
                                index= ttl_allc_combo[2].index)

In [None]:
# combining features
df_combo_allc_train = pd.concat([df_lines_allc_train, 
                                 pd.DataFrame(ttl_allc_combo[0].loc[:,"num_words":])], axis=1)
df_combo_allc_test = pd.concat([df_lines_allc_test, 
                                pd.DataFrame(ttl_allc_combo[2].loc[:,"num_words":])], axis=1)

Scaling the data

In [None]:
# scaling the data
scaler_combo_allc = StandardScaler()

X_train_scaled_allc_combo = pd.DataFrame(scaler_combo_allc.fit_transform(df_combo_allc_train), 
                                 columns=df_combo_allc_train.columns)
X_test_scaled_allc_combo = pd.DataFrame(scaler_combo_allc.transform(df_combo_allc_test), 
                                columns=df_combo_allc_test.columns)

Creating a PCA version (1000 components) for faster model testing

In [None]:
# create dataset with PCA components for faster model testing
pca = PCA(n_components=1000)

X_train_scaled_allc_combo_pca = pca.fit_transform(X_train_scaled_allc_combo)
X_test_scaled_allc_combo_pca = pca.transform(X_test_scaled_allc_combo)

Fitting the model

In [None]:
fit_pipeline(gs_logreg_em_fast, X_train_scaled_allc_combo_pca, ttl_allc_combo[1])

In [None]:
logreg_combo_allc = gs_logreg_em_fast.best_estimator_

In [None]:
logreg_combo_allc_scores = get_scores(logreg_combo_allc, X_train_scaled_allc_combo_pca,ttl_allc_combo[1], 
           X_test_scaled_allc_combo_pca, ttl_allc_combo[3])

<a name="xgb_main"></a>
## XGBoosted Random Forest

#### Singular Tree for comparison

In [None]:
# create a pipeline to fit a decision tree
pipe_dectree = Pipeline([
    ('DecTree', DecisionTreeClassifier())
])

param_grid_dectree = {
              "DecTree__max_depth" : np.linspace(2,20,4, dtype=int),
              "DecTree__min_samples_split": np.linspace(2, 10, 2, dtype=int),
              "DecTree__min_samples_leaf": np.linspace(2, 10, 2, dtype=int)}

gs_dectree_fast = GridSearchCV(pipe_dectree, param_grid_dectree, n_jobs=-2, verbose=3,cv=2)

In [None]:
# fit decision tree
fit_pipeline(gs_dectree_fast, X_train_scaled_allc_combo_pca, ttl_allc_combo[1])

In [None]:
# get decision tree parameters
dectree_combo_allc = gs_dectree_fast.best_estimator_
gs_dectree_fast.best_params_

In [None]:
# print out scores simple decision tree
dectree_combo_allc_scores = get_scores(dectree_combo_allc, X_train_scaled_allc_combo_pca, ttl_allc_combo[1], 
           X_test_scaled_allc_combo_pca, ttl_allc_combo[3])

### Data preparation for XGBoosting

In [None]:
# converting target to floats for XGB algorithm:
train_Y = [0 if t=="archer" else 1 if t=="janeway" 
           else 2 if t=="kirk" else 3 if t=="picard" 
           else 4 for t in ttl_allc_combo[1]]

test_Y = [0 if t=="archer" else 1 if t=="janeway" 
           else 2 if t=="kirk" else 3 if t=="picard" 
           else 4 for t in ttl_allc_combo[3]]

In [None]:
#creating smaller subset of the targets for testing algorithms
train_Y_subset = train_Y[:2000]
test_Y_subset = test_Y[:2000]

# creating subset of the pca data (1000 features)
train_X_pca_subset = X_train_scaled_allc_combo_pca[:2000]
test_X_pca_subset = X_test_scaled_allc_combo_pca[:2000]

# create subset of the full data (ca. 9000 features)
train_X_subset = X_train_scaled_allc_combo[:2000]
test_X_subset = X_test_scaled_allc_combo[:2000]

In [None]:
# drop column 133 (and 12, 20) because it creates a problem with the XGB algorithm because the 
# embedding feature 133 and the column of the string 133 are read as identical
train_X_subset.drop(columns=["133", "12", "20"], inplace=True)
test_X_subset.drop(columns=["133", "12", "20"], inplace=True)

In [None]:
# drop relevant columns from the full dataframes
X_train_scaled_allc_combo.drop(columns=["133", "12", "20"], inplace=True)
X_test_scaled_allc_combo.drop(columns=["133", "12", "20"], inplace=True)

### XGB Models

#### XGB Model Nr 1 - Forest of 3

In [None]:
# define the number of rounds
num_round = 200

# create first XGBoost model 
param_xgboost = {"max_depth": 10, "eta":0.01, 
                 "objective":"multi:softmax", 
                 "verbosity":2, "num_class":5, 
                 "num_parallel_tree":3,
                 "nthread":-2}

# create train and test matrix from a subset of the data (2000 observations)
xg_train = xgb.DMatrix(train_X_subset, label=train_Y_subset)
xg_test = xgb.DMatrix(test_X_subset, label=test_Y_subset)

# define watchlist
watchlist = [(xg_train, 'train'), (xg_test, 'test')]

In [None]:
# train first XGB model
bst = xgb.train(param_xgboost, xg_train, num_round, watchlist)

In [None]:
# print testscore first XGB Model
pred = bst.predict(xg_test)

# calculate accuracy
sum(pred == test_Y_subset) / 2000    #division by 2000 because the model was run on a subset of 2000

#### XGB Model Nr 2 - added Learning Rate Scheduler, Forest of 20

In [None]:
# define the base learning rate and the round number
eta_base = 0.1
num_round = 100

# define the decay of the learning rate
eta_decay = np.linspace(eta_base, 0.01, num_round).tolist()

# create a dictionary to collect logloss over epochs
results_bb_1 = {} 



param_xgboost = {"max_depth": 3, "eta":eta_base, 
                 "objective":"multi:softmax", 
                 "verbosity":1, "num_class":5, 
                 "num_parallel_tree":20,
                 "nthread":-2}

# define train and test matrix (subset of 2000)
xg_train = xgb.DMatrix(train_X_subset, label=train_Y_subset)
xg_test = xgb.DMatrix(test_X_subset, label=test_Y_subset)

watchlist = [(xg_train, 'train'), (xg_test, 'test')]



# train the model
bst2 = xgb.train(param_xgboost, xg_train, num_round, 
                             watchlist, callbacks=[xgb.callback.LearningRateScheduler(eta_decay)],
                             evals_result=results_bb_1)

In [None]:
pred2 = bst2.predict(xg_test)

# calculate accuracy
sum(pred2 == test_Y_subset) / 2000   

In [None]:
# plot learning over epochs
plt.plot(results_bb_1["test"]["mlogloss"]);

#### XGB Model Nr 3 - no Forest, more rounds

In [None]:
# define the base learning rate and the round number
eta_base = 0.1
num_round = 400

# define the decay of the learning rate
eta_decay = np.linspace(eta_base, 0.01,num_round).tolist()

# create a dictionary to collect logloss over epochs
results_bb_1 = {} 



param_xgboost = {"max_depth": 3, "eta":eta_base, 
                 "objective":"multi:softmax", 
                 "verbosity":1, "num_class":5, 
                 "num_parallel_tree":1,
                 "nthread":-2}

# this time define train and test matrix with all the data (not a subset)
xg_train = xgb.DMatrix(X_train_scaled_allc_combo, label=train_Y)
xg_test = xgb.DMatrix(X_test_scaled_allc_combo, label=test_Y)

watchlist = [(xg_train, 'train'), (xg_test, 'test')]


                 

# train the model
bst3 = xgb.train(param_xgboost, xg_train, num_round, 
                             watchlist, callbacks=[xgb.callback.LearningRateScheduler(eta_decay)],
                            evals_result=results_bb_1)

In [None]:
pred3 = bst3.predict(xg_test)

# calculate accuracy
sum(pred3 == test_Y) / 7111     # divide by 7111, the size of the full test set

In [None]:
# plot model learning
plt.plot(results_bb_1["test"]["mlogloss"]);

#### XGB Model Nr 4 - Forest of 7, max depth of 3

In [None]:
# defining base learning rate and number of rounds
eta_base = 0.2
num_round = 500

# define the decay of the learning rate over time
eta_decay = np.linspace(eta_base, 0.02, num_round).tolist()

# catch the mlogloss over rounds
results_bb_2 = {} 

param_xgboost = {"max_depth": 3, "eta":eta_base, 
                 "objective":"multi:softmax", 
                 "verbosity":1, "num_class":5, 
                 "num_parallel_tree":7,
                 "nthread":-2}

xg_train = xgb.DMatrix(X_train_scaled_allc_combo, label=train_Y)
xg_test = xgb.DMatrix(X_test_scaled_allc_combo, label=test_Y)

watchlist = [(xg_train, 'train'), (xg_test, 'test')]


                           
# train the model                         
bst4 = xgb.train(param_xgboost, xg_train, num_round, 
                             watchlist, callbacks=[xgb.callback.LearningRateScheduler(eta_decay)],
                            evals_result=results_bb_2, early_stopping_rounds=10)

In [None]:
pred4 = bst4.predict(xg_test)

# calculate accuracy
sum(pred4 == test_Y) /7111   

In [None]:
# plot learning rates
plt.plot(results_bb_2["train"]["mlogloss"], label="train")
plt.plot(results_bb_2["test"]["mlogloss"], label="test")

plt.legend(loc=[1,0], title="mlogloss");

In [None]:
def confusion_matrix_for_xgb(y_test, y_predicted):
    """A function to plot a confusion matrix for an XGBoosted model.
    Arguments:
        y_test: list of correct answers
        y_predicted: list of answers predictred by the model"""
    
    # create confusion matrix
    cm = confusion_matrix(y_test, y_predicted)
    
    plt.figure(figsize=(10,10))
    plt.clf()
    
    # show the plot to the computer
    plt.imshow(cm, interpolation='nearest', 
               cmap=ListedColormap(["#C6F6FA", "#7CEAF4", "#2FDEEE", "#0FAEBD", "#0B7A84"]))
               class_names = ['Archer','Janeway', 'Kirk', 'Picard', 'Sisko']
    
    # title, y and x label
    plt.title('Confusion Matrix', font="DIN Condensed", size=30)
    plt.ylabel('True label', font="DIN Condensed", size=20)
    plt.xlabel('Predicted label', font="DIN Condensed", size=20)
    
    # get x-ticks index
    tick_marks = np.arange(len(class_names))
    
    # rename xticks mit class names
    plt.xticks(tick_marks, class_names, rotation=45,  font="DIN Condensed", size=20)
    plt.yticks(tick_marks, class_names, font="DIN Condensed", size=20)
  
    
    # turn of gray grid
    plt.grid(False)
  
    # add numbers of counts into the grid
    for i in range(5):
        for j in range(5):
            plt.text(j-0.1,i+0.05, str(cm[i][j]), font="DIN Condensed", size=20)
      
    plt.show()

In [None]:
# plot confusion matrix
confusion_matrix_for_xgb(test_Y, pred4)

<a name="interaction_networks"></a>
# Interaction Networks

This section explores who talks after whom as a proxy for character interaction, since most of the time a character is proceeded by the person they are talking to. This information is used to plot interaction networks of the 15 characters per series with the most lines.

In [None]:
def edges_one_series(df_series_raw):
    """Function to create a list of tuples of characters speaking after each other.
    Arguments:
        df_series_raw = dataframe of all scripts in a series."""
    edges_series = []
    
    # iterates through episodes
    for episode in range(0,len(df_series_raw)):
        if episode in df_series_raw.index:   #checks whether an episode of that index number exist for the series
            for script in range(0,len(find_names_and_lines(df_series_raw.script[episode]))-1):
                
                # uses find_names_and_lines() to separate names and lines
                names_and_lines = find_names_and_lines(df_series_raw.script[episode]) 
                character_1 = names_and_lines[script][0].replace("[OC]","").strip(" ")
                character_2 = names_and_lines[script+1][0].replace("[OC]","").strip(" ")
                
                edges_series.append((character_1, character_2))
        else:
            pass
        
    return edges_series


def create_weight_dictionary_one_series(edges_series):
    """Function to create a dictionary of edges and their frequency in a series.
    Arguments:
        edges_series = list of tuples of connections in a series."""
    weight_dict = {}
    inverse_pairs = set()
    for pair in edges_series:
        inverse_pair = (pair[1], pair[0])
        if pair not in inverse_pairs:
            try:
                weight_dict[pair] += 1
            except:
                weight_dict[pair] = 1
            inverse_pairs.add(inverse_pair)
        else:
            weight_dict[inverse_pair] += 1  
    return weight_dict


def sort_dict(dictionary):
    """Takes in a dictionary and returns it sorted by value."""
    return {k:v for k,v in sorted(dictionary.items(), key= lambda x: -x[1]) if k[0] != k[1]}

In [None]:
def get_sorted_count_lines_per_character(df_series_raw):
    """Counts how many lines each character says in a series , returns a dictionary of character : count.
    Arguments:
        df_series_raw = dataframe of all scripts in a series."""
    dict_appearances_char_series = {}
    for episode in range(0,len(df_series_raw)):
        if episode in df_series_raw.index:
            for script in range(0,len(find_names_and_lines(df_series_raw.script[episode]))):
                character = find_names_and_lines(df_series_raw.script[episode])[script][0].replace("[OC]","").strip(" ")
                try:
                    dict_appearances_char_series[character] += 1
                except:
                    dict_appearances_char_series[character] = 1
                
    return sort_dict(dict_appearances_char_series)


def filter_edge_dict_by_character_list(character_list, weighted_edge_dict):
    new_edge_dict = {}
    for entry, weight in weighted_edge_dict.items():
        if ((entry[0] in character_list)&(entry[1] in character_list)):
            new_edge_dict[entry] = weight
    return new_edge_dict
        

In [None]:
# get the characters of each series sorted by number of lines said
top_chars_TOS = get_sorted_count_lines_per_character(df_TOS_raw)
top_chars_TNG = get_sorted_count_lines_per_character(df_TNG_raw)
top_chars_DS9 = get_sorted_count_lines_per_character(df_DS9_raw)
top_chars_VOY = get_sorted_count_lines_per_character(df_VOY_raw)
top_chars_ENT = get_sorted_count_lines_per_character(df_ENT_raw)

In [None]:
# get top 15 characters with most lines said per series
top_15_chars_TOS = list(top_chars_TOS)[:15]
top_15_chars_TNG = list(top_chars_TNG)[:15]
top_15_chars_DS9 = list(top_chars_DS9)[:15]
top_15_chars_VOY = list(top_chars_VOY)[:15]
top_15_chars_ENT = list(top_chars_ENT)[:15]

In [None]:
# get edges of all contacts per series defined as contacts between characters
edges_TOS = edges_one_series(df_TOS_raw)
edges_TNG = edges_one_series(df_TNG_raw)
edges_DS9 = edges_one_series(df_DS9_raw)
edges_VOY = edges_one_series(df_VOY_raw)
edges_ENT = edges_one_series(df_ENT_raw)

In [None]:
# get the weight of the edges defined as counts of character contacts
weight_dict_edges_TOS = create_weight_dictionary_one_series(edges_TOS)
weight_dict_edges_TNG = create_weight_dictionary_one_series(edges_TNG)
weight_dict_edges_DS9 = create_weight_dictionary_one_series(edges_DS9)
weight_dict_edges_VOY = create_weight_dictionary_one_series(edges_VOY)
weight_dict_edges_ENT = create_weight_dictionary_one_series(edges_ENT)

In [None]:
# sort the dictionaries by weight
weight_dict_edges_TOS_sort = sort_dict(weight_dict_edges_TOS)
weight_dict_edges_TNG_sort = sort_dict(weight_dict_edges_TNG)
weight_dict_edges_DS9_sort = sort_dict(weight_dict_edges_DS9)
weight_dict_edges_VOY_sort = sort_dict(weight_dict_edges_VOY)
weight_dict_edges_ENT_sort = sort_dict(weight_dict_edges_ENT)

In [None]:
# get top 15 characters with most lines with their respective weights and edges
top_15_weight_dict_TOS = filter_edge_dict_by_character_list(top_15_chars_TOS, weight_dict_edges_TOS_sort)
top_15_weight_dict_TNG = filter_edge_dict_by_character_list(top_15_chars_TNG, weight_dict_edges_TNG_sort)
top_15_weight_dict_DS9 = filter_edge_dict_by_character_list(top_15_chars_DS9, weight_dict_edges_DS9_sort)
top_15_weight_dict_VOY = filter_edge_dict_by_character_list(top_15_chars_VOY, weight_dict_edges_VOY_sort)
top_15_weight_dict_ENT = filter_edge_dict_by_character_list(top_15_chars_ENT, weight_dict_edges_ENT_sort)

In [None]:
# create name mappings with the first letter capitalized, the other letters lowercase
names_mapping_TOS = {name: name[0] + name[1:].lower() for name in top_15_chars_TOS}
names_mapping_TNG = {name: name[0] + name[1:].lower() for name in top_15_chars_TNG}
names_mapping_DS9 = {name: name[0] + name[1:].lower() for name in top_15_chars_DS9}
names_mapping_VOY = {name: name[0] + name[1:].lower() for name in top_15_chars_VOY}
names_mapping_ENT = {name: name[0] + name[1:].lower() for name in top_15_chars_ENT}

In [None]:
# create network plotting function
def plot_spiral_network(top_chars_list, dict_with_weights):
    """This function plots a spiral network graph from the top 15 characters in a list and the
    number of their interactions with each other.
    Arguments:
        top_chars_list: a list of the characters of interest
        dict_with_weights: a dictionary containing the weights of the edges between the top 15 characters"""
    
    # create graph
    G = nx.Graph()
    
    # add nodes (characters) and edges (their interactions)
    G.add_nodes_from(top_chars_list)
    G.add_edges_from(dict_with_weights.keys())

    # create lists of edges with different weights
    over_3000 = [k for k,v in dict_with_weights.items() if v >= 2000]
    over_1000 = [k for k,v in dict_with_weights.items() if 3000 > v >= 1000]
    over_500 = [k for k,v in dict_with_weights.items() if 1000 > v >= 500]
    over_100 = [k for k,v in dict_with_weights.items() if 500 > v >= 100]
    over_50 = [k for k,v in dict_with_weights.items() if 100 > v >= 50]
    over_10 = [k for k,v in dict_with_weights.items() if 50 > v >= 10]
    under_10 = [k for k,v in dict_with_weights.items() if v < 10]

    plt.figure(figsize=(20,20))

    pos = nx.drawing.spiral_layout(G)
    
    # plot nodes and labels
    nx.draw_networkx_nodes(G, pos=pos, node_color="lightgrey", node_size=10, label=50)
    nx.draw_networkx_labels(G, pos=pos, labels=top_chars_list)

    # add edges formated conditionally on weights
    nx.draw_networkx_edges(G, pos=pos, edge_color='grey', edgelist=under_10, width=0.2)
    nx.draw_networkx_edges(G, pos=pos, edge_color='grey', edgelist=over_10, width=1, style="dashed")
    nx.draw_networkx_edges(G, pos=pos, edge_color='grey', edgelist=over_50, width=3, style="dashed")
    nx.draw_networkx_edges(G, pos=pos, edge_color='#44DEC0', edgelist=over_100, width=3)
    nx.draw_networkx_edges(G, pos=pos, edge_color='#BADE44', edgelist=over_500, width=6)
    nx.draw_networkx_edges(G, pos=pos, edge_color='#DEBE44', edgelist=over_1000, width=9)
    nx.draw_networkx_edges(G, pos=pos, edge_color='#E69119', edgelist=over_3000, width=12)

    plt.show()

In [None]:
plot_spiral_network(names_mapping_TOS, top_15_weight_dict_TOS)

In [None]:
plot_spiral_network(names_mapping_TNG, top_15_weight_dict_TNG)

In [None]:
plot_spiral_network(names_mapping_DS9, top_15_weight_dict_DS9)

In [None]:
plot_spiral_network(names_mapping_VOY, top_15_weight_dict_VOY)

In [None]:
plot_spiral_network(names_mapping_ENT, top_15_weight_dict_ENT)

<a name="exp_data_game"></a>
## Export data for the browser game

In [None]:
# convert class indices to class names
predictions_boosted = ["archer" if t==0 else "janeway" if t==1
           else "kirk" if t==2 else "picard" if t==3
           else "sisko "for t in pred4]

In [None]:
### data for the game from the XGB model
saving_destiny = "/Users/tjanif/GA/GuessTheCaptain/content/"

ttl_allc_combo[2].to_csv(saving_destiny + "_test_lines_all_captains.csv")     # exports the testset of lines
ttl_allc_combo[3].to_csv(saving_destiny + "_test_answers_all_captains.csv")   # exports the correct anwers
pd.Series(predictions_boosted).to_csv(saving_destiny + "_model_answers.csv")  # exports the model's answers