# Superhero Creator Prediction

Code adapted from [kaggle](Code taken from https://www.kaggle.com/someadityamandal/superheroes-visualization-and-prediction)

Importing the necessary libraries.

NLTK (Natural Language Toolkit) is a suite of libraries and programs for symbolic and statistical natural language
processing (NLP) for English
The `nltk.download('...')` commands only have to be executed once on a system.

`tqdm` allows for progress bars in Jupyter notebooks.

In [1]:
# Code taken from https://www.kaggle.com/someadityamandal/superheroes-visualization-and-prediction
import gc
import re

import nltk
import pandas as pd
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from tqdm.auto import tqdm
tqdm.pandas()

nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/filipschlembach/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/filipschlembach/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Load the dataset

In [2]:
data = pd.read_csv("../datasets/superheroes_nlp_dataset.csv")

data_text = data[['history_text', 'creator']]
# we will only select comics by Marvel or DC as there's too many comic creators
# todo: idea: group all other cretors under one label and see how the performance changes
data_text = data_text.loc[data_text['creator'].isin(['Marvel Comics', 'DC Comics'])]
data_text.head(1)

Unnamed: 0,history_text,creator
0,"Delroy Garrett, Jr. grew up to become a track ...",Marvel Comics


Defining modules for Text Processing

In [3]:
", ".join(stopwords.words('english'))
stopwords_list = set(stopwords.words('english'))

puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*',
          '+', '\\', '•', '~', '@', '£',
          '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â', '█',
          '½', 'à', '…',
          '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥',
          '▓', '—', '‹', '─',
          '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾',
          'Ã', '⋅', '‘', '∞',
          '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹',
          '≤', '‡', '√', ]


def clean_text(x):
    """
    This method adds whitespaces before and after punctuation.
    :param x: text
    :return: text with whitespaces before and after punctuation
    """
    x = str(x)
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x


def clean_numbers(x):
    """
    This method replaces numerals with up to 5 #-symbols.
    :param x: text
    :return: text with numerals replaced by # symbols
    """
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x


def remove_stopwords(text):
    # self explanatory
    return " ".join([word for word in str(text).split() if word not in stopwords_list])


def stem_text(text):
    """
    This method lemmatizes the words in the given text. It uses a caching object (for performance improvement?)
    :param text: in it's original form
    :return: text cosisting of lemmatized words
    """
    lemma = nltk.wordnet.WordNetLemmatizer()

    class FasterStemmer(object):
        def __init__(self):
            self.words = {}

        def stem(self, x):
            if x in self.words:
                return self.words[x]
            t = lemma.lemmatize(x)
            self.words[x] = t
            return t

    faster_stemmer = FasterStemmer()
    text = text.split()
    stemmed_words = [faster_stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)
    del faster_stemmer
    return text


In [4]:
# apply text preprocessing

data_text['history_text'] = data_text['history_text'].progress_apply(lambda x: str(x).lower())
data_text['history_text'] = data_text['history_text'].progress_apply(lambda x: clean_text(x))
data_text['history_text'] = data_text['history_text'].progress_apply(lambda x: clean_numbers(x))
data_text['history_text'] = data_text['history_text'].progress_apply(lambda x: remove_stopwords(x))
data_text['history_text'] = data_text['history_text'].progress_apply(lambda x: stem_text(x))
print(data_text.head(1))

  0%|          | 0/1059 [00:00<?, ?it/s]

  0%|          | 0/1059 [00:00<?, ?it/s]

  0%|          | 0/1059 [00:00<?, ?it/s]

  0%|          | 0/1059 [00:00<?, ?it/s]

  0%|          | 0/1059 [00:00<?, ?it/s]

                                        history_text        creator
0  delroy garrett , jr . grew become track star c...  Marvel Comics


Creating a binary label for the target column `y`, i.e. the creator, and a new variable `X` for the input.

In [5]:
y = data_text['creator']
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
X = data_text['history_text']

- Splitting the dataset into training and test data.
- [ ] Bag of words mystery. How does this work?!

In [6]:
# 80-20 splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y
                                                    , test_size=0.35, random_state=1234)
# defining the bag-of-words transformer on the text-processed corpus
bow_transformer = CountVectorizer(analyzer='word').fit(X_train)
# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_train = bow_transformer.transform(X_train)
# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_test = bow_transformer.transform(X_test)

- [ ] what kind of model is used here? How does it work?

In [7]:
# ???

# instantiating the model with simple Logistic Regression..
model = LogisticRegression()
# training the model...
model = model.fit(text_bow_train, y_train)

print('\nScore on training data:')
print(model.score(text_bow_train, y_train))
print('\nScore on test data:')
print(model.score(text_bow_test, y_test))


Score on training data:
0.9796511627906976

Score on test data:
0.8733153638814016


- [ ] Concatenate the two text columns and measure the performance difference.
Does more data help?
How much more data do we actually get by concatenating the two columns?
- [ ] Use [Keras Tokennizer](https://www.youtube.com/watch?v=UFtXy0KRxVI) for BoW and try to replicate the result.
- [ ] Add TF-IDF part with explanation.
