# NLP For Heros
Jophseph Campbell,  Professor of Literature at Sarah Lawrence and author for The Hero With One Thousand Faces (THWOTF), is widely regarded as the theoretical word on fiction writing.  In writing THWOTF he lays out an architypal story arc that certain hero's journey stories follow.  The idea of simplifying a story down to basic components is nothing new, but the hero's journey in particular seems to follow his outline very, very closely.

This classification has inspired me to try and model this storyline using the Classic J.K. Rowling's Harry Potter and the Soccerer's Stone.  IN J.K.'s work, she very closely follows the archetype.  Often breaking each step down by chapter.  This, and it's simple language, as it is meant as a children's book after all, should make it a clear and natural choice as the labeled data for this experiment.

In [47]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/d8/f1/5a267addb30ab7eaa1beab2b9323073815da4551076554ecc890a3595ec9/fuzzywuzzy-0.17.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.17.0


In [58]:
import pandas as pd
import os
from fuzzywuzzy import process
import numpy as np

In [13]:
path = os.getcwd()
path

'C:\\Users\\Mark\\Documents\\DataSci\\Module 5\\Heros_Journey'

In [16]:
!dir

 Volume in drive C is Windows-SSD
 Volume Serial Number is 42C8-223B

 Directory of C:\Users\Mark\Documents\DataSci\Module 5\Heros_Journey

11/12/2019  01:26 PM    <DIR>          .
11/12/2019  01:26 PM    <DIR>          ..
11/12/2019  01:00 PM    <DIR>          .ipynb_checkpoints
10/30/2019  04:23 PM           465,924 Harry_Potter-J.K.Rowling.txt
11/12/2019  12:57 PM    <DIR>          Input
11/12/2019  01:26 PM             3,474 NLP_for_heros.ipynb
               2 File(s)        469,398 bytes
               4 Dir(s)  33,689,976,832 bytes free


In [34]:
file = open('Harry_Potter-J.K.Rowling.txt', 'r', encoding='utf8')
text = file.read()
file.close()

In [35]:
type(text)

str

In [36]:
len(text)

440862

In [39]:
text[:2000]

"Harry Potter\n\nand the Sorcerer’s Stone\n\n\n\n\n\nby\n\nJ. K. Rowling\n\nIllustrations by Mary Grandpré\n\n\n\nArthur A. Levine Books\n\nAn Imprint of Scholastic Press.\n\n\n\n\n\nFor Jessica, who loves stories\n\nfor Anne, who loved them too;\n\nand for Di, who heard this one first.\n\n\n\n\n\nText copyright © 1997 by J.K. Rowling\n\nIllustrations by Mary GrandPré copyright © 1998 Warner Bros.\n\nAll rights reserved. Published by Scholastic Press, a division of Scholastic Inc.,\n\nPublishers since 1920\n\nSCHOLASTIC, SCHOLASTIC PRESS, and the LANTERN LOGO\n\nare trademarks and/or registered trademarks of Scholastic Inc.\n\n\n\nHARRY POTTER and all related characters and elements are trademarks of Warner Bros.\n\n\n\nNo part of this publication may be reproduced, or stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission of the publisher. For information regarding permissions,

# Cleaning Data
The final data should be broken into samples that make sense for a model to read.  Sentances are a good choice but sentances also end with `!` or `?` likewise, just because there is a `.` doesn't mean that the sentance has ended.  For now, line breaks might seem a little less granular but there is no chance that spliting the data on the line with break things apart too much.

Later on, we will be labeling this particular corpus by chapter and by key words within certain chapters to label the phases of the hero's journey laided out by Campbell.

In [40]:
manuscript = text.split('\n')
type(manuscript)
manuscript[:5]

['Harry Potter', '', 'and the Sorcerer’s Stone', '', '']

In [62]:
np.zeros(len(manuscript[:5]))

array([0., 0., 0., 0., 0.])

In [66]:
labeled_data = pd.DataFrame({'text': manuscript , 'label' : np.zeros(len(manuscript), dtype=int)})

In [67]:
labeled_data.tail()

Unnamed: 0,text,label
6677,,0
6678,,0
6679,,0
6680,,0
6681,,0


In [68]:
labeled_data['text'].replace('', np.nan, inplace=True)
labeled_data.dropna(inplace=True)
labeled_data.tail()

Unnamed: 0,text,label
6667,"“In a manner of speaking,” said Uncle Vernon. ...",0
6669,Harry hung back for a last word with Ron and H...,0
6671,"“See you over the summer, then.”",0
6673,"“Hope you have — er — a good holiday,” said He...",0
6675,"“Oh, I will,” said Harry, and they were surpri...",0


In [87]:
count = 1
for idx, i in enumerate(labeled_data['text']):
    if i == f'Chapter {count}':
#         tag each of the chapters as events in the story arch
        print(f'Chapter {count} at index {idx}')
        count += 1

Chapter 1 at index 31
Chapter 2 at index 143
Chapter 3 at index 245
Chapter 4 at index 380
Chapter 5 at index 535
Chapter 6 at index 824
Chapter 7 at index 1113
Chapter 8 at index 1302
Chapter 9 at index 1396
Chapter 10 at index 1609
Chapter 11 at index 1775
Chapter 12 at index 1914
Chapter 13 at index 2122
Chapter 14 at index 2248
Chapter 15 at index 2389
Chapter 16 at index 2579
Chapter 17 at index 2901


In [88]:
labeled_data['label'][143:245] = 1 # Chapter 2 = Ordinary World
labeled_data['label'][245:380] = 2 # Chapter 3 = Call to Adventure (most of the calling comes form an outside source)
labeled_data['label'][380:535] = 3 # Chapter 4 = Refusae the Call (Uncle does most of the refusing)(Harry meets Hagrid)
labeled_data['label'][535:823] = 4 # Chapter 5 = Meeting Mentor (although harry already met Hagrid, they get more time to talk while shopping)
labeled_data['label'][143:245] = 5 # Chapter 6 =  Crossing the Threshold Platform 9 3/4

# a fair amount of time is spent meeting allies, enemies, and challenges.  This part of the journey is the most vauge.

# He doesn't find the mirror till halfway through the chapter
labeled_data['label'][1914:2122] = 6 # Chapter 12 = Approaching the Cave (This is where the hero confronts themselves,  Harry finds the Mirror of Erised and spends hours in the company of his parents he never knew.  As an orphan this topic very close to his heart.)
labeled_data['label'][1914:2122] = 7 # Chapter 16 = The Ordeal
# labeled_data['label'][1914:2901] = 8 # End of Chapter 16 = Seize the Sword(Stone)
labeled_data['label'][2901:] = 9 # Begining of Chapter 17 = Rededication (facing Quirrel/V)
# labeled_data['label'][2901:] = 10 # Harry wakes in Hospital = Ressurection
# labeled_data['label'][2901:] = 11 # End of year feast to the end of the book =  Return with Elixer

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,label
count,3132.0
mean,0.032567
std,0.177529
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [46]:
manuscript[90]

'Chapter 1'

In [72]:
def get_match(query, data, limit):
    process.extract(query, data, limit)

get_match('Chapter 1', labeled_data, limit=3)

TypeError: 'int' object is not callable

In [69]:
for row in labeled_data:
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, labeled_data, limit=10):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            print(Digi_Results)

60% partial match: "text" with "0                                            Harry Potter
2                                and the Sorcerer’s Stone
8                                                      by
10                                          J. K. Rowling
12                         Illustrations by Mary Grandpré
16                                 Arthur A. Levine Books
18                        An Imprint of Scholastic Press.
24                         For Jessica, who loves stories
26                          for Anne, who loved them too;
28                  and for Di, who heard this one first.
34                  Text copyright © 1997 by J.K. Rowling
36      Illustrations by Mary GrandPré copyright © 199...
38      All rights reserved. Published by Scholastic P...
40                                  Publishers since 1920
42      SCHOLASTIC, SCHOLASTIC PRESS, and the LANTERN ...
44      are trademarks and/or registered trademarks of...
48      HARRY POTTER and all related cha