# Exercise Fourteen: Project Design Starter

In this exercise, you'll be planning out a complex project. You'll draw in some code, but focus on commenting to describe your project structure. The sample document below will guide you through organizing and annotating your project design. The primary components you'll include are:

- **Dependencies:** What modules will your project need?
- **Collection:** Where is your data coming from?
- **Processing:** How will you format and process your data?
- **Analysis:** What techniques will you use to understand your data?
- **Visualization:** How will you visualize and explore your data?

Don't worry if you aren't exactly certain how you would implement everything - this should be a starting point for a larger research study, but it doesn't need to be a complete, functional workflow. Aim for a "good enough" starting point that you can reference and extend for future work. 

Note where you have something working, and where it's broken or in progress.

## Project Overview: NaNoGenMo

This sample project builds on our previous exercises inspired by National Novel Generation Month. It offers a framework for exporing text generation based upon children's literature, inspired by NaNoGenMo's call to think about different forms of procedural making. As such, it is guided by that project's rule: "Spend the month of November writing code that generates a novel of 50k+ words."


## Dependencies

Add the import code for every dependency of your project: for instance, if you are collecting data, you might import Tweepy or BeautifulSoup. If you're working with a file of folders, import os. Most projects will require Pandas, along with appropriate processing and visualization libraries. In the comments, explain briefly why you are including each library (as shown in the example below.) 

In [301]:
# Importing Tracery for generative grammars
import tracery
# Importing Markov and dependencies
import markovify
import random
# Importing OS for file sources
import os
# Importing nltk for word tokenization
import nltk
# Importing re to clean text
import re

## Collection

Describe your data collection scope and process briefly, and include an example of how you might collect your data drawing on our other projects. For example, if this workflow will collect Twitter data from a stream, you might revisit that demo, copy the stream, and adjust the hashtag.

In [269]:
# We'll need to extract our text from our novels, and we'll use regular expressions to start cleaning it of the Gutenberg header and footer 

up_to_word = "Contents"
after_word = "***"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
text = ""
import os
path = 'texts/'
with os.scandir(path) as entries:
    temporary = ""
    for entry in entries:
        print(entry.name)
        f = open(f'{path}\{entry.name}',encoding='utf-8-sig')
        temporary += f.read()
        temporary = (re.sub(rx_to_first, '', temporary, flags=re.DOTALL).strip())
        temporary = temporary[:temporary.index(after_word) + len(after_word)]
        temporary = temporary.replace("CHAPTER ","")
        text += temporary
print (text[0:500])


alice.txt
anne.txt
oz.txt
I.     Down the Rabbit-Hole
 II.    The Pool of Tears
 III.   A Caucus-Race and a Long Tale
 IV.    The Rabbit Sends in a Little Bill
 V.     Advice from a Caterpillar
 VI.    Pig and Pepper
 VII.   A Mad Tea-Party
 VIII.  The Queen’s Croquet-Ground
 IX.    The Mock Turtle’s Story
 X.     The Lobster Quadrille
 XI.    Who Stole the Tarts?
 XII.   Alice’s Evidence




I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do:


## Processing

After your data has been collected or imported, store it in a format that works for your purposes. This can vary: for Twitter analysis, it might be a Pandas dataframe, while for text, you might build a document term matrix.

In [270]:
# Now we'll pull proper nouns and verbs for our chapter headings and title
# To avoid words that are unlikely to flow, we'll remove any word with punctuation
def remove_punc(string):
    string = string.lower()
    punc = '''!()-[]{};:'’‘“-"\, <>—”./|?@#$%^&*_~'''
    for ele in string:  
        if ele in punc:  
            return ""
    return string
 
from nltk.tag import pos_tag
tagged_sent = pos_tag(text.split())
propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
propernouns = [remove_punc(word) for word in propernouns]
propernouns = list(set(propernouns))
propernouns = [word for word in propernouns if len(word) >= 5]
print(propernouns[0:50])

verbs = [word for word,pos in tagged_sent if pos == 'VBZ']
verbs = [remove_punc(word) for word in verbs]
verbs = list(set(verbs))
verbs = [word for word in verbs if len(word) >= 5]
print(verbs[0:50])


['english', 'daniel', 'witch', 'pearl', 'laura', 'rosalia', 'sleeves', 'begin', 'wilson', 'teacher', 'hatter', 'luckily', 'delight', 'prissy', 'xxxvii', 'muriel', 'almost', 'quadrille', 'tragic', 'queens', 'multiplication', 'zealand', 'jealous', 'eaglet', 'spirit', 'professor', 'nobody', 'worrying', 'kalidahs', 'xxvii', 'canary', 'samuel', 'brook', 'allow', 'henceforth', 'organized', 'laurette', 'scotch', 'tillie', 'advice', 'duchess', 'caterpillar', 'bouncing', 'getting', 'balloon', 'project', 'royal', 'latitude', 'winkies', 'literature']
['takes', 'hurts', 'falls', 'dwells', 'accounts', 'dresses', 'trusts', 'dimly', 'leaves', 'starts', 'bridges', 'cares', 'calls', 'schoolboys', 'spoils', 'spells', 'stalks', 'cannot', 'happens', 'attends', 'hungry', 'shows', 'shadows', 'appears', 'seeks', 'brings', 'tricks', 'haunts', 'pronounces', 'queens', 'teases', 'declares', 'keeps', 'includes', 'feels', 'tarts', 'means', 'breathes', 'criticizes', 'signifies', 'proves', 'finds', 'drank', 'seems',

## Analysis

Think across all of the methods we've tried this semester - what combination would be most helpful for your goals? Include code sections for each method you think is important. In most cases, a combination will be most revealing: for instance, you might employ several different textual analysis frameworks on a set of documents. Use at least two distinctly different methods of analysis.

In [271]:
#We'll start with generating a chapter title function
from tracery.modifiers import base_english
rules = {
    'title' : ['#noun.capitalize# #verb.capitalize# The #noun.capitalize#','#verb.capitalize# and #verb.capitalize#',"The #noun.capitalize# and The #noun.capitalize#","#directions.capitalize# #noun.capitalize#"],
    'noun' : propernouns,
    'verb' : verbs,
    'directions' : ['under','above','below','beyond','around','over','through','near','far']
}
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)
chapterTitle = grammar.flatten("#title#")
print(chapterTitle) 

Near Fancy


In [302]:
# This is an old prototype bot of mine - you'd want to write a new Tracery program
art_rules = { "origin": "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"1024\" height=\"512\">#background##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse##ellipse#</svg>",
"ellipse":"<ellipse cx=\"#cx#\" cy=\"#cy#\" rx=\"#rx#\" ry=\"50\" style=\"fill:#color#;stroke:#color#;stroke-width:2;opacity:#opacity#\" />",
"rx":["50","100","150","200","250","300","350"] ,
"color":["AliceBlue","Aqua","Aquamarine","Coral","Cyan","Crimson","DeepPink","DarkCyan","DeepSkyBlue","FireBrick"
"Indigo","SlateBlue","DarkMagenta","BlueViolet","DarkOrchid","Azure","BurlyWood","Bisque","DarkGreen","DarkSeaGreen","DarkRed","Fuschia","IndianRed","Maroon"],
"opacity":["0.2","0.3","0.4","0.5","0.6","0.7"],
"cx":["100","200","300","400","500","600","700","800","900"],
"cy":["100","200","300","400","500"],
"background":"<rect width=\"1024\" height=\"512\" style=\"fill:#backgroundcolor#\" />",
"backgroundcolor":["black","gray","darkgray"]}
art_grammar = tracery.Grammar(art_rules)
print(art_grammar.flatten("#origin#")) 

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1024" height="512"><rect width="1024" height="512" style="fill:darkgray" /><ellipse cx="100" cy="400" rx="250" ry="50" style="fill:Maroon;stroke:Coral;stroke-width:2;opacity:0.7" /><ellipse cx="700" cy="400" rx="100" ry="50" style="fill:BurlyWood;stroke:Cyan;stroke-width:2;opacity:0.2" /><ellipse cx="400" cy="300" rx="100" ry="50" style="fill:Coral;stroke:FireBrickIndigo;stroke-width:2;opacity:0.4" /><ellipse cx="100" cy="300" rx="100" ry="50" style="fill:BurlyWood;stroke:Bisque;stroke-width:2;opacity:0.6" /><ellipse cx="800" cy="300" rx="350" ry="50" style="fill:Aquamarine;stroke:BlueViolet;stroke-width:2;opacity:0.7" /><ellipse cx="400" cy="100" rx="250" ry="50" style="fill:AliceBlue;stroke:BlueViolet;stroke-width:2;opacity:0.4" /><ellipse cx="800" cy="200" rx="300" ry="50" style="fill:DarkGreen;stroke:DarkGreen;stroke-width:2;opacity:0.4" /><ellipse cx="200" cy="100" rx="150" ry="50" style="fil

In [303]:
# Next we'll build the chapter text function
def makeChapter(number):
    chapterText = "<h2>Chapter " + str(number) + ": "
    chapterText += grammar.flatten("#title#") + "</h2>"
    chapterText += "<p>"
    chapterText += "<center>" + art_grammar.flatten("#origin#") + "</center>"
    while (len(chapterText.split(" ")) < 5000):
      chapterText += "<p>"
      for i in range(random.randrange(3,9)):
        chapterText += text_model.make_sentence() + " "
      chapterText += "</p>"
    chapterText += "</p><p></p>"
    return chapterText

In [274]:
# Test Chapter function
text_model = markovify.Text(text)
print(makeChapter(1)[0:500])

<h2>Chapter 1: Around Phillips</h2><p><center><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1024" height="512"><rect width="1024" height="512" style="fill:black" /><ellipse cx="300" cy="200" rx="300" ry="50" style="fill:DeepSkyBlue;stroke:DarkOrchid;stroke-width:2;opacity:0.4" /><ellipse cx="500" cy="300" rx="350" ry="50" style="fill:Cyan;stroke:FireBrickIndigo;stroke-width:2;opacity:0.6" /><ellipse cx="300" cy="500" rx="200" ry="50" style="fill:BlueVi


## Visualization

Finally, think about the visualizations that would be most useful to sharing and exploring your data. Consider both static and dynamic approaches from the different libraries we've worked with this semester. Include at least two preliminary visualizations.

In [306]:
# build the novel html file
novelTitle = grammar.flatten("#title#")
with open("novel.html", "w", encoding="utf-8", errors="ignore") as f: 
    f.write("<html><head><title>" + novelTitle + "</title><style>h2 { font-family:Georgia,serif; color:#CC58DE; font-size:30px; line-height:1em; margin:0 0 0 60px; } h1 { color:#810474; font-size:85px; font-weight:normal; letter-spacing:-3.5px;line-height:1em; text-align:center;} p {font-family:'Helvetica Neue',Arial,sans-serif;font-size:15px;margin:30px 30px 30px 30px; letter-spacing:1px;} svg { max-width: 80% }</style></head><body>")
    f.write("<h1>" + novelTitle + "</h1>")
f.close()

with open ("novel.html", "a") as f:
    for chapter in range(1, 11):
        text = ""
        text = makeChapter(chapter)
        f.write(text)
    f.write("</body></html>")
f.close()