# Haiku

## Background

Traditional Haiku format consists of 3 lines of poetry, with a structure of 5 syllables, 7 syllables and 5 syllables.

## Acquire

We found a archive of over 20,000 potential haiku poems, based on a central theme.

We wrote a short python script to download the collection, with consideration for the bandwidth of the host and stored them into a local SQLite DB.

## Analyze

Using the Python package `sullapy`, we did some statistical evaluation of the poem database, looking for valid haiku. 

## Results

We found about 57% of poems in the collection could be evaluated as valid, based on the limitations of both the poets and `sullapy`. When we loosened up the definition of a haiku, to allow 4 to 6 syllables on the 1st and 3rd lines, and 6 to 8 on the 2nd line, we improved to 90% matches.


In [131]:
import sqlite3
import requests  # To get the pages
import hashlib
import re

import pyphen
import syllapy
from collections import defaultdict, Counter

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

from statistics import mean
# from bs4 import BeautifulSoup # and to process them

con = sqlite3.connect("spam.db")
cur = con.cursor()


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/parallels/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Globals

Setup the global variables needed to crawl the source

In [4]:
base_url = "http://web.mit.edu/jync/www/spam/"
extension = ".html"

## Scraper Table

Setup a table in the SQLite DB for the results of the scraping

In [5]:
st = cur.execute("CREATE TABLE IF NOT EXISTS \"url_cache\" (\
	\"hash\"	TEXT,\
	\"text\"	TEXT\
);")


## Poem Table

Setup a table in the SQLite DB for the extracted poems, or potential poems.

In [6]:
# pt = cur.execute("DROP TABLE poems;")

pt = cur.execute("CREATE TABLE IF NOT EXISTS \"poems\" (\
	\"id\"	INTEGER UNIQUE,\
	\"line_1\"	TEXT,\
	\"line_2\"	TEXT,\
	\"line_3\"	TEXT,\
	\"author\"	TEXT\
);")
print(pt)

<sqlite3.Cursor object at 0xffffa00586c0>


# Scraper

Run the scraper, verifying if page has previously been scraped, and only scrap if needed.

In [7]:
for i in range ( 1001,19601,100 ) :
    url = base_url + str(i) + "-" + str(i + 99) + extension
    # print(url)
    
    ID = i
    hash = hashlib.md5(url.encode()).hexdigest()
    
    # https://stackoverflow.com/a/9756276
    res = cur.execute("SELECT EXISTS(SELECT 1 FROM url_cache WHERE hash='"+hash+"');").fetchone()[0]
    
    if res is 0 :
        r = requests.get(url)
        text = r.text
        cur.execute("INSERT INTO url_cache VALUES ( ?, ?)", ( hash, text ) )
        # print(hash)

    else :
        # print(hash)
        continue 
        
con.commit()

## Poem Extractor

Using a regular expression find all possible matches in the cached page content storing the matches into the poem table in the database.

In [8]:
for i in range ( 1001,19601,100 ) :
    url = base_url + str(i) + "-" + str(i + 99) + extension
    # print(url)
    
    ID = i
    hash = hashlib.md5(url.encode()).hexdigest()
    # print(hash)
    res = cur.execute("SELECT text FROM url_cache WHERE `hash` = '" + hash + "'" )
    text = res.fetchone()[0]
    
    # print(len(text))
    matches = re.findall(r"<SPAN .*\n.*\n.*\n.*\n.*\n.*\n.*<P>",text)
    
    for match in matches :
        # Found some matches
        # print( match )
        groups = re.findall(r"<SPAN CLASS='Number'>(.*)\.</SPAN><BR>\n.*\n(.*)<BR>\n(.*)<BR>\n(.*)<BR>\n.*\n.*>--(.*)</ADDRESS><P>", match)[0]
        
        if len( groups) == 5 :
            # https://stackoverflow.com/a/4869782

            poem_values = [ int(groups[0]), 
                           re.sub('<[^>]*>', '', groups[1]), 
                           re.sub('<[^>]*>', '', groups[2]),
                           re.sub('<[^>]*>', '', groups[3]),
                           re.sub('<[^>]*>', '', groups[4]),]
            # print(poem_values)

            insert_results = cur.execute("INSERT OR REPLACE INTO poems VALUES ( ?, ?, ?, ?, ? )", poem_values )
            # print(insert_results)
        else :
            pass
            # print(len(groups))
    # break

con.commit()

## Syllable Counter

Setting up the required tools to do the syllable counting.

In [10]:
dic = pyphen.Pyphen(lang='en')
# dic.inserted(text)

In [158]:
# count = syllapy.count('additional.')
# print(count)

In [161]:
syllable_counts = defaultdict(list)
invalid_words = list()
all_words = list()

Load all poems from database, and count syllables for each line of each poem.

In [174]:
for poem in cur.execute("SELECT * FROM poems;"):
    first_line = 0
    second_line = 0
    third_line = 0

    for word in re.sub(r'[^\w\s]', '', poem[1]).split() :
        all_words.append(word.lower())
        first_line += syllapy.count(word)
        # record bad words
        if syllapy.count(word) == 0 :
            invalid_words.append(word)
    for word in re.sub(r'[^\w\s]', '', poem[2]).split() :
        second_line += syllapy.count(word)
        all_words.append(word.lower())
    for word in re.sub(r'[^\w\s]', '', poem[3]).split() :
        third_line += syllapy.count(word)
        all_words.append(word.lower())
    
#     print(first_line)
#     print(second_line)
#     print(third_line)
    syllable_counts[poem[0]] = [first_line, second_line, third_line]
    
    

In [106]:
# syllable_counts.items()

# Analysis

In [175]:
first_lines = []
second_lines = []
third_lines = []

valid = 0

semi_valid = 0


for poem in syllable_counts.items():
    first_lines.append(poem[1][0])
    second_lines.append(poem[1][1])
    third_lines.append(poem[1][2])
    # break
    # print(type(poem[1][0]))

    if poem[1][0] == 5 and \
    poem[1][1] == 7 and \
    poem[1][2] == 5 :
        valid = valid + 1
        
    if poem[1][0] >= 4 and poem[1][0] <= 6 and \
    poem[1][1] >= 6 and poem[1][1] <= 8 and \
    poem[1][2] >= 4 and poem[1][2] <= 6 :
        semi_valid = semi_valid + 1
    
print("Sums of all syllables")
print("First Line: "+str(sum(first_lines)))  
print("Second Line: "+str(sum(second_lines)))  
print("Third Line: "+str(sum(third_lines)))  
print()

print("Minimum number of syllables")
print("First Line: "+str(min(first_lines)))  
print("Second Line: "+str(min(second_lines)))  
print("Third Line: "+str(min(third_lines)))  
print()

print("Maximum number of syllables")
print("First Line: "+str(max(first_lines)))  
print("Second Line: "+str(max(second_lines)))  
print("Third Line: "+str(max(third_lines)))  
print()

print("Average number of syllables")
print("First Line: "+str(round(mean(first_lines),4)))  
print("Second Line: "+str(round(mean(second_lines),4)))  
print("Third Line: "+str(round(mean(third_lines),4)))  
print()

print("Valid Poems")
print( str(valid)+" out of 19601" )
print( str(100*round(valid/19601,6))+"% are valid" )
print()

print("Semi Valid Poems")
print( str(semi_valid)+" out of 19601" )
print( str(100*round(semi_valid/19601,6))+"% are semi valid" )
print()


Sums of all syllables
First Line: 93189
Second Line: 130978
Third Line: 93589

Minimum number of syllables
First Line: 0
Second Line: 0
Third Line: 0

Maximum number of syllables
First Line: 8
Second Line: 77
Third Line: 22

Average number of syllables
First Line: 5.0102
Second Line: 7.0418
Third Line: 5.0317

Valid Poems
11160 out of 19601
56.9359% are valid

Semi Valid Poems
17761 out of 19601
90.6127% are semi valid



## Naughty Words

Words which did not exist in syllable dictionary, and resulted in possible 0 syllable lines.

In [114]:
print(invalid_words[:15])

['95', 'NC-17:', "'40s", '8th', "2's", '13:', '/', '13', '5', '1st', 'MST3K:', '1984:', 'B-52', '486:', '911!']


## Most Common Words

Using `Counter`, we pulled out the most common words, and then after cleaning the word list of stop words produced a 2nd list of most common words.

In [176]:
# Pass the split_it list to instance of Counter class.
AllWordsCounter = Counter(all_words)
  
# most_common() produces k frequently encountered
# input values and their respective counts.
print("Most Common All Words")
print(AllWordsCounter.most_common(20))
print()

filtered_words = [t for t in all_words if not t in stopwords.words("english")]

# Pass the split_it list to instance of Counter class.
FilteredWordsCounter = Counter(filtered_words)
  
# most_common() produces k frequently encountered
# input values and their respective counts.
print("Most Common Filtered Words")
print(FilteredWordsCounter.most_common(20))


Most Common All Words
[('spam', 15338), ('the', 8381), ('a', 4836), ('i', 4630), ('of', 4357), ('in', 3672), ('and', 3624), ('is', 3146), ('to', 3099), ('my', 3080), ('it', 2924), ('you', 2218), ('can', 2152), ('on', 1881), ('for', 1812), ('pink', 1654), ('with', 1579), ('me', 1365), ('eat', 1308), ('not', 1193)]

Most Common Filtered Words
[('spam', 15338), ('pink', 1654), ('eat', 1308), ('like', 1100), ('meat', 1070), ('one', 655), ('blue', 605), ('love', 586), ('pig', 578), ('spamku', 507), ('hormel', 503), ('good', 477), ('pork', 473), ('food', 473), ('dont', 463), ('oh', 463), ('new', 390), ('man', 387), ('life', 385), ('haiku', 377)]


# Sample of Poems

In [243]:
def print_poem(poem):
    print(poem[1])
    print(poem[2])
    print(poem[3])
    print()
    
def analyze_poem(poem):
    counts = [0,0,0,0]
    for i in range(1,4):
        results = ""
        line = poem[i]
        # print(line)
        words = line.split()
        count = 0
        for word in words :
            results = results + word + "(" + str(syllapy.count(word)) + ") "
            counts[i] = counts[i] + syllapy.count(word)
        print(results + "=> " + str(counts[i]))
    print()
    if counts[1] == 5 and counts[2] == 7 and counts[3] == 5 :
        print("Valid Haiku")
    elif counts[1] >= 4 and counts[1] <= 6 and \
    counts[2] >= 6 and counts[2] <= 8 and \
    counts[3] >= 4 and counts[3] <= 6:
        print("Semi Valid Haiku")
    else:
        print("Invalid Haiku")
    print()

In [247]:
for poem in cur.execute("SELECT * FROM poems ORDER BY random() LIMIT 20;"):
    print_poem(poem)
    analyze_poem(poem)
    print("---------------------------------------------")
    print()

A resolution
to improve my "quality
of life": no more SPAM!

A(1) resolution(4) => 5
to(1) improve(2) my(1) "quality(3) => 7
of(1) life":(1) no(1) more(1) SPAM!(1) => 5

Valid Haiku

---------------------------------------------

disposable meat
unlikely recycled treat
pink brick of pork teat

disposable(4) meat(1) => 5
unlikely(3) recycled(3) treat(1) => 7
pink(1) brick(1) of(1) pork(1) teat(1) => 5

Valid Haiku

---------------------------------------------

A tallow candle
burns, and the greasy smoke is
the essence of SPAM.

A(1) tallow(2) candle(2) => 5
burns,(1) and(1) the(1) greasy(2) smoke(1) is(1) => 7
the(1) essence(2) of(1) SPAM.(1) => 5

Valid Haiku

---------------------------------------------

See, pigs are touchy.
That's why they don't rule the Earth.
Yay for SPAM! Eat it.

See,(1) pigs(1) are(1) touchy.(2) => 5
That's(1) why(1) they(1) don't(1) rule(1) the(1) Earth.(1) => 7
Yay(1) for(1) SPAM!(1) Eat(1) it.(1) => 5

Valid Haiku

-----------------------------------------

## Cleanup

Close database connection.

In [157]:
con.close()