## Goals
* Clean the data imported from the csv file

This code is adapted from https://github.com/adashofdata/nlp-in-python-tutorial (using my own examples)

## Input Data from csv, delete header row, and create a dictionary with titles as keys

In [1]:
import csv
file = open('syllabusFullText.csv') #created using getContent (from Wikipedia functions)
reader = csv.reader(file)
lists = list(reader)

del lists[0] #deletes the header row

data = {}
for i in range(len(lists)):
    data[lists[i][0]] = lists[i][1]

In [2]:
#Check data
data['William Wordsworth']

'William Wordsworth (7 April 1770 – 23 April 1850) was an English Romantic poet who, with Samuel Taylor Coleridge, helped to launch the Romantic Age in English literature with their joint publication Lyrical Ballads (1798).\nWordsworth\'s magnum opus is generally considered to be The Prelude, a semi-autobiographical poem of his early years that he revised and expanded a number of times. It was posthumously titled and published by his wife in the year of his death, before which it was generally known as "the poem to Coleridge".Wordsworth was Britain\'s poet laureate from 1843 until his death from pleurisy on 23 April 1850.\n\n\n== Early life ==\n\nThe second of five children born to John Wordsworth and Ann Cookson, William Wordsworth was born on 7 April 1770 in what is now named Wordsworth House in Cockermouth, Cumberland, part of the scenic region in northwestern England known as the Lake District. William\'s sister, the poet and diarist Dorothy Wordsworth, to whom he was close all his

## Clean the data and create a document term matrix

Code below is copied from https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb and then amended

In [3]:
# changing dictionary to pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame(data, index=[data.keys()]).transpose()
data_df.columns = ['Text']

data_df

Unnamed: 0,Text
William Wordsworth,"William Wordsworth (7 April 1770 – 23 April 1850) was an English Romantic poet who, with Samuel Taylor Coleridge, helped to launch the Romantic Ag..."
Samuel Taylor Coleridge,"Samuel Taylor Coleridge (; 21 October 1772 – 25 July 1834) was an English poet, literary critic, philosopher and theologian who, with his friend ..."
Anna Laetitia Barbauld,"Anna Laetitia Barbauld (, by herself possibly , as in French, née Aikin; 20 June 1743 – 9 March 1825) was a prominent English poet, essayist, lite..."
Mary Wollstonecraft,"Mary Wollstonecraft (, also UK: ; 27 April 1759 – 10 September 1797) was an English writer, philosopher, and advocate of women's rights. Until the..."
Percy Bysshe Shelley,"Percy Bysshe Shelley ( (listen) BISH; 4 August 1792 – 8 July 1822) was one of the major English Romantic poets, widely regarded as one of the fine..."
Christina Rossetti,"Christina Georgina Rossetti (5 December 1830 – 29 December 1894) was an English poet who wrote various romantic, devotional, and children's poems...."
"Alfred, Lord Tennyson","Alfred Tennyson, 1st Baron Tennyson (6 August 1809 – 6 October 1892) was a British poet. He was the Poet Laureate of Great Britain and Ireland du..."
Robert Browning,Robert Browning (7 May 1812 – 12 December 1889) was an English poet and playwright whose mastery of the dramatic monologue made him one of the for...
Elizabeth Barrett Browning,"Elizabeth Barrett Browning (née Moulton-Barrett, ; 6 March 1806 – 29 June 1861) was an English poet of the Victorian era, popular in Britain and t..."
Rudyard Kipling,"Joseph Rudyard Kipling ( RUD-yərd; 30 December 1865 – 18 January 1936) was an English journalist, short-story writer, poet, and novelist. He was b..."


In [6]:
#Check data to see what should be cleaned
data_df.iloc[1,0]

'Samuel Taylor Coleridge  (; 21 October 1772 – 25 July 1834) was an English poet, literary critic, philosopher and theologian who, with his friend William Wordsworth, was a founder of the Romantic Movement in England and a member of the Lake Poets.  He also shared volumes and collaborated with Charles Lamb, Robert Southey, and Charles Lloyd. He wrote the poems The Rime of the Ancient Mariner and Kubla Khan, as well as the major prose work Biographia Literaria. His critical work, especially on William Shakespeare, was highly influential, and he helped introduce German idealist philosophy to English-speaking culture. Coleridge coined many familiar words and phrases, including suspension of disbelief. He had a major influence on Ralph Waldo Emerson and American transcendentalism.\nThroughout his adult life Coleridge had crippling bouts of anxiety and depression; it has been speculated that he had bipolar disorder, which had not been defined during his lifetime. He was physically unhealthy

In [7]:
data_df.iloc[19,0]

'Zadie Adeline Smith FRSL (born 25 October 1975) is an English novelist, essayist, and short-story writer. Her debut novel, White Teeth (2000), immediately became a best-seller and won a number of awards. She has been a tenured professor in the Creative Writing faculty of New York University since September 2010.\n\n\n== Early life ==\nSmith was born in Willesden in the north-west London borough of Brent to a Jamaican mother, Yvonne Bailey, and an English father, Harvey Smith, who was 30 years his wife\'s senior. At the age of 14, she changed her name from Sadie to Zadie.Smith\'s mother grew up in Jamaica and emigrated to England in 1969. Smith\'s parents divorced when she was a teenager. She has a half-sister, a half-brother, and two younger brothers (one is the rapper and stand-up comedian Doc Brown, and the other is the rapper Luc Skyz). As a child, Smith was fond of tap dancing, and in her teenage years, she considered a career in musical theatre. While at university, Smith earned 

In [17]:
# First round of text cleaning: set lower case, remove the section headers, punctuation, and numbers
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('== See also ==(.*)', ' ', text) #removes everything after the "See Also" section header (this deletes references, links, etc.)
    text = re.sub('\=+.*?\=+', ' ', text) #Removes all the headers (everything between 2 or more ==)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text) #Removes punctuation
    text = re.sub('\w*\d\w*', '', text) #removes all the words with numbers
    return text

round1 = lambda x: clean_text_round1(x) #This creates a function using the above: so can write round1(x) 

data_clean = pd.DataFrame(data_df.Text.apply(round1))

In [18]:
# Check the data to see what can be removed in a second round
data_clean.iloc[2,0]

'anna laetitia barbauld    by herself possibly   as in french  née aikin   june  –  march   was a prominent english poet  essayist  literary critic  editor  and author of children s literature \na  woman of letters  who published in multiple genres  barbauld had a successful writing career at a time when women rarely wrote professionally  she was a noted teacher at the palgrave academy and an innovative writer of works for children  her primers provided a model for more than a century  her essays showed it was possible for a woman to be publicly engaged in politics  other women authors such as elizabeth benger emulated her  barbauld s literary career spanned numerous periods in british literary history  her work promoted the values of the enlightenment and of sensibility  while her poetry made a founding contribution to the development of british romanticism  barbauld was also a literary critic  her anthology of  century novels helped to establish the canon as it is known today \nbarba

In [19]:
data_clean.iloc[18,0]

'sir derek alton walcott  kcsl  obe  occ   january  –  march   was a saint lucian poet and playwright  he received the  nobel prize in literature  he was the university of alberta s first distinguished scholar in residence  where he taught undergraduate and graduate writing courses  he also served as professor of poetry at the university of essex from  to   his works include the homeric epic poem omeros     which many critics view  as walcott s major achievement   in addition to winning the nobel prize  walcott received many literary awards over the course of his career  including an obie award in  for his play dream on monkey mountain  a macarthur foundation  genius  award  a royal society of literature award  the queen s medal for poetry  the inaugural ocm bocas prize for caribbean literature  the  t  s  eliot prize for his book of poetry white egrets and the griffin trust for excellence in poetry lifetime recognition award in  \n\n\n \nwalcott was born and raised in castries  saint 

In [20]:
# Second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('\n', ' ', text)
    text = re.sub('\t', ' ', text)
    return text

round2 = lambda x: clean_text_round2(x)

data_clean = pd.DataFrame(data_clean.Text.apply(round2))

In [22]:
# Check the data again
data_clean.iloc[18,0]

'sir derek alton walcott  kcsl  obe  occ   january  –  march   was a saint lucian poet and playwright  he received the  nobel prize in literature  he was the university of alberta s first distinguished scholar in residence  where he taught undergraduate and graduate writing courses  he also served as professor of poetry at the university of essex from  to   his works include the homeric epic poem omeros     which many critics view  as walcott s major achievement   in addition to winning the nobel prize  walcott received many literary awards over the course of his career  including an obie award in  for his play dream on monkey mountain  a macarthur foundation  genius  award  a royal society of literature award  the queen s medal for poetry  the inaugural ocm bocas prize for caribbean literature  the  t  s  eliot prize for his book of poetry white egrets and the griffin trust for excellence in poetry lifetime recognition award in       walcott was born and raised in castries  saint luci

In [23]:
# Add full names (need for visualizations)

namesList = list(data.keys())
data_df['Page_Title'] = namesList
data_df

Unnamed: 0,Text,Page_Title
William Wordsworth,"William Wordsworth (7 April 1770 – 23 April 1850) was an English Romantic poet who, with Samuel Taylor Coleridge, helped to launch the Romantic Ag...",William Wordsworth
Samuel Taylor Coleridge,"Samuel Taylor Coleridge (; 21 October 1772 – 25 July 1834) was an English poet, literary critic, philosopher and theologian who, with his friend ...",Samuel Taylor Coleridge
Anna Laetitia Barbauld,"Anna Laetitia Barbauld (, by herself possibly , as in French, née Aikin; 20 June 1743 – 9 March 1825) was a prominent English poet, essayist, lite...",Anna Laetitia Barbauld
Mary Wollstonecraft,"Mary Wollstonecraft (, also UK: ; 27 April 1759 – 10 September 1797) was an English writer, philosopher, and advocate of women's rights. Until the...",Mary Wollstonecraft
Percy Bysshe Shelley,"Percy Bysshe Shelley ( (listen) BISH; 4 August 1792 – 8 July 1822) was one of the major English Romantic poets, widely regarded as one of the fine...",Percy Bysshe Shelley
Christina Rossetti,"Christina Georgina Rossetti (5 December 1830 – 29 December 1894) was an English poet who wrote various romantic, devotional, and children's poems....",Christina Rossetti
"Alfred, Lord Tennyson","Alfred Tennyson, 1st Baron Tennyson (6 August 1809 – 6 October 1892) was a British poet. He was the Poet Laureate of Great Britain and Ireland du...","Alfred, Lord Tennyson"
Robert Browning,Robert Browning (7 May 1812 – 12 December 1889) was an English poet and playwright whose mastery of the dramatic monologue made him one of the for...,Robert Browning
Elizabeth Barrett Browning,"Elizabeth Barrett Browning (née Moulton-Barrett, ; 6 March 1806 – 29 June 1861) was an English poet of the Victorian era, popular in Britain and t...",Elizabeth Barrett Browning
Rudyard Kipling,"Joseph Rudyard Kipling ( RUD-yərd; 30 December 1865 – 18 January 1936) was an English journalist, short-story writer, poet, and novelist. He was b...",Rudyard Kipling


In [24]:
# Pickle, to save the data to use again
import pickle

data_df.to_pickle("corpus.pkl") #This is just the corpus, not cleaned? Why not?

In [25]:
# Create a document term matrix, and pickle
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=3)
#NOTE -- play with ngram range, to clean relevant bigrams (e.g., Mickey Mouse)
data_cv = cv.fit_transform(data_clean.Text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,abandon,abandoned,abandonment,abbey,ability,able,abolition,abroad,academic,academy,...,york university,yorker,young,young man,young people,young poet,younger,youngest,youth,youthful
William Wordsworth,0,0,1,3,0,1,0,0,0,0,...,0,0,3,0,0,0,0,1,1,1
Samuel Taylor Coleridge,1,1,0,1,0,1,0,0,1,0,...,0,0,2,0,0,1,0,1,1,0
Anna Laetitia Barbauld,2,0,0,0,0,1,0,0,1,7,...,0,0,7,0,1,0,0,1,0,1
Mary Wollstonecraft,0,0,1,0,2,0,1,0,0,1,...,0,1,7,0,0,0,0,0,1,0
Percy Bysshe Shelley,0,3,0,1,0,0,0,1,0,1,...,1,0,6,0,1,0,7,1,2,1
Christina Rossetti,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
"Alfred, Lord Tennyson",0,0,0,2,0,0,0,0,0,0,...,0,0,1,1,0,0,1,0,0,0
Robert Browning,0,1,0,1,1,0,0,3,0,0,...,0,0,1,0,0,0,1,0,1,0
Elizabeth Barrett Browning,0,0,0,0,0,1,2,0,0,0,...,0,0,1,0,0,0,0,0,1,0
Rudyard Kipling,1,0,0,2,1,0,0,3,1,2,...,1,0,1,1,0,0,2,2,1,0


In [26]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")
data_clean.to_pickle('data_clean.pkl') #This is the cleaned data
pickle.dump(cv, open("cv.pkl", "wb")) #This is the CountVectorizer object