# Spell Checker

#### Ting-Wei Shen, tis50@pitt.edu, Feb 20, 2019

### Overview

I choose the data from The Project Gutenberg website because their copyrights have expired. For my project goal, I plan to design a spell checker, so I will choose some material with high quality context as my training data.



In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import numpy as np
import re

In [2]:
%pprint()

Pretty printing has been turned OFF


## Loading Data

#### By using nltk to help me processing the data.

In [3]:
corpus_root = './data/'
mlkcor = PlaintextCorpusReader(corpus_root, '.*txt')

In [4]:
mtoks = [w.lower() for w in mlkcor.words()]
mtokfd = nltk.FreqDist(mtoks)

## Basic stats on my data

#### How big is the corpus?

In [5]:
len(mtoks)

3837367

#### Look at the files

In [6]:
mlkcor.fileids()

['Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt', 'Anna_Karenina_by_Leo_Tolstoy.txt', 'David_Copperfield_by_Charles_Dickens.txt', 'Don_Quixote_by_Miguel_de_Cervantes.txt', 'Dracula_by_Bram_Stoker.txt', 'Emma_by_Jane_Austen.txt', 'Frankenstein_by_Mary_Shelley.txt', 'Great_Expectations_by_Charles_Dickens.txt', 'Grimms_Fairy_Tales_by_The_Brothers_Grimm.txt', 'Metamorphosis_by_Franz_Kafka.txt', 'Oliver_Twist_by_Charles_Dickens.txt', 'Pride_and_Prejudice_by_Jane_Austen.txt', 'The_Adventures_of_Sherlock_Holmes_by_Arthur_Conan_Doyle.txt', 'The_Adventures_of_Tom_Sawyer_by_Mark_Twain.txt', 'The_Count_of_Monte_Cristo_by_Alexandre_Dumas.txt', 'The_Picture_of_Dorian_Gray_by_Oscar_Wilde.txt', 'The_Prince_by_Nicolo_Machiavelli.txt', 'The_Romance_of_Lust_by_Anonymous.txt', 'The_Yellow_Wallpaper_by_Charlotte_Perkins_Gilman.txt', 'Through_the_Looking_Glass_by_Lewis_Carroll.txt']

In [7]:
for f in mlkcor.fileids():
    print(f, 'has', len(mlkcor.words(f)), 'tokens.')

Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt has 37861 tokens.
Anna_Karenina_by_Leo_Tolstoy.txt has 433609 tokens.
David_Copperfield_by_Charles_Dickens.txt has 448118 tokens.
Don_Quixote_by_Miguel_de_Cervantes.txt has 497782 tokens.
Dracula_by_Bram_Stoker.txt has 196321 tokens.
Emma_by_Jane_Austen.txt has 196091 tokens.
Frankenstein_by_Mary_Shelley.txt has 89381 tokens.
Great_Expectations_by_Charles_Dickens.txt has 230279 tokens.
Grimms_Fairy_Tales_by_The_Brothers_Grimm.txt has 124555 tokens.
Metamorphosis_by_Franz_Kafka.txt has 29080 tokens.
Oliver_Twist_by_Charles_Dickens.txt has 204966 tokens.
Pride_and_Prejudice_by_Jane_Austen.txt has 147764 tokens.
The_Adventures_of_Sherlock_Holmes_by_Arthur_Conan_Doyle.txt has 129459 tokens.
The_Adventures_of_Tom_Sawyer_by_Mark_Twain.txt has 94521 tokens.
The_Count_of_Monte_Cristo_by_Alexandre_Dumas.txt has 573731 tokens.
The_Picture_of_Dorian_Gray_by_Oscar_Wilde.txt has 99457 tokens.
The_Prince_by_Nicolo_Machiavelli.txt has 59953 tokens.

In [8]:
DataSet = pd.DataFrame()

DataSet['filename'] = [f[:-4] for f in mlkcor.fileids()]
DataSet['tokens'] = [len(mlkcor.words(f)) for f in mlkcor.fileids()]


In [9]:
DataSet.head(5)

Unnamed: 0,filename,tokens
0,Alices_Adventures_in_Wonderland_by_Lewis_Carroll,37861
1,Anna_Karenina_by_Leo_Tolstoy,433609
2,David_Copperfield_by_Charles_Dickens,448118
3,Don_Quixote_by_Miguel_de_Cervantes,497782
4,Dracula_by_Bram_Stoker,196321


In [10]:
DataSet

Unnamed: 0,filename,tokens
0,Alices_Adventures_in_Wonderland_by_Lewis_Carroll,37861
1,Anna_Karenina_by_Leo_Tolstoy,433609
2,David_Copperfield_by_Charles_Dickens,448118
3,Don_Quixote_by_Miguel_de_Cervantes,497782
4,Dracula_by_Bram_Stoker,196321
5,Emma_by_Jane_Austen,196091
6,Frankenstein_by_Mary_Shelley,89381
7,Great_Expectations_by_Charles_Dickens,230279
8,Grimms_Fairy_Tales_by_The_Brothers_Grimm,124555
9,Metamorphosis_by_Franz_Kafka,29080


#### Cleaning the data

In [11]:
mlkcor.raw('Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt')[:1000]

'Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Alice’s Adventures in Wonderland\r\n\r\nAuthor: Lewis Carroll\r\n\r\nPosting Date: June 25, 2008 [EBook #11]\r\nRelease Date: March, 1994\r\nLast Updated: October 6, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nALICE’S ADVENTURES IN WONDERLAND\r\n\r\nLewis Carroll\r\n\r\nTHE MILLENNIUM FULCRUM EDITION 3.0\r\n\r\n\r\n\r\n\r\nCHAPTER I. Down the Rabbit-Hole\r\n\r\nAlice was beginning to get very tired of sitting by her sister on the\r\nbank, and of having nothing to do: once or twice she had

In [12]:
def clean_text(txt):
    """Removes unnecessary characters from a text."""
    txt = re.sub(r'.*[\*\*\*]','', txt)
    txt = re.sub(r'\n', ' ', txt)
    txt = re.sub(r'\r','', txt)
    txt = txt.strip()
    return txt

In [13]:
DataSet['text'] = [mlkcor.raw(f) for f in mlkcor.fileids()]

In [14]:
DataSet.head(5)

Unnamed: 0,filename,tokens,text
0,Alices_Adventures_in_Wonderland_by_Lewis_Carroll,37861,Project Gutenberg’s Alice’s Adventures in Wond...
1,Anna_Karenina_by_Leo_Tolstoy,433609,\r\nThe Project Gutenberg EBook of Anna Kareni...
2,David_Copperfield_by_Charles_Dickens,448118,The Project Gutenberg EBook of David Copperfie...
3,Don_Quixote_by_Miguel_de_Cervantes,497782,"The Project Gutenberg EBook of Don Quixote, by..."
4,Dracula_by_Bram_Stoker,196321,"The Project Gutenberg EBook of Dracula, by Bra..."


In [15]:
DataSet['text_cleaned'] = DataSet.text.apply(clean_text)

In [16]:
DataSet.head(5)

Unnamed: 0,filename,tokens,text,text_cleaned
0,Alices_Adventures_in_Wonderland_by_Lewis_Carroll,37861,Project Gutenberg’s Alice’s Adventures in Wond...,Project Gutenberg’s Alice’s Adventures in Wond...
1,Anna_Karenina_by_Leo_Tolstoy,433609,\r\nThe Project Gutenberg EBook of Anna Kareni...,"The Project Gutenberg EBook of Anna Karenina, ..."
2,David_Copperfield_by_Charles_Dickens,448118,The Project Gutenberg EBook of David Copperfie...,The Project Gutenberg EBook of David Copperfie...
3,Don_Quixote_by_Miguel_de_Cervantes,497782,"The Project Gutenberg EBook of Don Quixote, by...","The Project Gutenberg EBook of Don Quixote, by..."
4,Dracula_by_Bram_Stoker,196321,"The Project Gutenberg EBook of Dracula, by Bra...","The Project Gutenberg EBook of Dracula, by Bra..."


In [17]:
DataSet['text'][0][:1500]

'Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Alice’s Adventures in Wonderland\r\n\r\nAuthor: Lewis Carroll\r\n\r\nPosting Date: June 25, 2008 [EBook #11]\r\nRelease Date: March, 1994\r\nLast Updated: October 6, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nALICE’S ADVENTURES IN WONDERLAND\r\n\r\nLewis Carroll\r\n\r\nTHE MILLENNIUM FULCRUM EDITION 3.0\r\n\r\n\r\n\r\n\r\nCHAPTER I. Down the Rabbit-Hole\r\n\r\nAlice was beginning to get very tired of sitting by her sister on the\r\nbank, and of having nothing to do: once or twice she had

In [18]:
DataSet['text_cleaned'][0][:1000]

'Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll  This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org   Title: Alice’s Adventures in Wonderland  Author: Lewis Carroll  Posting Date: June 25, 2008 [EBook #11] Release Date: March, 1994 Last Updated: October 6, 2016  Language: English  Character set encoding: UTF-8             ALICE’S ADVENTURES IN WONDERLAND  Lewis Carroll  THE MILLENNIUM FULCRUM EDITION 3.0     CHAPTER I. Down the Rabbit-Hole  Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’  So she was considering in her

## Summary 

After cleaning up the data, I have found that their are a lot of '\n' or '\r' in the text file. If I replace them by the ' '(space) character, there will be a lot of ' '(space) in the files. However, if I totally replaced them by ''(None), there will be some words connectd together. The reason is that some original words is seperated by '\r\n' rather than ' '(space) character. 

Otherwise, on the top of each file, there is a short description about the story. However, there is not a standard description for every file. It is hard to set a function to remove this part. 