## Cleaning the Dataset

In [1]:
import pandas as pd
import numpy as np
import re
import string

import pickle

In [2]:
pd.set_option('max_colwidth', 400)

### Import Pickled Dataframe

In [3]:
# Unpickle the dataframe
with open('../data/dfraw_novelMAS.pkl','rb') as fin:
    df = pickle.load(fin)

In [4]:
df.head()

Unnamed: 0,title,summary
0,A Wizard of Earthsea,"A Wizard of Earthsea is a fantasy novel written by American author Ursula K. Le Guin and first published by the small press Parnassus in 1968. It is regarded as a classic of children's literature, and of fantasy, within which it was widely influential. The story is set in the fictional archipelago of Earthsea and centers around a young mage named Ged, born in a village on the island of Gont. H..."
1,Carmilla,"Carmilla is an 1872 Gothic novella by Irish author Joseph Sheridan Le Fanu and one of the early works of vampire fiction, predating Bram Stoker's Dracula (1897) by 26 years. First published as a serial in The Dark Blue (1871–72), the story is narrated by a young woman preyed upon by a female vampire named Carmilla, later revealed to be Mircalla, Countess Karnstein (Carmilla is an anagram of Mi..."
2,Don_Quixote,"The Ingenious Gentleman Don Quixote of La Mancha (Modern Spanish: El ingenioso hidalgo (in Part 2, cavallero) Don Quijote de la Mancha, pronounced [el iŋxeˈnjoso iˈðalɣo ðoŋ kiˈxote ðe la ˈmantʃa] (listen)), or just Don Quixote (, US: , Spanish: [ðoŋ kiˈxote] (listen)), is a Spanish novel by Miguel de Cervantes. Published in two parts, in 1605 and 1615, Don Quixote is the most influential work..."
3,Erewhon,"Erewhon: or, Over the Range () is a novel by Samuel Butler which was first published anonymously in 1872. The title is also the name of a country, supposedly discovered by the protagonist. In the novel, it is not revealed where Erewhon is, but it is clear that it is a fictional country. Butler meant the title to be understood as the word ""nowhere"" backwards even though the letters ""h"" and ""w"" ..."
4,Farmer_Giles_of_Ham,"Farmer Giles of Ham is a comic Medieval fable written by J. R. R. Tolkien in 1937 and published in 1949. The story describes the encounters between Farmer Giles and a wily dragon named Chrysophylax, and how Giles manages to use these to rise from humble beginnings to rival the king of the land. It is cheerfully anachronistic and light-hearted, set in Britain in an imaginary period of the Dark ..."


In [5]:
print('original dimension: ',df.shape)

original dimension:  (2291, 2)


#### Pre-processing entries in `summary` and `title` columns in dataframe
- [x] Remove articles containing "Lists_of_" in the `title`
- [x] Remove first 24 entries, unrelated topic
- [x] Drop any duplicates
- [ ] Remove additional textual inconsistencies

In [6]:
# Check out articles that begin with the word 'List'

df[df['title'].str.contains('List')].head()

Unnamed: 0,title,summary
2125,List_of_accolades_received_by_Saving_Mr._Banks,"Saving Mr. Banks is a 2013 American drama film directed by John Lee Hancock, produced by Walt Disney Pictures, and starring Emma Thompson as P.L. Travers and Tom Hanks as Walt Disney. The following is list of accolades received by the film."


In [7]:
# Remove entries with "Lists", which are not book recommendations

df = df[~df['title'].str.contains('List')]
df.shape

(2290, 2)

In [8]:
# Drop duplicated rows

print(df.shape)
df.drop_duplicates(inplace=True)
print(df.shape)

(2290, 2)
(2290, 2)


#### Clean `'summary'` column

In [9]:
# Replace \n with  space ' '
df['summary'] = df['summary'].apply(lambda s: s.replace('\n',' '))
# # Replace \'s with  space ' '
# df['summary'] = df['summary'].apply(lambda s: s.replace('\'s',"'s"))

# Remove `References, external links, sources, etc.` at the end of text
regexPattern = re.compile('References(.*)', flags=re.IGNORECASE)
df['summary'] = df.summary.str.replace(regexPattern, '')

#### Clean `'title'` column

In [10]:
df['title'] = df['title'].apply(lambda s: s.replace('_',' '))

In [11]:
print('dimension after processing: ', df.shape)

dimension after processing:  (2290, 2)


#### Write files

In [12]:
# Write processed data as csv, just in case
df.to_csv('../data/df_processed_novel_non_series.csv',index=False)

In [13]:
# Pickle the processed file
with open('../data/fclean_novel_non_series.pkl','wb') as fout:
    pickle.dump(df, fout)

Next Step:
- Topic Modeling of these titles, Step3_Modeling.ipynb

---