# Data Cleaning and Preparation

After having imported a set of book summaries and genres, then limited the list to only the 20 most popular genres, we can begin to clean and prepare the data itself. This will be the start of our Natural Language Processing work before we build our machine learning engine.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
from string import punctuation
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

from IPython.display import clear_output
from datetime import datetime

In [2]:
# the data
path_to_data = "../Data/top_20_genres_and_summaries.csv"
df = pd.read_csv(path_to_data, index_col=0)
df.head()

Unnamed: 0,plot_summary,book_genres
0,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,"Old Major, the old boar on the Manor Farm, ca...",Speculative fiction
2,"Old Major, the old boar on the Manor Farm, ca...",Fiction
3,"Alex, a teenager living in near-future Englan...",Science Fiction
4,"Alex, a teenager living in near-future Englan...",Speculative fiction


## Step 1: Cleaning

For our cleaning, we need to reduce the impact of word varience as much as possible before we make our bags of words. To do this, we will make everything lower-case and eliminate punctuation, remove stopwords, then tag and lemmatize the words.

Define function for cleaning text. This will take an entire summary and break it into sentences, then make each sentence lowercase, remove all punctuation, define the parts of speach, and lemmmatize the individual words. Finally, it will combine the lemmatized words into a new string representing the cleaned summary.

In [3]:
# before cleaning
df.iloc[0].plot_summary[:250]

" Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napole"

In [4]:
en_stopwords = stopwords.words('english')
wnl = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # default pos

def clean_text(text):
    results = []
    # split text into sentences
    sentences = sent_tokenize(text)
    # clean and lemmatize each sentence
    for sentence in sentences:
        words = sentence.lower() # shift to lowercase
        words = re.sub(f"[{re.escape(punctuation)}]", "", words) # remove punctuation
        words = word_tokenize(words) # split sentence into individual words
        words = [word for word in words if not word in en_stopwords] # remove stopwords
        parts = pos_tag(words)
        for word, part in parts:
            lemma = wnl.lemmatize(word, pos=get_wordnet_pos(part))
            results.append(lemma)
    return " ".join(results)

In [5]:
# after cleaning
clean_text(df.iloc[0].plot_summary)[:250]

'old major old boar manor farm call animal farm meeting compare human parasite teach animal revolutionary song beast england major dy two young pig snowball napoleon assume command turn dream philosophy animal revolt drive drunken irresponsible mr jon'

Now let's build a new dataframe with our cleaned summaries. From previous experience on this operation, we will chunk it so that our poor computer doesn't have to remember everything at once.

In [6]:
clean_df = pd.DataFrame()
increment = 100 # how many rows to process at once
start = datetime.now()
for i in range(0, len(df), increment):
    print("Processing {}-{} of {}.\nTotal Time Elapsed: {} seconds"\
          .format(i, i+increment-1, len(df), (datetime.now()-start).total_seconds()))
    next_set = df.iloc[i:i+increment]['plot_summary'].apply(clean_text)
    clean_df = pd.concat((clean_df, next_set))
    clear_output()
    print("Last processed: {}\n".format(clean_df.iloc[-1:]))
print("Complete")

Last processed:                                                        0
26540  makar devushkin varvara dobroselova second cou...

Complete


In [7]:
clean_df

Unnamed: 0,0
0,old major old boar manor farm call animal farm...
1,old major old boar manor farm call animal farm...
2,old major old boar manor farm call animal farm...
3,alex teenager live nearfuture england lead gan...
4,alex teenager live nearfuture england lead gan...
...,...
26536,series follow character nick stone exmilitary ...
26537,series follow character nick stone exmilitary ...
26538,reader first meet rapp covert operation iran d...
26539,reader first meet rapp covert operation iran d...


Now we need to add the genres back and it'll be ready to go.

In [8]:
clean_df['genre'] = df['book_genres']
clean_df.rename(columns={0: 'summary'}, inplace=True)
clean_df

Unnamed: 0,summary,genre
0,old major old boar manor farm call animal farm...,Children's literature
1,old major old boar manor farm call animal farm...,Speculative fiction
2,old major old boar manor farm call animal farm...,Fiction
3,alex teenager live nearfuture england lead gan...,Science Fiction
4,alex teenager live nearfuture england lead gan...,Speculative fiction
...,...,...
26536,series follow character nick stone exmilitary ...,Fiction
26537,series follow character nick stone exmilitary ...,Suspense
26538,reader first meet rapp covert operation iran d...,Thriller
26539,reader first meet rapp covert operation iran d...,Fiction


And with that the data is clean and ready for bagging. There are probably internal errors in the lexemes or other portions, but the information should be sufficient to give us a valid machine learning model.

We'll revisit the cleaning and prep again in the last section, during which we will build a combined pipeline of the process. For now, though, let's forward this data to the next step for bagging and TF-IDF.

In [9]:
path_to_save = "../Data/cleaned_summaries_and_genres.csv"
clean_df.to_csv(path_to_save)