# Exploratory Data Analysis
* **Data Set**: ymaricar/[cmu-book-summary-dataset](https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset)
* **Project**: book recommendation system.

## Environment Setup

In [1]:
# THE REGS
import pandas as pd
import numpy as np
import kagglehub
import nltk
import string
import os
import time
import re

# DATA VIZ
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

my_template = go.layout.Template(
    layout=dict(
        font=dict(family="Arial", size=14),
        title=dict(font=dict(size=20)),
        paper_bgcolor="LightSteelBlue",
        plot_bgcolor="white",
        margin=dict(l=70, r=70, t=80, b=70),
        #xaxis=dict(showgrid=True, zeroline=False),
        yaxis=dict(showgrid=True, zeroline=False),
    )
)

pio.templates["custom_default"] = my_template
pio.templates.default = "custom_default"

## Data Setup

In [2]:
# DATA DOWNLOAD
path = kagglehub.dataset_download("ymaricar/cmu-book-summary-dataset")
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/cmu-book-summary-dataset


In [3]:
# DATA SETUP

# Reformat data file so it fits into a pandas dataframe
def text_to_csv_pandas(input_file, output_file, column_names, delimiter=None):
    """
    Reads a text file into a Pandas DataFrame and saves it as a CSV file.

    Args:
        input_file (str): The path to the input text file.
        output_file (str): The path to the output CSV file.
        delimiter (str, optional): The delimiter used in the text file. Defaults to None,
            which will split each line by whitespace if the text file is not delimited.
    """
    if delimiter is not None:
        df = pd.read_csv(input_file, sep=delimiter, names = column_names, header=None)
    else:
         df = pd.read_csv(input_file, sep=r'\s+', names = column_names, header=None)
    df.to_csv(output_file, index=False, header=True)

# Columns in the data set
columns = ['Wikipedia article ID', 
           'Freebase ID', 
           'Book title', 
           'Author', 
           'Publication date', 
           'Book genres', 
           'Plot summary']

text_to_csv_pandas('/kaggle/input/cmu-book-summary-dataset/booksummaries.txt', 'data.csv', 
                   column_names = columns, delimiter='\t')

data = pd.read_csv('/kaggle/working/data.csv')
data.head()

Unnamed: 0,Wikipedia article ID,Freebase ID,Book title,Author,Publication date,Book genres,Plot summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


## Data Exploration

**What do I want to know that would be helpful?**
1. Find the number of NAs and where.
2. The distribution of the number of authors per book.
3. The range for publication date (and how many styles dates are written in)
4. The distribution of the number of genres per book.
5. The distribution of the number of words per plot summary.
6. The distribution of the number of sentences per plot summary.

*** Take note of any cleaning that should be done.

### 1) Explore NA values.

In [4]:
print("Number of NA values for each feature:\n", data.isna().sum())

Number of NA values for each feature:
 Wikipedia article ID       0
Freebase ID                0
Book title                 0
Author                  2382
Publication date        5610
Book genres             3718
Plot summary               0
dtype: int64


### 2) Explore authors column

In [5]:
# Get word count for each cell in author column
author_word_count = pd.to_numeric(data['Author'].str.count(' ') + 1, errors='coerce')
fig = px.histogram(author_word_count, text_auto=True, title="Author column word count")
fig.show()

### 3) Exploring publication dates
**General formats to watch out for**:
\
YYYY-MM-DD
\
YYYY-MM
\
YYYY
\
NA

In [6]:
# Preview what we're working with...
print(data['Publication date'].loc[0:5])

0    1945-08-17
1          1962
2          1947
3           NaN
4           NaN
5    1929-01-29
Name: Publication date, dtype: object


In [7]:
# Get a closer look at how many rows have which date information
no_dates = data[data['Publication date'].isna()].index
print("Number of books with no publication date: ", len(no_dates))

regex = r'\d{4}$'
year_only_dates = data[data['Publication date'].str.contains(regex) == True].index
print("Number of books with only the publication year: ", len(year_only_dates))

regex = r'\d{4}-\d{2}$'
half_dates = data[data['Publication date'].str.contains(regex) == True].index
print("Number of books with just the year and month: ", len(half_dates))

regex = r'\d{4}-\d{2}-\d{2}'
full_dates = data[data['Publication date'].str.contains(regex) == True].index
print("Number of books with the full publication date: ", len(full_dates))

Number of books with no publication date:  5610
Number of books with only the publication year:  6799
Number of books with just the year and month:  1479
Number of books with the full publication date:  2671


In [8]:
# New column for publication year
data['Publication year'] = 0

# Fill in column with year values
data.loc[full_dates, 'Publication year'] = data.loc[full_dates, 'Publication date'].str.split("-").str[0].astype('int')
data.loc[half_dates, 'Publication year'] = data.loc[half_dates, 'Publication date'].str.split("-").str[0].astype('int')
data.loc[year_only_dates, 'Publication year'] = data.loc[year_only_dates, 'Publication date'].str.split("-").str[0].astype('int')

# Preview
data.head(n=3)

Unnamed: 0,Wikipedia article ID,Freebase ID,Book title,Author,Publication date,Book genres,Plot summary,Publication year
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca...",1945
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...",1962
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,1947


In [9]:
# Data set book publication year distribution
fig = px.violin(data[data['Publication year'] > 0], x="Publication year", points='all')
fig.show()

### 4) Exploring book genres 

In [10]:
# FIX GENRE COLUMN

# Example case of what we need to fix in the genres column
text = data['Book genres'].iloc[0]
print("What we need to fix in the genre column:\n", text)

# Text the proper regex expression
matches = re.findall(r'":\s*"([^"]+)"', text)
print("\nTesting out the regex expression to make sure it works properly:\n", matches)

# \\u00e0 is a utf-8 symbol so we must take care of this as well.
decoded_genres = [bytes(word, "utf-8").decode("unicode_escape") for word in matches]
print("\nPrint after handling utf-8 symbols:\n", decoded_genres) 

What we need to fix in the genre column:
 {"/m/016lj8": "Roman \u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children's literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"}

Testing out the regex expression to make sure it works properly:
 ['Roman \\u00e0 clef', 'Satire', "Children's literature", 'Speculative fiction', 'Fiction']

Print after handling utf-8 symbols:
 ['Roman à clef', 'Satire', "Children's literature", 'Speculative fiction', 'Fiction']


In [11]:
# Perform these two text preprocessing steps for entire column (Book genres)
data['Book genres'] = data['Book genres'].apply(lambda row: re.findall(r'":\s*"([^"]+)"', str(row)))
data['Book genres'] = data['Book genres'].apply(lambda row: [bytes(word, "utf-8").decode("unicode_escape") for word in row])

# Preview
data.head(n=3)

Unnamed: 0,Wikipedia article ID,Freebase ID,Book title,Author,Publication date,Book genres,Plot summary,Publication year
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",1945
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",1962
2,986,/m/0ldx,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,1947


In [12]:
# How many unique genres are there?
book_genres = data['Book genres'].tolist()
genres = list(set(sum(book_genres, [])))
print("Total number of genres in this data set: ", len(genres))

Total number of genres in this data set:  227


In [13]:
pd.DataFrame(list(sum(book_genres,[])), columns=['Genre'])

Unnamed: 0,Genre
0,Roman à clef
1,Satire
2,Children's literature
3,Speculative fiction
4,Fiction
...,...
29999,Thriller
30000,Fiction
30001,Autobiography
30002,Epistolary novel


In [14]:
genre_count = pd.DataFrame(list(sum(book_genres,[])), columns=['Genre']).value_counts()
genre_count.head()

Genre              
Fiction                4747
Speculative fiction    4314
Science Fiction        2870
Novel                  2463
Fantasy                2413
Name: count, dtype: int64

In [15]:
# get the frequency per genre
fig = px.bar(pd.DataFrame(genre_count).reset_index()[0:30], y='count', x='Genre', title="Book genre frequency")
fig.update_layout(
    xaxis=dict(
        tickangle=40  # or 90 for vertical
    ),
    margin=dict(b=120)
)

fig.show()

In [16]:
# get the frequency for number of genres per book
fig = px.histogram(data['Book genres'].apply(len), text_auto=True, title="Number of genres per book")
fig.update_layout(
    xaxis_title="Number of genres associated with book",
    yaxis_title="Number of books",
)
fig.show()

### 5) Number of terms per plot summary

In [17]:
words_per_summary = data['Plot summary'].apply(lambda x: len(nltk.word_tokenize(x)))

fig = px.histogram(words_per_summary)
fig.update_layout(
    xaxis_title="Number of words in book's plot summary",
)
fig.show()

In [18]:
# get a closer look at books with fewer than 5 terms in summary
data.loc[words_per_summary[words_per_summary < 5].index]

Unnamed: 0,Wikipedia article ID,Freebase ID,Book title,Author,Publication date,Book genres,Plot summary,Publication year
2045,1078455,/m/0442g7,The Kennel Murder Case,S. S. Van Dine,,"[Mystery, Fiction, Suspense]",~Plot outline description,0
3879,2664992,/m/07wd73,Slavers,Chris Pramas,2000,[Role-playing game],==Publication histor,2000
5271,4118477,/m/0bk2zx,Golem in the Gears,Piers Anthony,1986-02,"[Science Fiction, Speculative fiction, Fantasy...",pl:Zakochany golem,1986
5595,4507859,/m/0c63m2,The Adventures of Super Diaper Baby,Dav Pilkey,2002,[Children's literature],=== Plot summary ===,2002
5693,4597024,/m/0cbswx,The Deathlord of Ixia,John Grant,1992,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,1992
5879,4817875,/m/0cph_1,The Caverns of Kalte,Joe Dever,1984,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,1984
5972,4908574,/m/0ctp7r,The Eyes of Darkness,Dean Koontz,1981,"[Speculative fiction, Horror, Fiction, Romance...",==Character,1981
6335,5264015,/m/0dbjxj,Created By,Richard Christian Matheson,1993,"[Speculative fiction, Horror, Fiction]",~Plot outline description,1993
6622,5466643,/m/0dn44d,The Saint in Pursuit,Leslie Charteris,1970,[Mystery],To be added.,1970
6629,5471038,/m/0dnbrx,The Saint and the Hapsburg Necklace,Leslie Charteris,1976,"[Mystery, Suspense]",To be added.,1976


### 6) Number of sentences per plot summary

In [19]:
from nltk.tokenize import sent_tokenize

sentences_per_summary = data['Plot summary'].apply(lambda x: len(sent_tokenize(x)))

fig = px.histogram(sentences_per_summary)
fig.update_layout(
    xaxis_title="Number of sentences in book's plot summary",
)
fig.show()

### 7) Average number of words per sentence

In [20]:
print(words_per_summary.sum()/sentences_per_summary.sum())

24.626807723849094


## Text Cleaning Notes
* Remove ID columns (not necessary -- remove them and save RAM)
* Remove utf-8 symbols in the genres column (also put genres per book in list format).
* Create new column for publication year (could be useful for recommender filtering later)
* Drop rows that don't have a proper description in 'Plot summary' column (ex. less than 5 terms)