# Compiling Data
Ashley Feiler, aef56@pitt.edu

New Continuing from Data Exploration

## Imports 

In [1]:
import pickle
import glob
import re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  

## Sharable Data Samples
Because I can't share all of the data I'm using due to licensing, I plan on sharing samples. Since my computer could only handle loading so much data at a time, I used separate Jupyter Notebooks for different genres that I could open, merge the necessary data, pickle a smaller sample file, and then close, freeing memory. In this file, I will unpickle and combine all of those samples to then share.

### Original Data

In [2]:
directory = '/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/'

share_files = glob.glob(directory + 'genre_share/*.pkl') #Get filepath of all pickled files
print(len(share_files)) #Confirm 8 files for 8 genres
share_files[0]

8


'/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/genre_share/fantasy_share.pkl'

In [3]:
share_df = pd.DataFrame() #Create empty DataFrame to append each genre's sample to

for pkl in share_files: #For each file directory, load file and add to shared DataFrame
    f = open(pkl, 'rb')  
    df = pickle.load(f)     
    f.close()  
    share_df = pd.concat([share_df, df])
    
share_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 4
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       40 non-null     object
 1   book_id       40 non-null     int64 
 2   review_id     40 non-null     object
 3   rating        40 non-null     int64 
 4   review_text   40 non-null     object
 5   date_added    40 non-null     object
 6   date_updated  40 non-null     object
 7   read_at       40 non-null     object
 8   started_at    40 non-null     object
 9   n_votes       40 non-null     int64 
 10  n_comments    40 non-null     int64 
dtypes: int64(4), object(7)
memory usage: 3.8+ KB


This confirms that all together, there are 40 review samples just like there were supposed to be (5 from 8 genres). To keep the sample as minimal as possible to stay within Fair Use guidelines, I will take a sample of only 5 of these 40 reviews to then save as a CSV and share in my public repository.

(The code below that writes the CSV file has been commented out to prevent the CSV file from being overwritten every time this notebook is run)

In [4]:
#genre_samples = share_df.sample(5)
#genre_samples

In [5]:
#genre_samples.to_csv('data_samples/Genre_Samples.csv')

### Condensed Data
That first process was to show a sample of what the original UCSD data looked like, but I also want to show the final format of data that I compiled and will be working with for my analysis. Below is the same process as above, but with the final DataFrames I created for each genre (each genre ranging from around 3000-4000 reviews). 

In [6]:
genre_files = glob.glob(directory + 'genre_pkls/*.pkl')
print(len(genre_files))
genre_files[0]

8


'/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/genre_pkls/children_short.pkl'

In [7]:
total_df = pd.DataFrame()

for pkl in genre_files:
    f = open(pkl, 'rb')  
    df = pickle.load(f)     
    f.close()  
    total_df = pd.concat([total_df, df])
    
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28274 entries, 0 to 4998
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Text           28274 non-null  object 
 1   Rating         28274 non-null  int64  
 2   Title          28274 non-null  object 
 3   Author         28274 non-null  object 
 4   Category       28274 non-null  object 
 5   Genres         28274 non-null  object 
 6   Language       28274 non-null  object 
 7   Pages          28274 non-null  object 
 8   Pub_Year       28274 non-null  object 
 9   Avg_Rating     28274 non-null  float64
 10  Ratings_Count  28274 non-null  int64  
 11  User_ID        28274 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.8+ MB


Combining the samples from all 8 genres resulted in a total of 28274 reviews in total, which is a pretty decent amount of data to work with! Further down I will get into some more exploration of the makeup of this final data set I will be working with, but for now I want to save a small sample of this DataFrame to share.

In [8]:
#total_sample = total_df.sample(5)
#total_sample

In [9]:
#total_sample.to_csv('data_samples/FinalDF_Sample.csv')

## Data Makeup

At first I thought I might still need the userIDs, but I given all the columns I plan on adding for linguistic features, I don't think those IDs will be necessary, so my first order of business is to remove that column.

In [10]:
total_df = total_df[['Text', 'Rating', 'Title', 'Author', 'Category', 'Genres', 'Language', 'Pages', 'Pub_Year', 'Avg_Rating', 'Ratings_Count']]

In [11]:
total_df.columns

Index(['Text', 'Rating', 'Title', 'Author', 'Category', 'Genres', 'Language',
       'Pages', 'Pub_Year', 'Avg_Rating', 'Ratings_Count'],
      dtype='object')

In [12]:
total_df = total_df.reset_index(drop=True)
#Some reviews have the same indexes because they came from separate DataFrames, so this resets the index.

In [13]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28274 entries, 0 to 28273
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Text           28274 non-null  object 
 1   Rating         28274 non-null  int64  
 2   Title          28274 non-null  object 
 3   Author         28274 non-null  object 
 4   Category       28274 non-null  object 
 5   Genres         28274 non-null  object 
 6   Language       28274 non-null  object 
 7   Pages          28274 non-null  object 
 8   Pub_Year       28274 non-null  object 
 9   Avg_Rating     28274 non-null  float64
 10  Ratings_Count  28274 non-null  int64  
dtypes: float64(1), int64(2), object(8)
memory usage: 2.4+ MB


In [14]:
total_df.shape

(28274, 11)

I am working with a DataFrame of 28274 reviews and 11 total columns, though this will expand as I add more linguistic features.

Now that that's done, let's take a look at some of the counts of different categories. What makeup of data am I finally working with?

In [15]:
total_df.Category.value_counts()

ya                        4334
fantasy_paranormal        4323
romance                   3918
mystery_thriller_crime    3789
comics_graphic            3505
history_bio               3362
children                  2858
poetry                    2185
Name: Category, dtype: int64

Clearly there is a pretty wide range in the number of reviews left from each genre after some of the data cleaning. Each genre started out with 5000 reviews, but some were eliminated because they were non-English or empty, which disproportionately affected different genres. This will definitely be something to keep in mind during analysis.

In [16]:
total_df.Rating.value_counts()

5    9941
4    9593
3    5356
2    1807
0     894
1     683
Name: Rating, dtype: int64

5- and 4-star reviews are by far the most common, followed by 3-star reviews. 2-star reviews are much less frequent, and 0- and 1-star reviews even less. It makes sense that the higher ratings are more common as people are more likely to write a review about a book they like rather than a book they are indifferent about, but I'm a little surprised to see so few low ratings. In my experience, people tend to be pretty passionate about books they hate as well. If genre turns out to not be a significant factor changing linguistic features, it could be interesting to see if rating, which theoretically correlates to sentiment, has any effect on the language used in the review.

In [17]:
len(total_df.Title.unique())

17774

In [18]:
total_df.Title.value_counts()[:15]

Milk and Honey                                                                          113
Hamlet                                                                                   50
The Giver (The Giver, #1)                                                                50
The Hunger Games (The Hunger Games, #1)                                                  49
Cinder (The Lunar Chronicles, #1)                                                        49
The Girl on the Train                                                                    47
Brown Girl Dreaming                                                                      44
Wonder (Wonder #1)                                                                       43
Miss Peregrine’s Home for Peculiar Children (Miss Peregrine’s Peculiar Children, #1)     42
Divergent (Divergent, #1)                                                                40
Where the Sidewalk Ends                                                         

Out of 28274 reviews, there are 17774 unique book titles that are reviewed, meaning 10500 reviews are repeat reviews of at least one book (a suspiciously even number), but still the majority of books are only reviewed once. Milk and Honey, a very popular book of poetry, is the most reviewed book at 113 reviews, and a lot of the other most reviewed books I recognize as Young Adult and Fantasy novels. Those were the top 2 genres with the most reviews that made the final cut, so it's not surprising there are more repeat reviews for these books.

In [19]:
len(total_df.Author.unique())

9688

In [20]:
total_df.Author.value_counts()[:10]

Cassandra Clare     250
Brian K. Vaughan    157
Neil Gaiman         148
Marissa Meyer       137
Stephenie Meyer     130
Rupi Kaur           127
Sarah J. Maas       123
Stephen King        123
Rick Riordan        115
Suzanne Collins     106
Name: Author, dtype: int64

Out of 28274 reviews, there are only 9688 authors that are reviewed, which is a smaller number, but makes sense seeing as authors may have written many different books. 

In [21]:
total_df.Language.value_counts()

eng      22737
en-US     4268
en-GB     1026
en-CA      243
Name: Language, dtype: int64

In [22]:
total_df.describe()

Unnamed: 0,Rating,Avg_Rating,Ratings_Count
count,28274.0,28274.0,28274.0
mean,3.835396,3.990835,88025.85
std,1.22186,0.292023,350648.7
min,0.0,1.98,0.0
25%,3.0,3.81,536.0
50%,4.0,4.01,4224.0
75%,5.0,4.19,30524.5
max,5.0,5.0,4899965.0


In [23]:
total_df.groupby('Category').describe()

Unnamed: 0_level_0,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Avg_Rating,Avg_Rating,Avg_Rating,Avg_Rating,Avg_Rating,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
children,2858.0,3.904829,1.203209,0.0,3.0,4.0,5.0,5.0,2858.0,4.037768,...,4.21,5.0,2858.0,93980.546186,275095.225792,1.0,330.0,3001.5,31387.0,1876252.0
comics_graphic,3505.0,3.811412,1.153754,0.0,3.0,4.0,5.0,5.0,3505.0,4.02168,...,4.24,4.83,3505.0,16528.807703,41517.096041,1.0,479.0,2705.0,12834.0,406669.0
fantasy_paranormal,4323.0,3.8161,1.246819,0.0,3.0,4.0,5.0,5.0,4323.0,4.014464,...,4.23,5.0,4323.0,108879.451076,375846.839796,1.0,838.5,7755.0,55039.0,4765497.0
history_bio,3362.0,3.851279,1.21574,0.0,3.0,4.0,5.0,5.0,3362.0,3.943968,...,4.14,5.0,3362.0,96545.556217,342835.065977,0.0,592.0,4165.0,30058.75,3255518.0
mystery_thriller_crime,3789.0,3.727105,1.178367,0.0,3.0,4.0,5.0,5.0,3789.0,3.88413,...,4.06,4.88,3789.0,59168.214568,210601.102517,1.0,522.0,3984.0,22034.0,2046499.0
poetry,2185.0,3.897941,1.276413,0.0,3.0,4.0,5.0,5.0,2185.0,4.096256,...,4.26,5.0,2185.0,44478.507551,151734.841123,0.0,148.0,1433.0,15270.0,1029527.0
romance,3918.0,3.943849,1.212399,0.0,3.0,4.0,5.0,5.0,3918.0,4.000403,...,4.2,4.91,3918.0,32528.685299,143318.96349,1.0,333.0,1878.5,10393.0,2078406.0
ya,4334.0,3.781034,1.272587,0.0,3.0,4.0,5.0,5.0,4334.0,3.979213,...,4.17,5.0,4334.0,211864.244347,652314.359248,1.0,2863.5,19151.0,106182.0,4899965.0


The mean ratings for each genre are pretty close together, but Romance has the highest average rating of 3.94 and Mystery/Thriller/Crime has the lowest average rating of 3.73. It's also intersting to compare the average ratings from these reviews to the Avg_Rating column statistics, which is the average rating of the book being reviewed. In general, the sample of reviews I am analyzing rate the book slightly lower than its average rating from all reviews, which is just an interesting phenomenon. Finally, the ratings count shows the number of ratings each book had (again, not just the UCSD data), so it appears that the books of the Young Adult genre represented by the UCSD corpus has by far the most ratings on Goodreads (211864) and books of the Comics/Graphic genre have the least (16529). It could be interesting to look at genres with a lower ratings count but higher reviews count, which suggests that it is a more niche genre that appeals to a more specific type of reader.

# Analysis
Now that I've FINALLY got my final data set and a sense of its size and makeup, it's time to start analysis! Since I'm looking at overall linguistic differences between reviews for different genres, I want to include as many different linguistic features as I can think of. In this next secion, I will be adding those features as additional columns to the DataFrame so I can then analayze their differences between genre categories.

In [24]:
%pprint

Pretty printing has been turned OFF


### Tokens

For each feature, I'm going to test first on a small subset of the reviews before applying the changes to the full DataFrame.

In [25]:
test_df = total_df.head()
test_df['Toks'] = test_df.Text.map(nltk.word_tokenize)
test_df['Toks']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Toks'] = test_df.Text.map(nltk.word_tokenize)


0                                                  [O]
1        [my, pick, for, the, caldecott, so, far, ...]
2    [This, time, Dan, and, Amy, go, to, the, Baham...
3    [Loved, the, excerpts, where, Julia, ,, the, m...
4    [I, liked, the, illustrations, ,, which, are, ...
Name: Toks, dtype: object

In [26]:
total_df['Toks'] = total_df.Text.map(nltk.word_tokenize)
total_df.head()

Unnamed: 0,Text,Rating,Title,Author,Category,Genres,Language,Pages,Pub_Year,Avg_Rating,Ratings_Count,Toks
0,O,0,Xander's Panda Party,Linda Sue Park,children,"{'children': 143, 'fiction': 15, 'poetry': 9, ...",eng,40,2013,4.05,1163,[O]
1,my pick for the caldecott so far...,5,Xander's Panda Party,Linda Sue Park,children,"{'children': 143, 'fiction': 15, 'poetry': 9, ...",eng,40,2013,4.05,1163,"[my, pick, for, the, caldecott, so, far, ...]"
2,This time Dan and Amy go to the Bahamas and Ja...,4,"Storm Warning (The 39 Clues, #9)",Linda Sue Park,children,"{'mystery, thriller, crime': 188, 'young-adult...",eng,190,2010,3.98,39904,"[This, time, Dan, and, Amy, go, to, the, Baham..."
3,"Loved the excerpts where Julia, the main chara...",5,Project Mulberry,Linda Sue Park,children,"{'fiction': 122, 'children': 111, 'young-adult...",eng,240,2007,3.67,2929,"[Loved, the, excerpts, where, Julia, ,, the, m..."
4,"I liked the illustrations, which are are - wel...",4,A Moon of My Own,Jennifer Rustgi,children,"{'children': 13, 'young-adult': 2, 'non-fictio...",eng,32,2016,3.78,84,"[I, liked, the, illustrations, ,, which, are, ..."


I want to add a lowercased tokenized column which may help keep things consistent for future calculations, but I want to make that in addition to the first 'Toks' column because capitalization could be an interesting feature to look at as well.

Also, I am still trying to avoid flashing the head of the DataFrame too frequently to avoid sharing more data than I should.

In [27]:
test_df['Toks_Lower'] = test_df.Toks.map(lambda x: [word.lower() for word in x])
test_df['Toks_Lower']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Toks_Lower'] = test_df.Toks.map(lambda x: [word.lower() for word in x])


0                                                  [o]
1        [my, pick, for, the, caldecott, so, far, ...]
2    [this, time, dan, and, amy, go, to, the, baham...
3    [loved, the, excerpts, where, julia, ,, the, m...
4    [i, liked, the, illustrations, ,, which, are, ...
Name: Toks_Lower, dtype: object

In [28]:
total_df['Toks_Lower'] = total_df.Toks.map(lambda x: [word.lower() for word in x])

In [29]:
total_df.Toks_Lower.iloc[2]

['this', 'time', 'dan', 'and', 'amy', 'go', 'to', 'the', 'bahamas', 'and', 'jamaica', 'to', 'discover', 'the', 'truth', 'about', 'the', 'madrigals', '.', 'they', 'find', 'the', 'clue', 'to', 'the', 'next', 'country', 'which', 'may', 'yet', 'unify', 'the', 'cahills', 'once', 'again']

Looks good!

### Token Count

In [30]:
test_df['Tok_Count'] = test_df.Toks.map(len)
test_df['Tok_Count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Tok_Count'] = test_df.Toks.map(len)


0      1
1      8
2     35
3     18
4    153
Name: Tok_Count, dtype: int64

In [31]:
total_df['Tok_Count'] = total_df.Toks.map(len)

### Word Length

In [32]:
#Excludes punctuation for this category (could bring down average word length inaccurately)
test_df['Avg_Word_Len'] = test_df.Toks.map(lambda x: np.mean([len(w) for w in x if w.isalnum()]))
test_df['Avg_Word_Len']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Avg_Word_Len'] = test_df.Toks.map(lambda x: np.mean([len(w) for w in x if w.isalnum()]))


0    1.000000
1    3.714286
2    4.176471
3    5.000000
4    4.484375
Name: Avg_Word_Len, dtype: float64

In [33]:
test_df.Toks.iloc[1] #Checking the math on my own - looks correct!

['my', 'pick', 'for', 'the', 'caldecott', 'so', 'far', '...']

In [34]:
total_df['Avg_Word_Len'] = total_df.Toks.map(lambda x: np.mean([len(w) for w in x if w.isalnum()]))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


### Sentences

In [35]:
test_df['Sents'] = test_df.Text.map(nltk.sent_tokenize)
test_df['Sents']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Sents'] = test_df.Text.map(nltk.sent_tokenize)


0                                                  [O]
1                [my pick for the caldecott so far...]
2    [This time Dan and Amy go to the Bahamas and J...
3    [Loved the excerpts where Julia, the main char...
4    [I liked the illustrations, which are are - we...
Name: Sents, dtype: object

In [36]:
test_df.Sents.iloc[2]

['This time Dan and Amy go to the Bahamas and Jamaica to discover the truth about the Madrigals.', 'They find the clue to the next country which may yet unify the Cahills once again']

In [37]:
total_df['Sents'] = total_df.Text.map(nltk.sent_tokenize)

### Sentence Count

In [38]:
test_df['Sents_Count'] = test_df.Sents.map(len)
test_df['Sents_Count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Sents_Count'] = test_df.Sents.map(len)


0    1
1    1
2    2
3    1
4    5
Name: Sents_Count, dtype: int64

In [39]:
total_df['Sents_Count'] = total_df.Sents.map(len)

### Sentence Length

In [40]:
#Had to tokenize first
test_df['Avg_Sent_Len'] = test_df.Sents.map(lambda x: np.mean([len(nltk.word_tokenize(s)) for s in x]))
test_df['Avg_Sent_Len']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Avg_Sent_Len'] = test_df.Sents.map(lambda x: np.mean([len(nltk.word_tokenize(s)) for s in x]))


0     1.0
1     8.0
2    17.5
3    18.0
4    30.6
Name: Avg_Sent_Len, dtype: float64

In [41]:
total_df['Avg_Sent_Len'] = total_df.Sents.map(lambda x: np.mean([len(nltk.word_tokenize(s)) for s in x]))

I've added a decent number of features, so let's check in with the `.describe()` feature to see where the numbers stand.

In [42]:
total_df.describe()

Unnamed: 0,Rating,Avg_Rating,Ratings_Count,Tok_Count,Avg_Word_Len,Sents_Count,Avg_Sent_Len
count,28274.0,28274.0,28274.0,28274.0,28118.0,28274.0,28274.0
mean,3.835396,3.990835,88025.85,137.82468,4.326438,7.525005,16.558581
std,1.22186,0.292023,350648.7,201.674042,0.715024,10.178012,10.228095
min,0.0,1.98,0.0,1.0,1.0,1.0,1.0
25%,3.0,3.81,536.0,26.0,4.0,2.0,10.75
50%,4.0,4.01,4224.0,66.0,4.251969,4.0,15.714286
75%,5.0,4.19,30524.5,164.0,4.540541,9.0,20.88544
max,5.0,5.0,4899965.0,4159.0,16.0,210.0,388.0


The first thing that catches my eye is token count. The mean of about 138 words seems reasonable, but STD of almost 202 words?? The quartile breakdown shows that actually most reviews are quite short, with the median being around 66 words, but it looks like an absolutely massive review at 4159 tokens is pulling up that average. I want to take a closer look at this, so I'm going to make a filter to locate specifically large token-count reviews.

In [43]:
large = (total_df.Tok_Count > 4000) 
total_df[large]

Unnamed: 0,Text,Rating,Title,Author,Category,Genres,Language,Pages,Pub_Year,Avg_Rating,Ratings_Count,Toks,Toks_Lower,Tok_Count,Avg_Word_Len,Sents,Sents_Count,Avg_Sent_Len
26450,"The Knight's Tale \n Very tragic, romantic sto...",4,The Canterbury Tales,Geoffrey Chaucer,poetry,"{'poetry': 1659, 'fiction': 613, 'history, his...",eng,627,1934,3.48,16,"[The, Knight, 's, Tale, Very, tragic, ,, roman...","[the, knight, 's, tale, very, tragic, ,, roman...",4159,4.239285,"[The Knight's Tale \n Very tragic, romantic st...",196,21.219388


So there's only one review above 4000 tokens. This review should be considered a book in and of itself!

In [44]:
large = (total_df.Tok_Count > 3000) 
print(total_df[large].shape)

large = (total_df.Tok_Count > 1000) 
print(total_df[large].shape)

large = (total_df.Tok_Count > 500) 
print(total_df[large].shape)

(4, 18)
(221, 18)
(1521, 18)


Considering there are over 28000 reviews total, with only 1521 being longer than 500 words, it's clear that long reviews are the outliers here. I don't think I should just exclude the long reviews, but it's definitely something to keep an eye on as it might affect some of my statistics. I was going to add TTR as a feature, but I'm going to skip that for now because I'd have to take a tiny sample of some of the longer texts to get to a comparable point to the smaller reviews.

In [45]:
total_df.describe()

Unnamed: 0,Rating,Avg_Rating,Ratings_Count,Tok_Count,Avg_Word_Len,Sents_Count,Avg_Sent_Len
count,28274.0,28274.0,28274.0,28274.0,28118.0,28274.0,28274.0
mean,3.835396,3.990835,88025.85,137.82468,4.326438,7.525005,16.558581
std,1.22186,0.292023,350648.7,201.674042,0.715024,10.178012,10.228095
min,0.0,1.98,0.0,1.0,1.0,1.0,1.0
25%,3.0,3.81,536.0,26.0,4.0,2.0,10.75
50%,4.0,4.01,4224.0,66.0,4.251969,4.0,15.714286
75%,5.0,4.19,30524.5,164.0,4.540541,9.0,20.88544
max,5.0,5.0,4899965.0,4159.0,16.0,210.0,388.0


Going back to this chart, let's take a quick look at the other features. Average word length of 4.33 seems pretty reasonable and close to the median of 4.25. Average sentence count looks like it falls into a similar trap to token count with some really long reviews pulling those averages up. The median is 4 sentences, and I think from here on out I'll look at median more than mean to mitigate the effect of those outliers. Average sentence length has a median of 15.71 tokens, which actually is pretty close to the mean at 16.56 tokens. That being said, the min and max are curiously short and long respectively, so I might quickly look at those.

In [46]:
short = (total_df.Avg_Sent_Len == 1) 
total_df[short].head(3)

Unnamed: 0,Text,Rating,Title,Author,Category,Genres,Language,Pages,Pub_Year,Avg_Rating,Ratings_Count,Toks,Toks_Lower,Tok_Count,Avg_Word_Len,Sents,Sents_Count,Avg_Sent_Len
0,O,0,Xander's Panda Party,Linda Sue Park,children,"{'children': 143, 'fiction': 15, 'poetry': 9, ...",eng,40,2013,4.05,1163,[O],[o],1,1.0,[O],1,1.0
588,3.3,3,The Miserable Mill (A Series of Unfortunate Ev...,Lemony Snicket,children,"{'fiction': 1361, 'young-adult': 967, 'childre...",en-US,194,2000,3.83,103546,[3.3],[3.3],1,,[3.3],1,1.0
716,Cute,4,Snowmen All Year,Caralyn Buehner,children,"{'children': 106, 'fiction': 3, 'fantasy, para...",eng,32,2010,3.97,715,[Cute],[cute],1,4.0,[Cute],1,1.0


In [47]:
long = (total_df.Avg_Sent_Len == 388) 
total_df[long].Text #Find index of review
total_df.Text.loc[15166]

"first read in june 6th/2014 \n reread in april 11th/2016 \n I don't have words enough to say how much I love Ronan Lynch, he's my queer, angry, trashy, beautiful son and i love him so much \n RONAN LYNCH TAKING CARE OF SMALL ANIMALS \n RONAN LYNCH LOVING CHAINSAW \n RONAN LYNCH LOVING HIS LITTLE BROTHER AND HIS MOM SO MUCH \n RONAN LYNCH LAUGHING AND BEING HAPPY \n RONAN LYNCH LOVING HIS FRIENDS AND NOT FEELING SO ALONE AFTER ALL \n RONAN LYNCH ADMITTING TO HIMSELF HE'S GAY AND IN LOVE WITH ADAM PARRISH ISTG THIS FUCKING KID IS GOING TO KILL ME my fave *clutches chest* \n also: matthew lynch is smol and basically the most precious human being ever \n bluesey is THE BEST THING i'm still crying why do you make me suffer so much @ universe \n my poor kid adam parrish is so broken in this book and i just wanna hold him and love him forever, he needs love, warm blankets, cookies and happiness \n the women of 300 fox way are my squad goals and i love them dearly \n i still don't really care

Okay, I see what's happening here. This reviewer made a new line for every new sentence rather than using punctuation, meaning NLTK's sentence tokenizer didn't recognize them as sentence boundaries. I'll have to look into a way around this, because I'm sure this isn't the only review where this happened. In fact, let's check. 

In [48]:
long = (total_df.Avg_Sent_Len > 300) 
print(total_df[long].shape)

long = (total_df.Avg_Sent_Len > 200) 
print(total_df[long].shape)

total_df[long].Text

(1, 18)
(8, 18)


348      ** spoiler alert ** \n Ok so I just finished t...
3155     wHd mn lktb ldhy m n bd't fyh, lm stT` ltwqf H...
5951     hw rw'y@ amn bh hdh lshkhS khll Hyth mbyn lTbq...
9579     I bought this back in'89 just after the 89 Bur...
13773    ewlaamiiwrrnkrrmthiidiisakeruue`nge`aamaasraan...
15166    first read in june 6th/2014 \n reread in april...
17212    ** spoiler alert ** \n Setting: South Dakota: ...
25082    My Review: 5 Stars \n Okay so I recieved my co...
Name: Text, dtype: object

Yeah, this is something I unfortunately noticed when saving my data samples. It seems like some reviews are just nonsense text that wasn't filtered out because it technically is English and not empty. I'm going to have to find a way to get rid of those samples or I'm sure they will end up skewing the results, but that will be a problem for my next progress report.

**UPDATE FROM PROGRESS REPORT 3 - FILTERING OUT NONSENSE TEXT**

I found this Python library, nostril, that is able to differentiate between text that is real and text that is most likely nonsensical, so I tried it out to see if it could solve my problem.

In [49]:
from nostril import nonsense

nonsense_test = ["This is a real sentence.", "i luv 2 read bookz", "ghsuofdisogjifs"]

for sent in nonsense_test:
    print(nonsense(sent))

False
False
True


It seems to do a decent job, even with more slang and misspellings. That being said, input shorter than 6 characters can't be categorized, but I don't want the process to stop when it throws an error for a text that is too short. I made the function below to catch those exceptions, using this to map whether each review in the dataframe below is real or nonsense.

In [50]:
def test_nonsense(text):
    try:
        if nonsense(text) == True:
            return "nonsense"
        if nonsense(text) == False:
            return "real"
    except ValueError as error:
        return "short"

test_nonsense("this")

'short'

In [51]:
test_df = total_df.head()
test_df["Nonsense"] = test_df.Text.map(test_nonsense)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df["Nonsense"] = test_df.Text.map(test_nonsense)


In [52]:
test_df

Unnamed: 0,Text,Rating,Title,Author,Category,Genres,Language,Pages,Pub_Year,Avg_Rating,Ratings_Count,Toks,Toks_Lower,Tok_Count,Avg_Word_Len,Sents,Sents_Count,Avg_Sent_Len,Nonsense
0,O,0,Xander's Panda Party,Linda Sue Park,children,"{'children': 143, 'fiction': 15, 'poetry': 9, ...",eng,40,2013,4.05,1163,[O],[o],1,1.0,[O],1,1.0,short
1,my pick for the caldecott so far...,5,Xander's Panda Party,Linda Sue Park,children,"{'children': 143, 'fiction': 15, 'poetry': 9, ...",eng,40,2013,4.05,1163,"[my, pick, for, the, caldecott, so, far, ...]","[my, pick, for, the, caldecott, so, far, ...]",8,3.714286,[my pick for the caldecott so far...],1,8.0,real
2,This time Dan and Amy go to the Bahamas and Ja...,4,"Storm Warning (The 39 Clues, #9)",Linda Sue Park,children,"{'mystery, thriller, crime': 188, 'young-adult...",eng,190,2010,3.98,39904,"[This, time, Dan, and, Amy, go, to, the, Baham...","[this, time, dan, and, amy, go, to, the, baham...",35,4.176471,[This time Dan and Amy go to the Bahamas and J...,2,17.5,real
3,"Loved the excerpts where Julia, the main chara...",5,Project Mulberry,Linda Sue Park,children,"{'fiction': 122, 'children': 111, 'young-adult...",eng,240,2007,3.67,2929,"[Loved, the, excerpts, where, Julia, ,, the, m...","[loved, the, excerpts, where, julia, ,, the, m...",18,5.0,"[Loved the excerpts where Julia, the main char...",1,18.0,real
4,"I liked the illustrations, which are are - wel...",4,A Moon of My Own,Jennifer Rustgi,children,"{'children': 13, 'young-adult': 2, 'non-fictio...",eng,32,2016,3.78,84,"[I, liked, the, illustrations, ,, which, are, ...","[i, liked, the, illustrations, ,, which, are, ...",153,4.484375,"[I liked the illustrations, which are are - we...",5,30.6,real


Seems to work on a small subset, so let's try the whole thing!

In [53]:
total_df["Nonsense"] = total_df.Text.map(test_nonsense)
total_df.Nonsense.value_counts()

real        27219
nonsense      635
short         420
Name: Nonsense, dtype: int64

635 out of over 27000 reviews isn't a ton (only about 2%), so that's not terrible. Let's take a look at some of these nonsense reviews just to make sure they're classified correctly.

In [54]:
nonsense_filter = (total_df.Nonsense == "nonsense")
total_df[nonsense_filter].head()

Unnamed: 0,Text,Rating,Title,Author,Category,Genres,Language,Pages,Pub_Year,Avg_Rating,Ratings_Count,Toks,Toks_Lower,Tok_Count,Avg_Word_Len,Sents,Sents_Count,Avg_Sent_Len,Nonsense
69,"For my Goodreads friends, yes, I read a childr...",3,The Twits,Roald Dahl,children,"{'children': 447, 'fiction': 173, 'young-adult...",eng,96.0,2004,3.94,83762,"[For, my, Goodreads, friends, ,, yes, ,, I, re...","[for, my, goodreads, friends, ,, yes, ,, i, re...",365,4.068182,"[For my Goodreads friends, yes, I read a child...",15,24.333333,nonsense
140,Everything I want a book to be. AMAAAAZZZING!!...,5,"Anne of Green Gables (Anne of Green Gables, #1)",L.M. Montgomery,children,"{'fiction': 5772, 'young-adult': 3267, 'childr...",eng,,2003,4.23,513174,"[Everything, I, want, a, book, to, be, ., AMAA...","[everything, i, want, a, book, to, be, ., amaa...",31,4.5,"[Everything I want a book to be., AMAAAAZZZING...",3,10.333333,nonsense
221,"It's no Miraculous Journey of Edward Tulane, b...",3,Raymie Nightingale,Kate DiCamillo,children,"{'fiction': 718, 'history, historical fiction,...",eng,272.0,2016,3.92,9146,"[It, 's, no, Miraculous, Journey, of, Edward, ...","[it, 's, no, miraculous, journey, of, edward, ...",481,4.2,"[It's no Miraculous Journey of Edward Tulane, ...",36,13.361111,nonsense
268,"sypvrv shl hklb bq, shnkhtp mmshpkhh qlypvrnyt...",4,The Call of the Wild,Jack London,children,"{'young-adult': 696, 'fiction': 905, 'history,...",eng,150.0,2012,3.83,4900,"[sypvrv, shl, hklb, bq, ,, shnkhtp, mmshpkhh, ...","[sypvrv, shl, hklb, bq, ,, shnkhtp, mmshpkhh, ...",123,4.715596,"[sypvrv shl hklb bq, shnkhtp mmshpkhh qlypvrny...",7,17.571429,nonsense
293,hdh lktb mn lklsykyt ldhy lTlm 'rdt bty`h w ns...,5,The Giving Tree,Shel Silverstein,children,"{'children': 13199, 'fiction': 2045, 'poetry':...",eng,64.0,1964,4.37,720582,"[hdh, lktb, mn, lklsykyt, ldhy, lTlm, 'rdt, bt...","[hdh, lktb, mn, lklsykyt, ldhy, ltlm, 'rdt, bt...",25,3.25,[hdh lktb mn lklsykyt ldhy lTlm 'rdt bty`h w n...,3,8.333333,nonsense


Unfortunately some of these look like real reviews even though they were tagged as nonsense. I converted the full nonsense dataframe into a CSV so I could quickly look over the full text of all the reviews, and unfortunately, while it did find some nonsense text, many of the reviews were still real. Therefore I will not be removing all the reviews tagged as nonsense because I would be getting rid of valid input, and the real nonsense reviews make up a very small portion of the data. However, this should be noted as a limitation of the data.

In [55]:
#total_df[nonsense_filter].to_csv('/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/nonsense_reviews.csv')

Anyway, there's one last feature I still want to add - sentiment! I'm not too familiar with NLTK's sentiment analyzer, so I'm going to create some test cases first just to get a feel for how it works.

### Sentiment

In [56]:
#nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

sentiment = SentimentIntensityAnalyzer()

test_sents = ["I loved this book!!!", 
             "Boring.", 
             "This book was ok - not my favorite, but not the worst",
             "My heart after reading this book: :) <3",
             "Literally the worst book I've ever read"]

for sent in test_sents:
    print(sent)
    print(sentiment.polarity_scores(sent))
    compound = sentiment.polarity_scores(sent)['compound']
    if compound > 0:
        print('positive')
    elif compound < 0:
        print('negative')
    elif compound == 0:
        print('neutral')
        
    if compound >= -1 and compound < -0.6:
        print('1\n')
    elif compound >= -0.6 and compound < -0.2:
        print('2\n')
    elif compound >= -0.2 and compound < 0.2:
        print('3\n')
    elif compound >= 0.2 and compound < 0.6:
        print('4\n')
    elif compound >= 0.6:
        print('5\n')

I loved this book!!!
{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.6981}
positive
5

Boring.
{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.3182}
negative
2

This book was ok - not my favorite, but not the worst
{'neg': 0.11, 'neu': 0.507, 'pos': 0.383, 'compound': 0.6487}
positive
5

My heart after reading this book: :) <3
{'neg': 0.0, 'neu': 0.504, 'pos': 0.496, 'compound': 0.7096}
positive
5

Literally the worst book I've ever read
{'neg': 0.406, 'neu': 0.594, 'pos': 0.0, 'compound': -0.6249}
negative
1



Trying to get this sentiment score to correspond with stars (1-5) doesn't seem to work that well (some of the reviews it scored as 5 are definitely not that enthusiastic), so for right now I'll just stick with the overall positive/negative tagging.

In [57]:
def sent_analysis(sents):
    scores = [sentiment.polarity_scores(sent)['compound'] for sent in sents]
    average = np.mean(scores)
    return average

test_df['Sentiment_Num'] = test_df.Sents.map(sent_analysis)
test_df.Sentiment_Num

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Sentiment_Num'] = test_df.Sents.map(sent_analysis)


0    0.00000
1    0.00000
2    0.15910
3    0.59940
4    0.44342
Name: Sentiment_Num, dtype: float64

In [58]:
total_df['Sentiment_Num'] = total_df.Sents.map(sent_analysis)

In [59]:
sentiment.polarity_scores(test_df.Text.iloc[1])

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [60]:
def tag_sentiment(score):
    if score > 0:
        tag = 'positive'
    elif score < 0:
        tag = 'negative'
    else:
        tag = 'neutral'
    
    return tag

test_df['Sentiment_Tag'] = test_df.Sentiment_Num.map(tag_sentiment)
test_df.Sentiment_Tag

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Sentiment_Tag'] = test_df.Sentiment_Num.map(tag_sentiment)


0     neutral
1     neutral
2    positive
3    positive
4    positive
Name: Sentiment_Tag, dtype: object

In [61]:
total_df['Sentiment_Tag'] = total_df.Sentiment_Num.map(tag_sentiment)

In [62]:
total_df.describe()

Unnamed: 0,Rating,Avg_Rating,Ratings_Count,Tok_Count,Avg_Word_Len,Sents_Count,Avg_Sent_Len,Sentiment_Num
count,28274.0,28274.0,28274.0,28274.0,28118.0,28274.0,28274.0,28274.0
mean,3.835396,3.990835,88025.85,137.82468,4.326438,7.525005,16.558581,0.234953
std,1.22186,0.292023,350648.7,201.674042,0.715024,10.178012,10.228095,0.272964
min,0.0,1.98,0.0,1.0,1.0,1.0,1.0,-0.9493
25%,3.0,3.81,536.0,26.0,4.0,2.0,10.75,0.038188
50%,4.0,4.01,4224.0,66.0,4.251969,4.0,15.714286,0.229155
75%,5.0,4.19,30524.5,164.0,4.540541,9.0,20.88544,0.417267
max,5.0,5.0,4899965.0,4159.0,16.0,210.0,388.0,0.9903


I'm extremely curious to know what the most negative and most positive reviews were, so let's look!

In [63]:
neg = (total_df.Sentiment_Num < -0.9) 
total_df[neg].shape

(3, 21)

In [64]:
print(total_df[neg].Text.iloc[0])
print(total_df[neg].Text.iloc[1])

This book was too encyclopedic for me, with description of murder after murder and little-to-no narrative structure other than the sequential murders.
Slow starting but once into its stride 19 Purchase Street is a hearty tale of money laundering, greed, revenge murder and a billion dollar robbery caper.


Ok, so the first one is pretty accurate, but with the second one, it looks like the sentiment analyzer saw all those words like "greed", "revenge", and "murder" and rated the review as negative, even though the review itself seems to overall like the book. Not sure there's much I can do about this, but it's something to keep in mind.

Let's look at the positive!

In [65]:
pos = (total_df.Sentiment_Num > 0.95) 
total_df[pos].shape

(32, 21)

In [66]:
print(total_df[pos].Text.iloc[3])
print(total_df[pos].Text.iloc[4])

I love frozen because... 
 Olaf = Cute, Fuuny, Sweet 
 Elsa = Fearless and Loving 
 Anna = Caring and Brave
Brilliant writing, fascinating history, but the novel bogs down a bit about 2/3 through....still a wonderfully good read if not a page-turner.


Once again, the first example seems pretty spot on, but as a human reader, I can tell the second review has a little bit of hesitation and probably shouldn't be at a 0.95 if 1.0 means the most positive a text could be. But for what I have to work with, I think this will be a good tool, especially given that it understands differences like slang terms, emoticons, and differences in tone conveyed through things like all caps. It'll be really interesting to see how sentiment derived from the review text compares to the rating people gave the book.

### Part of Speech

Time for the last thing I want to try: POS tagging! I feel like the adjectives used for each different genre's review could have a wide semantic range, so I'm going to use NLTK's POS tagger to provide that information. (I wanted to use Spacy but was having some trouble downloading it - if I'm able to get that figured out, I'll probably come back and use that instead).

In [67]:
#!pip install spacy --user
#!python -m spacy download "en_core_web_sm" --user
#import spacy

In [68]:
from nltk import pos_tag
test_text = nltk.word_tokenize(total_df.Text.iloc[10])
test_text

['I', 'rather', 'enjoyed', 'the', 'book', 'more', 'than', 'the', 'movie', '.', 'A', 'lot', 'of', 'detail', 'was', 'lost', 'in', 'the', 'movie', '.', 'And', 'I', 'feel', 'some', 'important', 'insights', 'that', 'I', 'feel', 'developed', 'the', 'storyline', 'better', 'and', 'assisted', 'with', 'much', 'more', 'understanding', 'of', 'Jonas', "'", 'character', 'were', 'not', 'introduced', 'in', 'the', 'movie', '.', 'Both', 'are', 'good', ';', 'however', ',', 'I', 'recommend', 'the', 'book']

I only really am interested in adjectives, so I'm going to make a function that can be used to map a list of all adjectives used in a review to a new column in the dataframe.

In [69]:
def find_adjs(toks):
    adj_list = []
    pos_list = nltk.pos_tag(toks)
    for (tok, pos) in pos_list:
        if pos.startswith('JJ'):
            adj_list.append((tok, pos))
    
    return adj_list

find_adjs(test_text)

[('more', 'JJR'), ('important', 'JJ'), ('understanding', 'JJ'), ('good', 'JJ')]

In [70]:
test_df['Adjs'] = test_df.Toks.map(find_adjs)
print(test_df.Adjs)
print(test_df.Adjs.iloc[4])

0                                                   []
1                                                   []
2                                         [(next, JJ)]
3                                         [(main, JJ)]
4    [(luminous, JJ), (more, JJR), (mundane, JJ), (...
Name: Adjs, dtype: object
[('luminous', 'JJ'), ('more', 'JJR'), ('mundane', 'JJ'), ('trite', 'JJ'), ('poetic', 'JJ'), ('back', 'JJ'), ('useful', 'JJ'), ('informative', 'JJ'), ('full', 'JJ'), ('unlikely', 'JJ')]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['Adjs'] = test_df.Toks.map(find_adjs)


Works on a small subset, just returning an empty list for no adjectives, which is good to keep in mind if I end up wanting to filter those out.

In [71]:
total_df['Adjs'] = total_df.Toks.map(find_adjs)

In [72]:
total_df['Adjs_Count'] = total_df.Adjs.map(len)
total_df.Adjs_Count.describe()

count    28274.000000
mean         9.872073
std         14.076462
min          0.000000
25%          2.000000
50%          5.000000
75%         12.000000
max        290.000000
Name: Adjs_Count, dtype: float64

As with sentence length, there seem to be pretty significant high-count outliers, so when looking at descriptive statistics like these, I'm going to pay more attention to the median (50%) than the mean. 

Now that my dataframe has all the info I want, I'm going to pickle this file and complete the analysis in a separate Jupyter Notebook to keep things clean. Check out my Data Analysis JNB for more info!

In [75]:
#f = open('data/analysis_df.pkl', 'wb')
#pickle.dump(total_df, f)
#f.close()