# Data Analysis
Ashley Feiler, aef56@pitt.edu

## Imports 

In [1]:
import pickle
import glob
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns      

## Sharable Data Samples
Because I can't share all of the data I'm using due to licensing, I plan on sharing samples. Since my computer could only handle loading so much data at a time, I used separate Jupyter Notebooks for different genres that I could open, merge the necessary data, pickle a smaller sample file, and then close, freeing memory. In this file, I will unpickle and combine all of those samples to then share.

### Original Data

In [2]:
directory = '/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/'

share_files = glob.glob(directory + 'genre_share/*.pkl') #Get filepath of all pickled files
print(len(share_files)) #Confirm 8 files for 8 genres
share_files[0]

8


'/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/genre_share/fantasy_share.pkl'

In [3]:
share_df = pd.DataFrame() #Create empty DataFrame to append each genre's sample to

for pkl in share_files: #For each file directory, load file and add to shared DataFrame
    f = open(pkl, 'rb')  
    df = pickle.load(f)     
    f.close()  
    share_df = pd.concat([share_df, df])
    
share_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 4
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       40 non-null     object
 1   book_id       40 non-null     int64 
 2   review_id     40 non-null     object
 3   rating        40 non-null     int64 
 4   review_text   40 non-null     object
 5   date_added    40 non-null     object
 6   date_updated  40 non-null     object
 7   read_at       40 non-null     object
 8   started_at    40 non-null     object
 9   n_votes       40 non-null     int64 
 10  n_comments    40 non-null     int64 
dtypes: int64(4), object(7)
memory usage: 3.8+ KB


This confirms that all together, there are 40 review samples just like there were supposed to be (5 from 8 genres). To keep the sample as minimal as possible to stay within Fair Use guidelines, I will take a sample of only 5 of these 40 reviews to then save as a CSV and share in my public repository.

(The code below that writes the CSV file has been commented out to prevent the CSV file from being overwritten every time this notebook is run)

In [4]:
#genre_samples = share_df.sample(5)
#genre_samples

In [5]:
#genre_samples.to_csv('data_samples/Genre_Samples.csv')

### Condensed Data
That first process was to show a sample of what the original UCSD data looked like, but I also want to show the final format of data that I compiled and will be working with for my analysis. Below is the same process as above, but with the final DataFrames I created for each genre (each genre ranging from around 3000-4000 reviews). 

In [6]:
genre_files = glob.glob(directory + 'genre_pkls/*.pkl')
print(len(genre_files))
genre_files[0]

8


'/Users/ashleyfeiler/Documents/data_science/Goodreads-Genre-Reviews-Analysis/data/genre_pkls/children_short.pkl'

In [7]:
total_df = pd.DataFrame()

for pkl in genre_files:
    f = open(pkl, 'rb')  
    df = pickle.load(f)     
    f.close()  
    total_df = pd.concat([total_df, df])
    
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28274 entries, 0 to 4998
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Text           28274 non-null  object 
 1   Rating         28274 non-null  int64  
 2   Title          28274 non-null  object 
 3   Author         28274 non-null  object 
 4   Category       28274 non-null  object 
 5   Genres         28274 non-null  object 
 6   Language       28274 non-null  object 
 7   Pages          28274 non-null  object 
 8   Pub_Year       28274 non-null  object 
 9   Avg_Rating     28274 non-null  float64
 10  Ratings_Count  28274 non-null  int64  
 11  User_ID        28274 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.8+ MB


Combining the samples from all 8 genres resulted in a total of 28274 reviews in total, which is a pretty decent amount of data to work with! Further down I will get into some more exploration of the makeup of this final data set I will be working with, but for now I want to save a small sample of this DataFrame to share.

In [8]:
#total_sample = total_df.sample(5)
#total_sample

In [9]:
#total_sample.to_csv('data_samples/FinalDF_Sample.csv')

## Data Makeup

At first I thought I might still need the userIDs, but I given all the columns I plan on adding for linguistic features, I don't think those IDs will be necessary, so my first order of business is to remove that column.

In [10]:
total_df = total_df[['Text', 'Rating', 'Title', 'Author', 'Category', 'Genres', 'Language', 'Pages', 'Pub_Year', 'Avg_Rating', 'Ratings_Count']]

In [11]:
total_df.columns

Index(['Text', 'Rating', 'Title', 'Author', 'Category', 'Genres', 'Language',
       'Pages', 'Pub_Year', 'Avg_Rating', 'Ratings_Count'],
      dtype='object')

In [12]:
total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28274 entries, 0 to 4998
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Text           28274 non-null  object 
 1   Rating         28274 non-null  int64  
 2   Title          28274 non-null  object 
 3   Author         28274 non-null  object 
 4   Category       28274 non-null  object 
 5   Genres         28274 non-null  object 
 6   Language       28274 non-null  object 
 7   Pages          28274 non-null  object 
 8   Pub_Year       28274 non-null  object 
 9   Avg_Rating     28274 non-null  float64
 10  Ratings_Count  28274 non-null  int64  
dtypes: float64(1), int64(2), object(8)
memory usage: 2.6+ MB


In [13]:
total_df.shape

(28274, 11)

I am working with a DataFrame of 28274 reviews and 11 total columns, though this will expand as I add more linguistic features.

Now that that's done, let's take a look at some of the counts of different categories. What makeup of data am I finally working with?

In [14]:
total_df.Category.value_counts()

ya                        4334
fantasy_paranormal        4323
romance                   3918
mystery_thriller_crime    3789
comics_graphic            3505
history_bio               3362
children                  2858
poetry                    2185
Name: Category, dtype: int64

Clearly there is a pretty wide range in the number of reviews left from each genre after some of the data cleaning. Each genre started out with 5000 reviews, but some were eliminated because they were non-English or empty, which disproportionately affected different genres. This will definitely be something to keep in mind during analysis.

In [15]:
total_df.Rating.value_counts()

5    9941
4    9593
3    5356
2    1807
0     894
1     683
Name: Rating, dtype: int64

5- and 4-star reviews are by far the most common, followed by 3-star reviews. 2-star reviews are much less frequent, and 1-star reviews even less. It makes sense that the higher ratings are more common as people are more likely to write a review about a book they like rather than a book they are indifferent about, but I'm a little surprised to see so few low ratings. In my experience, people tend to be pretty passionate about books they hate as well. If genre turns out to not be a significant factor changing linguistic features, it could be interesting to see if rating, which theoretically correlates to sentiment, has any effect on the language used in the review.

In [16]:
len(total_df.Title.unique())

17774

In [17]:
total_df.Title.value_counts()[:15]

Milk and Honey                                                                          113
Hamlet                                                                                   50
The Giver (The Giver, #1)                                                                50
The Hunger Games (The Hunger Games, #1)                                                  49
Cinder (The Lunar Chronicles, #1)                                                        49
The Girl on the Train                                                                    47
Brown Girl Dreaming                                                                      44
Wonder (Wonder #1)                                                                       43
Miss Peregrine’s Home for Peculiar Children (Miss Peregrine’s Peculiar Children, #1)     42
Divergent (Divergent, #1)                                                                40
Where the Sidewalk Ends                                                         

Out of 28274 reviews, there are 17774 unique book titles that are reviewed, meaning 10500 reviews are repeat reviews of at least one book (a suspiciously even number), but still the majority of books are only reviewed once. Milk and Honey, a very popular book of poetry, is the most reviewed book at 113 reviews, and a lot of the other most reviewed books I recognize as Young Adult and Fantasy novels. Those were the top 2 genres with the most reviews that made the final cut, so it's not surprising there are more repeat reviews for these books.

In [18]:
len(total_df.Author.unique())

9688

In [19]:
total_df.Author.value_counts()[:10]

Cassandra Clare     250
Brian K. Vaughan    157
Neil Gaiman         148
Marissa Meyer       137
Stephenie Meyer     130
Rupi Kaur           127
Sarah J. Maas       123
Stephen King        123
Rick Riordan        115
Suzanne Collins     106
Name: Author, dtype: int64

Out of 28274 reviews, there are only 9688 authors that are reviewed, which is a much smaller number, but makes sense seeing as authors may have written many different books. 

In [20]:
total_df.Language.value_counts()

eng      22737
en-US     4268
en-GB     1026
en-CA      243
Name: Language, dtype: int64

In [21]:
total_df.describe()

Unnamed: 0,Rating,Avg_Rating,Ratings_Count
count,28274.0,28274.0,28274.0
mean,3.835396,3.990835,88025.85
std,1.22186,0.292023,350648.7
min,0.0,1.98,0.0
25%,3.0,3.81,536.0
50%,4.0,4.01,4224.0
75%,5.0,4.19,30524.5
max,5.0,5.0,4899965.0


In [22]:
total_df.groupby('Category').describe()

Unnamed: 0_level_0,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Rating,Avg_Rating,Avg_Rating,Avg_Rating,Avg_Rating,Avg_Rating,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count,Ratings_Count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
children,2858.0,3.904829,1.203209,0.0,3.0,4.0,5.0,5.0,2858.0,4.037768,...,4.21,5.0,2858.0,93980.546186,275095.225792,1.0,330.0,3001.5,31387.0,1876252.0
comics_graphic,3505.0,3.811412,1.153754,0.0,3.0,4.0,5.0,5.0,3505.0,4.02168,...,4.24,4.83,3505.0,16528.807703,41517.096041,1.0,479.0,2705.0,12834.0,406669.0
fantasy_paranormal,4323.0,3.8161,1.246819,0.0,3.0,4.0,5.0,5.0,4323.0,4.014464,...,4.23,5.0,4323.0,108879.451076,375846.839796,1.0,838.5,7755.0,55039.0,4765497.0
history_bio,3362.0,3.851279,1.21574,0.0,3.0,4.0,5.0,5.0,3362.0,3.943968,...,4.14,5.0,3362.0,96545.556217,342835.065977,0.0,592.0,4165.0,30058.75,3255518.0
mystery_thriller_crime,3789.0,3.727105,1.178367,0.0,3.0,4.0,5.0,5.0,3789.0,3.88413,...,4.06,4.88,3789.0,59168.214568,210601.102517,1.0,522.0,3984.0,22034.0,2046499.0
poetry,2185.0,3.897941,1.276413,0.0,3.0,4.0,5.0,5.0,2185.0,4.096256,...,4.26,5.0,2185.0,44478.507551,151734.841123,0.0,148.0,1433.0,15270.0,1029527.0
romance,3918.0,3.943849,1.212399,0.0,3.0,4.0,5.0,5.0,3918.0,4.000403,...,4.2,4.91,3918.0,32528.685299,143318.96349,1.0,333.0,1878.5,10393.0,2078406.0
ya,4334.0,3.781034,1.272587,0.0,3.0,4.0,5.0,5.0,4334.0,3.979213,...,4.17,5.0,4334.0,211864.244347,652314.359248,1.0,2863.5,19151.0,106182.0,4899965.0


Most interesting here is to look at the mean rating for each genre. They're pretty close together, but Romance has the highest average rating of 3.94 and Mystery/Thriller/Crime has the lowest average rating of 3.73. It's also intersting to compare the average ratings from these reviews to the Avg_Rating column statistics, which is the average rating of the book being reviewed. In general, the sample of reviews I am analyzing rate the book slightly lower than its average rating from all reviews, which is just an interesting phenomenon. In general, all Goodreads reviewers (not just from this data set) seem to rate books in the Poetry genre highest (4.26) and Mystery/Thriller/Crime books lowest (4.06). Finally, the ratings count shows the number of ratings each book had (again, not just the UCSD data), so it appears that the books of the Young Adult genre represented by the UCSD corpus has by far the most ratings on Goodreads (211864) and books of the Comics/Graphic genre have the least (16529). It could be interesting to look at genres with a lower ratings count but higher reviews count, which suggests that it is a more niche genre that appeals to a more specific type of reader.

# Analysis
Now that I've FINALLY got my final data set and a sense of its size and makeup, it's time to start analysis! Since I'm looking at overall linguistic differences between reviews for different genres, I want to include as many different linguistic features as I can think of. In this next secion, I will be adding those features as additional columns to the DataFrame so I can then analayze their differences between genre categories.

### Tokens

### Token Count

### Word Length

### Sentence Count

### Sentence Length

### TTR (Modified)

### K-Bands

### Punctuation

Question marks, exclamation points, commas, semicolons