## Goodreads Book Data

### Problem Statement:
- Reading is my biggest hobby. I am interested in knowing which genres tend to have the highest ratings, whether male or female authors usually write specific types of books, what counties they are from, etc. One of my goals is to create a recommended book list for each genre - the must reads based on ratings. I may use a second dataset to match book titles with descriptions that I already found depending on the depth/time of my analysis.

### Data Source Location:
- The data was found on Kaggle at the following address:
    - https://www.kaggle.com/choobani/goodread-authors?select=final_dataset.csv
	- Secondary dataset that pulls book descriptions:
		- https://www.kaggle.com/meetnaren/goodreads-best-books

### Data Acquisition:
- Data was acquired from a google search specifically looking for data sets that include author gender and country of origin. I tried to find something that also included a book description so I could scrape for keywords relating to genre, but I could not find a dataset including that and gender/country. I settled for a dataset that contained gender/country/genre and might include a second dataset to match book title to descriptions. This data was also found on kaggle. 

### Description of Available Data:
- This data is 209517 rows x 20 columns. The columns include authorid, name, workcount, fan_count, gender, image_url,	about, born, died, influence	average_rate, rating_count, review_count, website, twitter, genre, original_hometown, country, latitude, and longitude. Several of the columns have a significant number of null-values, though not in the columns of most interest to me. The first dataset is 115MB and the potential second is 57.7MB. I know that this is technically larger than the limit provided in the project description. I believe part of this is due to some columns that are not relevent to my analysis and can be dropped. The main one is a column containing links to images of the authors. Without that column the dataset is less than 100MB. I can remove this prior to uploading or after uploading to Jupyter depending on which method you would prefer. 

### Data Usage: 
- This dataset was created using scraped data from Goodreads and was last updated 10 months ago. It was uploaded to kaggle for use by anyone who is interested. 

### Documentation: 
- From what I can tell there are simple visualizations created from the data but no documented processes for EDA, visualization, or analysis. 

### Hypotheses/Questions:
- Men write more science fiction/fantasy than women.
	- what genre do women contribute to the most?
- Women have higher "follower" count
- What genres have the most submissions?
- Which genres have the highest reviews/number of reviews?
- Do men or women authors receive higher reviews?
- Can books be classified by select words in the blurbs on the back?

### Code Sources
- https://stackoverflow.com/questions/54135085/create-new-column-based-on-string
- https://davidhamann.de/2017/06/26/pandas-select-elements-by-string/
- https://stackoverflow.com/questions/20076195/what-is-the-most-efficient-way-of-counting-occurrences-in-pandas
- https://stackoverflow.com/questions/53997862/pandas-groupby-two-columns-and-plot
- https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot
- https://stackoverflow.com/questions/52135315/set-axis-maximum-with-seaborn-distplot/52135483
- https://www.kaggle.com/tejainece/seaborn-barplot-and-pandas-value-counts
- https://stackoverflow.com/questions/41494942/pandas-dataframe-groupby-plot
- https://stackoverflow.com/questions/52132970/pandas-how-to-plot-the-pie-diagram-for-the-movie-counts-versus-genre-of-imdb-mo

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale=2.4)
pd.options.mode.chained_assignment = None

In [None]:
# load csv
df = pd.read_csv("final_dataset.csv")

In [None]:
# basic info on dataframe
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
# There are a lot of columns not needed for this analysis so they will be dropped for clarity
df1 = df.drop(columns=['authorid', 'image_url', 'died', 'influence', 'website', 'twitter', 'original_hometown', 'latitude', 'longitude'])

In [None]:
df1.head()

In [None]:
# droping null values in the genre category because this is mainly about genre
df2 = df1.dropna(subset=['genre'])

In [None]:
df2.head()

In [None]:
df2.info()

## Country Analysis

In [None]:
# drop nulls from country for analysis. Not having these does not decrease value of data 
df_country = df1.dropna(subset=['country'])

In [None]:
df_country.info()

In [None]:
df_country.head()

In [None]:
# count the occurances of each country, create dataframe of top 20, rename columns
country_count  = df_country['country'].value_counts()
country_count = pd.DataFrame(country_count[:20]).reset_index()
country_count.columns = ['country', 'workcount']

In [None]:
country_count

In [None]:
# plot country and count
country_count.plot(kind='bar', title='Countries With the Most Published Works', x='country', y='workcount', figsize=(40,10)) 
plt.ylabel('Number of Entries')
plt.xlabel('Country (Top 20)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plots pie chart containing top 5 percentage of total works per country (including those dropped for genre null values)
df1.country.value_counts().iloc[:5].plot(kind='pie', autopct='%1.0f%%', figsize=(10,10), title="Country Makeup of Published Works (Top 5)")
plt.show()

In [None]:
# create new df that sorts by work count
df3 = df2.sort_values('workcount', ascending=False)
df3.head()

In [None]:
# create new df sorting by review count
df4 = df2.sort_values('review_count', ascending=False)
df4.head()

In [None]:
# plot number of authors per top 20 countries by gender
plt.figure(figsize=(20,10))
sns.countplot(data=df2,x='country',hue='gender', alpha=0.8, order=df2.country.value_counts().iloc[:20].index)
plt.title('Author Gender and Country of Origin')
plt.ylabel('Number of Occurrences')
plt.xlabel('Country')
plt.xticks(rotation=90)
plt.show()

### Conclusions: 
- from this data, the United States, United Kingdom, Canada, Australia, and France have the most published works
- The United States makes up 60% of the published works
- The United Kingdom makes up 18% of the published works
- Canada Makes Up 5% of the published works
- France and Germany each make up 4% of the published works
- Of the top 20 countries, only Canada, Australia and Japan have more works by female authors

## Genre Analysis:

In [None]:
# creates list of most common genres starting with most common, maps the list to the genre column and splits 
# into new column containing one value with main genre
values = ['science fiction', 'fantasy', 'fiction', 'classics', 'chick lit', 'beer', 'horror', 'home', 
          'juvenile', 'feminism', 'engineering', 'origami', 'nursing', 'games', 'knitting', 'ethics', 
          'anatomy', 'journalism', 'illustration', 'plays', 'theatre', 'regency', 'plays', 'computer gaming',
          'construction', 'style guide', 'zoroastrianism', 'card games', 'web development', 
          'personal development', 'innovation', 'adaptation', 'stage plays', 'historical', 'psychology', 
          'manga', 'literature', 'cooking', 'disambiguation', 'philosophy', 'sex', 'culture', 'social studies', 
          'reference', 'cookbooks', 'technology', 'gender', 'graphic novels', 'programming', 'economy', 
          'music', 'sport', 'foreign policy', 'language', 'contemporary', 'dystopian', 'short stories', 
          'criticism', 'investing', 'geology', 'entertainment', 'play', 'pets', 'western', 'suspense', 
          'translation', 'animal', 'economics', 'health', 'architecture', 'archaeology', 'astrology', 
          'interior design', 'linguistics', 'folklore', 'political', 'nature', 'nonfiction', 'biographies', 
          'crime', 'spirituality', 'mystery', 'comics', 'art', 'young adult', 'history', 'comedy', 
          'paranormal', 'romance', 'children', 'poetry', 'business', 'non fiction', 'crafts', 'travel', 
          'computers', 'self help', 'science', 'religious', 'chronicles', 'ufology', 'photography', 
          'mathematics', 'hybrids', 'russia', 'occult', 'physics', 'terrorism', 'parenting', 'gardening', 
          'medicine', 'sociology', 'logic', 'psychotherapy', 'biology', 'zoology', 'theater', 'civil rights', 
          'current events', 'domestic abuse', 'international development', 'marketing', 'pop punk', 'law', 
          'survivalism', 'wine', 'prose', 'metaphysical', 'marx', 'etiquette', 'drama', 'sql', '3d', 'columns',
          'realism', 'movies', 'botony', 'horses', 'news', 'firearms', 'biblical geography', 'film', 'maps', 
          'symbolism', 'botany', 'land reform', 'land reform', 'military', 'relationships', 'international studies',
          'theonomy', 'cyberpunk', 'anthropology', 'existentialism', 'fashion', 'life', 'objectivism', 
          'hagiography', 'paracord', 'cross dressing', 'play', 'self defense', 'nutrition', 'chess', 'outdoors']

conditions = list(map(df2['genre'].str.contains, values))

df2['split_genres'] = np.select(conditions, values, 'other')

df2.head(10)

In [None]:
# creates new dataframe with just the genres - used to develop most common genre list above 
#df_genres = pd.DataFrame(df2.split_genres.unique())
#df_genres.columns = ['genre_list']

In [None]:
#df_genres

In [None]:
# used to narrow down genres until they got super obscure
# df_genres[df_genres['split_genres'].str.match('other')]

In [None]:
# displays to 20 genres and the number of works in each 
df2.split_genres.value_counts().iloc[:20]

In [None]:
# plots top 20 genres by gender
plt.figure(figsize=(20,10))
sns.countplot(data=df2,x='split_genres',hue='gender', alpha=0.8, order=df2.split_genres.value_counts().iloc[:20].index)
plt.title('Author Gender and Genre')
plt.ylabel('Number of Occurrences')
plt.xlabel('Genre')
plt.xticks(rotation=90)
plt.show()

In [None]:
# creates new dataframe that counts the occurances of each genre
genre_count = pd.DataFrame(df2['split_genres'].value_counts().reset_index())
genre_count = genre_count.rename({'index': 'genre', 'split_genres': 'count'}, axis=1)
genre_count.head(10)

In [None]:
# Plots genre counts
plt.figure(figsize=(10,6))
sns.countplot(data=df2,x='split_genres', alpha=0.8, order=df2.split_genres.value_counts().iloc[:20].index)
plt.title('Most Popular Genres')
plt.ylabel('Number of Titles')
plt.xlabel('Genre (Top 20)')
plt.xticks(rotation=90)
plt.show()

In [None]:
# creates new dataframe grouping genres with their average rating
# sort by highest rating 
genre_ratings = df2.groupby('split_genres', as_index=False)['average_rate'].mean()
genre_ratings = genre_ratings.sort_values('average_rate', ascending=False)

In [None]:
genre_ratings.head()

In [None]:
# plots the top 20 mean ratings for genres
genre_ratings.iloc[:30].plot(x = "split_genres", y = "average_rate", kind = "bar", figsize=(15,8), title="Mean Ratings for Genre (Top 20)")
plt.ylabel('Mean Average')
plt.xlabel('Genre (Top 30)')
plt.show()

In [None]:
genre_ratings.iloc[:20]

In [None]:
# greates new dataframe that groups by genre and fan count, sorts by highest fan count
genre_fan_count = pd.DataFrame(df2.groupby('split_genres')['fan_count'].sum().reset_index().sort_values('fan_count', ascending=False))
genre_fan_count

In [None]:
# plot top 20 fan count for genre
genre_fan_count.iloc[:20].plot(x = "split_genres", y = "fan_count", kind = "bar", figsize=(15,8), title="Fan Count By Genre")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Fans')
plt.xlabel('Genre (Top 20)')
plt.show()

In [None]:
# greates new dataframe that groups by genre and review count, sorts by highest review count
genre_review_count = pd.DataFrame(df2.groupby('split_genres')['review_count'].sum().reset_index().sort_values('review_count', ascending=False))
genre_review_count

In [None]:
# plot top 20 review count for genre
genre_review_count.iloc[:20].plot(x = "split_genres", y = "review_count", kind = "bar", figsize=(15,10), title="Review Count By Genre")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Reviews')
plt.xlabel('Genre (Top 20)')
plt.show()

In [None]:
# shows top 20 review count
genre_review_count.iloc[:20]

In [None]:
df2['born']= pd.to_datetime(df2['born'])

In [None]:
df2['year_born'] = pd.DatetimeIndex(df2['born']).year

In [None]:
born_df = df2.dropna(subset=['year_born'])
born_df['year_born'] = born_df.year_born.astype(int)
born_df.head()

In [None]:
born_df_genre = pd.DataFrame(born_df['split_genres'].value_counts().iloc[:10]).reset_index()

In [None]:
born_df_genre['index'].to_list()

In [None]:
born_df = born_df[born_df['split_genres'].isin(['fiction','fantasy','biographies','history','mystery','spirituality',
 'poetry','children','philosophy','graphic novels'])]

In [None]:
born_df = born_df.sort_values('year_born').reset_index()
born_df.drop(['index'], inplace=True, axis=1)
born_df.head()

In [None]:
# plots top 10 genres by year born 
plt.figure(figsize=(20,15))
sns.countplot(data=born_df,x='year_born',hue='split_genres', alpha=0.8, order=df2.year_born.value_counts().iloc[:20].index.sort_values(), dodge=False)
plt.title('Count of Works Per Year by Genre')
plt.ylabel('Number Per Year')
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.show()

#### Conclusion
- Out of the top 20 genres in this dataset according to number of works, women tend to write more fantasy, mystery, contemporary, chidrens bookd, romance, sex, suspense and chick lit books. Science fiction did not make the top 20 according to work count. 
- Men wrote more fiction, history, biographies, spiritual books, poetry, philosophy, graphic novels, horror, crime, psychology and art. 
- The unknown section includes nonbinary authors as well as those that don't have that data filled in. 
- fiction and fantasy had by far the largest number of works, though neither fall into the top 20 rated category. This is most likely due to the larger number of entries and the likelyhood of readers taking the time to rate them. 
- This can be demonstrated in the fan count - these are people who consistently follow, are up to date with specific authors, and regularly review their work. 
- fiction has the highest number of works throughout the years sampled (top 20 based on number of works. Followd mostly by fantasy and then mystery with a couple outliers. 

## Author Analysis: 

In [None]:
# plot top 30 author work counts - starts after the the first 3 because they are companies, not people
df3.iloc[3:33].plot(kind='bar', title='Top 30 Author Work Count', x='name', y='workcount', figsize=(40,10)) 
plt.ylabel('Number of Published Works')
plt.xlabel('Author (Top 30)')
plt.show()

In [None]:
# plot top 20 authors by number of works skipping first 3 because they are corporations
df2.sort_values('workcount', ascending=False).iloc[3:23].plot(kind='bar', title='Top 20 Authors by Work Count', x='name', y='workcount', figsize=(40,10))
plt.ylabel('Work Count')
plt.xlabel('Author (Top 20 Work Count)')
plt.show()

In [None]:
# plot top 100 authors by average rate
df2.iloc[1:101].plot(kind='bar', title='Random Sample of Author Ratings', x='name', y='average_rate', figsize=(40,10))
plt.ylabel('Rating')
plt.xlabel('Author (Random 100)')
plt.show()

In [None]:
# plot top 30 review counts
df4.iloc[:30].plot(kind='bar', title='Authors With Most Reviews', x='name', y='review_count', figsize=(40,10))
plt.ylabel('Number of Reviews')
plt.xlabel('Author (Top 30)')
plt.show()

In [None]:
# plot top 20 authors by number of reviews
df2.sort_values('review_count', ascending=False).iloc[:20].plot(kind='bar', title='Top 20 Authors by Review Count', x='name', y='review_count', figsize=(40,10))
plt.ylabel('Review Count')
plt.xlabel('Author (Top 20 Review Count)')
plt.show()

In [None]:
# plot top 20 authors by number of fans 
df2.sort_values('fan_count', ascending=False).iloc[:20].plot(kind='bar', title='Top 20 Authors by Fan Count', x='name', y='fan_count', figsize=(40,10))
plt.ylabel('Fan Count')
plt.xlabel('Author (Top 20 Fan Count)')
plt.show()

### Conclusions:
- William Shakespear has the highest work count at over 5000. This value is most likely inflated due to translations into different languages being included. 
- Stephen King has the highest number of fans
- JK Rowling has the highest number of reviews

## Gender Analysis: 

In [None]:
# create new dataframe sorting by average rate
df5 = df2.sort_values('average_rate', ascending=False)
df5.head()

In [None]:
# create new dataframe counting number of authors by gender
gender_count = pd.DataFrame(df2['gender'].value_counts())
gender_count

In [None]:
# plot gender count of authors (using all entries including those droped for genre null values)
plt.figure(figsize=(20,10))
sns.countplot(data=df1,x='gender', alpha=0.8, order=df2.gender.value_counts().iloc[:3].index)
plt.title('Gender Makeup of Authors')
plt.ylabel('Number of Titles')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plots pie chart containing percentage of total authors in each gender (including those dropped for genre null values)
df1.gender.value_counts().plot(kind='pie', autopct='%1.0f%%', figsize=(10,10), title="Author Gender Makeup of Published Works")
plt.show()

In [None]:
# plot gender count of authors (dropped genre nulls)
plt.figure(figsize=(20,10))
sns.countplot(data=df2,x='gender', alpha=0.8, order=df2.gender.value_counts().iloc[:3].index)
plt.title('Gender Makeup of Authors')
plt.ylabel('Number of Titles')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plots pie chart containing percentage of total authors in each gender (dropped genre nulls)
df2.gender.value_counts().plot(kind='pie', autopct='%1.0f%%', figsize=(10,10), title="Author Gender Makeup of Published Works")
plt.show()

In [None]:
# create new dataframe containing only female authors
df_female1 = df1[(df1.gender.str.contains('female'))]
df_female2 = df2[(df2.gender.str.contains('female'))]

In [None]:
df_female1.info()

In [None]:
df_female1.head()

In [None]:
# create new datraframe containing only male authors
df_male1 = df1[(df1.gender.str.match('male'))]
df_male2 = df2[(df2.gender.str.match('male'))]

In [None]:
df_male1.info()

In [None]:
df_male1.head()

In [None]:
# create new dataframe containing only unknown gender authors 
df_unknown1 = df1[(df1.gender.str.contains('unknown'))]
df_unknown2 = df2[(df2.gender.str.contains('unknown'))]

In [None]:
df_unknown1.info()

In [None]:
# sorts df_female by split_genre
df_female2 = df_female2.sort_values('split_genres')

In [None]:
df_female2.head()

In [None]:
# plot distribution of average rating by gender including genre nulls
sns.displot(data=df1, x="average_rate", hue='gender', kde=True, height=10, aspect=2)
plt.title("Average Rating Distribution by Gender (With Genre Nulls)")
plt.xlim(2.5,5)

In [None]:
# plot distribution of average rating by gender without genre nulls
sns.displot(data=df2, x="average_rate", hue='gender', kde=True, height=10, aspect=2)
plt.title("Average Rating Distribution by Gender Without Genre Nulls")
plt.xlim(2.5,5)

In [None]:
# creates new dataframe grouping by gender and counting reviews using df including genre nulls
review_count1 = pd.DataFrame(df1.groupby('gender')['review_count'].sum().reset_index())
review_count1

In [None]:
# creates new dataframe grouping by gender and counting reviews without genre nulls
review_count2 = pd.DataFrame(df2.groupby('gender')['review_count'].sum().reset_index())
review_count2

In [None]:
# plot review count by gender - including genre nulls
review_count1.plot(x = "gender", y = "review_count", kind = "bar", figsize=(15,10), title="Review Count by Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Reviews')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plot review count by gender without genre nulls
review_count2.plot(x = "gender", y = "review_count", kind = "bar", figsize=(15,10), title="Review Count by Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Reviews')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# creates new dataframe grouping by gender and total fan count - with genre nulls
fan_count1 = pd.DataFrame(df1.groupby('gender')['fan_count'].sum().reset_index())
fan_count1

In [None]:
# creates new dataframe grouping by gender and total fan count - without genre nulls
fan_count2 = pd.DataFrame(df2.groupby('gender')['fan_count'].sum().reset_index())
fan_count2

In [None]:
# plot fan count by gender - with genre nulls
fan_count1.plot(x = "gender", y = "fan_count", kind = "bar", figsize=(15,10), title="Fan Count By Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Fans')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plot fan count by gender - without genre nulls
fan_count2.plot(x = "gender", y = "fan_count", kind = "bar", figsize=(15,10), title="Fan Count By Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Fans')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# creates new dataframe grouped by gender and total work count - with genre nulls
work_count1 = pd.DataFrame(df1.groupby('gender')['workcount'].sum().reset_index())
work_count1

In [None]:
# creates new dataframe grouped by gender and total work count - without genre nulls
work_count2 = pd.DataFrame(df2.groupby('gender')['workcount'].sum().reset_index())
work_count2

In [None]:
# plot work count by gender - with genre nulls
work_count1.plot(x = "gender", y = "workcount", kind = "bar", figsize=(15,10), title="Work Count by Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Published Works')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# plot work count by gender - with genre nulls
work_count2.plot(x = "gender", y = "workcount", kind = "bar", figsize=(15,10), title="Work Count by Gender")
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
plt.ylabel('Number of Published Works')
plt.xlabel('Gender')
plt.xticks(rotation=45)
plt.show()

In [None]:
# creates new dataframe that counts number of works per genre for females
female_genre_count = pd.DataFrame(df_female2['split_genres'].value_counts()).reset_index()
female_genre_count.head(30)

In [None]:
female_genre_count[female_genre_count['index'].str.contains("science fiction")]

In [None]:
# creates new dataframe that counts number of works per genre for males
male_genre_count = pd.DataFrame(df_male2['split_genres'].value_counts()).reset_index()
male_genre_count.head(10)

In [None]:
male_genre_count[male_genre_count['index'].str.contains("science fiction")]

In [None]:
# creates new dataframe that counts number of works per genre for unknown gender
unknown_genre_count = pd.DataFrame(df_unknown2['split_genres'].value_counts()).reset_index()
unknown_genre_count.head(20)

In [None]:
unknown_genre_count[unknown_genre_count['index'].str.contains("science fiction")]

#### Conclusion
- According to this data, women actually wrote double the number of science fiction. This is diminished by the data being created as a user conglomerate, though interesting. 

In [None]:
# descriptive statistics for female average rating - with genre nulls
female_stats1 = df_female1['average_rate'].describe()
female_stats1

In [None]:
# descriptive statistics for female average rating - without genre nulls
female_stats2 = df_female2['average_rate'].describe()
female_stats2

In [None]:
# descriptive statistics for male average rating - with genre nulls
male_stats1 = df_male1['average_rate'].describe()
male_stats1

In [None]:
# descriptive statistics for male average rating - without genre nulls
male_stats2 = df_male2['average_rate'].describe()
male_stats2

In [None]:
# descriptive statistics for unknown gender average rating - with genre nulls
unknown_stats1 = df_unknown1['average_rate'].describe()
unknown_stats1

In [None]:
# descriptive statistics for unknown gender average rating - with genre nulls
unknown_stats2 = df_unknown2['average_rate'].describe()
unknown_stats2

### Conclusion: 
- According to this dataset, the highest number of works (58%) are written by authors of unknown genders. This is due to nonbinary authors as well as those missing gender entries in their profiles. Males wrote 22% and females wrote 20%. 
- If we use the dataset that drops the null values in genre, which is the most beneficial for genre analysis, males wrote 42%, females wrote 41%, and unknown wrote 17%. 
- There appears to be a normal distribution in regards to average rating between all gender categories with and without the genre null values
- Females have a higher review and fan count despite men having more published works
- Men have a higher work count (except when using the dataframe that did not drop genre null values
- Men, women, and unknown all have a 3.8 average rating with genre nulls and 3.9 without

# Final Conclusions:
## Hypotheses/Questions:
### Men write more science fiction/fantasy than women.
- False. I find this very interesting as it is one of my favorite genres and I assumed men to be the primary contributors. Here is an interesting article on it:
    - https://www.wired.com/2019/02/geeks-guide-history-women-sci-fi/
    
### what genre do women contribute to the most?
- fantasy, mystery, contemporary, chidrens bookd, romance, sex, suspense and chick lit books

### Women have higher "follower" count
- True. I assumed this because while men contributed more works, women tend to read more books by women and support them by leaving more reviews and keeping up with the authors via becoming a 'fan' on Goodreads

### What genres have the most submissions?
- Top 20
    - fiction           26437
    - fantasy            8994
    - mystery            3171
    - history            2940
    - biographies        2939
    - spirituality       2526
    - contemporary       2484
    - children           2120
    - romance            1914
    - young adult        1743
    - sex                1555
    - suspense           1467
    - poetry             1245
    - philosophy         1228
    - graphic novels     1141
    - horror              915
    - crime               861
    - chick lit           812
    - psychology          694
    - art                 621
    
### Which genres have the highest ratings/number of reviews?
- Top 20 Average Ratings
    - pop punk	                4.860000
    - metaphysical	            4.720000
    - theonomy	                4.670000
    - objectivism	            4.345000
    - firearms	                4.325000
    - cross dressing	        4.290000
    - survivalism	            4.260000
    - international development	4.250000
    - nursing	                4.235000
    - current events	        4.230000
    - beer	                    4.220000
    - land reform	            4.210000
    - news	                    4.200000
    - style guide	            4.190000
    - self defense	            4.190000
    - horses	                4.180000
    - photography	            4.175833
    - domestic abuse	        4.170000
    - zoroastrianism	        4.145000
    - spirituality	            4.097937
- Top 20 Review Counts
    - fiction	        36443815
    - fantasy	        18965848
    - mystery	        4988633
    - contemporary	    3482044
    - young adult	    3287134
    - children	        2567303
    - romance	        2323639
    - graphic novels	2005191
    - suspense	        1863630
    - biographies	    1729650
    - history	        1653643
    - sex	            1613801
    - spirituality	    1208409
    - horror	        1085838
    - crime	            873758
    - chick lit	        868076
    - philosophy	    562000
    - comedy	        439009
    - poetry	        310632
    - psychology	    283842
    
### Do men or women authors receive higher ratings?
- Men, women, and unknown all have an average rating of 3.8 with genre nulls and 3.9 without

### Can books be classified by select words in the blurbs on the back?
- unknown. I Think this would require work outside of what I know how to do or could teach myself. Attempting to find genre key words in the descriptions might work, but not using this dataset as it is author data and not book data. 

### Section Conclusions:
#### Country
- from this data, the United States, United Kingdom, Canada, Australia, and France have the most published works
- Of the top 20 countries, only Canada, Australia and Japan have more works by female authors 
- The United States makes up 60% of the published works
- The United Kingdom makes up 18% of the published works
- Canada Makes Up 5% of the published works
- France and Germany each make up 4% of the published works

#### Genre
- Out of the top 20 genres in this dataset according to number of works, women tend to write more fantasy, mystery, contemporary, chidrens bookd, romance, sex, suspense and chick lit books. Science fiction did not make the top 20 according to work count. 
- Men wrote more fiction, history, biographies, spiritual books, poetry, philosophy, graphic novels, horror, crime, psychology and art. 
- The unknown section includes nonbinary authors as well as those that don't have that data filled in. 
- fiction and fantasy had by far the largest number of works, though neither fall into the top 20 rated category. This is most likely due to the larger number of entries and the likelyhood of readers taking the time to rate them. 
- This can be demonstrated in the fan count - these are people who consistently follow, are up to date with specific authors, and regularly review their work. 
- fiction has the highest number of works throughout the years sampled (top 20 based on number of works. Followd mostly by fantasy and then mystery with a couple outliers.

#### Author
- William Shakespear has the highest work count at over 5000. This value is most likely inflated due to translations into different languages being included. 
- Stephen King has the highest number of fans
- JK Rowling has the highest number of reviews

#### Gender
- According to this dataset, the highest number of works (58%) are written by authors of unknown genders. This is due to nonbinary authors as well as those missing gender entries in their profiles. Males wrote 22% and females wrote 20%. 
- If we use the dataset that drops the null values in genre, which is the most beneficial for genre analysis, males wrote 42%, females wrote 41%, and unknown wrote 17%. 
- There appears to be a normal distribution in regards to average rating between all gender categories with and without the genre null values
- Females have a higher review and fan count despite men having more published works
- Men have a higher work count (except when using the dataframe that did not drop genre null values
- Men, women, and unknown all have a 3.8 average rating with genre nulls and 3.9 without