## Hypothesis Testing

In [1]:
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import statsmodels.stats

In [2]:
data = pd.read_csv('C:\\Users\\juhic\\OneDrive\\Desktop\\master_dataset.csv')
drop_cols = ['Unnamed: 0','author_avg_rating','author_work_count','log2_author_work_count','page_count','genre']
data.drop(axis = 1, columns = drop_cols, inplace = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34138 entries, 0 to 34137
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   book            34138 non-null  object
 1   author          34138 non-null  object
 2   rating_count    34138 non-null  int64 
 3   is_volume       34138 non-null  object
 4   author_sex      34138 non-null  object
 5   author_exp      34138 non-null  object
 6   book_size       34138 non-null  object
 7   genre_category  34138 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


#### Type Conversions

In [3]:
data = data.astype({'is_volume':'category',
                    'author_sex':'category',
                    'author_exp':'category',
                    'book_size':'category',
                    'genre_category':'category'})
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34138 entries, 0 to 34137
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   book            34138 non-null  object  
 1   author          34138 non-null  object  
 2   rating_count    34138 non-null  int64   
 3   is_volume       34138 non-null  category
 4   author_sex      34138 non-null  category
 5   author_exp      34138 non-null  category
 6   book_size       34138 non-null  category
 7   genre_category  34138 non-null  category
dtypes: category(5), int64(1), object(2)
memory usage: 967.6+ KB


In [4]:
data.head()

Unnamed: 0,book,author,rating_count,is_volume,author_sex,author_exp,book_size,genre_category
0,inner circle,kate brian,7597,yes,female,average,average,fiction
1,ambition,kate brian,6719,yes,female,average,average,fiction
2,revelation,kate brian,7431,yes,female,average,average,fiction
3,legacy,kate brian,7010,yes,female,average,average,fiction
4,vanished,kate brian,3724,yes,female,average,average,fiction


#### Hypothesis 1:

The publisher is interested to understand if the rating count between male & female authors is significantly different. If so, books from the group that holds larger rating counts would be prioritized over the other.

<center> Null Hypothesis: Average rating counts b/w male & female authors are equal </center>
<center> $mu_{male} = mu_{female} $

Test:<br></br>
Independent 2-tailed Un-equal variance Z-test to compare mean reader count among the 2 groups.


Outcome:<br></br>
p-value of the sample under Null Hypothesis  <br></br>
Considering 5% level of significance,  if the p-value would be less than 5%, then we would reject the null hypothesis.

In [5]:
d0 = data.groupby(by = 'author').agg({'rating_count':'mean'}).reset_index()

d1 = data[['author','author_sex']].drop_duplicates()
d11 = pd.merge(d0, d1, on = 'author', how = 'inner')

d11.head()

Unnamed: 0,author,rating_count,author_sex
0,50 cent,7329.0,male
1,a. kirk,5272.666667,female
2,a. manette ansay,14418.0,female
3,a. meredith walters,13256.5,female
4,a. merritt,864.0,male


In [6]:
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d11.loc[d11['author_sex'] == 'male','rating_count'],
         x2 = d11.loc[d11['author_sex'] == 'female', 'rating_count'],
         value = 0)

(0.24638958656882956, 0.805380666531034)

#### Hypothesis 2

The publisher is interested to know if new-comer authors (published < 16 books) have a similar rating count as an average author (published average #books). This is to help avoid the new-comer trap! (when a publisher discards the book solely reasoning that the author has no experience) <br></br>
Work_count provides the #works (books, articles, revisions etc) of an author. It can be classified into 3 bins: 
- New-comers: <16 (or <2 std dev from mean) 
- Average: within 1 std. Dev from mean (both sides)
- Legendary: > 1 std. Dev from mean

<center> Null Hypothesis: Average rating count of a new-comer author is equal to that of an average author. </center>

Test: <br></br>
Independent 2-tailed Z-test to compare mean rating count b/w New-comers and Average authors.

Outcome: <br></br>
p-value of the sample under Null Hypothesis \<br></br>
Considering 5% level of significance,  if the p-value would be less than 5%, then we would reject the null hypothesis.


In [7]:
d2 = data[['author','author_exp']].drop_duplicates()
d22 = pd.merge(d0, d2, on = 'author', how = 'inner')

d22.head()

Unnamed: 0,author,rating_count,author_exp
0,50 cent,7329.0,average
1,a. kirk,5272.666667,newbie
2,a. manette ansay,14418.0,newbie
3,a. meredith walters,13256.5,average
4,a. merritt,864.0,average


In [8]:
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d22.loc[d22['author_exp'] == 'newbie','rating_count'],
         x2 = d22.loc[d22['author_exp'] == 'average', 'rating_count'],
         value = 0)

(1.0160269164120113, 0.3096165691549334)

#### Grouping Hypothesis 1 & 2: It can also make sense to view the effect of author gender with author’s work exp (clubbing Hypothesis 1 & 2) on average rating count. We are not sure how it will play; so if there is insight - we would want to share.

In [9]:
d3 = pd.merge(d11, d2, on = 'author', how = 'inner')
d3.head()

Unnamed: 0,author,rating_count,author_sex,author_exp
0,50 cent,7329.0,male,average
1,a. kirk,5272.666667,female,newbie
2,a. manette ansay,14418.0,female,newbie
3,a. meredith walters,13256.5,female,average
4,a. merritt,864.0,male,average


In [10]:
# Male (Newbie v/s Average)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'newbie'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'average'),'rating_count'],
         value = 0)

(0.17405146915844733, 0.8618250129469954)

In [11]:
# Male (Newbie v/s Experienced)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'newbie'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'experienced'),'rating_count'],
         value = 0)

(-3.629932952819773, 0.00028349484825117516)

In [12]:
# Male (Average v/s Experienced)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'average'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'experienced'),'rating_count'],
         value = 0)

(-5.413302258270755, 6.187291596702804e-08)

In [13]:
# Female (Newbie v/s Average)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'newbie'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'average'),'rating_count'],
         value = 0)

(1.0663869925052734, 0.2862487392291887)

In [14]:
# Female (Newbie v/s Experienced)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'newbie'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'experienced'),'rating_count'],
         value = 0)

(-2.3597119117780965, 0.018289131843670016)

In [15]:
# Female (Average v/s Experienced)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'average'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'experienced'),'rating_count'],
         value = 0)

(-3.941984966492798, 8.081004281510416e-05)

In [16]:
# Newbie (Male v/s Female)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'newbie'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'newbie'),'rating_count'],
         value = 0)

(-0.5514378076087788, 0.5813335891758067)

In [17]:
# Average (Male v/s Female)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'average'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'average'),'rating_count'],
         value = 0)

(0.17127906652047378, 0.8640043397253947)

In [18]:
# Experienced (Male v/s Female)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = d3.loc[(d3['author_sex'] == 'male') & (d3['author_exp'] == 'experienced'),'rating_count'],
         x2 = d3.loc[(d3['author_sex'] == 'female') & (d3['author_exp'] == 'experienced'),'rating_count'],
         value = 0)

(-0.9477706728153442, 0.34324621356225427)

#### Hypothesis 3

The publisher receives books with pages ranging from 1-10K+! Bulky books require a lot of time to be reviewed. And if they do not sell more, it’s just a waste of effort. The publisher is interested to see if the bulkier books have less rating count on an average as compared to average-sized books.  <br></br>
Books can be classified into 3 bins: <br></br>
- Light: <80 pages (or <2 std. Dev from mean) 
- Average: within 1 std. Dev from mean page count (both sides)
- Bulky: > 1 std. Dev from mean page count

<center> Null Hypothesis: Average rating count of a bulky book is greater than or equal to that of an average size book. </center>

Test: <br></br>
Independent 1-tailed Z-test to compare mean rating count b/w Bulky  and Average sized books.

Outcome: <br></br>
p-value of the sample under Null Hypothesis <br></br>
Considering 5% level of significance,  if the p-value would be less than 5%, then we would reject the null hypothesis. 


In [26]:
# Bulky v/s Average
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[data['book_size'] == 'bulky','rating_count'],
         x2 = data.loc[data['book_size'] == 'average','rating_count'],
         value = 0, alternative = 'larger')

(6.046462467727214, 7.40303518575635e-10)

In [29]:
# Light v/s Average
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[data['book_size'] == 'light','rating_count'],
         x2 = data.loc[data['book_size'] == 'average','rating_count'],
         value = 0, alternative = 'larger')

(-3.1090357422401165, 0.9990615049482333)

#### Hypothesis 4

Genre of the book is an obvious factor that can have a grossing affect on the rating count. The most popular ones being: Fiction and Non-Fiction (we’ll club the rest in ‘others’). Intuitively, most people prefer reading fiction over non-fiction and others. The publisher is interested to know if this is indeed true. If we find out that the rating count of fiction is not greater than non-fiction (or others), then we would know that the bias a publisher has toward fiction books is ill-formed.

<center> Null Hypothesis: Average rating count of Fiction books is greater than or equal to that of non-fiction & others. </center>

Test: <br></br>
2 Pairwise comparison of means b/w group 1 v/s 2, and group 1 v/s 3
	
Outcome: <br></br>
p-value of the sample under Null Hypothesis in both tests <br></br>
Considering 5% level of significance,  if the p-value would be less than 5%, then we would reject the null hypothesis. 


In [33]:
# Fiction v/s Non-Fiction
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[data['genre_category'] == 'fiction','rating_count'],
         x2 = data.loc[data['genre_category'] == 'non-fiction','rating_count'],
         value = 0, alternative = 'larger')

(5.138707626988176, 1.38317211447273e-07)

In [34]:
# Fiction v/s Others
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[data['genre_category'] == 'fiction','rating_count'],
         x2 = data.loc[data['genre_category'] == 'others','rating_count'],
         value = 0, alternative = 'larger')

(9.841139078088977, 3.742921234655322e-23)

In [35]:
# Fiction v/s Others
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[data['genre_category'] == 'non-fiction','rating_count'],
         x2 = data.loc[data['genre_category'] == 'others','rating_count'],
         value = 0, alternative = 'larger')

(4.597470935849239, 2.138251304452644e-06)

#### Grouping Hypothesis 3 & 4: Bulkier books may have a different selling pattern across fiction & non-fiction genres. We also want to club Hypothesis 3 & 4 and check the effect of book genre & size on average rating count.

In [38]:
# Bulky (Fiction v/s Non-fiction)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[(data['book_size'] == 'bulky') & (data['genre_category'] == 'fiction'),'rating_count'],
         x2 = data.loc[(data['book_size'] == 'bulky') & (data['genre_category'] == 'non-fiction'),'rating_count'],
         value = 0, alternative = 'larger')

(3.811571012432866, 6.904318644528193e-05)

In [47]:
# Fiction (Bulky v/s Average)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[(data['book_size'] == 'bulky') & (data['genre_category'] == 'fiction'),'rating_count'],
         x2 = data.loc[(data['book_size'] == 'average') & (data['genre_category'] == 'fiction'),'rating_count'],
         value = 0, alternative = 'larger')

(5.515402167995858, 1.7399163947011153e-08)

In [50]:
# Non-Fiction (Bulky v/s Average)
from statsmodels.stats import weightstats as ws
ws.ztest(x1 = data.loc[(data['book_size'] == 'bulky') & (data['genre_category'] == 'non-fiction'),'rating_count'],
         x2 = data.loc[(data['book_size'] == 'average') & (data['genre_category'] == 'non-fiction'),'rating_count'],
         value = 0, alternative = 'two-sided')

(-0.6092140276178453, 0.5423825835260467)