<a href="https://colab.research.google.com/github/Rishabhyadav888/Book-recommendation-system/blob/main/Book_recommendation_system_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

To summarize, the project aimed to develop a book recommendation system that uses machine learning techniques to suggest books based on users' reading history, ratings, and other factors. The project involved several steps, including EDA and data wrangling, data visualization, and hypothesis testing.
During the EDA and data wrangling process, several steps were taken to ensure the data was clean and ready for analysis. The data visualization process involved exploring and presenting insights from various aspects of the data, including books, ratings, users' age, authors, users' countries, publishers, and publication year.

In the hypothesis testing process, statistical tests were performed to test three hypotheses related to users' behavior and preferences. The results of the tests showed that there was sufficient evidence to reject the null hypothesis for the first two hypotheses, indicating that the average number of books read by users is more than 10, and the average age of users who read more than 50 books is 35. For the third hypothesis, the result of the test showed that there was not sufficient evidence to reject the null hypothesis, indicating that the top 100 books read by most users do not necessarily have an average rating of more than 8.

In the data preprocessing step, a final dataset was created by selecting books with an average rating of 5 or more and read by at least 50 users. Users who had read at least 50 books and rated them were also selected, and a pivot table was created for the user and the books, filled with null values with zero.

Finally, the recommendation system was developed using collaborative filtering, which utilized cosine similarity and a nearest neighbor model. Both models performed well, and the same recommendations were given by both models. Overall, the book recommendation system developed through this project can help users discover new books that they will enjoy, saving them time and increasing their overall enjoyment of reading.


# **GitHub Link -**

https://github.com/Rishabhyadav888/Book-recommendation-system

# **Problem Statement**


Many book lovers often find it challenging to discover new books that suit their taste, and they rely on recommendations from friends or online reviews, which can be overwhelming and time-consuming. To address this issue, we aim to develop a book recommendation system that leverages machine learning techniques to suggest books that match users' preferences.

The primary objective of this project is to build a personalized book recommendation system that takes into account users' reading history, ratings, and other relevant factors such as  author,publiser and publication year. The system should be able to provide accurate and diverse book recommendations that cater to users' specific interests and preferences.

The goal of this system is to provide personalized recommendations that will help readers discover new books they will enjoy and increase the likelihood of them continuing to read more books.

By building such a system, we can save time for book lovers and provide them with a more diverse range of books to choose from, ultimately increasing the overall enjoyment of reading.







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from scipy.stats import *
import math
from scipy.stats import ttest_1samp

from wordcloud import WordCloud,STOPWORDS

from sklearn.metrics.pairwise import cosine_similarity

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

### Dataset Loading

In [None]:
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# path of the datasets
path='/content/drive/MyDrive/Capstone Project/Book recommendation system/'

In [None]:
# Load Dataset
books=pd.read_csv(path+'Books.csv')
ratings=pd.read_csv(path+'Ratings.csv')
users=pd.read_csv(path+'Users.csv')


### Dataset First View

In [None]:
# Book Dataset First Look
books.head()

In [None]:
# Users Dataset First Look
users.head()

In [None]:
# Ratings Dataset First Look
ratings.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(books.shape)
print(users.shape)
print(ratings.shape)


### Dataset Information

In [None]:
# Books dataset info
books.info()

In [None]:
# Users dataset info
users.info()

In [None]:
# Ratings dataset info
ratings.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(books.duplicated().sum())
print(users.duplicated().sum())
print(ratings.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
books_null_count=books.isnull().sum()
print(books_null_count)

In [None]:
# Visualizing the missing values in book dataset
plt.figure(figsize=(8,5))
books_null_count.plot(kind='bar')
plt.show()

In [None]:
# checking for null values in rating dataset
ratings.isnull().sum()

In [None]:
# checking for null values in users dataset
users_null_count=users.isnull().sum()
print(users_null_count)

In [None]:
# Visualizing the missing values in users dataset
users_null_count.plot(kind='bar')
plt.show()

### What did you know about your dataset?

* The dataset we have is of different types of books,users informations and rating given to different kinds of books and we have to analysis the insights behind it.
* There are in total 6 null values in the books dataset and almost 110000 null values in age columns of users dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(books.columns)
print(users.columns)
print(ratings.columns)


In [None]:
# Books dataset Describe
books.describe()

In [None]:
# Users dataset Describe
users.describe()

In [None]:
# Ratings dataset Describe
ratings.describe()

### Variables Description 

* **ISBN               :** unique book id

* **Book-Title       :**Name of the book

* **Book-Author            :**Name of the author

* **Year-Of-Publication           :**In which year book was publised

* **Publisher        :**Name of the publiser

* **Image-URL-S             :**Small image url

* **Image-URL-M**         :Medium image url

* **Image-URL-L**         :Large image url

* **User-ID**          :unique user id
* **Location**          :City,State,Country of the user
* **Age**          :Age of the year
* **Book-Rating**          :Ratings given by user for the book


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable of books dataset.
books.nunique()

In [None]:
# Check Unique Values for each variable of users dataset.
users.nunique()

In [None]:
# Check Unique Values for each variable of ratings dataset.
ratings.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.olun
# changed the name of the few columns
books.rename(columns={'Book-Title':'book_title','Book-Author':'book_author','Year-Of-Publication':'year_of_publication','Publisher':'publisher','Image-URL-M':'image_url_m'},inplace=True)
users.rename(columns={'	User-ID':'user_id','Location':'location','Age':'age'},inplace=True)
ratings.rename(columns={'	User-ID':'user_id','Book-Rating':'book_rating'},inplace=True)

In [None]:
# Deleted image url of small and large size image
books.drop(['Image-URL-S','Image-URL-L'],axis=1,inplace=True)

In [None]:
# unique value of year_of_publication
books.year_of_publication.unique()

While checking for unique values found 2 values which are the name of publiser was mentioned in year_of_publication.

In [None]:
# wroung entreis in the columns
books[(books['year_of_publication']=='DK Publishing Inc') | (books['year_of_publication']=='Gallimard')]

In [None]:
# corrected the wroung entreis in the columns
books.loc[books.ISBN == '078946697X','publisher'] = 'DK Publishing Inc'
books.loc[books.ISBN == '078946697X','year_of_publication'] = '2000'
books.loc[books.ISBN == '078946697X','book_author'] = 'Michael Teitelbaum'
books.loc[books.ISBN == '078946697X','book_title'] = 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'
books.loc[books.ISBN == '078946697X','image_url_m'] = 'http://images.amazon.com/images/P/078946697X.01.MZZZZZZZ.jpg'

books.loc[books.ISBN == '0789466953','image_url_m'] = 'http://images.amazon.com/images/P/0789466953.01.MZZZZZZZ.jpg'
books.loc[books.ISBN == '0789466953','publisher'] = 'DK Publishing Inc'
books.loc[books.ISBN == '0789466953','year_of_publication'] = '2000'
books.loc[books.ISBN == '0789466953','book_author'] = 'James Buckley'
books.loc[books.ISBN == '0789466953','book_title'] = 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'


books.loc[books.ISBN == '2070426769','publisher'] = 'Gallimard'
books.loc[books.ISBN == '2070426769','year_of_publication'] = '2003'
books.loc[books.ISBN == '2070426769','book_author'] = 'Jean-Marie Gustave Le ClÃ?Â©zio'
books.loc[books.ISBN == '2070426769','book_title'] = "Peuple du ciel, suivi de 'Les Bergers"
books.loc[books.ISBN == '2070426769','image_url_m'] = 'http://images.amazon.com/images/P/2070426769.01.MZZZZZZZ.jpg'

In [None]:
# function to extract country from location
def get_country(location):
  a=location.split(', ')
  country=a[-1]
 
  return country

In [None]:
# create new column country 
users['country']=users['location'].apply(get_country)

In [None]:
# Deleted rows where rating was zero
zero_dataset=ratings[ratings['book_rating']==0].index
ratings.drop(zero_dataset,inplace=True)

In [None]:
# Droped rows with age <12 & >85
users.drop(users[(users['age']>85) | (users['age']<12)].index,inplace=True)

In [None]:
# Merge the book, rating and users dataset
rating_with_book=books.merge(ratings,on='ISBN')
df=rating_with_book.merge(users,on='User-ID')

In [None]:
# book title with highest book rating
book_with_highest_rating=df.groupby('book_title')['book_rating'].sum().reset_index()
book_with_highest_rating=book_with_highest_rating.sort_values(by = 'book_rating',ascending=False)

In [None]:
# Most reded books with total rating

most_readed_books=df['book_title'].value_counts().reset_index().head(20)
most_readed_books.columns=['book_title','count']

book_with_avg_rating=df.groupby('book_title')['book_rating'].mean().sort_index(ascending=False).reset_index()

most_readed_books_with_rating=book_with_avg_rating.merge(most_readed_books,on='book_title')
most_readed_books_with_rating = most_readed_books_with_rating.sort_values(by = 'count',ascending=False)

In [None]:
# Author with highest book rating
author_with_highest_rating= df.groupby('book_author')['book_rating'].sum().reset_index()
author_with_highest_rating=author_with_highest_rating.sort_values(by = 'book_rating',ascending=False).head(10)

In [None]:
# Heighest rating done by which age group
age_with_most_rating=df.groupby('age')['book_rating'].sum().reset_index().sort_values(by='book_rating',ascending=False).head(20)

### What all manipulations have you done and insights you found?

* changed the name of the few columns for better understanding.
* Deleted image url of small and large size image as they were no use of.
* corrected the wroung entreis in few of the columns.
* Made a function to extract country from location and create new column country.
* Deleted rows where the book rating was zero by the user in rating column.
* Droped rows with users having age of less than 12 & greater than 85.As there was outlier in the dataset.
* Merge the book, rating and users dataset.
* Made a few new dataset for the analysis:
  * Book titles with highest ratings.
  * Most reded books with total rating
  * Author with highest book rating
  * Heighest rating done by which age group

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Author(Univariate analysis)

In [None]:
# Chart - 1 visualization code(Top 10 author with highest number of books written)
plt.figure(figsize=(10,6))
sns.countplot(data=books,y='book_author',order=books['book_author'].value_counts().index[0:10])
plt.title('Top 10 author with highest number of books written')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between author and number of book written.

##### 2. What is/are the insight(s) found from the chart?

The highest number of books written by any author was 600 plus and top ten author with highest number of books were around 300 plus books.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In book recommendation system in most of the cases these top ten authors book will be recommended as they have written more numbers of books.

#### Publiser(Univariate analysis)

In [None]:
# Chart - 2 visualization code(Top 10 publishers with highest number of books)
plt.figure(figsize=(10,6))
sns.countplot(data=books,y='publisher',order=books['publisher'].value_counts().index[0:10])
plt.title('Top 10 publishers with highest number of books')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between publisher and number of book published.

##### 2. What is/are the insight(s) found from the chart?

Harlequin is the top publisher with the highest number of books publish around 7000 plus books and most of the top publisher has published 2000+ books.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In book recommendation system in most of the cases these top ten publisher book will be recommended as they have published more numbers of books.

#### Year(Univariate analysis)

In [None]:
# Chart - 3 visualization code(Top 10 year of publishing)
plt.figure(figsize=(10,6))
sns.countplot(data=books,x='year_of_publication',order=books['year_of_publication'].value_counts().index[0:10])
plt.title('Top 10 year of publishing')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the  number of book published in a year.

##### 2. What is/are the insight(s) found from the chart?

Highest number of books were published in 2002 and most of the books were published between the year of 1994 to 2003.

#### Rating(Univariate analysis)

In [None]:
# Chart - 4 visualization code(Number of ratings)
plt.figure(figsize=(10,6))
sns.countplot(data=ratings,x='book_rating')
plt.title('Number of ratings given to the books')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the number of rating given to the books.

##### 2. What is/are the insight(s) found from the chart?

Number of 0 rating was very high but as in recommendations system 0 rating is not going to help us to recommend the books to the users, so we have removed the 0 rating and after removing we have plotted the count plot for each type of ratings.

we can find out that most of the books have got 5 or 5+ ratings.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Now after removing those books which were having zero ratings now we are left with only those books which have some ratings which can help us to recommend the users in our recommendation system.

#### Books(Univariate analysis)

In [None]:
# Chart - 5 visualization code(Top 20 most readed books)
plt.figure(figsize=(10,6))
sns.countplot(data=df,y='book_title',order=df['book_title'].value_counts().index[0:20])
plt.title('Top 20 most readed books')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the top 20 books readed by the different users.

##### 2. What is/are the insight(s) found from the chart?

These are the top 20 books having more user base where 250 plus different type of users had read these books and given some reviews.

The lovely bones is the top novel which has been read by most of the users around 700 plus users have read these books and given some reviews.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These top 20 books can be used to recommend to the new users as these books are read by most of the users.

####  Most frequent words in book title(Univariate analysis)

In [None]:
# Chart - 6 visualization code( Most frequent words in book title)
stopwords=set(STOPWORDS)

text=books['book_title']
wordcloud2 = WordCloud(random_state=10,stopwords=stopwords).generate(str(text))
# Generate plot
plt.figure(figsize=(10,6))
plt.title(' Most frequent words in book title')
plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

Wordcloud is basically a visualization technique to represent the frequency of words in a text where the size of the word represents its frequency.

Thus, I have used the word clouds to represent what kind of books are available in our dataset.

##### 2. What is/are the insight(s) found from the chart?

We found that most of the books are related to stories type and these stories are related to mythology,classic,pandemic,mummies,decision making,world,influenza etc.

#### Age(Univariate analysis)

In [None]:
# Chart - 7 visualization code(Number of users in each age group)
plt.figure(figsize=(10,6))
sns.histplot(data=users,x='age')
plt.title('Number of users in each age group')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

Most of the users who read books are from the age group of 15 to 55 and we have removed the age group of less than 12 and above 85 as the number of books read by these users are very less and review given by these age groups may not help for recommending books as the most of the user base are between 15 to 55.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Age group of users between 12 to 85 can help us to recommend good books to the users as our user base is around 15 to 50,so their reviews can help us to build a good recommendation system.

#### Country vs users(Bivariate analysis)

In [None]:
# Chart - 8 visualization code
sns.countplot(data=users,y='country',order=users['country'].value_counts().index[0:10])
plt.title('Number of users per country')
plt.xlabel('Number of users')
plt.show()


##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between country and users.

##### 2. What is/are the insight(s) found from the chart?

These are the top ten countries with the highest user base.

As most of the users are from the USA around 1,40,000 and rest countries have around 20,000 users or below 20,000


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As most of the users are from USA with around 1,40,000 users, so we can say that more likely the book recommended is based on the USA users.

#### Books vs rating(Bivariate analysis)

In [None]:
# chart - 9 visualization code(Top 20 books with higest rating)

plt.figure(figsize=(10,7))
sns.barplot(data=book_with_highest_rating.head(20),x='book_rating',y='book_title')
plt.title('Top 20 books with higest rating')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between book title and book rating.

##### 2. What is/are the insight(s) found from the chart?

These are the top 20 books having the highest ratings given by the users with 2000 plus ratings.

The lovely bones is the top novel with highest rating of around 6000.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These top 20 books can be used to recommend to the new users as these books are having highest ratings.

#### Most read books with average rating(Bivariate analysis)

In [None]:
# Chart - 8 visualization code(Most read books with average rating)
plt.figure(figsize=(10,7))
sns.barplot(data=most_readed_books_with_rating,x='book_rating',y='book_title')
plt.title('Top 20 most read books with average rating ')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time.  Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between book title and average book rating.

##### 2. What is/are the insight(s) found from the chart?

These are the top 20 books which are read by most of the users and the average rating of these books are more than 7 plus.

Wild animus is this second highest read book but the average rating is around 4.5.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These top 20 books can be used to recommend to the new users as these books are read by most of the users and there average rating is more than 7.

#### Age vs Rating(Bivariate analysis)

In [None]:
# Chart - 10 visualization code(Highest rating amoung different age group)
plt.figure(figsize=(10,7))
sns.barplot(data=age_with_most_rating,x='age',y='book_rating')
plt.title('Highest rating amoung different age group')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between age and book rating.

##### 2. What is/are the insight(s) found from the chart?

Highest rating is given by the age group of 23 to 52

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As most of the books are read by the user age of 15 to 50 and the highest rating is given by the age group of 23 to 52 based on these rating we can recommend good books to the user base of age 15 to 50.

#### Author v/s Rating(Bivariate analysis)

In [None]:
# Chart - 12 visualization code(Top 10 author with highest rating)

plt.figure(figsize=(10,7))
sns.barplot(data=author_with_highest_rating.head(10),x='book_rating',y='book_author')
plt.title('Top 10 author with highest rating')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.

Thus, I have used the bar plot to show the relationship between author and book ratings.

##### 2. What is/are the insight(s) found from the chart?

These are the top ten authors with the highest rating of 9000+ as books of these authors are more liked by the users.

Most favorite author is Stephen King with the highest rating of 35,000.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These top 10 authors books can be used to recommend to the new users as these author books are having highest ratings.

####  Most frequent words in book title(Bivariate analysis)

In [None]:
# Chart - 13 visualization code( Most frequent words in book title of top 10 author)
top_author=author_with_highest_rating['book_author'].tolist()
stopwords=set(STOPWORDS)
for i in top_author:
  author_books=df.loc[(df['book_author']==i),'book_title'].reset_index()
  text=author_books['book_title']
  wordcloud2 = WordCloud(random_state=10,stopwords=stopwords).generate(str(text))
  # Generate plot
  plt.figure(figsize=(10,6))
  plt.title(f' Most frequent words in book title of author {i}')
  plt.imshow(wordcloud2)
  plt.axis("off")
  plt.show()

##### 1. Why did you pick the specific chart?

Wordcloud is basically a visualization technique to represent the frequency of words in a text where the size of the word represents its frequency.

Thus, I have used the word clouds to represent what kind of books are written by top 10 authors.

##### 2. What is/are the insight(s) found from the chart?

With the highest frequency of words present in the book title of each author can represent books topics written by the different authors.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Average book read by the users is more than 10.

Average age of the users who read more than 50 books is 35.

Top 100 books which are read by most of the users are having the average rating of more than 8.

In [None]:
# Creating Parameter Class 
class findz:
  def proportion(self,sample,hyp,size):
    return (sample - hyp)/math.sqrt(hyp*(1-hyp)/size)
  def mean(self,hyp,sample,size,std):
    return (sample - hyp)*math.sqrt(size)/std
  def varience(self,hyp,sample,size):
    return (size-1)*sample/hyp

variance = lambda x : sum([(i - np.mean(x))**2 for i in x])/(len(x)-1)
zcdf = lambda x: norm(0,1).cdf(x)
# Creating a function for getting P value
def p_value(z,tailed,t,hypothesis_number,df,col):
  if t!="true":
    z=zcdf(z)
    if tailed=='l':
      return z
    elif tailed == 'r':
      return 1-z
    elif tailed == 'd':
      if z>0.5:
        return 2*(1-z)
      else:
        return 2*z
    else:
      return np.nan
  else:
    z,p_value=stats.ttest_1samp(df[col],hypothesis_number)
    return p_value
    
  


# Conclusion about the P - Value
def conclusion(p):
  significance_level = 0.05
  if p>significance_level:
    return f"Failed to reject the Null Hypothesis for p = {p}."
  else:
    return f"Null Hypothesis rejected Successfully for p = {p}"

# Initializing the class
findz = findz()

### Hypothetical Statement - 1

Average book read by the users is more than 10.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N = 10

Alternate Hypothesis : N > 10

Test Type: Right Tailed Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
hypo_1=ratings.groupby('User-ID')['ISBN'].count().reset_index()
# Getting the required parameter values for hypothesis testing
hypothesis_number = 10
sample_mean = hypo_1["ISBN"].mean()
size = len(hypo_1)
std=(variance(hypo_1["ISBN"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)
# Getting P - Value
p = p_value(z=z,tailed='l',t="true",hypothesis_number=hypothesis_number,df=hypo_1,col="ISBN")
# Getting Conclusion
print(conclusion(p))

##### Which statistical test have you done to obtain P-Value?

I have used T-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected 

##### Why did you choose the specific statistical test?

The distribution is postively skewed. For a skewed data Z-Test can't be performed.So, for a skewed data we can use T-test for better result. Thus, I used t - test.

### Hypothetical Statement - 2

Average age of the users who read more than 50 books is 35.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N != 35

Alternate Hypothesis : N = 35

Test Type: Two Tailed Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

hypo_2=hypo_1.merge(users,on='User-ID')
hypo_2= hypo_2.drop(columns=['location', 'country'])
hypo_2=hypo_2.dropna()
# Getting the required parameter values for hypothesis testing
hypothesis_number =35 
sample_mean = hypo_2["age"].mean()
size = len(hypo_1)
std=(variance(hypo_2["age"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)
# Getting P - Value
p = p_value(z=z,tailed='d',t="false",hypothesis_number=hypothesis_number,df=hypo_2,col="age")
# Getting Conclusion
print(conclusion(p))

##### Which statistical test have you done to obtain P-Value?

I have used Z-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis rejected successfully.

##### Why did you choose the specific statistical test?

I have used Z-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis rejected successfully.

### Hypothetical Statement - 3

Top 100 books which are read by most of the users are having the average rating of more than 8.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N = 8

Alternate Hypothesis : N > 8

Test Type: Right Tailed Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
hypo_3=ratings.groupby('ISBN').agg({'User-ID':'count','book_rating':'mean'}).reset_index().sort_values(by='User-ID',ascending=False).head(100)
# Getting the required parameter values for hypothesis testing
hypothesis_number =8
sample_mean = hypo_3["book_rating"].mean()
size = len(hypo_1)
std=(variance(hypo_3["book_rating"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)
# Getting P - Value
p = p_value(z=z,tailed='r',t="true",hypothesis_number=hypothesis_number,df=hypo_3,col="book_rating")
# Getting Conclusion
print(conclusion(p))

##### Which statistical test have you done to obtain P-Value?

I have used T-Test as the statistical testing to obtain P-Value and found the result that Failed to reject the Null Hypothesis.

##### Why did you choose the specific statistical test?

The distribution is negatively skewed. For a skewed data Z-Test can't be performed.So, for a skewed data we can use T-test for better result. Thus, I used t - test.

## 6. ***Data Pre-processing***

In [None]:
# Creat dataset with Number of rating & avg rating per book title
number_rating=rating_with_book.groupby("book_title")['book_rating'].agg(num_rating='count',avg_rating='mean').reset_index()

In [None]:
# remove number of rating below 500 and avg rating below 5.
best_book=number_rating[(number_rating['num_rating']>=50)& (number_rating['avg_rating']>=5)]

In [None]:
best_book.shape

In [None]:
# removed the users given less than 50 book ratings
x=df.groupby('User-ID').count()['book_rating']>=50
y=x[x].index
df=df[df['User-ID'].isin(y)]

In [None]:
# Final dataset after mergeing best books
final_df=df.merge(best_book,on='book_title').reset_index()

In [None]:
# Droped few columns
final_df.drop( ['year_of_publication','publisher', 'location', 'age','country'],axis=1,inplace=True)

In [None]:
# final dataset
final_df.head()

In [None]:
# Created pivot table base on book title and user id
pivot_t = final_df.pivot_table(index='book_title',columns='User-ID',values='book_rating')

In [None]:
# Filled NA with 0
pivot_t.fillna(0,inplace=True)

In [None]:
pivot_t