## Instructions

Please Upload the .csv file containing your answers to the questions asked below.

While creating the csv file, ensure that:

a. The csv file is utf-8 format

b. For questions indicated with 'F', your answer is float and rounded to two decimal places. eg. 7.05

c. For questions indicated with 'N', your answer is integer with no decimal places. eg .1298

d. For questions indicated with 'S', you answer is string entered within double quotes . eg."Blue Whale"

e. Dataset to be referred will be mentioned in the square brackets for each question.eg [books.csv]

f.  Answers in the csv file must follow the same order as asked in the problem statement below

g. For Q14 (Textblob analysis) use non-null dataset , ie. remove the records for which 'original_title' is null

In [96]:
import pandas as pd

In [97]:
books_df = pd.read_csv("/content/books.csv")

print(books_df.columns , "\n"*2)
books_df.head(5)

Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
       'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object') 




Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
1,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
2,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...
3,6,11870085,11870085,16827462,226,525478817,9780525000000.0,John Green,2012.0,The Fault in Our Stars,...,2346404,2478609,140739,47994,92723,327550,698471,1311871,https://images.gr-assets.com/books/1360206420m...,https://images.gr-assets.com/books/1360206420s...
4,7,5907,5907,1540236,969,618260307,9780618000000.0,J.R.R. Tolkien,1937.0,The Hobbit or There and Back Again,...,2071616,2196809,37653,46023,76784,288649,665635,1119718,https://images.gr-assets.com/books/1372847500m...,https://images.gr-assets.com/books/1372847500s...


### Q2
How many unique books are present in the dataset ? Evaluate based on the 'book_id'? [books.csv] (N)

In [98]:
len(books_df["book_id"].unique())

1309

In [99]:
book_tags_df = pd.read_csv("/content/book_tags.csv")

book_tags_df.head(5)

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


### Q11
How many unique tags are there in the dataset [book_tags.csv] ? (N)

In [100]:
len(book_tags_df["tag_id"].unique())

8789

In [101]:
ratings_df = pd.read_csv("/content/ratings.csv")

ratings_df.head(5)

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,33,4
2,4,18,5
3,4,27,5
4,4,2,5


### Q12
What is the mean value of rating of all the books in the dataset based on (average_rating) [books.csv] ? (F)

##### Extra Info
`round(float_number , decimal_places)` rounds the decimal number to that many decimal places

In [102]:
round(books_df["average_rating"].mean(),2)

3.98

### Q3
 How many unique users are present in the dataset [ratings.csv] ? (N)


In [103]:
len(ratings_df['user_id'].unique())

53323

### Q6
Which book (title) has the most number of counts of tags given by the user ie. the book with maximum user records including all tags [book_tags.csv,books.csv] ? (S)

##### Extra Info
- I have already sorted all the values by "count" in **ascending** order ,
the first row will have least count and last row will have max count.

- `.tail(1)` gives the last row of a dataframe. This returns a `Series` object.

- `.values` on `Series` object gets an array of the values in `Series`. The array in our case only has one element and we access that element using index 0.

In [112]:
tag_count_df = book_tags_df.groupby(['goodreads_book_id']).sum().sort_values("count").reset_index()
max_tag_goodreads = tag_count_df["goodreads_book_id"].tail(1).values[0]

books_df[ books_df["goodreads_book_id"] == max_tag_goodreads ]["original_title"].values[0]

"Harry Potter and the Philosopher's Stone"

### Q10
Which book (goodreads_book_id) has the least number of count of tags given by the user ie. the book with minimum user records including all tags [book_tags.csv] ? (N)

##### Extra Info
- `.head(1)` returns the first row of the dataframe , which is a `Series` object.

In [113]:
min_tag_goodreads = tag_count_df["goodreads_book_id"].head(1).values[0]
min_tag_goodreads

13545345

### Q1
How many books do not have an original title [books.csv] ? (N)

#### Extra Info
- `.isna` checks if the values/ items in rows are NA or not and returns an array full of true or false values.
- `.values` accesses those values and returns an array of just True or False
- `.sum` is an array method to sum all the values in the array , True/False values in python can be treated as 1/0.



In [106]:
books_df["original_title"].isna().values.sum()

54

In [107]:
toread_df = pd.read_csv("/content/toread.csv")

toread_df.head(5)

Unnamed: 0,user_id,book_id
0,15,275
1,29,2304
2,124,5
3,94,1239
4,94,10


### Q7
Which book (goodreads_book_id) is marked as to-read by most users [books.csv,toread.csv] ? (N)

#### Extra Info
- `value_counts` returns a pandas Series in descending order.
- `index[0]` grabs the first index.
- The actual values are the counts , if we were to grab the first value then ,
we would get the count instead of the `book_id`

In [108]:
most_toread_id = toread_df["book_id"].value_counts().index[0]

books_df[books_df["book_id"] == most_toread_id]["goodreads_book_id"].values[0]

19063

### Q9
Which book (title) has the minimum (average_rating) [books.csv], if more than 1 book have same average rating, sort the books by ['title'] in alphabetical order and use the first book in the sorted list? (S)



In [109]:
min_rating = min(books_df["average_rating"])

sorted(books_df[ books_df["average_rating"] == min_rating]["original_title"])[0]

'The Almost Moon'

### Q14
Predict sentiment using Textblob. How many positive titles (original_title) are there [books.csv] ? (cut-off >=0) N

#### Extra Info
- `.dropna()` removes all the rows from the dataframe which have NaN values
- We are considering titles with polarity to be zero as positive here. (Usually they are considered neutral)

In [110]:
from textblob import TextBlob

In [111]:
count = 0
for title in books_df["original_title"].dropna():
  polarity = TextBlob(title).sentiment.polarity
  if polarity >= 0:
    count += 1

count

# The code below does the same job , but its a one-liner 😊
# sum([1 for title in books_df["original_title"].dropna() if TextBlob(title).sentiment.polarity >= 0])

1167