# Mid Bootcamp Project: 
# Bestseller books on Amazon 2009-2019
### About the Dataset
Dataset on Amazon's Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads.

### General observations

- Data was scrapped from amazon site. Primary source: [https://www.amazon.com/-/es/gp/bestsellers/2014/books/ref=zg_bsar_cal_ye?language=en_US](https://www.amazon.com/-/es/gp/bestsellers/2014/books/ref=zg_bsar_cal_ye?language=en_US)
- The dataset includes 550 entries, matching the 50 best-selling books across 11 years (2009 - 2019).
- We infer that the ranking of each book in a year is based on its sales. The way web scrappers work is in an orderly manner, so its position in every year is indeed the order of the ranking. But when we order by year, within the year books are ordered alphabetically, not ranked :( → **create a new column called ‘Ranking’ for every year, using web scraping or manual entry.**
- There are books with the same user rating and quantity of reviews in different years → This means that the same book was in the top50 for more than just one year.
- Throughout the years the same rating and reviews → it gets the reviews and rating from the moment it was scrapped, according to Amazon (if you scrapped it today the dataset would include different data)
- The same book might have different prices along the years. This might be related to book format (hardcover, paperback).
- Currency of the Price column is USD.

### Questions:

1. Is there consistency in the bestselling books during the decade?
2. Which book is the most durable?
3. Which actor is the best seller?
4. Evolution of genres
5. Is price any relevant?
6. Distribution of ratings, reviews and pricing
7. All time top rating (20)

In [1]:
# pip install pandas-dedupe

In [2]:
import pandas as pd
#import pandas_dedupe
import distance
from collections import Counter 

In [3]:
data1 = pd.read_csv("Data/bestsellers with categories.csv")
data1

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction
...,...,...,...,...,...,...,...
545,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9413,8,2019,Fiction
546,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2016,Non Fiction
547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2017,Non Fiction
548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2018,Non Fiction


In [4]:
data2 = pd.read_excel("Data/books_final_ranked.xlsx")
data2

Unnamed: 0,Index,Name,Author,User Rating,Reviews,Price,Year,Genre 1,Genre 2,Genre 3,Rank
0,1,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17 350,8,2016-01-01,Non Fiction,"Health, Fitness & Dieting",Diets & Weight Loss,47
1,2,11/22/63: A Novel,Stephen King,4.6,2 052,22,2011-01-01,Fiction,Literature & Fiction,Genre Fiction,19
2,3,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18 979,15,2018-01-01,Non Fiction,"Health, Fitness & Dieting",Psychology & Counseling,7
3,4,1984 (Signet Classics),George Orwell,4.7,21 424,6,2017-01-01,Fiction,Literature & Fiction,Genre Fiction,17
4,5,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7 665,12,2019-01-01,Non Fiction,Children's Books,Education & Reference,40
...,...,...,...,...,...,...,...,...,...,...,...
545,546,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9 413,8,2019-01-01,Fiction,Children's Books,Growing Up & Facts of Life,7
546,547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2016-01-01,Non Fiction,Self-Help,Happiness,34
547,548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2017-01-01,Non Fiction,Self-Help,Happiness,13
548,549,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2018-01-01,Non Fiction,Self-Help,Happiness,18


# Data evaluation

In [5]:
print(data1.shape)

(550, 7)


In [6]:
data1.columns

Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')

In [7]:
data1.describe()

Unnamed: 0,User Rating,Reviews,Price,Year
count,550.0,550.0,550.0,550.0
mean,4.618364,11953.281818,13.1,2014.0
std,0.22698,11731.132017,10.842262,3.165156
min,3.3,37.0,0.0,2009.0
25%,4.5,4058.0,7.0,2011.0
50%,4.7,8580.0,11.0,2014.0
75%,4.8,17253.25,16.0,2017.0
max,4.9,87841.0,105.0,2019.0


### Enriching the Dataset

We would like to enrich our dataset including 2 additional columns from another dataset that was scrapped form the Amazon web page and that includes further information about genres. 

Whe have checked that table size of the 2 datasets is the same. We will drop all the columns except those of gender2 and gender3 and we are going to concatenate the two tables by axis 1

In [8]:
data3 = data2.copy()
data3

Unnamed: 0,Index,Name,Author,User Rating,Reviews,Price,Year,Genre 1,Genre 2,Genre 3,Rank
0,1,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17 350,8,2016-01-01,Non Fiction,"Health, Fitness & Dieting",Diets & Weight Loss,47
1,2,11/22/63: A Novel,Stephen King,4.6,2 052,22,2011-01-01,Fiction,Literature & Fiction,Genre Fiction,19
2,3,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18 979,15,2018-01-01,Non Fiction,"Health, Fitness & Dieting",Psychology & Counseling,7
3,4,1984 (Signet Classics),George Orwell,4.7,21 424,6,2017-01-01,Fiction,Literature & Fiction,Genre Fiction,17
4,5,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7 665,12,2019-01-01,Non Fiction,Children's Books,Education & Reference,40
...,...,...,...,...,...,...,...,...,...,...,...
545,546,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9 413,8,2019-01-01,Fiction,Children's Books,Growing Up & Facts of Life,7
546,547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2016-01-01,Non Fiction,Self-Help,Happiness,34
547,548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2017-01-01,Non Fiction,Self-Help,Happiness,13
548,549,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14 331,8,2018-01-01,Non Fiction,Self-Help,Happiness,18


In [9]:
data3 = data3.drop(["Index","Name", "Author", "User Rating", "Reviews", "Price", "Year", "Genre 1"], axis=1)
data3.head()

Unnamed: 0,Genre 2,Genre 3,Rank
0,"Health, Fitness & Dieting",Diets & Weight Loss,47
1,Literature & Fiction,Genre Fiction,19
2,"Health, Fitness & Dieting",Psychology & Counseling,7
3,Literature & Fiction,Genre Fiction,17
4,Children's Books,Education & Reference,40


In [10]:
data= pd.concat([data1, data3], axis=1)

data.shape
display(data)
data.info()

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre,Genre 2,Genre 3,Rank
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction,"Health, Fitness & Dieting",Diets & Weight Loss,47
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction,Literature & Fiction,Genre Fiction,19
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction,"Health, Fitness & Dieting",Psychology & Counseling,7
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction,Literature & Fiction,Genre Fiction,17
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction,Children's Books,Education & Reference,40
...,...,...,...,...,...,...,...,...,...,...
545,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9413,8,2019,Fiction,Children's Books,Growing Up & Facts of Life,7
546,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2016,Non Fiction,Self-Help,Happiness,34
547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2017,Non Fiction,Self-Help,Happiness,13
548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2018,Non Fiction,Self-Help,Happiness,18


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         550 non-null    object 
 1   Author       550 non-null    object 
 2   User Rating  550 non-null    float64
 3   Reviews      550 non-null    int64  
 4   Price        550 non-null    int64  
 5   Year         550 non-null    int64  
 6   Genre        550 non-null    object 
 7   Genre 2      550 non-null    object 
 8   Genre 3      550 non-null    object 
 9   Rank         550 non-null    int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 43.1+ KB


### Normalization

- We will now normalize column titles, setting everything to lowercase.
- We will also be swapping blank spaces (" ") by underscores ("_").

In [11]:
def col_lowercase(x):
    cols = []
    for item in x:
        cols.append(item.lower())
    return cols

In [12]:
data.columns = col_lowercase(data.columns)
data.columns

Index(['name', 'author', 'user rating', 'reviews', 'price', 'year', 'genre',
       'genre 2', 'genre 3', 'rank'],
      dtype='object')

In [13]:
def no_spaces(x):
    cols = []
    for item in x:
        col = ""
        for letter in item:
            if letter == " ":
                letter = "_"
                col += letter
            else:
                col += letter
        cols.append(col)
    return cols

In [14]:
data.columns = no_spaces(data.columns)
data.columns

Index(['name', 'author', 'user_rating', 'reviews', 'price', 'year', 'genre',
       'genre_2', 'genre_3', 'rank'],
      dtype='object')

We are going to change the "name" column by "title".

In [15]:
data = data.rename(columns={'name':'title'})

### Data types
Now we are going to check types of our data and make changes if needed.


In [16]:
data.dtypes

title           object
author          object
user_rating    float64
reviews          int64
price            int64
year             int64
genre           object
genre_2         object
genre_3         object
rank             int64
dtype: object

Looking at data types, we state that: 

- "Price" shold be a float, as it shold allow to include decimals if needed. We are going to change from int64 to float
- "Year" column is an integer, and maybe it should be a date (¿?)

The rest seems just fine.

In [17]:
data['price'] = data['price'].astype(float)
data.dtypes

title           object
author          object
user_rating    float64
reviews          int64
price          float64
year             int64
genre           object
genre_2         object
genre_3         object
rank             int64
dtype: object

In [18]:
#turn year into datetime format, may not necessary.

#data['year'] = pd.to_datetime(data['year'], format='%Y', errors= 'coerce')
#data['year'] = pd.to_datetime(data['year'], errors='coerce')
#data['year'] = data['year'].dt.year
#data

#dt_object = pd.to_datetime(data.year,format='%Y')
#df = data
#df['date'] = dt_object
#df = data.set_index("date")
#df


### We will now check for NaNs

In [19]:
data.isna().sum()

title          0
author         0
user_rating    0
reviews        0
price          0
year           0
genre          0
genre_2        0
genre_3        0
rank           0
dtype: int64

No NaNs apparently, which is nice!

### We will now try to look for duplicate rows.

In [20]:
data.duplicated().sum()

0

In [21]:
data.duplicated() 

0      False
1      False
2      False
3      False
4      False
       ...  
545    False
546    False
547    False
548    False
549    False
Length: 550, dtype: bool

In [22]:
doubl_data = data[data.duplicated()]
print(doubl_data)

Empty DataFrame
Columns: [title, author, user_rating, reviews, price, year, genre, genre_2, genre_3, rank]
Index: []


No duplicated rows, apparently!

### We will now try to look for duplicates, both in titles and authors.

In [23]:
print("UNIQUE TITLES AND TIMES IT APPEARS")
display(data["title"].value_counts(dropna=True))
print("-----")
print("UNIQUE AUTHORS AND TIMES IT APPEARS")
display(data["author"].value_counts(dropna=True))


UNIQUE TITLES AND TIMES IT APPEARS


Publication Manual of the American Psychological Association, 6th Edition       10
StrengthsFinder 2.0                                                              9
Oh, the Places You'll Go!                                                        8
The Very Hungry Caterpillar                                                      7
The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change     7
                                                                                ..
Humans of New York : Stories                                                     1
Howard Stern Comes Again                                                         1
Homebody: A Guide to Creating Spaces You Never Want to Leave                     1
Have a Little Faith: A True Story                                                1
Night (Night)                                                                    1
Name: title, Length: 351, dtype: int64

-----
UNIQUE AUTHORS AND TIMES IT APPEARS


Jeff Kinney                           12
Gary Chapman                          11
Rick Riordan                          11
Suzanne Collins                       11
American Psychological Association    10
                                      ..
Keith Richards                         1
Chris Cleave                           1
Alice Schertle                         1
Celeste Ng                             1
Adam Gasiewski                         1
Name: author, Length: 248, dtype: int64

In [24]:
unique_title = data["title"].unique()
#print(sorted(unique_title))
unique_title_series = pd.Series(sorted(unique_title)) 
#display(unique_title_series)
unique_title_df = (unique_title_series).to_frame(name="title")
display(unique_title_df)


Unnamed: 0,title
0,10-Day Green Smoothie Cleanse
1,11/22/63: A Novel
2,12 Rules for Life: An Antidote to Chaos
3,1984 (Signet Classics)
4,"5,000 Awesome Facts (About Everything!) (Natio..."
...,...
346,Winter of the World: Book Two of the Century T...
347,Women Food and God: An Unexpected Path to Almo...
348,Wonder
349,Wrecking Ball (Diary of a Wimpy Kid Book 14)


In [25]:
unique_author = data["author"].unique()
#print(sorted(unique_author))
unique_author_series = pd.Series(sorted(unique_author)) 
#display(unique_title_author)
unique_author_df = (unique_author_series).to_frame(name="author")
display(unique_author_df)

Unnamed: 0,author
0,Abraham Verghese
1,Adam Gasiewski
2,Adam Mansbach
3,Adir Levy
4,Admiral William H. McRaven
...,...
243,Walter Isaacson
244,William Davis
245,William P. Young
246,Wizards RPG Team


We think that titles or authors might be dupplicated for another reason: **There might be typos that make it difficult to identify them as duplicates!**. How do we solve that?

In [26]:
#How to check if names might have typos and be repeated?

#METHOD 1: We have tried DEDUPE but it takes FOREVER, do we will be trying another method


#dd_data = pandas_dedupe.dedupe_dataframe(
  #  data,                         # Dataframe to deduplicate
  #  field_properties=['title', 'author'], # List of fields to base deduplication on
  #  canonicalize=['title'],      # List of fields to canonicalize (optional)
  #  sample_size=0.2,            # Size of sample of records to be labelled
#)

In [27]:
# METHOD 2: Will try an approach using Levenshtein Distance

#dict_title = dict()
#dict_NAME = dict()

#def dist(str1, str2):
#    return distance.levenshtein(str1, str2)

#def find_title(titlelist, todict):
#    for titles in titlelist:
#        titlesorted = Counter(titles).most_common()
#        for title in titlesorted[1:]:
#            if dist(titlesorted[0][0], title[0]) < 3:
#                todict.update({title[0]: titlesorted[0][0]})

                
# ----                
                
#dfsurname = df1.groupby(['BIRTH', 'NAME']).SURNAME.apply(list).reset_index()
#find_name(dfsurname.SURNAME.tolist(), dict_SURNAME)

#dfname = df1.groupby(['BIRTH', 'SURNAME']).NAME.apply(list).reset_index()
#find_name(dfname.NAME.tolist(), dict_NAME)

#print(dict_SURNAME)
#print(dict_NAME)

#df2 = df1.replace({'NAME': dict_NAME, 'SURNAME': dict_SURNAME})
#print(df2)

In [28]:
# METHOD 3: maybe less sophisticated than Levenshtein, but either way effective.
# We will create a column named 'Author_id' in which we will be using the name of authors in lowercase and without blank spaces
#nor dots. This will serve to trully identify unique authors in case there are additional blank spaces in the names.

In [29]:
#data["author"].tolist()

In [30]:
'''
authors = data["author"].tolist()
new_authors = []

def author_lower(df):
    for author in authors:
        new_author = ""
        for character in author:
            if character == " ":
                character = ""
                new_author += character
            elif character == ".":
                character == ""
                new_author += character      
            else:
                new_col += letter
        new_authors.append(new_author.lower())
    authors = new_authors
    return authors
    
'''

'\nauthors = data["author"].tolist()\nnew_authors = []\n\ndef author_lower(df):\n    for author in authors:\n        new_author = ""\n        for character in author:\n            if character == " ":\n                character = ""\n                new_author += character\n            elif character == ".":\n                character == ""\n                new_author += character      \n            else:\n                new_col += letter\n        new_authors.append(new_author.lower())\n    authors = new_authors\n    return authors\n    \n'

In [31]:
'''
def author_lowercase(x):
    author = []
    for item in x:
        author.append(item.lower())
    return author
'''

'\ndef author_lowercase(x):\n    author = []\n    for item in x:\n        author.append(item.lower())\n    return author\n'

In [32]:
#data["author_lowercase"] = data["author"].apply(author_lowercase)

Order dataset per year and rank using groupby

In [33]:
data_year = data.sort_values(by=['year','rank'])
data_year

Unnamed: 0,title,author,user_rating,reviews,price,year,genre,genre_2,genre_3,rank
429,The Lost Symbol,Dan Brown,4.2,8747,19.0,2009,Fiction,"Mystery, Thriller & Suspense",Thrillers & Suspense,1
459,The Shack: Where Tragedy Confronts Eternity,William P. Young,4.6,19720,8.0,2009,Fiction,Children's Books,Literature & Fiction,2
216,Liberty and Tyranny: A Conservative Manifesto,Mark R. Levin,4.8,3828,15.0,2009,Non Fiction,History,Americas,3
38,"Breaking Dawn (The Twilight Saga, Book 4)",Stephenie Meyer,4.6,9769,13.0,2009,Fiction,Teen & Young Adult,Science Fiction & Fantasy,4
134,Going Rogue: An American Life,Sarah Palin,4.6,1636,6.0,2009,Non Fiction,History,Americas,5
...,...,...,...,...,...,...,...,...,...,...
263,P is for Potty! (Sesame Street) (Lift-the-Flap),Naomi Kleinberg,4.7,10820,5.0,2019,Non Fiction,Children's Books,Growing Up & Facts of Life,46
472,The Total Money Makeover: Classic Edition: A P...,Dave Ramsey,4.7,11550,10.0,2019,Non Fiction,Christian Books & Bibles,Christian Living,47
41,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,4.9,14344,5.0,2019,Fiction,Children's Books,Early Learning,48
475,The Unofficial Harry Potter Cookbook: From Cau...,Dinah Bucholz,4.7,9030,10.0,2019,Non Fiction,Children's Books,"Arts, Music & Photography",49


We will now create an ID column to easily identify books.

In [34]:
#WAY 1
#crear primera fila 'fantasma', fer groupby.ngroup, assignar Id i dsprés dropejar primera entrada.

#data_year['id'] = data_year.groupby(['title']).ngroup()
#data_year

#WAY 2


In [35]:
#data_year['id'] = data_year['title'].rank(method="first", ascending = True)
#data_year['id'] = data_year['id'].astype(int)

data_year['book_id'] = data_year.groupby(['title','author'],sort = False).ngroup()+1

data_year

Unnamed: 0,title,author,user_rating,reviews,price,year,genre,genre_2,genre_3,rank,book_id
429,The Lost Symbol,Dan Brown,4.2,8747,19.0,2009,Fiction,"Mystery, Thriller & Suspense",Thrillers & Suspense,1,1
459,The Shack: Where Tragedy Confronts Eternity,William P. Young,4.6,19720,8.0,2009,Fiction,Children's Books,Literature & Fiction,2,2
216,Liberty and Tyranny: A Conservative Manifesto,Mark R. Levin,4.8,3828,15.0,2009,Non Fiction,History,Americas,3,3
38,"Breaking Dawn (The Twilight Saga, Book 4)",Stephenie Meyer,4.6,9769,13.0,2009,Fiction,Teen & Young Adult,Science Fiction & Fantasy,4,4
134,Going Rogue: An American Life,Sarah Palin,4.6,1636,6.0,2009,Non Fiction,History,Americas,5,5
...,...,...,...,...,...,...,...,...,...,...,...
263,P is for Potty! (Sesame Street) (Lift-the-Flap),Naomi Kleinberg,4.7,10820,5.0,2019,Non Fiction,Children's Books,Growing Up & Facts of Life,46,313
472,The Total Money Makeover: Classic Edition: A P...,Dave Ramsey,4.7,11550,10.0,2019,Non Fiction,Christian Books & Bibles,Christian Living,47,349
41,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,4.9,14344,5.0,2019,Fiction,Children's Books,Early Learning,48,286
475,The Unofficial Harry Potter Cookbook: From Cau...,Dinah Bucholz,4.7,9030,10.0,2019,Non Fiction,Children's Books,"Arts, Music & Photography",49,350


Now we set a new order

Then we are going to rearrange the order of the colums. It will show ID, title, author, year, rating on year, user rating, reviews, price and genres.

In [36]:
data_year= data_year[['book_id', 'title', 'author', 'year', 'rank', 'genre', 'genre_2', 'genre_3','user_rating', 'reviews', 'price']]
data_year   
    

Unnamed: 0,book_id,title,author,year,rank,genre,genre_2,genre_3,user_rating,reviews,price
429,1,The Lost Symbol,Dan Brown,2009,1,Fiction,"Mystery, Thriller & Suspense",Thrillers & Suspense,4.2,8747,19.0
459,2,The Shack: Where Tragedy Confronts Eternity,William P. Young,2009,2,Fiction,Children's Books,Literature & Fiction,4.6,19720,8.0
216,3,Liberty and Tyranny: A Conservative Manifesto,Mark R. Levin,2009,3,Non Fiction,History,Americas,4.8,3828,15.0
38,4,"Breaking Dawn (The Twilight Saga, Book 4)",Stephenie Meyer,2009,4,Fiction,Teen & Young Adult,Science Fiction & Fantasy,4.6,9769,13.0
134,5,Going Rogue: An American Life,Sarah Palin,2009,5,Non Fiction,History,Americas,4.6,1636,6.0
...,...,...,...,...,...,...,...,...,...,...,...
263,313,P is for Potty! (Sesame Street) (Lift-the-Flap),Naomi Kleinberg,2019,46,Non Fiction,Children's Books,Growing Up & Facts of Life,4.7,10820,5.0
472,349,The Total Money Makeover: Classic Edition: A P...,Dave Ramsey,2019,47,Non Fiction,Christian Books & Bibles,Christian Living,4.7,11550,10.0
41,286,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,2019,48,Fiction,Children's Books,Early Learning,4.9,14344,5.0
475,350,The Unofficial Harry Potter Cookbook: From Cau...,Dinah Bucholz,2019,49,Non Fiction,Children's Books,"Arts, Music & Photography",4.7,9030,10.0


In [37]:
data_year.to_csv("Bestseller_cleaned.csv")

Premisas para top 5 de la decada:
- Presencia en los top 50 más vendidos en la decada
- Relevancia (cuanto más bajo en rango, mojor)

In [38]:
#top = data_year[(data_year["rank"] <= 5)]
#top


In [39]:
#data_groupid = data_year.copy()

In [74]:
#data_groupid = data_year.groupby("title").agg(total_rank =("book_id", ""))
#data_groupid

In [41]:
#inv_rank(data_year['rank'])
data_year['rank_weight'] = data_year['rank'].apply(lambda x: 51-x)
data_year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_year['rank_weight'] = data_year['rank'].apply(lambda x: 51-x)


Unnamed: 0,book_id,title,author,year,rank,genre,genre_2,genre_3,user_rating,reviews,price,rank_weight
429,1,The Lost Symbol,Dan Brown,2009,1,Fiction,"Mystery, Thriller & Suspense",Thrillers & Suspense,4.2,8747,19.0,50
459,2,The Shack: Where Tragedy Confronts Eternity,William P. Young,2009,2,Fiction,Children's Books,Literature & Fiction,4.6,19720,8.0,49
216,3,Liberty and Tyranny: A Conservative Manifesto,Mark R. Levin,2009,3,Non Fiction,History,Americas,4.8,3828,15.0,48
38,4,"Breaking Dawn (The Twilight Saga, Book 4)",Stephenie Meyer,2009,4,Fiction,Teen & Young Adult,Science Fiction & Fantasy,4.6,9769,13.0,47
134,5,Going Rogue: An American Life,Sarah Palin,2009,5,Non Fiction,History,Americas,4.6,1636,6.0,46
...,...,...,...,...,...,...,...,...,...,...,...,...
263,313,P is for Potty! (Sesame Street) (Lift-the-Flap),Naomi Kleinberg,2019,46,Non Fiction,Children's Books,Growing Up & Facts of Life,4.7,10820,5.0,5
472,349,The Total Money Makeover: Classic Edition: A P...,Dave Ramsey,2019,47,Non Fiction,Christian Books & Bibles,Christian Living,4.7,11550,10.0,4
41,286,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,2019,48,Fiction,Children's Books,Early Learning,4.9,14344,5.0,3
475,350,The Unofficial Harry Potter Cookbook: From Cau...,Dinah Bucholz,2019,49,Non Fiction,Children's Books,"Arts, Music & Photography",4.7,9030,10.0,2


In [72]:
top_performers = data_year.groupby(["book_id", "title","author"]).agg({'rank_weight': 'sum'})
top_performers

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rank_weight
book_id,title,author,Unnamed: 3_level_1
1,The Lost Symbol,Dan Brown,50
2,The Shack: Where Tragedy Confronts Eternity,William P. Young,73
3,Liberty and Tyranny: A Conservative Manifesto,Mark R. Levin,48
4,"Breaking Dawn (The Twilight Saga, Book 4)",Stephenie Meyer,47
5,Going Rogue: An American Life,Sarah Palin,46
...,...,...,...
347,Can't Hurt Me: Master Your Mind and Defy the Odds,David Goggins,7
348,The Guardians: A Novel,John Grisham,6
349,The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,Dave Ramsey,4
350,The Unofficial Harry Potter Cookbook: From Cauldron Cakes to Knickerbocker Glory--More Than 150 Magical Recipes for…,Dinah Bucholz,2


In [73]:
top_performers.nlargest(n=10, columns="rank_weight")

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rank_weight
book_id,title,author,Unnamed: 3_level_1
6,StrengthsFinder 2.0,Gallup,403
25,"Publication Manual of the American Psychological Association, 6th Edition",American Psychological Association,257
137,"Oh, the Places You'll Go!",Dr. Seuss,248
193,First 100 Words,Roger Priddy,196
224,The 5 Love Languages: The Secret to Love that Lasts,Gary Chapman,183
161,Laugh-Out-Loud Jokes for Kids,Rob Elliott,171
227,Giraffes Can't Dance,Giles Andreae,170
67,The Official SAT Study Guide,The College Board,168
60,"Unbroken: A World War II Story of Survival, Resilience, and Redemption",Laura Hillenbrand,165
103,Jesus Calling: Enjoying Peace in His Presence (with Scripture References),Sarah Young,162
