# [RQ3]
Let’s have a historical look at the dataset!

Write a function that takes as input a year and returns as output the following information:

1.   The number of books published that year.
2.   The total number of pages written that year.
3. The most prolific month of that year.
4. The longest book written that year.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

### Loading a chunck of 10 thousand rows to have information of the data in the dataset

In [9]:
booksDf_org = pd.read_json('lighter_books.json', lines=True, nrows=100000)

In [10]:
booksDf_org.head()

Unnamed: 0,id,title,authors,author_name,author_id,work_id,isbn,isbn13,asin,language,...,format,edition_information,image_url,publisher,num_pages,series_id,series_name,series_position,shelves,description
0,2,Harry Potter and the Order of the Phoenix (Har...,"[{'id': '1077326', 'name': 'J.K. Rowling', 'ro...",J.K. Rowling,1077326,2809203,0439358078,9780439358071.0,,eng,...,Paperback,US Edition,https://i.gr-assets.com/images/S/compressed.ph...,Scholastic Inc.,870,45175,Harry Potter,5,"[{'name': 'to-read', 'count': 324191}, {'name'...",There is a door at the end of a silent corrido...
1,3,Harry Potter and the Sorcerer's Stone (Harry P...,"[{'id': '1077326', 'name': 'J.K. Rowling', 'ro...",J.K. Rowling,1077326,4640799,,,,eng,...,Hardcover,Library Edition,https://i.gr-assets.com/images/S/compressed.ph...,Scholastic Inc,309,45175,Harry Potter,1,"[{'name': 'fantasy', 'count': 63540}, {'name':...",Harry Potter's life is miserable. His parents ...
2,4,Harry Potter and the Chamber of Secrets (Harry...,,J.K. Rowling,1077326,6231171,0439554896,9780439554893.0,,eng,...,Hardcover,,https://i.gr-assets.com/images/S/compressed.ph...,Scholastic,352,45175,Harry Potter,2,"[{'name': 'to-read', 'count': 282341}, {'name'...",The Dursleys were so mean and hideous that sum...
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,"[{'id': '1077326', 'name': 'J.K. Rowling', 'ro...",J.K. Rowling,1077326,2402163,043965548X,9780439655484.0,,eng,...,Mass Market Paperback,,https://i.gr-assets.com/images/S/compressed.ph...,Scholastic Inc.,435,45175,Harry Potter,3,"[{'name': 'to-read', 'count': 292815}, {'name'...","For twelve long years, the dread fortress of A..."
4,6,Harry Potter and the Goblet of Fire (Harry Pot...,"[{'id': '1077326', 'name': 'J.K. Rowling', 'ro...",J.K. Rowling,1077326,3046572,,,,eng,...,Paperback,First Scholastic Trade Paperback Edition,https://i.gr-assets.com/images/S/compressed.ph...,Scholastic,734,45175,Harry Potter,4,"[{'name': 'to-read', 'count': 287086}, {'name'...",Harry Potter is midway through his training as...


In [11]:
booksDf_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99984 entries, 0 to 99983
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         99984 non-null  int64  
 1   title                      99984 non-null  object 
 2   authors                    34259 non-null  object 
 3   author_name                99984 non-null  object 
 4   author_id                  99984 non-null  int64  
 5   work_id                    99984 non-null  int64  
 6   isbn                       99984 non-null  object 
 7   isbn13                     99984 non-null  object 
 8   asin                       99984 non-null  object 
 9   language                   99984 non-null  object 
 10  average_rating             99984 non-null  float64
 11  rating_dist                99984 non-null  object 
 12  ratings_count              99984 non-null  int64  
 13  text_reviews_count         99984 non-null  int

In [17]:
columns = ['title', 'author_name', 'language', 'average_rating', 'publication_date', 'num_pages', 'original_publication_date', 'author_id']
booksDfcleaned = booksDf_org[columns]
booksDf = booksDfcleaned

In [13]:
booksDf.head()

Unnamed: 0,title,author_name,language,average_rating,publication_date,num_pages,original_publication_date
0,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,eng,4.5,2004-09,870,2003-06-21
1,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,eng,4.48,2003-11-01,309,1997-06-26
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,eng,4.43,2003-11-01,352,1998-07-02
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,eng,4.57,2004-05-01,435,1999-07-08
4,Harry Potter and the Goblet of Fire (Harry Pot...,J.K. Rowling,eng,4.56,2002-09-28,734,2000-07-08


In [14]:
def historical_look(year):
    yearDf = booksDf[booksDf['original_publication_date'].str[:4].astype(int) == year]
    
    # The number of books published that year
    totalBooks = yearDf.shape[0]

    # The total number of pages written that year
    # Replace empty strings with a default value (0) and then convert to integers
    yearDf.num_pages= yearDf.num_pages.replace('', '0').astype(int)
    totalPages = yearDf.num_pages.sum()

    # The most prolific month of that year
    df_months = yearDf.original_publication_date.str[5:7].astype(int)
    monthCount = df_months.value_counts() #counting no. of months
    maxOccurance = monthCount.max()
    prolificMonth = monthCount[monthCount == maxOccurance]
    prolificMonth = prolificMonth.index.values.tolist()

    # The longest book written that year
    maxPages = yearDf.num_pages.max()
    longestBook = yearDf[yearDf['num_pages'] == maxPages]['title'].values[0]

    return totalBooks, totalPages, prolificMonth, longestBook

In [15]:
date_pattern = r'^\d{4}-\d{2}-\d{2}$'
booksDf = booksDf[booksDf['original_publication_date'].str.match(date_pattern, na=False)]
booksDf.original_publication_date.replace('', '0', inplace = True)
booksDf['publication_year'] = booksDf.original_publication_date.str[:4].astype(int)


In [16]:
booksDf

Unnamed: 0,title,author_name,language,average_rating,publication_date,num_pages,original_publication_date,publication_year
0,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,eng,4.50,2004-09,870,2003-06-21,2003
1,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,eng,4.48,2003-11-01,309,1997-06-26,1997
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,eng,4.43,2003-11-01,352,1998-07-02,1998
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,eng,4.57,2004-05-01,435,1999-07-08,1999
4,Harry Potter and the Goblet of Fire (Harry Pot...,J.K. Rowling,eng,4.56,2002-09-28,734,2000-07-08,2000
...,...,...,...,...,...,...,...,...
99977,Awaken the Giant Within: How to Take Immediate...,Tony Robbins,eng,4.15,1992-11-01,544,1992-01-01,1992
99978,The Anger Control Workbook,Matthew McKay,eng,4.02,2000-11-08,208,2000-01-01,2000
99979,Total Workday Control Using Microsoft Outlook,Michael Linenberger,,4.04,2005-10-20,289,2005-10-20,2005
99980,Desata Tu Poder Ilimitado!,Anthony Robbins,,4.25,2001-04-01,450,1997-01-01,1997


Use this function to build your data frame: the primary key will be a year, and the required information will be the attributes within the row. Finally, show the head and the tail of this new data frame considering the first ten years registered and the last ten years.

In [17]:
# Dataframe with yearwise totalBooks, totalPages, prolificMonth, longestBook
columns = ['totalBooks', 'totalPages', 'prolificMonth', 'longestBook']
uniqueYears = booksDf.publication_year.unique().tolist()
yearwiseList = list(map(historical_look, uniqueYears))
yearwiseDf = pd.DataFrame(yearwiseList, columns=columns)
yearwiseDf['year'] = uniqueYears

In [18]:
print('Yearwise Dataframe head: ')
yearwiseDf.head(10)

Yearwise Dataframe head: 


Unnamed: 0,totalBooks,totalPages,prolificMonth,longestBook,year
0,2618,657498,[1],"Harry Potter Boxed Set, Books 1-5 (Harry Potte...",2003
1,1471,2597323,[1],Sholokhov's Tikhii Don: A Commentary In Two Vo...,1997
2,1664,400926,[1],"Harper American Literature, Single Volume Edition",1998
3,1860,430482,[1],Cecil Textbook of Medicine: Single Volume,1999
4,2062,515958,[1],The History of Middle Earth: Part Two,2000
5,3153,780350,[1],"Harry Potter Collection (Harry Potter, #1-6)",2005
6,254,64737,[1],The Norton Anthology of American Literature,1979
7,1324,308224,[1],The Quantum Theory of Fields 3 Volume Set,1996
8,710,167476,[1],Introduction to Algorithms,1989
9,1148,257359,[1],Ryrie Study Bible Expanded Edition New America...,1995


In [19]:
print('Yearwise Dataframe tail: ')
yearwiseDf.tail(10)

Yearwise Dataframe tail: 


Unnamed: 0,totalBooks,totalPages,prolificMonth,longestBook,year
229,1,124,[1],The History of the Reign of King Henry VII,1622
230,1,124,[9],The Gold Bug,1842
231,1,0,[1],Matthew Henry Concise Commentary on the Whole ...,1710
232,4,1935,[1],"The Journey to the West, Volume 1 (Journey to ...",1592
233,7,2373,[1],"Select Works of Edmund Burke, Volume 2: Reflec...",1790
234,1,272,[1],Selected Poems,1860
235,1,0,[8],The Vampire In Europe,2017
236,1,128,[1],David Walker's Appeal,1829
237,1,352,[5],Three Major Plays,1740
238,1,348,[10],Martin Luther's Ninety-Five Theses,1517


### Using LLM

In [20]:
def historical_look_chatgpt(booksDf, year):
    # Convert year to a string for comparison
    year_str = str(year)

    # Convert the 'publication_date' column to strings
    booksDf['publication_date'] = booksDf['publication_date'].astype(str)

    # Filter the DataFrame to include only rows from the specified year
    year_books = booksDf[booksDf['publication_date'].str.startswith(year_str, na=False)]

    # Number of books published that year
    num_books = len(year_books)

    # Clean the 'num_pages' column - remove non-numeric characters
    year_books['num_pages'] = year_books['num_pages'].astype(str).str.replace(r'\D', '', regex=True)
    
    # Filter out rows with empty or invalid 'num_pages' values
    year_books = year_books[year_books['num_pages'].str.isnumeric()]
    year_books['num_pages'] = pd.to_numeric(year_books['num_pages'])

    # Check if there's data for the year
    if year_books.empty:
        return {
            "num_books_published": 0,
            "total_pages_written": 0,
            "most_prolific_month": None,
            "longest_book": None
        }

    # Total number of pages written that year
    total_pages = year_books['num_pages'].sum()

    # Find the most prolific month of that year, or None if there's no data
    prolific_month = year_books['publication_month'].value_counts().idxmax() if 'publication_month' in year_books else None

    # Find the longest book written that year, or None if there's no data
    longest_book = year_books.loc[year_books['num_pages'].idxmax()].to_dict() if not year_books.empty else None

    return {
        "num_books_published": num_books,
        "total_pages_written": total_pages,
        "most_prolific_month": prolific_month,
        "longest_book": longest_book['title'] if longest_book else None
    }


In [21]:
# Assuming booksDf is your DataFrame
year = 2004  # Replace with the desired year
result = historical_look_chatgpt(booksDf, year)
print(result)

{'num_books_published': 3287, 'total_pages_written': 890831, 'most_prolific_month': None, 'longest_book': 'Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5)'}


In [22]:
import pandas as pd

# Assuming you have the original booksDf DataFrame

# Define a list of years for which you want to compute the information
years_to_analyze = range(2003, 2023)  # Adjust the range as needed

# Initialize an empty list to store the data for each year
data_for_each_year = []

# Compute the information for each year and store it in the list
for year in years_to_analyze:
    result = historical_look_chatgpt(booksDf, year)
    data_for_each_year.append(result)

# Create a new DataFrame from the list
yearly_info_df = pd.DataFrame(data_for_each_year, index=years_to_analyze)

In [23]:
print("Head of the DataFrame (First Ten Years):\n")
yearly_info_df.head(10)

Head of the DataFrame (First Ten Years):



Unnamed: 0,num_books_published,total_pages_written,most_prolific_month,longest_book
2003,2796,783720,,The Letters of D. H. Lawrence 8 Volume Set
2004,3287,890831,,"Harry Potter Boxed Set, Books 1-5 (Harry Potte..."
2005,4175,1148390,,"Harry Potter Collection (Harry Potter, #1-6)"
2006,5123,1441142,,Collected Works of John Stuart Mill (8 Volumes)
2007,2266,639703,,The Norton Anthology of American Literature
2008,213,55854,,Hebrew-Greek Key Word Study Bible: New America...
2009,120,34081,,The Mystical City of God (4 Volume Set)
2010,110,32258,,Bone: The Complete Edition
2011,80,20015,,George Washington's Sacred Fire
2012,67,18912,,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...


In [24]:
print("\nTail of the DataFrame (Last Ten Years):\n")
yearly_info_df.tail(10)


Tail of the DataFrame (Last Ten Years):



Unnamed: 0,num_books_published,total_pages_written,most_prolific_month,longest_book
2013,39,10901,,"Chess: 5334 Problems, Combinations and Games"
2014,16,3688,,"Business Cycles: A Theoretical, Historical, an..."
2015,35,7473,,The Valley Of Decision
2016,20,7203,,It (Eso)
2017,18,4933,,New and Collected Poems: 1931-2001
2018,22,3161,,Witness
2019,11,1959,,The Blue Bistro
2020,7,3038,,"Methods of Soil Analysis, Part 2: Microbiologi..."
2021,0,0,,
2022,0,0,,


# [RQ4] 
Quirks questions about consistency. In most cases, we will not have a consistent dataset, and the one we are dealing with is no exception. So, let's enhance our analysis.

In [2]:
authorsDf = pd.read_json('lighter_authors.json', lines=True)

In [3]:
authorsDf.head()

Unnamed: 0,ratings_count,average_rating,text_reviews_count,work_ids,book_ids,works_count,id,name,gender,image_url,about,fans_count
0,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826
1,1417316,4.02,84176,"[613469, 2305997, 940892, 2611786, 7800569, 31...","[9791, 21, 28, 24, 7507825, 27, 10538, 25, 26,...",75,7,Bill Bryson,male,https://images.gr-assets.com/authors/157859752...,"William McGuire ""Bill"" Bryson, OBE, FRS was bo...",16144
2,56159,4.53,352,"[17150, 808427, 20487307, 90550, 25460625, 171...","[349254, 15222, 14833682, 15221, 18126815, 152...",14,10,Jude Fisher,female,https://images.gr-assets.com/authors/141145711...,"Jude Fisher is the pseudonym for <a href=""http...",60
3,3302,3.79,480,"[4417, 14300808, 14780, 3796968, 44703121, 103...","[40, 9416484, 12482, 3753106, 26889789, 104764...",45,12,James Hamilton-Paterson,male,https://images.gr-assets.com/authors/127051738...,James Hamilton-Paterson's work has been transl...,72
4,7979,3.6,772,"[13330815, 19109351, 42306244, 72694240, 26291...","[8466327, 15739968, 22756778, 51026133, 260451...",61,14,Mark Watson,male,https://images.gr-assets.com/authors/133175379...,Mark Andrew Watson (born 13 February 1980) is ...,179


You should be sure there are no eponymous (different authors who have precisely the same name) in the author's dataset. Is it true?


In [4]:
epoNum = len(authorsDf.name) - len(authorsDf.name.unique())
print("Number of eponymous names: "+ str(epoNum))

Number of eponymous names: 37


In [6]:
eponymousNames = authorsDf[authorsDf.duplicated(subset='name')]['name'].unique()
eNames = []
for name in eponymousNames:
    eNames.append(name)
print(str(epoNum) +' Eponymous author names:\n')
print(eNames)

37 Eponymous author names:

['Peter  Marshall', 'Hildegard von Bingen', 'George  Franklin', 'محمد نجيب', 'Peter King', 'Paul Graham', 'John  Mole', 'Chris Lynch', 'Caroline Miller', 'Paul      Davies', 'David Yates', 'James Kent', 'Jorge Molina', 'Joseph Fink', 'Julie  Campbell', 'Jackson Butch Guice', 'Q. Hayashida', 'Mike   Lee', 'Christopher Phillips', 'Robert W. Sullivan IV', 'Yordan Yovkov', 'Catherine   Jones', 'Martin    Shaw', 'David  Nelson', 'Peter      Marshall', 'Katherine Mercurio Gotthardt', 'M.K. Graff', '小野不由美', 'Boris Zakhoder', 'Peter Green', 'Peter    Green', 'William Messner-Loebs', 'Peter  Davies', 'Dimitar Dimov', 'James C.L. Carson', 'Cicerón', 'Erin  Bedford']


Write a function that, given a list of author_id, outputs a dictionary where each author_id is a key, and the related value is a list with the names of all the books the author has written.


In [18]:
def authorBooks(author_id):
    joinedDf = authorsDf.merge(booksDf, left_on='id', right_on='author_id', how='inner')
    authorBookdict = {}
    for id_ in author_id:
        authorBookdict[id_] = list(joinedDf[joinedDf['id'] == id_]['title'])
    return authorBookdict

In [19]:
# checking
author_id = [4, 7, 10, 20]
authorBooks(author_id)

{4: ["The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
  "The Ultimate Hitchhiker's Guide: Five Complete Novels and One Story (Hitchhiker's Guide to the Galaxy, #1-5)",
  "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1-5)",
  "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
  "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
  "The Hitchhiker's Guide to the Galaxy: Quandary Phase (Hitchhiker's Guide: Radio Play, #4)",
  "The Ultimate Hitchhiker's Guide (Hitchhiker's Guide to the Galaxy, #1-5)",
  "The Hitchhiker's Guide to the Galaxy: Quintessential Phase (Hitchhiker's Guide: Radio Play, #5)",
  'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)',
  "Dirk Gently's Holistic Detective Agency (Dirk Gently, #1)",
  'The Salmon of Doubt (Dirk Gently, #3)',
  "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)",
  "Life, the Universe and Everything (Hitchhiker's

What is the longest book title among the books of the top 20 authors regarding their average rating? Is it the longest book title overall?


In [20]:
joinedDf = authorsDf.merge(booksDf, left_on='id', right_on='author_id', how='inner', suffixes=('_auth', '_book'))
joinedDf.head()

Unnamed: 0,ratings_count,average_rating_auth,text_reviews_count,work_ids,book_ids,works_count,id,name,gender,image_url,about,fans_count,title,author_name,language,average_rating_book,publication_date,num_pages,original_publication_date,author_id
0,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826,The Hitchhiker's Guide to the Galaxy (Hitchhik...,Douglas Adams,eng,4.22,2005,216,1979-10-12,4
1,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826,The Ultimate Hitchhiker's Guide: Five Complete...,Douglas Adams,eng,4.36,2005-11-01,815,1996-01-17,4
2,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826,The Ultimate Hitchhiker's Guide to the Galaxy ...,Douglas Adams,eng,4.36,2002-04-28,815,1996-01-17,4
3,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826,The Hitchhiker's Guide to the Galaxy (Hitchhik...,Douglas Adams,eng,4.22,2004-08-03,215,1979-10-12,4
4,2862064,4.19,62681,"[3078186, 135328, 1877624, 74123, 3078120, 104...","[386162, 13, 8695, 8694, 6091075, 365, 569429,...",106,4,Douglas Adams,male,https://images.gr-assets.com/authors/159137433...,"Douglas Noël Adams was an English author, comi...",19826,The Hitchhiker's Guide to the Galaxy (Hitchhik...,Douglas Adams,eng,4.22,2005-03-23,6,1979-10-12,4


Given the joined dataset has 2 columns that are named as average ratings, I am going with the one that came from the books dataset i.e average_rating_book


In [21]:
joinedDfsort = joinedDf.sort_values(by = 'average_rating_book', ascending = False).reset_index()
joinedDfsort.head(20) # Top 20 authors

Unnamed: 0,index,ratings_count,average_rating_auth,text_reviews_count,work_ids,book_ids,works_count,id,name,gender,...,about,fans_count,title,author_name,language,average_rating_book,publication_date,num_pages,original_publication_date,author_id
0,27592,828836,4.03,41596,"[3312237, 2339177, 2067752, 1243291, 805783, 3...","[518848, 334643, 47624, 47613, 47619, 47620, 2...",185,8347,Garth Nix,male,...,"Garth Nix was born in 1963 in Melbourne, Austr...",12025,"Aussie Bites- Serena and the Sea Serpent, Plus...",Garth Nix,,5.0,2004-01-01,,,8347
1,6330,32706,3.75,2604,"[1881188, 2244425, 24082015, 24082022, 87329, ...","[7628, 777824, 8701589, 17017131, 10411554, 21...",211,1209,Ford Madox Ford,male,...,"Ford Madox Ford, born Ford Hermann Hueffer, wa...",261,The Correspondence of Ford Madox Ford and Stel...,Ford Madox Ford,,5.0,1994-01-01,479.0,1993-11,1209
2,94145,3525,4.19,215,"[7251305, 9158556, 51228373, 13379000, 7253622...","[7006735, 7363667, 30686748, 8512570, 7008737,...",488,4177969,Taschen,,...,'Taschen is an art book publisher founded in 1...,247,Kandinsky Poster Book,Taschen,,5.0,1995-06-01,6.0,1995-06-01,4177969
3,40178,368,4.19,25,"[42878808, 26516, 26512, 18538016, 1271664, 32...","[23326565, 25796, 25792, 13330379, 1282634, 31...",24,14463,Ringu Tulku,male,...,<b>Karma Tsultrim Gyurmé Trinlé</b> (Tibetan: ...,11,The Lazy Lama Looks At Refuge: Finding A Purpo...,Ringu Tulku,,5.0,,,2000,14463
4,86389,254,3.96,32,"[5180629, 1281455, 6466817, 2266090, 15049414,...","[5113854, 1292360, 6282828, 2260138, 10151065,...",56,620489,Patricia Lynch,female,...,Patricia Lynch (c. 1894–1972) was an Irish wri...,3,Elsewhere: The Adventures of Belemus,Patricia Lynch,,5.0,1997-06-19,128.0,1985,620489
5,67095,2426,3.89,254,"[84110, 1173005, 84116, 1063411, 584844, 84111...","[87141, 87144, 87147, 1076698, 598215, 87142, ...",179,50004,Hans Küng,male,...,"Hans Küng is a Swiss Catholic priest, controve...",97,Consensus in Theology? A Dialogue with Hans Ku...,Hans Küng,,5.0,1980-12-31,165.0,1980,50004
6,19376,22410,3.83,821,"[823316, 823503, 414512, 823295, 915674, 10801...","[352383, 106197, 425452, 837713, 930682, 7732,...",114,5228,Jean Anouilh,male,...,"Anouilh was born in Cérisole, a small village ...",126,"Anouilh Plays: 2: The Rehearsal, Becket, The O...",Jean Anouilh,eng,5.0,1997-12-04,288.0,1995,5228
7,81358,132,4.26,38,"[73787210, 70457459, 68231728, 84891502, 25068...","[52922506, 45870850, 43844777, 54401953, 17897...",32,96541,Robert Walton,male,...,I am a retired teacher with thirty-six years o...,6,The Dragon and the Lemon Tree,Robert Walton,eng,5.0,,,1989,96541
8,89490,239,4.34,45,"[43643396, 2281034, 23738470, 6890555, 2823338...","[24043727, 2275010, 17226378, 3649363, 2797552...",32,1078471,Dulce María Loynaz,female,...,"Dulce María Loynaz (December 10, 1902 - April ...",21,Homenaje a Dulce Maria Loynaz: Premio Cervante...,Dulce María Loynaz,,5.0,1993-04-01,415.0,1993-04,1078471
9,81236,35,3.89,6,"[1689107, 66626408, 6700313, 2400409, 5743819,...","[1692210, 42854767, 6508624, 2393397, 5572623,...",52,95993,Malcolm Whyte,male,...,"Malcolm Whyte is an author, editor, publisher,...",1,Prehistoric Mammals Action Set,Malcolm Whyte,,5.0,1988-08-13,,1988,95993


In [22]:
# top 20 authors df
top20auth = joinedDfsort[['title', 'name', 'average_rating_book']].head(20)
top20auth['title_length'] = list((top20auth.title).str.len())
top20auth = top20auth.sort_values(by = 'title_length', ascending = False).reset_index()
print('The longest book title is: '+ top20auth.title[0] +
      '\n of length: ' + str(top20auth.title_length[0]) +
      ' \n by author: ' + top20auth.name[0] + 
      '\n with average rating: ' + str(top20auth.average_rating_book[0]))


The longest book title is: Homenaje a Dulce Maria Loynaz: Premio Cervantes 1993/Obra Literaria : Poesia Y Prosa : Estudios Y Comentarios (Coleccion Clasicos Cubanos)
 of length: 138 
 by author: Dulce María Loynaz
 with average rating: 5.0


What is the shortest overall book title in the dataset? If you find something strange, provide a comment on what happened and an alternative answer.


In [23]:
joinedDf['title_length'] = list((joinedDf.title).str.len())
joinedDf = joinedDf.sort_values(by = 'title_length').reset_index()
print('The longest book title is: '+ joinedDf.title[0] +
      '\n of length: ' + str(joinedDf.title_length[0]) +
      ' \n by author: ' + joinedDf.name[0] + 
      '\n with average rating: ' + str(joinedDf.average_rating_book[0]))

The longest book title is: a
 of length: 1 
 by author: Andy Warhol
 with average rating: 3.39
