## Future Idea: Adding Curated Collections
This notebook explores the possibility of serving the scenario where our stakeholder, an independent bookstore, introduces a new curated collection of books to their inventory. This is both to figure out how to boost the sales of said collections (like a special occasion for exploring new topics and realms of thought) as well as using them as opportunities to improve recommendations for readers suffering from our current unbalanced and overly west-centric book collection and the recommendations which keep them stuck in an echo chamber of similar literature to what they already know. 

### About This Dataset

__Source__: [The Munich Meets African Literature Book Club's Dataset On Kaggle]('https://www.kaggle.com/datasets/sthabile/munich-meets-african-literature-book-club-list?select=Munich-Meets-African-Lit.csv')

__Context__
The Munich Meets African Literature (MMAL) is a book club founded by Nana Kesewaa Dankwa in Munich during April 2018. The book club was started with the intention of celebrating literary works written by authors from all over Africa. Meetings are held on a monthly basis and the books read and discussed at the monthly meetings are selected by popular vote.

__Content__
Each row in the table represents a monthly book club selection. The column entries show attributes for the book selection and the meeting during which it was discussed.

__Acknowledgements__
The source information for this dataset is stored on the Meetup.com servers and is accessible via their website. Many thanks go to the members of the MMAL who add to the health and longetivity of the Book Club by reading the books, attending the meetings, participating in the book discussions and donating funds to keep the group hosted on the Meetup website.

__Inspiration__
The book club selections at MMAL can bring forth fascinating insights about popular and widely known African literature. Questions such as which regions of Africa produce the most notable literary works based on this book club's selection. Which publishers support African writers the most. If you manage to mine datasets showing book selections from other book clubs, you could compare and contrast attributes across datasets to find common patterns in how books are selected for discussions.

In [99]:
# importing the main data
%run import_data

Continuing with existing version of data folder
Goodreads dataset loaded successfully as books_goodreads
Pandas dataframes (books_goodreads, books_big, book, users, ratings) loaded successfully
Columns in DataFrames 'users' and 'ratings' renamed
You can use the DataFrames 'books' or 'books_big' - they are exactly the same (big) dataset
Ready to go!


In [14]:
# importing the new collection
books_mmal = pd.read_csv("data/mmal.csv", sep=";", na_filter=True)

In [31]:
books_mmal.head()

Unnamed: 0,country,region,book_title,book_author,author_gender,year_of_publication,publisher,date_of_meeting,venue,time_(cet)
0,Zimbabwe,Southern,We Need New Names,NoViolet Bulawayo,F,21.05.2013,Reagan Arthur Books,18.05.2018,The Munich Readery,18h00
1,Kenya,Eastern,Decolonising the Mind,Ngugi wa Thiong'o,M,16.07.1986,Heinemann Educational Books,22.06.2018,The Munich Readery,18h00
2,Ghana,Western,Homecoming,Yaa Gyasi,F,15.06.2016,Alfred A. Knopf,27.07.2018,The Munich Readery,18h00
3,South Africa,Southern,Born a Crime,Trevor Noah,M,15.11.2016,Doubleday Canada,17.08.2018,Café Jasmin,18h00
4,Nigeria,Western,Binti,Nnedi Okarafor,F,22.09.2015,Tor.com,21.09.2018,The Munich Readery,18h00


In [16]:
books_mmal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Country          65 non-null     object
 1   Region           65 non-null     object
 2   Book-title       65 non-null     object
 3   Author           65 non-null     object
 4   Author Gender    65 non-null     object
 5   Date Published   65 non-null     object
 6   Publisher        65 non-null     object
 7   Date of Meeting  65 non-null     object
 8   Venue            65 non-null     object
 9   Time (CET)       65 non-null     object
dtypes: object(10)
memory usage: 5.2+ KB


In [26]:
books_mmal.columns = books_mmal.columns.str.lower().str.replace('-', '_')
books_mmal.columns = books_mmal.columns.str.lower().str.replace(' ', '_')

In [27]:
books_mmal = books_mmal.rename(columns={'author': 'book_author'})

In [29]:
books_mmal = books_mmal.rename(columns={'date_published': 'year_of_publication'})

In [33]:
books_mmal['year_of_publication'] = pd.to_datetime(books_mmal['year_of_publication'], format='%d.%m.%Y').dt.year

In [34]:
books_mmal.head()

Unnamed: 0,country,region,book_title,book_author,author_gender,year_of_publication,publisher,date_of_meeting,venue,time_(cet)
0,Zimbabwe,Southern,We Need New Names,NoViolet Bulawayo,F,2013,Reagan Arthur Books,18.05.2018,The Munich Readery,18h00
1,Kenya,Eastern,Decolonising the Mind,Ngugi wa Thiong'o,M,1986,Heinemann Educational Books,22.06.2018,The Munich Readery,18h00
2,Ghana,Western,Homecoming,Yaa Gyasi,F,2016,Alfred A. Knopf,27.07.2018,The Munich Readery,18h00
3,South Africa,Southern,Born a Crime,Trevor Noah,M,2016,Doubleday Canada,17.08.2018,Café Jasmin,18h00
4,Nigeria,Western,Binti,Nnedi Okarafor,F,2015,Tor.com,21.09.2018,The Munich Readery,18h00


In [35]:
books_mmal.region.unique()

array(['Southern', 'Eastern', 'Western', 'Northern', 'Island', 'Central',
       'Southern ', 'Westerm'], dtype=object)

In [100]:
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l,genre
0,590085417,Heidi,Johanna Spyri,2021,Scholastic,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,"Johanna Spyri, Shirley Temple, Movie tie-in, C..."
1,068160204X,The Royals,Kitty Kelley,2020,Bausch & Lombard,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,
2,068107468X,Edgar Allen Poe Collected Poems,Edgar Allan Poe,2020,Bausch & Lombard,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,American Fantasy poetry
3,307124533,Owl's Amazing but True No. 2,Owl Magazine,2012,Golden Books,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,
4,380816792,A Rose in Winter,Kathleen E. Woodiwiss,2011,Harper Mass Market Paperbacks,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,"Fiction, Historical Fiction, Romance, Fiction,..."


In [44]:
# First, let's scrape the isbn so we can join the new data with our old data

In [37]:
import requests

# Define a function that scrapes the isbn from Google Books API
def get_isbn_from_google_books(title):
    # Google Books API URL
    url = 'https://www.googleapis.com/books/v1/volumes'
    
    # Parameters for the request
    params = {
        'q': title,
        'maxResults': 1
    }
    
    # Send the request
    response = requests.get(url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        items = data.get('items', [])
        
        # Extract ISBN-10 if available
        for item in items:
            identifiers = item.get('volumeInfo', {}).get('industryIdentifiers', [])
            for identifier in identifiers:
                if identifier.get('type') == 'ISBN_10':
                    return identifier.get('identifier')
                    
    return None

# Apply the function to the book_title column
books_mmal['isbn'] = books_mmal['book_title'].apply(get_isbn_from_google_books)

In [38]:
books_mmal.head()

Unnamed: 0,country,region,book_title,book_author,author_gender,year_of_publication,publisher,date_of_meeting,venue,time_(cet),isbn
0,Zimbabwe,Southern,We Need New Names,NoViolet Bulawayo,F,2013,Reagan Arthur Books,18.05.2018,The Munich Readery,18h00,1448156238
1,Kenya,Eastern,Decolonising the Mind,Ngugi wa Thiong'o,M,1986,Heinemann Educational Books,22.06.2018,The Munich Readery,18h00,9966466843
2,Ghana,Western,Homecoming,Yaa Gyasi,F,2016,Alfred A. Knopf,27.07.2018,The Munich Readery,18h00,1498225187
3,South Africa,Southern,Born a Crime,Trevor Noah,M,2016,Doubleday Canada,17.08.2018,Café Jasmin,18h00,9044975757
4,Nigeria,Western,Binti,Nnedi Okarafor,F,2015,Tor.com,21.09.2018,The Munich Readery,18h00,765384469


In [39]:
books_mmal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   country              65 non-null     object
 1   region               65 non-null     object
 2   book_title           65 non-null     object
 3   book_author          65 non-null     object
 4   author_gender        65 non-null     object
 5   year_of_publication  65 non-null     int32 
 6   publisher            65 non-null     object
 7   date_of_meeting      65 non-null     object
 8   venue                65 non-null     object
 9   time_(cet)           65 non-null     object
 10  isbn                 56 non-null     object
dtypes: int32(1), object(10)
memory usage: 5.5+ KB


In [40]:
# Let's see if Open Library API has the 9 missing isbn's:

In [41]:
# Define a function to scrape isbn from Open Library API
def get_isbn_from_open_library(title):
    # Open Library API URL
    url = 'https://openlibrary.org/search.json'
    
    # Parameters for the request
    params = {
        'q': title,
        'limit': 1  # Limit to 1 result
    }
    
    # Send the request
    response = requests.get(url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        docs = data.get('docs', [])
        
        # Extract ISBN-10 if available
        for doc in docs:
            isbn_10 = doc.get('isbn', [])
            if isbn_10:
                # Return the first ISBN-10 found
                return next((isbn for isbn in isbn_10 if len(isbn) == 10), None)
                    
    return None


In [42]:
# Apply the function only to the missing values of the book_title column
books_mmal['isbn'] = books_mmal['isbn'].fillna(
    books_mmal[books_mmal['isbn'].isna()]['book_title'].apply(get_isbn_from_open_library)
)

In [43]:
books_mmal.info() # Now we have all the isbn's for our new collection

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   country              65 non-null     object
 1   region               65 non-null     object
 2   book_title           65 non-null     object
 3   book_author          65 non-null     object
 4   author_gender        65 non-null     object
 5   year_of_publication  65 non-null     int32 
 6   publisher            65 non-null     object
 7   date_of_meeting      65 non-null     object
 8   venue                65 non-null     object
 9   time_(cet)           65 non-null     object
 10  isbn                 65 non-null     object
dtypes: int32(1), object(10)
memory usage: 5.5+ KB


In [None]:
# Columns to drop
columns_to_drop = ['country', 'region']

# Drop multiple columns in one line
df = df.drop(columns=columns_to_drop)

In [101]:
# Let's think about what we can make use of here:

# First, what do the values of 'genre' in our main dataset look like?
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l,genre
0,590085417,Heidi,Johanna Spyri,2021,Scholastic,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,"Johanna Spyri, Shirley Temple, Movie tie-in, C..."
1,068160204X,The Royals,Kitty Kelley,2020,Bausch & Lombard,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,
2,068107468X,Edgar Allen Poe Collected Poems,Edgar Allan Poe,2020,Bausch & Lombard,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,American Fantasy poetry
3,307124533,Owl's Amazing but True No. 2,Owl Magazine,2012,Golden Books,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,
4,380816792,A Rose in Winter,Kathleen E. Woodiwiss,2011,Harper Mass Market Paperbacks,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,"Fiction, Historical Fiction, Romance, Fiction,..."


In [102]:
books['genre'] = books['genre'].str.lower()

In [103]:
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l,genre
0,590085417,Heidi,Johanna Spyri,2021,Scholastic,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,http://images.amazon.com/images/P/0590085417.0...,"johanna spyri, shirley temple, movie tie-in, c..."
1,068160204X,The Royals,Kitty Kelley,2020,Bausch & Lombard,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,http://images.amazon.com/images/P/068160204X.0...,
2,068107468X,Edgar Allen Poe Collected Poems,Edgar Allan Poe,2020,Bausch & Lombard,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,http://images.amazon.com/images/P/068107468X.0...,american fantasy poetry
3,307124533,Owl's Amazing but True No. 2,Owl Magazine,2012,Golden Books,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,http://images.amazon.com/images/P/0307124533.0...,
4,380816792,A Rose in Winter,Kathleen E. Woodiwiss,2011,Harper Mass Market Paperbacks,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,http://images.amazon.com/images/P/0380816792.0...,"fiction, historical fiction, romance, fiction,..."


In [106]:
# Function to remove duplicates from a comma-separated string
def remove_duplicate_genres(genres_str):
    # Split the string into a list
    genres_list = genres_str.split(', ')
    # Turn everything into lower case
    # Remove duplicates and sort if desired
    unique_genres = sorted(set(genres_list))
    # Join the list back into a comma-separated string
    return ', '.join(unique_genres)

# Apply the function to the 'genres' column
books['genre'] = books['genre'].astype(str)
books['genre'] = books['genre'].apply(remove_duplicate_genres)


In [107]:
books.genre.describe()

count     226493
unique    129860
top          nan
freq       30774
Name: genre, dtype: object

In [112]:
# Count how many values in books_mmal['isbn'] are also in books['isbn']

# Create a boolean mask for matching ISBNs
mask = books_mmal['isbn'].isin(books['isbn'])

# Filter the books_mmal DataFrame to show only matching ISBNs
matching_books_mmal = books_mmal[mask]
matching_books_mmal

Unnamed: 0,country,region,book_title,book_author,author_gender,year_of_publication,publisher,date_of_meeting,venue,time_(cet),isbn
41,South Africa,Southern,Chaka,Thomas Mofolo,M,1983,Heinemann,26.02.2022,Online,17h00,1579548261


In [113]:
books[books['isbn'] == '1579548261']

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l,genre
7383,1579548261,Chaka! Through the Fire,Chaka Khan,2003,Rodale Books,http://images.amazon.com/images/P/1579548261.0...,http://images.amazon.com/images/P/1579548261.0...,http://images.amazon.com/images/P/1579548261.0...,"biography, singers"
