# Foreword
<a name="foreword"></a>

### On some preferable practices of problem solving when coding

&nbsp;
&nbsp;

As similarly occurred to me when solving the challenges for the course _fCC DA with Python_, the cornerstone to tackle and solve these problems was to commence by **understanding** the **test_module** and its classes and methods. In the case of this ML course,  since an IDE is being used to develop and test the all the code, it's about understanding the  **test_function**.

[Go to _Understanding the **test_function**_](#understand-the-test-function)

Another noteworthy habit that helped me organize and map what I was coding was to utilize constants and variables as much as conviniently possible. That also helped me tweak my code and re-plan my steps when facing deadends in my initial mapping or when encountering inescapable Exceptions.

Nonetheless, and prior to taking any small reverse engineering approach, I tried to stay up-close to some key principles which are a _sine qua non_ requisits in each and every project, whatever their size and scope can be. These are **understanding the problem statement** - to the extent as though _you created it_ - , **exploring** and **becoming oneself familiar with the provided datasets**.

&nbsp;
&nbsp;

Thanks for reading this and happy coding!

&nbsp;
&nbsp;

G.Blanch



In [None]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2024-01-08 03:38:03--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 104.26.3.33, 172.67.70.149, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip.5’


2024-01-08 03:38:04 (50.6 MB/s) - ‘book-crossings.zip.5’ saved [26085508/26085508]

Archive:  book-crossings.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [None]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [None]:
# Avoid repeated and unnecessary importations of the dataset
# from fCC's Static Assets
books_df = df_books.copy()
ratings_df = df_ratings.copy()

In [None]:
ratings_df.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


In [None]:
# Any values to impute in ratings_df?
ratings_df.isna()\
        .any()


user      False
isbn      False
rating    False
dtype: bool

In [None]:
books_df.head()

Unnamed: 0,isbn,title,author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


In [None]:
# Any values to impute in books_df?
books_df.isna()\
        .any()


isbn      False
title     False
author     True
dtype: bool

In [None]:
# Select the existing nans in it
nans = books_df.isna()\
               .any(axis = 1)

books_df[nans]

Unnamed: 0,isbn,title,author
187700,9627982032,The Credit Suisse Guide to Managing Your Perso...,


We don't even need to bother on imputing it for it'll be removed due to lack of ratings, this amount being:

In [None]:
# Compute the counting of this isbn code
ratings_df['isbn'].value_counts()\
                   ['9627982032']

1

In this regard, and in order to ensure statistical significance in our results, we are to wrangle the dataset `ratings_df`.

The procedure will be to determine which groups of observations will be excluded in our final dataframe. These groups are with respect to the categories `user` and `isbn`, and being more precise, the metrics to be parsed will be **rates per user** and **rates per book**. These will be limited to 200 and 100, respectively.

This is to code:

In [None]:
# Select all the ratings each user gave
rates_per_user = ratings_df['user'].value_counts()
rates_per_user

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: user, Length: 105283, dtype: int64

In [None]:
thres_users = 200

# Select the **users who have rated less times** than the assigned threshold
mask_user = rates_per_user[rates_per_user < thres_users]
mask_user

193458    199
240403    199
203017    199
79942     198
267061    198
         ... 
116180      1
116166      1
116154      1
116137      1
276723      1
Name: user, Length: 104378, dtype: int64

Analogously, for the category `isbn` we code the same:

In [None]:
# Select all the ratings to each books (N.b.:books ≈ isbn)
rates_per_book = ratings_df['isbn'].value_counts()
rates_per_book

0971880107     2502
0316666343     1295
0385504209      883
0060928336      732
0312195516      723
               ... 
1568656386        1
1568656408        1
1569551553        1
1570081808        1
05162443314       1
Name: isbn, Length: 340556, dtype: int64

In [None]:
thres_book = 100

# Select the **books rated less times** than the assigned threshold
mask_book = rates_per_book[rates_per_book < thres_book]
mask_book
#mask_book.plot()

0375500510     99
0671727583     99
0425174271     99
1576737330     99
0425172996     99
               ..
1568656386      1
1568656408      1
1569551553      1
1570081808      1
05162443314     1
Name: isbn, Length: 339825, dtype: int64

In [None]:
# Apply the masks to the entire dataset, respectively
masked_df_users = ratings_df['user'].isin(mask_user.index)
masked_df_books = ratings_df['isbn'].isin(mask_book.index)

masked_df_books.head()

0     True
1     True
2    False
3     True
4     True
Name: isbn, dtype: bool

I.e., the book whose index is 2 in the dataframe 'ratings_df' will be included in our neighbor algorithm search, for it has more than 100 reviews. (_It's False that has less than 100_)

Let us take a look:


In [None]:
# Fetch the isbn code -i.e., 'epsilon'- of this book whose index is 2
epsilon = ratings_df['isbn'].iloc[2]
epsilon

'0446520802'

In [None]:
# Compute its counting
ratings_df['isbn'].value_counts()\
                   [epsilon]

116

It checks.

Moving forward in our data wrangling :

In [None]:
# Combine both masked series to exclude
# non-statistical significant observationss
masked_ratings_df = ratings_df[~ masked_df_users\
                               & ~ masked_df_books]

masked_ratings_df

# N.b.: If we don't negate these dataframes while using the specified masks,
# we'll be getting a PerformanceWarning error which read:
# The following operation may generate 21007032576 cells in the resulting pandas object."

Unnamed: 0,user,isbn,rating
1456,277427,002542730X,10.0
1469,277427,0060930535,0.0
1471,277427,0060934417,0.0
1474,277427,0061009059,9.0
1484,277427,0140067477,0.0
...,...,...,...
1147304,275970,0804111359,0.0
1147436,275970,140003065X,0.0
1147439,275970,1400031346,0.0
1147440,275970,1400031354,0.0


Now we're looking to join this wrangled dataset of ratings with the one containing the titles of the books, `books_df`

In [None]:
# Create a dataframe with the category 'isbn' as the index
isbn_df = books_df.set_index('isbn',
                             verify_integrity = True)
isbn_df

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
0195153448,Classical Mythology,Mark P. O. Morford
0002005018,Clara Callan,Richard Bruce Wright
0060973129,Decision in Normandy,Carlo D'Este
0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
0393045218,The Mummies of Urumchi,E. J. W. Barber
...,...,...
0440400988,There's a Bat in Bunk Five,Paula Danziger
0525447644,From One to One Hundred,Teri Sloat
006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker
0192126040,Republic (World's Classics),Plato


In [None]:
# Pivot the wrangled dataframe, passing of course the category 'isbn' as index,
# and replacing new-existing Nans (a vast quantity of it btw) due to the pivoting operation itself
# (these Nans being all unrated books by certain users)
sparse_matrix = masked_ratings_df.pivot_table(values = 'rating',
                                              index = 'isbn',
                                              columns = 'user',
                                              fill_value = 0
                                              )

sparse_matrix

user,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
002542730X,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,10,0,0,0
0060008032,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0060096195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
006016848X,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0060173289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1573227331,0,0,0,0,0,0,0,6,0,0,...,0,0,0,0,0,0,0,0,0,0
1573229326,0,0,0,0,0,0,0,6,0,0,...,0,0,0,0,0,0,0,0,0,0
1573229571,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1592400876,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For the curious:


In [None]:
zeros = (sparse_matrix == 0).sum().sum()
zero_percentage = zeros / sparse_matrix.size * 1e2
zero_percentage.round(2)

98.05

That's why the name of the variable for the dataframe `sparse matrix`.

Next, we are to join both dataframes and so obtain the final training dataframe :

In [None]:
sparse_matrix.index = sparse_matrix.join(isbn_df)['title']
# N.b.: Don't run this line more than once
# If you do, rerun the cell for defining `sparse_matrix`

sparse_matrix.head()


user,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Politically Correct Bedtime Stories: Modern Tales for Our Life and Times,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,10,0,0,0
Angels,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Boy Next Door,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Men Are from Mars, Women Are from Venus: A Practical Guide for Improving Communication and Getting What You Want in Your Relationships",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Divine Secrets of the Ya-Ya Sisterhood : A Novel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Poisonwood Bible,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Daughter of Fortune : A Novel (Oprah's Book Club (Hardcover)),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Prodigal Summer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
I Know This Much Is True (Oprah's Book Club),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Stupid White Men ...and Other Sorry Excuses for the State of the Nation!,0,0,0,0,0,0,0,0,10,0,...,9,0,0,0,0,0,0,0,0,0


In [None]:
# Implement unsupervised nearest neighbors learning
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

NEIGHBORS = 6

neigh = NearestNeighbors(n_neighbors = NEIGHBORS,
                         algorithm = 'auto',
                         metric = 'cosine'
                         )

# Just for fitting purposes,
# parse the training dataframe into an array-like object
sparse_array_like = sparse_matrix.values

neigh.fit(sparse_array_like)


In [None]:
def get_recommends(book = ""):

  recommended_books = []

  # Pass the array-like object into the variable `X`.
  # This contains all the ratings of the assigned book
  X = [sparse_matrix.loc[book].values]

  dist, ind = neigh.kneighbors(X,
                               n_neighbors = NEIGHBORS,
                               return_distance = True)

  for i in range(NEIGHBORS - 2):
    # Stop when you're at two iterations to finalize,
    # since that data is not needed to be selected
    # ( see the variables `recommended_books/_dist` in the test function below)

    # Fetch the title of the books..
    titles = sparse_matrix.index[ind[0][-i-1]]

    # ..and its distances with respect to their K-neighbor search
    # (Rounding helps find patterns/clarity in results, esp. when debugging)
    distances = dist[0][-i-1].round(4)

    # N.b.: When passing the index for the sublists contained in 'titles' and 'distances',
    # adjusting their index was necessary in order to fetch the title "Where the Heart.."
    # out of the first position, Otherwise, the list would be out of range.

    recommended_books.append([titles,
                              distances])

  # Share the input argument `book` to simplify code parsing
  # and satisfy fCC test function (Fetch the string "Where the Heart.." in 1st pos.)
  return [book, recommended_books]


<a name="understand-the-test-function"></a>
### Understanding the **test_function**

      test_book_recommendation()

The variable `recommends` which is being tested, must be a nested list and it has to satisfy, at least but not limited to, the following:

- the first element must be the title(str) of the book being tested. (L7)
- the second element must be another nested list - of at least 4 dimensions - , and inside each sublist:
    - the first parameter must be a list of the titles(str), and these must be the same as in the list `recommended_books`. (L12)
    - the second parameter must be a list of the distance(int) of the k-nearest neighbors. The values of the list `recommended_books_dist` will be subtracted to the ones of each sublist, and for each index (these have no limit) the absolute value of this difference(int) must be equal or smaller than .05 . (L14)



In [None]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2):
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016], ['The Weight of Water', 0.7709], ['The Surgeon', 0.7699], ['I Know This Much Is True', 0.7677]]]
You passed the challenge! 🎉🎉🎉🎉🎉


[Go back to Foreword](#foreword)
