*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

# Imports

In [None]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2022-01-14 14:50:43--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip’


2022-01-14 14:50:44 (27.9 MB/s) - ‘book-crossings.zip’ saved [26085508/26085508]

Archive:  book-crossings.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [None]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

* To build a model for this, we need to align the two dataframes according to the ISBN numbers.

# Data Processing

In [None]:
df_books.head()

Unnamed: 0,isbn,title,author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


In [None]:
df_ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


In [None]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   isbn    271379 non-null  object
 1   title   271379 non-null  object
 2   author  271378 non-null  object
dtypes: object(3)
memory usage: 6.2+ MB


In [None]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   user    1149780 non-null  int32  
 1   isbn    1149780 non-null  object 
 2   rating  1149780 non-null  float32
dtypes: float32(1), int32(1), object(1)
memory usage: 17.5+ MB


In [None]:
df = pd.merge(df_books, df_ratings, on = 'isbn')

In [None]:
df.head()

Unnamed: 0,isbn,title,author,user,rating
0,195153448,Classical Mythology,Mark P. O. Morford,2,0.0
1,2005018,Clara Callan,Richard Bruce Wright,8,5.0
2,2005018,Clara Callan,Richard Bruce Wright,11400,0.0
3,2005018,Clara Callan,Richard Bruce Wright,11676,8.0
4,2005018,Clara Callan,Richard Bruce Wright,41385,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031175 entries, 0 to 1031174
Data columns (total 5 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   isbn    1031175 non-null  object 
 1   title   1031175 non-null  object 
 2   author  1031174 non-null  object 
 3   user    1031175 non-null  int32  
 4   rating  1031175 non-null  float32
dtypes: float32(1), int32(1), object(3)
memory usage: 39.3+ MB


In [None]:
isnotnum = []
for i in range(len(df['isbn'])):
  if df['isbn'][i].isnumeric() == False:
    isnotnum.append(i)

In [None]:
df = df.drop(isnotnum, axis = 0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 945774 entries, 0 to 1031174
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   isbn    945774 non-null  object 
 1   title   945774 non-null  object 
 2   author  945773 non-null  object 
 3   user    945774 non-null  int32  
 4   rating  945774 non-null  float32
dtypes: float32(1), int32(1), object(3)
memory usage: 36.1+ MB


In [None]:
df['isbn'] = df['isbn'].astype('int')

In [None]:
df = df[df['rating'] > 0]

In [None]:
df.describe()

Unnamed: 0,isbn,user,rating
count,352205.0,352205.0,352205.0
mean,850133200.0,136004.8032,7.62668
std,1280668000.0,80483.810172,1.837556
min,913154.0,8.0,1.0
25%,375708000.0,67547.0,7.0
50%,471579700.0,133786.0,8.0
75%,786866800.0,206202.0,9.0
max,10000000000.0,278854.0,10.0


In [None]:
df['user'].nunique()

64826

In [None]:
df['isbn'].nunique()

137715

In [None]:
book_count = pd.DataFrame(df["title"].value_counts())

In [None]:
rare_book = book_count[book_count["title"] <= 100].index

In [None]:
rare_book.nunique()

125247

In [None]:
common_book = df[~df["title"].isin(rare_book)]
common_book.head()

Unnamed: 0,isbn,title,author,user,rating
104,440234743,The Testament,John Grisham,3329,8.0
108,440234743,The Testament,John Grisham,7346,9.0
109,440234743,The Testament,John Grisham,7352,8.0
110,440234743,The Testament,John Grisham,9419,5.0
113,440234743,The Testament,John Grisham,11224,6.0


In [None]:
user_book_df = common_book.pivot_table(index=["user"], columns=["title"], values="rating")
user_book_df

title,1984,1st to Die: A Novel,2nd Chance,A Bend in the Road,"A Child Called \It\"": One Child's Courage to Survive""",A Heartbreaking Work of Staggering Genius,A Is for Alibi (Kinsey Millhone Mysteries (Paperback)),A Map of the World,A Prayer for Owen Meany,A Time to Kill,A Walk in the Woods: Rediscovering America on the Appalachian Trail (Official Guides to the Appalachian Trail),A Walk to Remember,ANGELA'S ASHES,About a Boy,Airframe,Along Came a Spider (Alex Cross Novels),American Gods,Angela's Ashes (MMP) : A Memoir,Angels &amp; Demons,Animal Farm,"Artemis Fowl (Artemis Fowl, Book 1)",Back Roads,Balzac and the Little Chinese Seamstress : A Novel,Bel Canto: A Novel,Black House,Black Notice,Brave New World,Bridget Jones's Diary,Bridget Jones: The Edge of Reason,Cold Mountain : A Novel,Confessions of a Shopaholic (Summer Display Opportunity),Cradle and All,Dance upon the Air (Three Sisters Island Trilogy),Digital Fortress : A Thriller,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Dolores Claiborne,Dreamcatcher,Empire Falls,Ender's Game (Ender Wiggins Saga (Paperback)),Fahrenheit 451,...,The Last Precinct,The Last Time They Met : A Novel,The Lovely Bones: A Novel,The Nanny Diaries: A Novel,The No. 1 Ladies' Detective Agency (Today Show Book Club #8),The Notebook,The Partner,The Pelican Brief,The Pilot's Wife : A Novel,The Poisonwood Bible,The Poisonwood Bible: A Novel,The Reader,The Red Tent (Bestselling Backlist),The Rescue,The Runaway Jury,The Secret Life of Bees,The Smoke Jumper,The Street Lawyer,The Summons,The Tao of Pooh,The Testament,"The Two Towers (The Lord of the Rings, Part 2)","The Vampire Lestat (Vampire Chronicles, Book II)",The Witching Hour (Lives of the Mayfair Witches),Timeline,To Kill a Mockingbird,"Tuesdays with Morrie: An Old Man, a Young Man, and Life's Greatest Lesson",Two for the Dough,Unnatural Exposure,Violets Are Blue,Watership Down,We Were the Mulvaneys,When the Wind Blows,Where the Heart Is (Oprah's Book Club (Paperback)),While I Was Gone,White Oleander : A Novel,White Oleander : A Novel (Oprah's Book Club),Wicked: The Life and Times of the Wicked Witch of the West,Wild Animus,"\O\"" Is for Outlaw"""
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
16,,,,,,,,,,,,,,,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,,,,,,,,,,,,,,
51,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,9.0,,,,,,,,,,,,,,,,,,,,,,
91,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
114,,,,,,,,,,,,,,,,,,,10.0,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278800,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,9.0,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
278836,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
278843,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
278844,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


# Final Evaluation

In [None]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):
  book_name = user_book_df[book]
  book_name.sort_values(ascending=False)
  rec_books = user_book_df.corrwith(book_name).sort_values(ascending = False).head()
  recommended_books = list(rec_books.index)
  return recommended_books

Use the cell below to test your function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying.

In [None]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    #if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
     # test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

['The God of Small Things', "Where the Heart Is (Oprah's Book Club (Paperback))", 'Stupid White Men ...and Other Sorry Excuses for the State of the Nation!', 'Artemis Fowl (Artemis Fowl, Book 1)', 'Tell No One']
You haven't passed yet. Keep trying!


  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
