## Pandas and Modin Notebook Demo

Book Recommendation Engine using KNN

[Building A Book Recommender System – The Basics, kNN and Matrix Factorization](https://datascienceplus.com/building-a-book-recommender-system-the-basics-knn-and-matrix-factorization)


## Install Libraries

In [1]:
# ! pip3 list | grep pandas
# ! pip freeze > requirements.txt

# ! pip install --upgrade pip
# ! pip install --upgrade pandas
# ! pip install ray   # this may need to be installed first
# ! pip install modin[ray]

In [2]:
import pandas as pd
print(pd.__version__)

1.4.0


In [3]:
#! apt-get install -y ssh screen nano vim git > /dev/null
#! pip install -U scikit-learn > /dev/null
#! pip install numpy==1.21 > /dev/null       # restart runtime

# ! pip list | grep pandas

print('Ready!')

Ready!


## Copy Files

In [4]:
import os

# GOOGLE_DRIVE="/datasets/google"
# # !ls /datasets/google

# ! cp $GOOGLE_DRIVE/bash/copy.sh /work

# # copy files to root
# ! cp $GOOGLE_DRIVE/bash/.bashrc /root
# ! cp $GOOGLE_DRIVE/bash/.bash_aliases /root
# ! cp $GOOGLE_DRIVE/bash/.bash_functions /root
# ! cp $GOOGLE_DRIVE/bash/.vimrc /root
# ! cp -r $GOOGLE_DRIVE/bash/.vim /root

# # create directory if not exist
# if not os.path.exists('data'):
#     os.makedirs('data')

# # copy required files
# #! cp $GOOGLE_DRIVE/python/*.py .
# #! cp -r $GOOGLE_DRIVE/data/BX-CSV-Dump.zip ./data

# #! unzip ./data/BX-CSV-Dump.zip

print('Ready!')

Ready!


## KNN

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the closeness of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

## Load Data

In [5]:
# ! mkdir books
# ! wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
# ! unzip BX-CSV-Dump.zip -d ./books

In [6]:
import os
import time
import ray

def ray_init():    
    os.environ["MODIN_ENGINE"] = "ray"
    os.environ['MODIN_MEMORY'] = '250000000' # 250MB
    os.environ["MODIN_NPARTITIONS"] = "2"
    os.environ["MODIN_PROGRESS_BAR"] = "true"

    # assert ray.is_initialized() == True

    return None

# ray_init()

print("Done!")

Done!


In [7]:
import csv
import numpy as np
import pandas as pd
# import modin.pandas as pd
import os
import re
import time
import matplotlib.pyplot as plt

# from pandas import option_context
from pprint import pprint

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

def load_data():
    """
    Load the datasets
    """
    books_filename = './data/BX-Books.csv'
    ratings_filename = './data/BX-Book-Ratings.csv'
    users_filename = './data/BX-Users.csv'

    # import csv data into dataframes
    books_df = pd.read_csv(
        books_filename,
        encoding = "latin-1",
        sep=";",
        engine='python',
        header=0,
        names=['isbn', 'title', 'author', 'year', 'publisher', 'image_urls', 'image_urlm', 'image_urll'],
        usecols=['isbn', 'title', 'author', 'year', 'publisher'],
        dtype={'isbn': 'str', 'title': 'str', 'author': 'str', 'year': 'str', 'publisher': 'str'},
        quotechar='"',
        escapechar='\\',
        low_memory=True, 
        memory_map=True)
    
    ratings_df = pd.read_csv(
        ratings_filename,
        encoding = "latin-1",
        sep=";",
        header=0,
        names=['userid', 'isbn', 'rating'],
        usecols=['userid', 'isbn', 'rating'],
        dtype={'userid': 'int', 'isbn': 'str', 'rating': 'float'},
        # quotechar='"',
        low_memory=True, 
        memory_map=True)
    
    users_df = pd.read_csv(
        users_filename,
        encoding = "latin-1",
        sep=";",
        header=0,
        names=['userid', 'location', 'age'],
        usecols=['userid', 'location', 'age'],
        dtype={'userid': 'int', 'location': 'str', 'age': 'float'},
        # quotechar='"',
        low_memory=True, 
        memory_map=True)

    books_df.dropna(inplace=True)

    print(f"books: {books_df.shape}")
    print(f"ratings: {ratings_df.shape}")
    print(f"users: {users_df.shape}")

    return books_df, ratings_df, users_df

books_df, ratings_df, users_df = load_data()

books: (271376, 5)
ratings: (1149780, 3)
users: (278858, 3)


In [8]:
# # creating a filter for year column
# filter = books_df['year'].str.isnumeric() == False

# # print only filtered columns
# books_filtered = books_df.where(filter).dropna()
# print(books_filtered.shape)
# print(books_filtered)

# # fix invalid year
# books_df.iloc[6450]['year'] = 9999
# books_df.iloc[43665]['year'] = 9999
# print(books_df.iloc[6450])

books_df['year'] = books_df['year'].astype(int)
print(books_df.shape)
books_df.head()

(271376, 5)


Unnamed: 0,isbn,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


## Summary Statistics

### Ratings data

The ratings data set provides a list of ratings that users have given to books. 

It includes 1,149,780 records and 3 fields: userID, ISBN, and bookRating.

The ratings are very unevenly distributed and the vast majority of ratings are 0.

In [9]:
print(ratings_df.shape)
print(list(ratings_df.columns))
ratings_df.head()

(1149780, 3)
['userid', 'isbn', 'rating']


Unnamed: 0,userid,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


### Ratings Distribution

In [10]:
# plt.rc("font", size=15)
# ratings_df.value_counts(sort=False).plot(kind='bar')
# plt.title('Rating Distribution\n')
# plt.xlabel('Rating')
# plt.ylabel('Count')
# plt.savefig('ratings_dist.png', bbox_inches='tight')
# plt.show()

### Books data

The books dataset provides book details. 

It includes 271,360 records and 8 fields: ISBN, title, author, publisher, etc.

In [11]:
print(books_df.shape)
print(list(books_df.columns))
books_df.head()

(271376, 5)
['isbn', 'title', 'author', 'year', 'publisher']


Unnamed: 0,isbn,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


### Users data

The users dataset provides the user demographic information. It includes 278,858 records and 3 fields: user id, location, and age.

In [12]:
print(users_df.shape)
print(list(users_df.columns))
users_df.head()

(278858, 3)
['userid', 'location', 'age']


Unnamed: 0,userid,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


## Exploratory Data Analysis

In [13]:
# Get summary stastics
# include string and categorical features
# print('books:')
# print(books_df.describe(include=['int', 'float', 'object', 'category']), "\n")
# print('\nrantings:')
# print(ratings_df.describe(include=['int', 'float', 'object', 'category']), "\n")
# print('\nusers:')
# print(users_df.describe(include=['int', 'float', 'object', 'category']), "\n")

# check data types
print('\nbooks:')
print(books_df.info())
print('\nratings:')
print(ratings_df.info())
print('\nusers:')
print(users_df.info())


books:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 271376 entries, 0 to 271378
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   isbn       271376 non-null  object
 1   title      271376 non-null  object
 2   author     271376 non-null  object
 3   year       271376 non-null  int64 
 4   publisher  271376 non-null  object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB
None

ratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   userid  1149780 non-null  int64  
 1   isbn    1149780 non-null  object 
 2   rating  1149780 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 26.3+ MB
None

users:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---

In [14]:
# number of missing values in each column
print('\nbooks:')
print(books_df.isna().sum())
print('\nratings:')
print(ratings_df.isna().sum())
print('\nusers:')
print(users_df.isna().sum())

# check for duplicate values?
print('\nduplicates:')
print(f"books: {books_df.duplicated().sum()}")
print(f"ratings: {ratings_df.duplicated().sum()}")
print(f"users: {users_df.duplicated().sum()}")

# check the distribution of categorical columns
# df["product_group"].value_counts()


books:
isbn         0
title        0
author       0
year         0
publisher    0
dtype: int64

ratings:
userid    0
isbn      0
rating    0
dtype: int64

users:
userid           0
location         0
age         110762
dtype: int64

duplicates:
books: 0
ratings: 0
users: 0


## Recommendations based on rating counts

In [15]:
rating_count = pd.DataFrame(ratings_df.groupby('isbn')['rating'].count())
rating_count.sort_values('rating', ascending=False).head()

Unnamed: 0_level_0,rating
isbn,Unnamed: 1_level_1
971880107,2502
316666343,1295
385504209,883
60928336,732
312195516,723


The book with ISBN “0971880107” received the most rating counts. 

We can find what book it is, and what books are in the top 5.

In [16]:
most_rated_books = pd.DataFrame(['0971880107', '0316666343', '0385504209', '0060928336', '0312195516'], 
                                index=np.arange(5), 
                                columns = ['isbn'])
most_rated_books_summary = pd.merge(most_rated_books, books_df, on='isbn')
most_rated_books_summary

Unnamed: 0,isbn,title,author,year,publisher
0,971880107,Wild Animus,Rich Shapero,2004,Too Far
1,316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
2,385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
3,60928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial
4,312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA


The book that received the most rating counts in this data set is Rich Shapero’s “Wild Animus”.

Thus, there is something in common among these five books that received the most rating counts — they are all novels. 

The recommender suggests that novels are popular and likely receive more ratings. 

If someone likes “The Lovely Bones: A Novel”, we should probably also recommend to him(or her) “Wild Animus”.

## Recommendations based on correlations

We use Pearsons’R correlation coefficient to measure the linear correlation between two variables, in our case, the ratings for two books.

First, we need to find out the average rating, and the number of ratings each book received.


In [17]:
average_rating = pd.DataFrame(ratings_df.groupby('isbn')['rating'].mean())
average_rating['rating_count'] = pd.DataFrame(ratings_df.groupby('isbn')['rating'].count())
average_rating.sort_values('rating_count', ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
971880107,1.019584,2502
316666343,4.468726,1295
385504209,4.652322,883
60928336,3.448087,732
312195516,4.334716,723


In this data set, the book that received the most rating counts was not highly rated at all. 

Thus, if we were to use recommendations based on rating counts, we would definitely make mistakes here. So, we need to have a better system.



### Filter Data

To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.

In [18]:
ratings = ratings_df.copy()
counts1 = ratings['userid'].value_counts()
ratings = ratings[ratings['userid'].isin(counts1[counts1 >= 200].index)]
counts = ratings['rating'].value_counts()
ratings = ratings[ratings['rating'].isin(counts[counts >= 100].index)]
print(ratings.shape)
ratings.head(10)

(527556, 3)


Unnamed: 0,userid,isbn,rating
1456,277427,002542730X,10.0
1457,277427,0026217457,0.0
1458,277427,003008685X,8.0
1459,277427,0030615321,0.0
1460,277427,0060002050,0.0
1461,277427,0060006641,10.0
1462,277427,0060159685,0.0
1463,277427,0060177721,0.0
1464,277427,0060192704,0.0
1465,277427,0060542128,7.0


### Ratings Matrix

We convert the ratings table to a 2D matrix. 

The matrix will be sparse because not every user rated every book.

In [None]:
# This code sample crashes notebook kernel using modin
# ratings_pivot = ratings.pivot(index='userid', columns='isbn').rating
# userid = ratings_pivot.index
# isbn = ratings_pivot.columns
# print(ratings_pivot.shape)
# ratings_pivot.head()

We can find out which books are correlated with the 2nd most rated book “The Lovely Bones: A Novel”.

In [20]:
# bones_ratings = ratings_pivot['0316666343']
# similar_to_bones = ratings_pivot.corrwith(bones_ratings)
# corr_bones = pd.DataFrame(similar_to_bones, columns=['pearsonR'])
# corr_bones.dropna(inplace=True)
# corr_summary = corr_bones.join(average_rating['rating_count'])
# corr_summary[corr_summary['rating_count']>=300].sort_values('pearsonR', ascending=False).head(10)

We obtained the book ISBNs but we need to find out the titles of the books to see whether they make sense.

In [21]:
# books_corr_to_bones = pd.DataFrame(['0312291639', '0316601950', '0446610038', '0446672211', '0385265700', '0345342968', '0060930535', '0375707972', '0684872153'], 
#                                   index=np.arange(9), columns=['isbn'])
# corr_books = pd.merge(books_corr_to_bones, books_df, on='isbn')
# corr_books

## Collaborative Filtering using k-Nearest Neighbors (kNN)

kNN is a machine learning algorithm to find clusters of similar users based on common book ratings, and make predictions using the average rating of top-k nearest neighbors. For example, we first present ratings in a matrix with the matrix having one row for each item (book) and one column for each user.

We find the k item that has the most similar user engagement vectors. Here, Nearest Neighbors of item id 5= [7, 4, 8, …]. 

Now, we can implement kNN into our book recommender system.

Starting from the original data set, we will be only looking at the popular books. 

In order to find out which books are popular, we combine books data with ratings data.

In [22]:
# Combine books data with ratings data
combine_book_rating = pd.merge(ratings, books_df, on='isbn')
columns = ['year', 'publisher', 'author']
combine_book_rating = combine_book_rating.drop(columns, axis=1)
combine_book_rating.head()

Unnamed: 0,userid,isbn,rating,title
0,277427,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...
1,3363,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...
2,11676,002542730X,6.0,Politically Correct Bedtime Stories: Modern Ta...
3,12538,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...
4,13552,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...


We then group by book titles and create a new column for total rating count.

In [23]:
combine_book_rating = combine_book_rating.dropna(axis=0, subset=['title'])

# TypeError: 'PandasOnRayDataframePartition' object is not iterable
book_rating_count = (combine_book_rating.
     groupby(by = ['title'])['rating'].
     count().
     reset_index().
     rename(columns = {'rating': 'total_rating_count'})
     [['title', 'total_rating_count']]
    )

book_rating_count.head()

Unnamed: 0,title,total_rating_count
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


We combine the rating data with the total rating count data.

This gives us exactly what we need to find out which books are popular and filter out lesser-known books.



In [24]:
rating_with_total_count = combine_book_rating.merge(book_rating_count, left_on='title', right_on='title', how='left')
rating_with_total_count.head()

Unnamed: 0,userid,isbn,rating,title,total_rating_count
0,277427,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,82
1,3363,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,82
2,11676,002542730X,6.0,Politically Correct Bedtime Stories: Modern Ta...,82
3,12538,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,82
4,13552,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,82


 We can view the statistics of total rating count.

In [25]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(book_rating_count['total_rating_count'].describe())

count   160586.000
mean         3.044
std          7.428
min          1.000
25%          1.000
50%          1.000
75%          2.000
max        365.000
Name: total_rating_count, dtype: float64


In [26]:
print(ratings.shape)
print(list(ratings.columns))
ratings.info()

(527556, 3)
['userid', 'isbn', 'rating']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 527556 entries, 1456 to 1147616
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   userid  527556 non-null  int64  
 1   isbn    527556 non-null  object 
 2   rating  527556 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 16.1+ MB


### Filter to users in US and Canada only

In order to improve computing speed, and not run into the “MemoryError” issue, we limit our user data to those in the US and Canada. 

Then, we combine user data with the rating data and total rating count data.

In [27]:
# Filter to users in US and Canada only
combined = rating_with_total_count.merge(users_df, left_on = 'userid', right_on = 'userid', how = 'left')
us_canada_user_rating = combined[combined['location'].str.contains("usa|canada")]
# us_canada_user_rating = us_canada_user_rating.drop(['author'], axis='columns')
us_canada_user_rating.head()

Unnamed: 0,userid,isbn,rating,title,total_rating_count,location,age
0,277427,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,82,"gilbert, arizona, usa",48.0
1,3363,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,82,"knoxville, tennessee, usa",29.0
3,12538,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,82,"byron, minnesota, usa",18.0
4,13552,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,82,"cordova, tennessee, usa",32.0
5,16795,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,82,"mechanicsville, maryland, usa",47.0


## Implementing kNN

We convert our table to a 2D matrix and fill the missing values with zeros (since we will calculate distances between rating vectors).

Then, we transform the values (ratings) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.

#### Finding the Nearest Neighbors

We use unsupervised algorithms with `sklearn.neighbors`. 

The algorithm we use to compute the nearest neighbors is brute and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. 

Finally, we fit the model.

In [28]:
us_canada_user_rating = us_canada_user_rating.drop_duplicates(['userid', 'title'])
us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'title', columns = 'userid', values = 'rating').fillna(0)
us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)

from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(us_canada_user_rating_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

## Test our model and make some recommendations

In this step, the kNN algorithm measures distance to determine the “closeness” of instances. 

Then, it classifies an instance by finding its nearest neighbors and picks the most popular class among the neighbors.

In [29]:
# query_index = np.random.choice(us_canada_user_rating_pivot.shape[0])
# distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index, :].reshape(1, -1), n_neighbors = 6)

# for i in range(0, len(distances.flatten())):
#     if i == 0:
#         print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index]))
#     else:
#         print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

In [30]:
ur_pivot = us_canada_user_rating_pivot.copy()

x = "The Green Mile: Coffey's Hands (Green Mile Series)"
query = ur_pivot[ur_pivot.index.str.startswith(x)]
query_index = np.where(ur_pivot.index == x)
query_index = query_index[0][0]
print(f"query_index: {query_index}")

query_s = ur_pivot.iloc[query_index, :]
arr = query_s.to_numpy().reshape(1, -1)
print(arr.shape)

distances, indices = model_knn.kneighbors(arr, n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(ur_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, ur_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

query_index: 109912
(1, 734)
Recommendations for The Green Mile: Coffey's Hands (Green Mile Series):

1: The Green Mile: Night Journey (Green Mile Series), with distance of 0.12578625288073897:
2: The Two Dead Girls (Green Mile Series), with distance of 0.24189525423576685:
3: The Green Mile: The Bad Death of Eduard Delacroix (Green Mile Series), with distance of 0.2461088756486024:
4: The Green Mile: The Mouse on the Mile (Green Mile Series), with distance of 0.27170038776068084:
5: The Green Mile: Coffey on the Mile (Green Mile Series), with distance of 0.4094509270584026:


In [31]:
# Find row with matching title
x = "The Green Mile: Coffey's Hands (Green Mile Series)"
query = ur_pivot[ur_pivot.index.str.startswith(x)]
row_index = np.where([ur_pivot.index == x])[1][0]
query_s = ur_pivot.iloc[row_index]
print(f"row_index: {row_index}")
print(f"series name: {ur_pivot.iloc[row_index].name}")

row_index: 109912
series name: The Green Mile: Coffey's Hands (Green Mile Series)
