# Collaborative Filtering based Recommendation System_Questios

## About Book Crossing Dataset
###This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

## Objective
This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

Execute the below cell to load the datasets

In [182]:
import pandas as pd

In [183]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


## Q1 Check no.of records (shape) and features given in each dataset 

In [184]:
print('Books:',books.shape, books.info())
print('-----------------------------------------')
print('Users:',users.shape, users.info())
print('-----------------------------------------')
print('Ratings:',ratings.shape, ratings.info())
print('-----------------------------------------')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB
Books: (271360, 8) None
-----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
Users: (278858, 3) None
-----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data col

## Q2. Exploring books dataset - 1

In [185]:
books.describe().transpose()

Unnamed: 0,count,unique,top,freq
ISBN,271360,271360,0763621552,1
bookTitle,271360,242135,Selected Poems,27
bookAuthor,271359,102023,Agatha Christie,632
yearOfPublication,271360,202,2002,13903
publisher,271358,16807,Harlequin,7535
imageUrlS,271360,271044,http://images.amazon.com/images/P/014027569X.0...,2
imageUrlM,271360,271044,http://images.amazon.com/images/P/067144171X.0...,2
imageUrlL,271357,271041,http://images.amazon.com/images/P/039475929X.0...,2


### Drop last three columns containing image URLs which will not be required for analysis

In [186]:
print(books.shape)
books.drop(columns=['imageUrlS','imageUrlM','imageUrlL'],inplace=True,index=1)
books.info()
print(books.shape)

(271360, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 271359 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271359 non-null object
bookTitle            271359 non-null object
bookAuthor           271358 non-null object
yearOfPublication    271359 non-null object
publisher            271357 non-null object
dtypes: object(5)
memory usage: 12.4+ MB
(271359, 5)


### check the unique values of yearOfPublication

In [187]:
books['yearOfPublication'].unique()

array([2002, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994, 2001,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Check the rows having 'DK Publishing Inc' as yearOfPublication and drop them
### Change the datatype of yearOfPublication to 'int'  -1

In [188]:
print(books[books['yearOfPublication'] =='DK Publishing Inc'].shape[0])
books=books[books['yearOfPublication'] != 'DK Publishing Inc']
books['yearOfPublication'] = books['yearOfPublication'].apply(pd.to_numeric, errors='coerce')
books['yearOfPublication']=books['yearOfPublication'].fillna(0).astype(int)
books.info()

2
<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int32
publisher            271355 non-null object
dtypes: int32(1), object(4)
memory usage: 11.4+ MB


### Check for null vaules and impute them

In [189]:
books.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

In [190]:
for feature in books.columns: # Loop through all columns in the dataframe
    if books[feature].dtype == 'float64' or books[feature].dtype == 'int64': # Only apply for numeric columns
        books[feature] = books[feature].fillna(books[feature].median()) # Replace missing with median
    if books[feature].dtype == 'object': # Only apply for char columns
        books[feature] = books[feature].fillna(books[feature].mode()[0]) # Replace missing with mode
books.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

## Q3. Explore Users Dataset

### Age values below 5 and above 90 do not make much sense for our book rating case...hence replacing these by mean and change the datatype to int - 1

In [191]:
users.describe()

Unnamed: 0,userID,Age
count,278858.0,168096.0
mean,139429.5,34.751434
std,80499.51502,14.428097
min,1.0,0.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,244.0


In [192]:
users.isnull().sum()

userID           0
Location         0
Age         110762
dtype: int64

In [193]:
users.loc[users["Age"] < 5] = round(users[(users["Age"] >= 5) & (users["Age"] <= 90)]["Age"].mean())
users.loc[users["Age"] > 90] = round(users[(users["Age"] >= 5) & (users["Age"] <= 90)]["Age"].mean())
users.describe()

Unnamed: 0,userID,Age
count,278858.0,168096.0
mean,138774.48959,34.725996
std,80868.614403,13.53266
min,1.0,5.0
25%,68727.25,24.0
50%,138766.5,32.0
75%,208803.75,44.0
max,278858.0,90.0


In [194]:
users['Age']=users['Age'].fillna(round(users['Age'].mean())).astype(int)
users.describe()

Unnamed: 0,userID,Age
count,278858.0,278858.0
mean,138774.48959,34.83483
std,80868.614403,10.507639
min,1.0,5.0
25%,68727.25,29.0
50%,138766.5,35.0
75%,208803.75,35.0
max,278858.0,90.0


## Q4. Explore ratings Dataset

In [195]:
ratings.describe()

Unnamed: 0,userID,bookRating
count,1149780.0,1149780.0
mean,140386.4,2.86695
std,80562.28,3.854184
min,2.0,0.0
25%,70345.0,0.0
50%,141010.0,0.0
75%,211028.0,7.0
max,278854.0,10.0


In [196]:
ratings.isnull().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

### Ratings dataset should have books only which exist in our books dataset. - 1

In [244]:
print(len(books['ISBN'].unique()))
book_list=books['ISBN'].unique()
ratings_extract=ratings[ratings['ISBN'].isin(book_list)]
print(ratings.shape,ratings_extract.shape)

271357
(1149780, 3) (1031119, 3)


### Ratings dataset should have ratings from users which exist in users dataset.

In [245]:
print(len(users['userID'].unique()))
users_list=users['userID'].unique()
ratings_user_extract=ratings_extract[ratings_extract['userID'].isin(users_list)]
print(ratings_extract.shape,ratings_user_extract.shape)

277546
(1031119, 3) (1026140, 3)


### Consider only ratings from 1-10 and leave 0s.

In [246]:
ratings_user_extract['bookRating'].value_counts()

0     644033
8      91362
10     70963
7      66100
9      60496
5      45153
6      31550
4       7576
3       5082
2       2360
1       1465
Name: bookRating, dtype: int64

In [252]:
ratings_user_extract=ratings_user_extract[ratings_user_extract['bookRating']!=0]
ratings_user_extract.shape

(382107, 3)

### Find out which rating has been given highest number of times

In [249]:
ratings_user_extract['bookRating'].value_counts()

8     91362
10    70963
7     66100
9     60496
5     45153
6     31550
4      7576
3      5082
2      2360
1      1465
Name: bookRating, dtype: int64

In [251]:
#Rating 8 has been given highest number of times
print(ratings_user_extract.shape)
len(ratings_user_extract[ratings_user_extract['bookRating']==0])

(382107, 3)


0

## **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [253]:
value_count=ratings_user_extract['userID'].value_counts(sort=True)
df = value_count.rename_axis('userID').reset_index(name='counts')
df=df[df['counts']>=100]
users_list=df['userID'].unique()
ratings_user_extract=ratings_user_extract[ratings_user_extract['userID'].isin(users_list)]
print(len(df['userID'].unique()),len(ratings_user_extract['userID'].unique()))

447 447


In [254]:
ratings_user_extract.shape

(102976, 3)

## Q5 Generating ratings matrix from explicit ratings table

#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [255]:
ratings_user_extract.isnull().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [256]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 10))
print(reader.rating_scale)
data = Dataset.load_from_df(ratings_user_extract[['userID', 'ISBN', 'bookRating']], reader)
data

(1, 10)


<surprise.dataset.DatasetAutoFolds at 0x21f5fb38630>

## Q6. Generate the predicted ratings using SVD with no.of singular values to be 50

In [257]:
# Split data to train and test
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=123)

In [283]:
from surprise import SVD
from surprise import accuracy
svd_model = SVD(n_factors=50,biased=False)
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x21f61a0d278>

In [284]:
test_pred = svd_model.test(testset)
accuracy.rmse(test_pred)

RMSE: 3.1265


3.12649147013945

## Take a particular user_id

Lets find the recommendations for user with id 2110
Note: Execute the below cells to get the variables loaded

In [314]:
userID = 2110

## Q7 Get the predicted ratings for userID 2110 and sort them in descending order

In [292]:
df = pd.DataFrame(test_pred,columns=['uid','iid','rui','est','details'])
df[df['uid']==userID].sort_values(by='est',ascending=False)

Unnamed: 0,uid,iid,rui,est,details
11448,2110,0345283554,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
8153,2110,0803298145,9.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
23204,2110,0373619472,7.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
23377,2110,037361490X,5.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
14311,2110,0373614845,6.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
13679,2110,0142302163,8.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
24278,2110,0373642911,7.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
9815,2110,059046678X,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
9101,2110,0439222303,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."
9032,2110,0698119916,8.0,7.832142,"{'was_impossible': True, 'reason': 'User and i..."


## Q8 Create a dataframe with name user_data containing userID 2110 explicitly interacted books

In [315]:
df = pd.DataFrame(test_pred,columns=['userID','ISBN','rui','est','details'])

In [316]:
user_data=df[df['userID']==userID].sort_values(by='est',ascending=False)

31

## Q9 Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [317]:
user_fill_info=user_data.merge(books,how='left',on='ISBN')

In [318]:
user_fill_info

Unnamed: 0,userID,ISBN,rui,est,details,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,0345283554,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Han Solo at Stars' End,Daley,1979,Not Avail
1,2110,0803298145,9.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",When Worlds Collide (Bison Frontiers of Imagin...,Philip Wylie,1999,University of Nebraska Press
2,2110,0373619472,7.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Freedom Watch (Stony Man #63),Don Pendleton,2003,Gold Eagle
3,2110,037361490X,5.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Age of War (Super Bolan #90),Don Pendleton,2003,Gold Eagle
4,2110,0373614845,6.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Mack Bolan: Dark Truth,Don Pendleton,2002,Gold Eagle
5,2110,0142302163,8.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
6,2110,0373642911,7.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Blood Trade (The Executioner #291),Don Pendleton,2003,Gold Eagle
7,2110,059046678X,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",The Yearbook,Peter Lerangis,1994,Scholastic
8,2110,0439222303,10.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...","Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
9,2110,0698119916,8.0,7.832142,"{'was_impossible': True, 'reason': 'User and i...",Time Stops for No Mouse (Hermux Tantamoq Adven...,Michael Hoeye,2003,Penguin USA (Paper)


## Q10 Get top 10 recommendations for above given userID from the books not already rated by that user

In [321]:
testset_new = trainset.build_anti_testset()

In [322]:
predictions = svd_model.test(testset_new)

In [323]:
predictions_df = pd.DataFrame([[x.uid,x.est,x.iid] for x in predictions])

In [332]:
predictions_df.columns = ["userId","est_rating","ISBN"]
predictions_df[predictions_df['userId']==userID].sort_values(by = ["userId", "est_rating"],ascending=False).iloc[0:9,:].merge(books,how='left',on='ISBN')

Unnamed: 0,userId,est_rating,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,7.664795,394747232,Maus a Survivors Tale: My Father Bleeds History,Art Spiegelman,1986,Pantheon Books
1,2110,7.600071,375703764,House of Leaves,Mark Z. Danielewski,2000,Pantheon Books
2,2110,7.534536,671003755,She's Come Undone (Oprah's Book Club (Paperback)),Wally Lamb,1996,Washington Square Press
3,2110,7.173186,60929790,One Hundred Years of Solitude,Gabriel Garcia Marquez,1998,Perennial
4,2110,6.939132,385337639,Crow Lake (Today Show Book Club #7),Mary Lawson,2003,Delta
5,2110,6.851989,553295772,Extreme Measures,Michael Palmer,1992,Bantam Books
6,2110,6.802494,60976241,The Lone Ranger and Tonto Fistfight in Heaven,Sherman Alexie,1994,Perennial
7,2110,6.667222,316776963,Me Talk Pretty One Day,David Sedaris,2001,Back Bay Books
8,2110,6.567608,446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company


# Content Based Recommendation System - Optional ( Q11 - Q19 will not be graded)

## Q11 Read the Dataset `movies_metadata.csv`

In [402]:
movies_df=pd.read_csv('movies_metadata.csv')
movies_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [403]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

## Q12 Create a new column with name 'description' combining `'overview' and 'tagline'` columns in the given dataset

In [404]:
movies_df['description']=movies_df['overview']+movies_df['tagline']
movies_df['description'].head()

0                                                  NaN
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: description, dtype: object

## Q13  Lets drop the null values in `description` column

In [405]:
movies_df=movies_df[~movies_df['description'].isnull()]
print(movies_df.isnull().sum())
print(movies_df.shape)

adult                        0
belongs_to_collection    17832
budget                       0
genres                       0
homepage                 15994
id                           0
imdb_id                      4
original_language            0
original_title               0
overview                     0
popularity                   0
poster_path                 15
production_companies         0
production_countries         0
release_date                14
revenue                      0
runtime                      0
spoken_languages             0
status                      14
tagline                      0
title                        0
video                        0
vote_average                 0
vote_count                   0
description                  0
dtype: int64
(20404, 25)


## Q14 Keep the first occurance and drop duplicates of each title in column `title`

In [406]:
movies_df.drop_duplicates(subset='title',inplace=True)
movies_df.shape

(19437, 25)

## Q15   As we might have dropped a few rows with duplicate `title` in above step, just reset the index [make sure you are not adding any new column to the dataframe while doing reset index]

In [407]:
movies_df.reset_index(inplace=True)
movies_df.drop(columns='index',inplace=True)
movies_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,description
0,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,When siblings Judy and Peter discover an encha...
1,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,A family wedding reignites the ancient feud be...
2,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"Cheated on, mistreated and stepped on, the wom..."
3,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,Just when George Banks has recovered from his ...
4,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,"Obsessive master thief, Neil McCauley leads a ..."


## Q16    Generate tf-idf matrix using the column `description`. Consider till 3-grams, with minimum document frequency as 0.

Hint:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

In [408]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tf.fit(movies_df['description'])
description_matrix = tf.transform(movies_df["description"])

In [409]:
description_matrix.shape

(19437, 1149506)

## Q17  Create cosine similarity matrix

In [410]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_description = cosine_similarity(description_matrix)

## Q18  Write a function with name `recommend` which takes `title` as argument and returns a list of 10 recommended title names in the output based on the above cosine similarities

Hint:

titles = df['title'] <br>
indices = pd.Series(df.index, index=df['title']) <br>

def recommend(title): <br>
    idx = indices[title] <br>
    sim_scores = list(enumerate(cosine_similarities[idx])) <br>
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) <br>
    sim_scores = sim_scores[1:31] <br>
    movie_indices = [i[0] for i in sim_scores] <br>
    return titles.iloc[movie_indices] <br>

In [415]:
titles = movies_df['title'] 
indices = pd.Series(movies_df.index, index=movies_df['title']) 
def recommend(title): 
    idx = indices[title] 
    sim_scores = list(enumerate(cosine_sim_description[idx])) 
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) 
    sim_scores = sim_scores[1:11] 
    movie_indices = [i[0] for i in sim_scores] 
    return titles.iloc[movie_indices] 

## Q19 Give the recommendations from above functions for movies `The Godfather` and `The Dark Knight Rises`

In [416]:
recommend('The Godfather')

864      The Godfather: Part II
15874          Honor Thy Father
12008                The Family
12608                Blood Ties
3228                       Made
4000         Johnny Dangerously
26               Shanghai Triad
18496             Live by Night
5899                       Fury
18911    In Memory of My Father
Name: title, dtype: object

In [417]:
recommend('The Dark Knight Rises')

8087                        The Dark Knight
112                          Batman Forever
986                          Batman Returns
2330           Batman: Mask of the Phantasm
451                                  Batman
11899       Batman: Mystery of the Batwoman
9476             Batman: Under the Red Hood
6369     Batman Beyond: Return of the Joker
13487                     Batman vs Dracula
10503                      Batman: Year One
Name: title, dtype: object

# Popularity Based Recommendation System

### About Dataset

Anonymous Ratings on jokes.

1. Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").

2. One row per user

3. The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.

# Q20 Read the dataset(jokes.csv)

Take care about the header in read_csv() as there are no column names given in the dataset. 

In [354]:
jokes_df=pd.read_csv('jokes.csv')
jokes_df.head()

Unnamed: 0,NumJokes,Joke1,Joke2,Joke3,Joke4,Joke5,Joke6,Joke7,Joke8,Joke9,...,Joke91,Joke92,Joke93,Joke94,Joke95,Joke96,Joke97,Joke98,Joke99,Joke100
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


In [355]:
print(jokes_df.shape)
jokes_df.info()

(24983, 101)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24983 entries, 0 to 24982
Columns: 101 entries, NumJokes to Joke100
dtypes: float64(100), int64(1)
memory usage: 19.3 MB


# Q21 Consider `ratings` named dataframe with only first 200 rows and all columns from 1(first column is 0) of dataset

In [361]:
#ratings=jokes_df.iloc[:,1:]
ratings=jokes_df.iloc[0:200,1:]
print(ratings.shape)
ratings.info()

(200, 100)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 100 columns):
Joke1      200 non-null float64
Joke2      200 non-null float64
Joke3      200 non-null float64
Joke4      200 non-null float64
Joke5      200 non-null float64
Joke6      200 non-null float64
Joke7      200 non-null float64
Joke8      200 non-null float64
Joke9      200 non-null float64
Joke10     200 non-null float64
Joke11     200 non-null float64
Joke12     200 non-null float64
Joke13     200 non-null float64
Joke14     200 non-null float64
Joke15     200 non-null float64
Joke16     200 non-null float64
Joke17     200 non-null float64
Joke18     200 non-null float64
Joke19     200 non-null float64
Joke20     200 non-null float64
Joke21     200 non-null float64
Joke22     200 non-null float64
Joke23     200 non-null float64
Joke24     200 non-null float64
Joke25     200 non-null float64
Joke26     200 non-null float64
Joke27     200 non-null float64
Joke28     200 non-

# Q22 Change the column indices from 0 to 99

In [362]:
#Column indices/names were already part of file

# Q23 In the dataset, the null ratings are given as 99.00, so replace all 99.00s with 0
Hint: You can use `ratings.replace(<the given value>, <new value you wanted to change with>)`

In [363]:
ratings.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Joke1,200.0,29.73505,45.142111,-9.71,-0.8525,4.220,99.0000,99.00
Joke2,200.0,18.90890,38.567707,-9.95,-2.5950,3.200,8.3125,99.00
Joke3,200.0,34.88765,47.374806,-9.71,-0.7275,5.195,99.0000,99.00
Joke4,200.0,45.33400,50.301999,-9.76,-1.6850,8.035,99.0000,99.00
Joke5,200.0,-0.03790,5.433866,-9.81,-4.2200,0.340,4.2325,9.22
Joke6,200.0,31.02005,44.822417,-9.85,0.6300,5.530,99.0000,99.00
Joke7,200.0,0.17150,5.627576,-9.95,-4.4800,0.850,4.6850,9.27
Joke8,200.0,-0.08530,4.861652,-9.85,-4.2700,0.680,3.4250,9.27
Joke9,200.0,41.59670,49.136842,-9.85,-0.4525,6.190,99.0000,99.00
Joke10,200.0,16.73470,36.337965,-9.76,-1.7500,2.720,8.1225,99.00


In [364]:
ratings.replace(to_replace=99,value=0,inplace=True)
ratings.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Joke1,200.0,0.53005,4.507771,-9.71,-0.8525,0.000,3.4125,9.27
Joke2,200.0,0.59390,4.915306,-9.95,-2.5950,0.000,4.5350,9.27
Joke3,200.0,0.23765,4.471239,-9.71,-0.7275,0.000,2.3300,9.27
Joke4,200.0,-0.70100,3.863859,-9.76,-1.6850,0.000,0.0000,8.83
Joke5,200.0,-0.03790,5.433866,-9.81,-4.2200,0.340,4.2325,9.22
Joke6,200.0,1.32005,4.393323,-9.85,0.0000,0.000,4.9375,9.27
Joke7,200.0,0.17150,5.627576,-9.95,-4.4800,0.850,4.6850,9.27
Joke8,200.0,-0.08530,4.861652,-9.85,-4.2700,0.680,3.4250,9.27
Joke9,200.0,0.01670,4.037797,-9.85,-0.4525,0.000,1.3350,9.03
Joke10,200.0,0.89470,5.006043,-9.76,-1.7500,0.680,4.3200,9.32


# Q24 Normalize the ratings using StandardScaler and save them in `ratings_diff` variable

In [365]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
ratings_diff=pd.DataFrame(scaler.fit_transform(ratings))
ratings_diff.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,200.0,-1.110223e-17,1.002509,-2.277344,-0.307473,-0.117881,0.641045,1.943728
1,200.0,-9.658940e-17,1.002509,-2.150499,-0.650397,-0.121130,0.803814,1.769548
2,200.0,-4.107825e-17,1.002509,-2.230392,-0.216399,-0.053284,0.469132,2.025169
3,200.0,3.719247e-17,1.002509,-2.350430,-0.255307,0.181880,0.181880,2.472895
4,200.0,-8.881784e-18,1.002509,-1.802882,-0.771567,0.069720,0.787858,1.708016
5,200.0,-2.636780e-18,1.002509,-2.548886,-0.301221,-0.301221,0.825463,1.814094
6,200.0,-1.665335e-18,1.002509,-1.803067,-0.828629,0.120870,0.804045,1.620828
7,200.0,2.331468e-17,1.002509,-2.013555,-0.862917,0.157811,0.723850,1.929134
8,200.0,-4.329870e-17,1.002509,-2.449717,-0.116494,-0.004146,0.327309,2.237834
9,200.0,-2.553513e-17,1.002509,-2.133708,-0.529627,-0.042996,0.685950,1.687249


### Popularity based recommendation system

# Q25  Find the mean for each column  in `ratings_diff` i.e, for each joke
Consider all the mean ratings and find the jokes with highest mean value and display the top 10 joke IDs.

In [366]:
mean_df=ratings_diff.mean()
mean_df = mean_df.rename_axis('JokeID').reset_index(name='mean').sort_values(by='mean',ascending=False).iloc[0:9,:]
mean_df

Unnamed: 0,JokeID,mean
98,98,1.840889e-16
81,81,1.24345e-16
97,97,1.198694e-16
20,20,1.065814e-16
94,94,1.054712e-16
92,92,8.770762000000001e-17
47,47,7.771561000000001e-17
73,73,7.743806000000001e-17
99,99,7.549517e-17
