# 👋 Welcome to the first INFOMPPM seminar 👋

During the seminar session, we'll begin by exploring how public values are intertwined with recommender systems. You'll need to consider the type of data required to develop a recommender system, along with the potential opportunities and risks these systems pose to different values. Next, we will delve into the basics of creating a recommender system in Python, covering:

1. Non-personalized recommendations (including ratings, seeding, confidence, and support)
2. Implicit ratings
3. Using Streamlit

The activities are designed to test your understanding of the readings, help you get your codebook operational, extract features from existing data, and engage with core concepts.


### Dataset
The dataset you for this assignment will be the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) mined by [Cai-Nicolas Ziegler](http://dbis.informatik.uni-freiburg.de/team/ziegler/cai).

The dataset:
> ... a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

Although the dataset may seem outdated, it serves as an excellent starting point and presents several challenges.

# 🔬 Data exploration and preparation
In this notebook, we'll examine the dataset and create a subset of it for further analysis. The dataset was relatively clean when downloaded, though we addressed some problematic delimiter issues for you. If you're interested in tackling these issues firsthand, the original dataset is available at the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

### 1. Loading the data
Load the three datasets and explore the data.


In [23]:
# code goes here
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')
df_books = pd.read_csv('data/BX-Books.csv', low_memory=False, sep=';', encoding='latin-1')
df_users = pd.read_csv('data/BX-Users.csv', low_memory=False, sep=';', encoding='latin-1')

### 2. Cleaning the data
Ensure that all reviews are linked to a book. Investigate whether there are any reviews that lack a corresponding book or user. Verify the accuracy of author names and identify any anomalies, such as users who have submitted an unusually high number of reviews. Describe the process you followed to clean the data.

In [19]:
# code goes here
# check for missing values
print("\nRatings")
print(df_books_ratings.isnull().sum())
print("\nBooks")
print(df_books.isnull().sum())
print("\nUsers")
print(df_users.isnull().sum())


Ratings
User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

Books
ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

Users
User-ID          0
Location         0
Age         110762
dtype: int64


Age is often missing.

In [35]:
#list all the reviews that are not linked to a book
rating_with_no_existing_book = df_books_ratings[df_books_ratings['ISBN'].isin(df_books['ISBN']) == False]
print(df_books_ratings.head())
print(df_books_ratings.shape)
print(rating_with_no_existing_book.shape)
print(rating_with_no_existing_book.head())

   User-ID        ISBN  Book-Rating
0   276725  034545104X            0
1   276726  0155061224            5
2   276727  0446520802            0
3   276729  052165615X            3
4   276729  0521795028            6
(1149779, 3)
(118604, 3)
    User-ID        ISBN  Book-Rating
6    276736  3257224281            8
7    276737  0600570967            6
9    276745   342310538           10
25   276748  3442437407            0
26   276751  033390804X            0


About 10 percent of ratings refer to a non-existing ISBN in the book.csv file.

### 3. Subsetting the data
The publication accompanied with this dataset [Improving Recommendation Lists Through Topic Diversification](http://www2.informatik.uni-freiburg.de/~cziegler/BX/WWW-2005-Preprint.pdf) by Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; describes the process of subsetting (condensation steps) the dataset as follows (p5):

> Hence, we discarded all books missing taxonomic descriptions, along with all ratings referring to them. Next, we also removed book titles with fewer than 20 overall mentions. Only community members with at least five ratings each were kept.

Investigate the significance of these parameters for the dataset as a whole. Additionally, decide whether to include implicit ratings (where Book-Rating equals 0) in your final dataset. Consider the potential consequences of this choice. Would you opt to exclude them prior to assessing other parameters, or would it be more appropriate to exclude them later?

Although the publication outlines the expected dimensions of the resulting dataset, it's acceptable if your findings vary at this stage.

In [None]:
# code goes here

### 4. Extra step
Examine the `BX-Books.csv` file specifically for the book Robots and _Empire by Isaac Asimov_. Identify any issues you come across. Would you address these issues?

Given that this could pose a problem for our dataset, consider how you would resolve it. You may need to revisit step 2 if you decide to undertake this additional step.

In [None]:
# code goes here

### 5. Save the new dataset(s)
Save the dataset(s) in distinct named CSV-files for later usage. Move the file(s) to the data directory.


In [None]:
# code goes here

# 📚 Recommendations based on most reviewed books
You will start by generating recommendations based on the most reviewed books. Although this approach is not personalized, it remains widely used and provides an opportunity to familiarize yourself with the Streamlit app located in the app directory.

### 1. Calculate the total reviews per book

In [None]:
# code goes here

### 2. Save the recommendations

Select the top 10 books based on the number of reviews. Save these recommendations in a file named `recommendations-most-reviewed.csv`. Then, update the `app/recommendations` directory by replacing the existing recommendations file with this new one. The current recommendations in the app require significant improvements. Ensure the file includes the following columns: `ISBN;count`.

In [None]:
# code goes here

### 3. Run the Streamlit app
This might be your first experience running a Streamlit app. We've supplied you with boilerplate code to view your recommendations through a functional interface. As you progress, you may want to adjust some buttons or include additional metadata. Therefore, it's beneficial to familiarize yourself with the [Streamlit documentation](https://docs.streamlit.io/library/api-reference). For aspiring data scientists, the ability to create quick proofs-of-concept is essential.

1. Install Streamlit
2. Go to the terminal, navigate to the `app` folder and type `streamlit run app.py`

# 📚 Recommendations based on average ratings
You will create your first recommendations using average ratings. This method highlights books with high reader ratings, combining popularity with quality. You'll calculate each book's average rating and choose the top-rated ones for your recommendations.

### 1. Calculate the average ratings
Calculate the average ratings and the number of reviews (count) for the books in your new dataset(s).

In [None]:
# code goes here

### 2. Save the recommendations
Choose the top 10 based on ratings and save them as `recommendations-ratings-avg.csv`, replacing the existing file in the app directory. Ensure the file includes the columns: `ISBN;mean`. After you have saved it you can refresh Streamlit to see the results

In [None]:
# code goes here

### 3. Reflect on the recommendations
Examine the average rating and number of reviews for the top 10 books. Reflect on why solely using average ratings isn't the best method for recommendations.

# 📚 Recommendations based on weighted ratings
Considering the drawbacks of using average ratings, you will now develop recommendations based on the weighted average for each book. Refer to the article [Building a Recommendation System using weighted-average score](https://medium.com/@developeraritro/building-a-recommendation-system-using-weighted-hybrid-technique-75598b6be8ed) to understand and apply this concept.


### 1. Calculate Weightage Average for Individual books average rating
Determine the mean vote value (C) for the entire dataset.


In [None]:
# code goes here

### 2. Save the recommendations
Choose the top 10 books based on their weighted ratings and save these recommendations as `recommendations-ratings-weight.csv`. Then, update the app directory by replacing the existing file. Ensure the file includes the columns: `ISBN;weight`.

In [None]:
# code goes here

### 3. Compare recommendations based on average rating and weighted ratings
Review the interface to note any significant differences with this method.

# 📚 Recommendations based on Frequently Reviewed Together (frequency)
Use the `permutations` function from `itertools` to create combinations of books that are frequently reviewed together.

### 1. Quick introduction to permutations

In [None]:
from itertools import permutations

# items bought together
items = ['milk', 'bread', 'eggs']

# this code creates sets of 2 items from the itemset above
list(permutations(items, 2))

### 2. Count the combinations of books reviewed together
Create combinations with `permutations` and count how often each combination occurs. This process might be time-consuming, depending on your initial data exploration.

In [None]:
# code goes here

### 3. Save the recommendations
Given the potential size of the output, limit the CSV file to include only the top 10 recommendations per book. Save this as `recommendations-seeded-freq.csv` and update the file in the app directory. Remember to enable the code block related to this step if it was previously commented out.


# 📚 Recommendations based on Frequently Reviewed Together (association rules)
For the final segment of this assignment, refer to section 5.4 of the _Practical Recommender Systems_ book (pages 113-127). After reading, download the code provided by the book and focus on the `association_rules_calculator.py` in the `builder` directory. Your task is to adapt this code for use in this notebook, translating its steps into a format suitable for our environment. Here's a simplified outline based on the source code:

The steps found in the source code are:
1. Load the data
2. Generate transactions or, in our case reviews
3. Calculate the Support Confidence
4. Save the results

### 1. Load the data
Instead of using a database, load your `.csv` files into a dataframe. Select the data necessary for identifying which user reviewed which books.


In [None]:
# code goes here

### 2. Generating the reviews
In this context, transactions are the reviews. You need to compile a list of lists, where each inner list contains reviews that are related, similar to how shopping lists are grouped in the example: `[['eggs','milk','bread'], ['bacon', 'bread'], [...], [...]]`

In [None]:
# code goes here

### 3. Calculate the Support Confidence
This requires some puzzling, but looking at the source code will give you a clear idea. You can reuse the subroutines in the source code and pass along the list containing the reviews belonging together. Play around with the _minimum support_ parameter. Too strict will result in fewer associations.

In [None]:
# code goes here

### 4. Save the results
Create a dataframe for the results of step 3. In order to make it work with the current app please make sure the columns are `source;target;support;confidence`. Save the recommendations as `recommendations-seeded-associations.csv` and replace the file in the app directory.

In [None]:
# code goes here