# Import the necessary  Libraries
- Pandas: Used for data manipulation and analysis.
- Requests: Used for making HTTP requests and interacting with web APIs.
- Re: Provides support for regular expressions in Python.

In [1]:
import pandas as pd
import requests

# Load data
The code snippet reads the data from two CSV files, `ratings.csv` and `books.csv`, and creates pandas DataFrames.

1. `pd.read_csv('../Data/training/ratings.csv')`: This line reads the data from the `ratings.csv` file located at `'../Data/training/ratings.csv'` using the `pd.read_csv()` function from the pandas library. The data is loaded into a DataFrame named `ratings`.

2. `pd.read_csv('../Data/training/books.csv')`: Similarly, this line reads the data from the `books.csv` file located at `'../Data/training/books.csv'` and stores it in a DataFrame named `books`.

3. `pd.Series(books.title.values, index=books.index).to_dict()`: In this line, a Series object is created using the values from the "title" column of the `books` DataFrame. The `.values` attribute retrieves the values as an array, and the `.index` attribute provides the index for the Series, which is taken from the index of the `books` DataFrame. The resulting Series is then converted into a dictionary using the `.to_dict()` method. The resulting dictionary, named `book_id_to_name`, maps the index of each book to its corresponding title.

By executing this code, the `ratings` DataFrame is populated with the data from the `ratings.csv` file, the `books` DataFrame is populated with the data from the `books.csv` file, and the `book_id_to_name` dictionary is created to map the index of each book to its title.

In [2]:
ratings = pd.read_csv('training/ratings.csv')
books = pd.read_csv('training/books.csv')

We print the first few records and a summary of the data for a quick examination.

In [3]:
print(ratings.head())
print(ratings.describe())

   book_id  user_id  rating
0        1      314       5
1        1      439       3
2        1      588       5
3        1     1169       4
4        1     1185       4
             book_id        user_id         rating
count  981756.000000  981756.000000  981756.000000
mean     4943.275636   25616.759933       3.856534
std      2873.207415   15228.338826       0.983941
min         1.000000       1.000000       1.000000
25%      2457.000000   12372.000000       3.000000
50%      4921.000000   25077.000000       4.000000
75%      7414.000000   38572.000000       5.000000
max     10000.000000   53424.000000       5.000000


# Analyzing User Ratings and Rating Count Thresholds

1. `user_ratings = ratings.groupby('user_id')['rating'].count()`: This line groups the 'ratings' DataFrame by the 'user_id' column and counts the number of ratings for each user, creating a Series object named 'user_ratings'. Each value in the Series represents the count of ratings for a specific user.

2. `user_rating_counts = ratings['user_id'].value_counts()`: This line counts the occurrences of each unique user ID in the 'user_id' column of the 'ratings' DataFrame and creates another Series object named 'user_rating_counts'. The resulting Series provides the count of ratings for each user ID.

3. `users_with_ratings = user_rating_counts.groupby(user_ratings).count()`: Here, the 'user_rating_counts' Series is grouped by the values from the 'user_ratings' Series, which represents the number of ratings for each user. The resulting Series, 'users_with_ratings', counts the number of users for each number of ratings.

4. `rating_thresholds = list(range(5, 100, 5))`: This line creates a list of rating count thresholds ranging from 5 to 100 (exclusive) with a step size of 5.

5. The subsequent block of code calculates the number of users with fewer than X ratings for each threshold in 'rating_thresholds'. It iterates over the thresholds, subtracting the previous count to calculate the count of users falling below each threshold. The counts are appended to the 'count_per_threshold' list.

6. `percent_per_threshold = [round((count / total_users) * 100) for count in count_per_threshold]`: This line calculates the percentage of the whole user base for each count in 'count_per_threshold' and stores the results in the 'percent_per_threshold' list. The percentages are rounded to the nearest whole number.

7. Finally, a DataFrame named 'df' is created using a dictionary with keys 'fewer than X', 'count', and 'percent', corresponding to the rating thresholds, user counts, and percentages, respectively. The DataFrame captures the information about the number of users with fewer than X ratings for each threshold.

The resulting DataFrame is then printed.

In [4]:
user_ratings = ratings.groupby('user_id')['rating'].count()
user_rating_counts = ratings['user_id'].value_counts()
# Count the number of users for each number of ratings
users_with_ratings = user_rating_counts.groupby(user_ratings).count()
# Create a list of rating count thresholds
rating_thresholds = list(range(5, 100, 5))

# Count the number of users with fewer than X ratings, excluding the previous ranks
count_per_threshold = []
previous_count = 0
total_users = 53424  # Total number of users
for threshold in rating_thresholds:
    count = user_ratings[user_ratings < threshold].count() - previous_count
    count_per_threshold.append(count)
    previous_count += count

# Calculate the percentage of the whole user base
percent_per_threshold = [round((count / total_users) * 100) for count in count_per_threshold]

# Create the DataFrame
df = pd.DataFrame({"fewer than X": rating_thresholds, "count": count_per_threshold, "percent": percent_per_threshold})

# Print the DataFrame
(df)


Unnamed: 0,fewer than X,count,percent
0,5,17714,33
1,10,11305,21
2,15,5859,11
3,20,3907,7
4,25,2759,5
5,30,2082,4
6,35,1671,3
7,40,1305,2
8,45,1020,2
9,50,875,2


### Filtering and Preparing User Data

The following code snippet performs the following tasks:

#### Filter Out Users with Few Ratings

The code filters out users with a rating count below a specified threshold. In this case, the threshold is set to 10. Users who have fewer than 10 ratings will be excluded from the analysis.

In [6]:
filter_out = 10
filtered_ratings = ratings[~ratings['user_id'].isin(user_rating_counts[user_rating_counts < filter_out].index.tolist())].copy()
filtered_ratings.loc[:, 'user_id'] = filtered_ratings.groupby('user_id').ngroup()

# Add the 'cold_start' column with default value False
filtered_ratings['cold_start'] = False

# Get unique user IDs from the ratings data
rating_counts = filtered_ratings.groupby('user_id').size().reset_index(name='rating_count')

users = pd.DataFrame(rating_counts)
ratings = filtered_ratings
users['new_data'] = False
users.head()

Unnamed: 0,user_id,rating_count,new_data
0,0,76,False
1,1,28,False
2,2,22,False
3,3,10,False
4,4,59,False


Save the new filtered ratings to the .csv and the new users df

# Searching for Book Cover Images and Amazon Links

This code snippet retrieves book cover images and Amazon links for books in the books DataFrame.

1. It iterates over each row in the DataFrame and checks if the amazon_link starts with 'https://www.amazon.com/'.

2. If not, it performs a Google search for the book cover image using the book title.

3. The book title is processed by removing special characters and spaces, converting to lowercase, and replacing spaces with hyphens.
4. The search term is constructed and a GET request is sent to the Google Custom Search API.
5. The response is parsed as JSON and the list of search results items is retrieved.
6. It checks each item for a scraped item with an image link, and if found, assigns it to image_url.
7. It also checks for an Amazon link in the items, assigns it to amazon_link if found, and stops the search process.
8. Finally, it updates the image_url and amazon_link columns in the books DataFrame.
This code enhances the books DataFrame by adding book cover images and corresponding Amazon links for books that did not have valid Amazon links initially.

In [None]:
#API_key
with open('../api_key', 'rb') as key_file:
    api_key = key_file.read().decode()
with open('../search_engine_id', 'rb') as key_file:
    search_engine_id = key_file.read().decode()

for index, row in books.iterrows():
    amazon_link = str(row['amazon_link'])

    if not amazon_link.startswith('https://www.amazon.com/'):
        book_title = row['title']
        # Remove special characters and spaces
        #search_title = re.sub(r'[^\w\s-]', '', book_title)

        # Replace spaces with hyphens
        #search_title = re.sub(r'\s', '+', search_title)

        # Convert to lowercase
        
        #search_title = search_title.lower()
        search_term = f"{book_title}+book+cover+amazon"
        # Construct the search URL
        search_url = f"https://www.googleapis.com/customsearch/v1?key={api_key}&cx={search_engine_id}&q={search_term}"
        # Perform the search and retrieve the image URLs
        response = requests.get(search_url)
        search_results = response.json()
        items = search_results.get("items", [])  # Get the list of items from the search results

        image_url = None

        for item in items:
            pagemap = item.get("pagemap", {})  # Get the pagemap dictionary of the item
            scraped = pagemap.get("scraped", [])  # Get the list of scraped items

            if scraped:
                image_link = scraped[0].get("image_link")  # Get the image link from the scraped item
                if image_link:
                    image_url = image_link  # Found an image link, assign it to image_url

                link = item.get("link")  # Get the link from the item
                if link and link.startswith('https://www.amazon.com/'):
                    amazon_link = link  # Found an Amazon link, assign it to amazon_link
                    break    
        books.at[index, 'image_url'] = image_url
        books.at[index, 'amazon_link'] = amazon_link


Save the edited df to their csvs

In [7]:
ratings.to_csv('ratings.csv',index=False)
books.to_csv('books.csv',index=False)
users.to_csv('users.csv',index=False)
