# First Steps

At the beginning, a connection to all important data records is established and the original DataFrame is loaded.

In [None]:
# RUN THIS COMMAND ONLY IF YOU USE GOOGLE COLAB.
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/AI_Projekt_24"

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/19FtEjCVrulf15txxPt0nskakA-HPrEHX/AI_Projekt_24


In [None]:
#These are all necessary libraries and imports for the first steps up to the Sentimental Analysis
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("/content/drive/MyDrive/AI_Projekt_24/Data/Raw/Books_rating.csv")
print (df.shape)
df.head(10)

(3000000, 10)


Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...
5,826414346,Dr. Seuss: American Icon,,A2F6NONFUDB6UK,Malvin,2/2,4.0,1127174400,One of America's greatest creative talents,"""Dr. Seuss: American Icon"" by Philip Nel is a ..."
6,826414346,Dr. Seuss: American Icon,,A14OJS0VWMOSWO,Midwest Book Review,3/4,5.0,1100131200,A memorably excellent survey of Dr. Seuss' man...,Theodor Seuss Giesel was best known as 'Dr. Se...
7,826414346,Dr. Seuss: American Icon,,A2RSSXTDZDUSH4,J. Squire,0/0,5.0,1231200000,Academia At It's Best,When I recieved this book as a gift for Christ...
8,826414346,Dr. Seuss: American Icon,,A25MD5I2GUIW6W,"J. P. HIGBED ""big fellow""",0/0,5.0,1209859200,And to think that I read it on the tram!,Trams (or any public transport) are not usuall...
9,826414346,Dr. Seuss: American Icon,,A3VA4XFS5WNJO3,Donald Burnside,3/5,4.0,1076371200,Fascinating account of a genius at work,"As far as I am aware, this is the first book-l..."


###Preparing the DataFrame for the Sentimental Analysis
In this step, the DataFrame is reduced in order to speed up the sentimental analysis process. Users who have rarely reviewed and books that have hardly any reviews are deleted.

In [None]:
#Filtering the DataFrame by users with more than or equal to 50 reviews

# Count the number of unique user IDs before filtering
initial_unique_user_ids = df["User_id"].nunique()
print(f"Number of different user IDs before filtering:: {initial_unique_user_ids}")

# Count the frequency of each user ID
user_id_counts = df["User_id"].value_counts()

# Determine the user IDs that appear more than or are equal to 50 times
user_ids_to_keep = user_id_counts[user_id_counts >= 50].index

# Filter the DataFrame to keep only the rows with the user IDs that appear at least 50 times
df_filtered = df[df["User_id"].isin(user_ids_to_keep)]

# Count the frequency of each user ID after filtering
filtered_unique_user_ids = df_filtered["User_id"].nunique()
print(f"Number of different user IDs after filtering: {filtered_unique_user_ids}")

print(df_filtered.shape)

Number of different user IDs before filtering:: 1008972
Number of different user IDs after filtering: 2743
(344861, 10)


In [None]:
#Filtering the new DataFrame by book titles with more than or equal to 5 reviews

# Count the number of unique book titles before filtering:
initial_unique_titles = df_filtered["Title"].nunique()
print(f"Number of different book titles before filtering: {initial_unique_titles}")

# Count the frequency of each book title:
title_counts = df_filtered["Title"].value_counts()

# Determine the book titles that appear more than 5 times
titles_to_keep = title_counts[title_counts >= 5].index

# Filtering the DataFrame to keep only the rows with the book titles that appear at least 5 times
df_filtered_final = df_filtered[df_filtered["Title"].isin(titles_to_keep)]

# Count the number of unique book titles after filtering
filtered_final_unique_titles = df_filtered_final["Title"].nunique()
print(f"Number of different book titles after filtering: {filtered_final_unique_titles}")

print(df_filtered_final.shape)

#df_filtered_final.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Raw/book_review_filtered.csv', index=False)

Number of different book titles before filtering: 77094
Number of different book titles after filtering: 11967
(244731, 10)


#Sentimental Analysis
In this step, a sentimental analysis is carried out for the written reviews. A modified BERT language model is used as a model (bert-base-multilingual-uncased-sentiment model from NLPTown provided on HuggingFace). This modified BERT model (“HuggingFace BERT”) is an open-source tool that has been specifically developed for the purpose of sentiment analysis. The sentimental analysis divides the written reviews into five categories: Very positive, positive, neutral, negative and very negative. In this way, the written reviews can be utilised for further use in a prediction model.

In [None]:
#These are all the libraries and imports necessary for the Sentimental Analysis
import pandas as pd
import numpy as np

!pip install tqdm ipywidgets #this widget is necessary to display the current execution status of the analysis
from tqdm.notebook import tqdm

from concurrent.futures import ThreadPoolExecutor #The Executer helps to ensure that the Sentimental Analysis can use parallel processing with a CPU.



### Installing Hugging Face **BERT**


In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

### Loading the filtered DataFrame and doing the Analysis
The code for sentimental analysis is optimised for CPU. The code is divided into individual batches and is saved regularly (the size of the batches and the memory units can be adjusted manually). If the code aborts during the analysis, it can simply be restarted. It starts the analysis automatically from the last save point.

ATTENTION: The analysis takes approx. 30-40 hours. You can skip this step and use the edited dataset directly in the next step.

In [None]:
df = pd.read_csv("/content/drive/MyDrive/AI_Projekt_24/Data/Raw/book_review_filtered.csv")
print (df.shape)
#df.head(10)

(244731, 10)


In [None]:
# Function for sentiment analysis
def analyze_sentiment(review):
    tokens = tokenizer.encode(review, truncation=True, max_length=512, return_tensors='pt')
    segment_text = tokenizer.decode(tokens[0], skip_special_tokens=True)
    result = pipe(segment_text)
    sentiment = result[0]['label']
    score = result[0]['score']
    return sentiment, score

# Define the progress file path
progress_file_path = '/content/drive/MyDrive/AI_Projekt_24/Data/Raw/sentiment_analysis_progress.csv'

# Check if a progress file already exists and load it if it does
if os.path.exists(progress_file_path):
    df_progress = pd.read_csv(progress_file_path)
    start_index = df_progress.shape[0]
else:
    df_progress = pd.DataFrame(columns=list(df.columns) + ['sentiment', 'score'])
    start_index = 0

# Select the rows from the DataFrame that haven't been processed yet
df_remaining = df.iloc[start_index:].copy()

# Define the chunk size and batch size
chunk_size = 1000  # Save every 1000 rows
batch_size = 200   # Process 200 reviews at a time


# Function to process chunks of the dataframe
def process_chunk(start_index, end_index):
    chunk = df_remaining.iloc[start_index:end_index].copy()
    reviews = chunk['review/text'].tolist()

    # Use ThreadPoolExecutor for parallel processing
    try:
        with ThreadPoolExecutor(max_workers=4) as executor:  # Limit to 4 workers to avoid overloading
            results = list(executor.map(analyze_sentiment, reviews))

        sentiments, scores = zip(*results)
        chunk.loc[:, 'sentiment'] = sentiments
        chunk.loc[:, 'score'] = scores
    except Exception as e:
        print(f"Error processing chunk {start_index}-{end_index}: {e}")

    return chunk

# Show the progress of the sentiment analysis and save progress regularly
for i in tqdm(range(start_index, len(df), batch_size)):
    end_index = min(i + batch_size, len(df_remaining))
    processed_chunk = process_chunk(i, end_index)

    # Concatenate the processed chunk to the progress DataFrame
    df_progress = pd.concat([df_progress, processed_chunk], ignore_index=True)


    # Save progress regularly
    if (end_index) % chunk_size == 0 or end_index == len(df_remaining):
        df_progress.to_csv(progress_file_path, index=False)
        print(f"Progress saved at row {end_index}")

# Save the final progress
df_progress.to_csv(progress_file_path, index=False)
print("Final progress saved.")

  0%|          | 0/1224 [00:00<?, ?it/s]

  df_progress = pd.concat([df_progress, processed_chunk], ignore_index=True)


KeyboardInterrupt: 

# Merging the Datasets and doing a final cleaning
Once the sentimental analysis has been completed, the new data frame is merged with another data set containing data on the individual books and then the entire data frame is cleared again.

In [None]:
#Necessary libraries and imports for this step
import pandas as pd
import numpy as np

###Cleaning the Dataframe of the Sentimental Analysis

In [None]:
#Loading the DataFrame with the Sentimenal Analysis results
sa = pd.read_csv(r"/content/drive/MyDrive/AI_Projekt_24/Data/Raw/sentiment_analysis_progress_final.csv")


In [None]:
#Show the first five rows of the DataFrame
sa.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4 stars,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5 stars,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4 stars,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4 stars,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4 stars,0.558193


In [None]:
#Dropping duplicates
print(sa.shape)
sa.drop_duplicates(inplace = True)
sa.shape

(213831, 12)


(199176, 12)

In [None]:
sa.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4 stars,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5 stars,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4 stars,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4 stars,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4 stars,0.558193


###Comparing the Results of the Sentimental Analysis
For the vast majority of reviews, the sentimental analysis worked and the reviews were rated correctly. The results that are incorrect should be deleted from the DataFrame in order to avoid subsequent misjudgements of the model.

In [None]:
# Extract number from the string in the ‘sentiment’ column
# \d+ finds whole numbers
sa['sentiment'] = sa['sentiment'].str.extract(r'(\d+)').astype(float)
sa.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4.0,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5.0,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4.0,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4.0,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4.0,0.558193


In [None]:
# dropping rows where sentiment analysis and review score aren't equivalent
#(more than 4 star difference, e.g. review star 5.0 and seniment 1.0)

# shape before
print(sa.shape)

# Calculate the absolute difference
sa['difference'] = abs(sa['sentiment'] - sa['review/score'])

# show columns with difference
filtered_sa1 = sa[sa['difference'] > 3.9]
print(filtered_sa1[['sentiment', 'review/score', 'difference']])

# Filter the DataFrame to keep rows where the difference is <= 4
filtered_sa = sa[sa['difference'] <= 3.9].copy()

# Drop the 'difference' column as it's no longer needed
filtered_sa.drop(columns='difference', inplace=True)

# Display the filtered DataFrame
print(filtered_sa.shape)
print("\nFiltered DataFrame:")
filtered_sa.head(5)

#=> approx. 1000 lines less

(199176, 12)
        sentiment  review/score  difference
279           1.0           5.0         4.0
685           1.0           5.0         4.0
793           1.0           5.0         4.0
2164          1.0           5.0         4.0
2479          5.0           1.0         4.0
...           ...           ...         ...
212192        1.0           5.0         4.0
212504        1.0           5.0         4.0
212907        1.0           5.0         4.0
213185        1.0           5.0         4.0
213187        1.0           5.0         4.0

[1007 rows x 3 columns]
(198169, 12)

Filtered DataFrame:


Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4.0,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5.0,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4.0,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4.0,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4.0,0.558193


In [None]:
# show cleaned sentiment_analysis_progress_final.csv
print(filtered_sa.shape)
filtered_sa.head(5)

(198169, 12)


Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4.0,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5.0,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4.0,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4.0,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4.0,0.558193


In [None]:
# save cleaned dataframe
#filtered_sa.to_csv(r"/content/drive/MyDrive/AI_Projekt_24/Data/Cleaned/sentiment_analysis_progress_final_clean.csv", index=False)


### Merge  sentiment_analysis_progress_final_cleaned.csv and books_data.csv

In [None]:
#Load both DataFrame
bd = pd.read_csv(r"/content/drive/MyDrive/AI_Projekt_24/Data/Raw/books_data.csv")
br = pd.read_csv(r"/content/drive/MyDrive/AI_Projekt_24/Data/Cleaned/sentiment_analysis_progress_final_clean.csv")

In [None]:
#Check size and shape of both DataFrames
print("bd shape: ")
print(bd.shape)
print("br shape: ")
print(br.shape)

bd shape: 
(212404, 10)
br shape: 
(198169, 12)


In [None]:
bd.head(5)

Unnamed: 0,Title,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,ratingsCount
0,Its Only Art If Its Well Hung!,,['Julie Strain'],http://books.google.com/books/content?id=DykPA...,http://books.google.nl/books?id=DykPAAAACAAJ&d...,,1996,http://books.google.nl/books?id=DykPAAAACAAJ&d...,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],http://books.google.com/books/content?id=2tsDA...,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,,2000,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,['Religion'],
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],http://books.google.com/books/content?id=aRSIg...,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,iUniverse,2005-02,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,['Fiction'],
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,http://books.google.nl/books?id=399SPgAACAAJ&d...,,2003-03-01,http://books.google.nl/books?id=399SPgAACAAJ&d...,,


In [None]:
br.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,sentiment,score
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,4.0,0.700174
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,5.0,0.703788
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,4.0,0.622445
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",4.0,0.677933
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,4.0,0.558193


In [None]:
#Merge both Dataframes
books = pd.merge(br,bd, on = 'Title')
print(books.shape)

(198169, 21)


In [None]:
books.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text,...,score,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,ratingsCount
0,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A3IKBHODOTYYHM,"fra7299 ""fra7299""",0/0,4.0,1315008000.0,Those beastly curses!,More so than many of the Sherlock Holmes' stor...,...,0.700174,"Sherlock Holmes at his best, the inimitable sl...",['Arthur Conan Doyle'],,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,Library Reproduction Services,1997,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,['Fiction'],3.0
1,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A1E2NA2F4RTQ9B,Debnance at Readerbuzz,0/0,5.0,1313021000.0,My First Sherlock Holmes,This is a book I never expected to like. I hav...,...,0.703788,"Sherlock Holmes at his best, the inimitable sl...",['Arthur Conan Doyle'],,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,Library Reproduction Services,1997,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,['Fiction'],3.0
2,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A34BWLZ9HERHGM,M. D. Stern,3/4,5.0,1163549000.0,A Classic That Is Timeless,I had never read a Sherlock Holmes mystery bef...,...,0.622445,"Sherlock Holmes at his best, the inimitable sl...",['Arthur Conan Doyle'],,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,Library Reproduction Services,1997,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,['Fiction'],3.0
3,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,AJQ1S39GZBKUG,"A. T. A. Oliveira ""A. T. A. Oliveira""",3/4,5.0,1112486000.0,Conan Doyle deceives us -- and we like it,"Published in the beginning of the XX Century, ...",...,0.677933,"Sherlock Holmes at his best, the inimitable sl...",['Arthur Conan Doyle'],,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,Library Reproduction Services,1997,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,['Fiction'],3.0
4,1581180012,Hound of the Baskervilles (Lrs Large Print Her...,,A19O80VYV3XFJ8,N. Hirsch,0/0,4.0,1228003000.0,Still Fresh After 100+ Years,Not being an avid mystery reader (outside of t...,...,0.558193,"Sherlock Holmes at his best, the inimitable sl...",['Arthur Conan Doyle'],,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,Library Reproduction Services,1997,http://books.google.nl/books?id=L3SYPwAACAAJ&d...,['Fiction'],3.0


In [None]:
#save to csv
#books.to_csv(r"/content/drive/MyDrive/AI_Projekt_24/Data/Cleaned/merge_sentiment_books_data.csv", index=False)


# Splitting the Dataframe into a Training and Test-Set

We want to create the same basis for all our models so that we create a fixed training and test set that is used for all models. In addition, a user is extracted to be tested as a potential new customer for each model.

In [None]:
#Necessary libraries and imports for this step
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Loading the CSV file
df = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/AI_Projekt_24/Data/Cleaned/merge_sentiment_books_data.csv')
df.head(5)
df.shape

(198169, 21)

###Droping unneccesary columns from the DataFrame

In [None]:
df = df.drop(['review/text', "review/summary", "Id", "profileName", "previewLink", "description", "infoLink", "image"], axis=1)
df.head(5)
df.shape

(198169, 13)

###Extracting a single user

In [None]:
#Checking the usability of users by counting it's reviews
user_id = 'A19O80VYV3XFJ8'
user_counts = df['User_id'].value_counts().get(user_id, 0)

print(f'The User {user_id} has made {user_counts} reviews')

The User A19O80VYV3XFJ8 has made 46 reviews


In [None]:
#Exctracting a singel user
user_to_extract = df[df['User_id'] == 'A19O80VYV3XFJ8']
df = df[df['User_id'] != 'A19O80VYV3XFJ8']
df.shape

(198123, 13)

In [None]:
#Save the single user
#user_to_extract.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/new_user.csv', index=False)

###Splitting into training- and testset

In [None]:
#Normal split first (will be adjusted later)
train_data, test_data = train_test_split(df, test_size=0.25, random_state=42)
print(train_data.shape, test_data.shape)

# Ensure that the test set only contains users who are also present in the training set
train_users = set(train_data['User_id'])
test_data = test_data[test_data['User_id'].isin(train_users)]
print(train_data.shape, test_data.shape)

(148592, 13) (49531, 13)
(148592, 13) (49525, 13)


In [None]:
# Saving the training- and testset
#train_data.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set.csv', index=False)
#test_data.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set.csv', index=False)

###Repeating the process for the big Dataframe
The large data set, for which no sentimental analysis was carried out, should be used as a comparison so that a test and training set can also be created from it.

In [None]:
# Loading both raw CSV-Files
df_rating = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Raw/Books_rating.csv')
df_books = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Raw/books_data.csv')

#Merging both Dataframes
df = pd.merge(df_rating, df_books, on='Title')

#Dropping all duplicates
df_no_duplicates = df.drop_duplicates()

#Limit the dataframe again to users who have made 50 or more reviews
#Calculate number of reviews per user
user_review_counts = df['User_id'].value_counts()

#  Filter users who have submitted at least 50 reviews
users_to_keep = user_review_counts[user_review_counts >= 50].index

# Filter DataFrame to keep only the users who fulfil the condition
df = df[df['User_id'].isin(users_to_keep)]

df.shape

(344861, 19)

In [None]:
#Dropping uneccesary data columns
df = df.drop(['review/text', "review/summary", "Id", "profileName", "previewLink", "description", "infoLink", "image"], axis=1)

#Normal split first (will be adjusted later)
train_data, test_data = train_test_split(df, test_size=0.25, random_state=42)
print(train_data.shape, test_data.shape)

# Ensure that the test set only contains users who are also present in the training set
train_users = set(train_data['User_id'])
test_data = test_data[test_data['User_id'].isin(train_users)]
print(train_data.shape, test_data.shape)

(258645, 11) (86216, 11)
(258645, 11) (86216, 11)


In [None]:
#Saving the new traing- and testset
#train_data.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/big_train_set.csv', index=False)
#test_data.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/big_test_set.csv', index=False)

#Training the Model
In this step, the model is to be trained to predict book reviews by individual users. A regression tree model based on XGBoost is used. Two models are trained. One is the target model, which is trained using the previously performed sentimental analysis, and the other model uses the same parameters but takes the original "reveiw/score" as a target.

###Installing XGBoost and all necessary libraries and imports

In [None]:
!pip install xgboost
from xgboost import XGBRegressor  # XGBoost for regression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import joblib  # To save the model



##First Model with 'sentiment' as target
In the first model, sentiment is taken as the prediciton value and the model is trained accordingly

###Loading the Test- and Trainingset of the DataFrame with the Sentimental Analysis

In [None]:
# Loading the Datasets
X_train = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set.csv')
X_test = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set.csv')

In [None]:
# Show all column names
print(X_train.columns)

Index(['Title', 'Price', 'User_id', 'review/helpfulness', 'review/score',
       'review/time', 'sentiment', 'score', 'authors', 'publisher',
       'publishedDate', 'categories', 'ratingsCount'],
      dtype='object')


###Preparing the two sets for the model

XGBoost can only work with data in float and int format. Therefore, categorical data, dates or fractions must first be converted

In [None]:
#Transforming categorical columns into numerical labels

# The list of columns to be converted
categorical_columns = ['Title', 'User_id', 'authors', 'publisher', 'categories']

# LabelEncoder initialising
label_encoders = {}

#To ensure that the label encoder converts all data correctly, the two sets must first be combined again.
#Otherwise, the label encoder will only convert data in the test set that it already knows from the training set.

# Create a combined list of all categories
for column in categorical_columns:
    # Combine all unique values from train and test data
    all_categories = pd.concat([X_train[column], X_test[column]], axis=0).unique()

    # Initialise and train the LabelEncoder
    le = LabelEncoder()
    le.fit(all_categories)

    # Transform the training data
    X_train[column] = le.transform(X_train[column])

    # Saving the encoder for later use
    label_encoders[column] = le

    # Transform the test data
    X_test[column] = le.transform(X_test[column])

# Show the new dataframes
#print(X_train.head())
#print(X_test.head())


In [None]:
#Transform the column "helpfulness/review"

#Function to transform the column
def convert_helpfulness_to_percentage(df, column_name):
    # Erstelle zwei neue Spalten für Zähler und Nenner
    df['helpfulness_numerator'] = df[column_name].str.split('/', expand=True)[0].astype(float)
    df['helpfulness_denominator'] = df[column_name].str.split('/', expand=True)[1].astype(float)

    # Calculate the percentages
    df[column_name] = df['helpfulness_numerator'] / df['helpfulness_denominator']

    # Remove the auxiliary columns when they are no longer needed
    df.drop(['helpfulness_numerator', 'helpfulness_denominator'], axis=1, inplace=True)

# Apply the conversion to both DataFrames
convert_helpfulness_to_percentage(X_train, 'review/helpfulness')
convert_helpfulness_to_percentage(X_test, 'review/helpfulness')

# Print the new Dataframes
#print(X_train.head())
#print(X_test.head())

In [None]:
# Replace NaN values in all columns of the DataFrame with 0
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

# Print the new Dataframes
#print(X_train.head())
#print(X_test.head())

In [None]:
#Convert the "publishedDate" column

#Function to transform the column
def process_published_date(df, column_name):
    # Convert the column into datetime objects, handling dates that are not fully specified
    df[column_name] = pd.to_datetime(df[column_name], errors='coerce', format='%Y-%m-%d')

    # For years without a month and day, fill in the date on the first day of the year
    df[column_name] = df[column_name].fillna(pd.to_datetime(df[column_name].astype(str) + '-01-01', format='%Y-%m-%d', errors='coerce'))

    # Extract the year and replace the column with the years
    df[column_name] = df[column_name].dt.year

# Apply the processing to both DataFrames
process_published_date(X_train, 'publishedDate')
process_published_date(X_test, 'publishedDate')

# Output of the DataFrames for checking
print("X_train after the clean-up:")
print(X_train.head())
print("X_test after the clean-up:")
print(X_test.head())



X_train after the clean-up:
   Title  Price  User_id  review/helpfulness  review/score   review/time  \
0   4459    0.0      898            0.000000           5.0  1.232928e+09   
1   4534    0.0      301            1.000000           5.0  9.667296e+08   
2   4561    0.0      910            0.947368           4.0  1.089331e+09   
3   6663    0.0     1286            0.916667           3.0  9.853056e+08   
4   7704    0.0      195            1.000000           3.0  9.451296e+08   

   sentiment     score  authors  publisher  publishedDate  categories  \
0        4.0  0.631499     4962       1335            NaN         662   
1        5.0  0.701677     2720        893         2016.0         376   
2        3.0  0.570874       21       1074         2014.0         145   
3        2.0  0.457438     4962       1335            NaN         662   
4        3.0  0.815212     4962       1335            NaN         662   

   ratingsCount  
0           0.0  
1           0.0  
2           2.0  
3   

In [None]:
#Checking if all columns have the right format:

# Checking the data types in X_test and X_train
print("Date types in X_test:")
print(X_test.dtypes)

print("Date types in X_train:")
print(X_train.dtypes)


Date types in X_test:
Title                   int64
Price                 float64
User_id                 int64
review/helpfulness    float64
review/score          float64
review/time           float64
sentiment             float64
score                 float64
authors                 int64
publisher               int64
publishedDate         float64
categories              int64
ratingsCount          float64
dtype: object
Date types in X_train:
Title                   int64
Price                 float64
User_id                 int64
review/helpfulness    float64
review/score          float64
review/time           float64
sentiment             float64
score                 float64
authors                 int64
publisher               int64
publishedDate         float64
categories              int64
ratingsCount          float64
dtype: object


In [None]:
#Save the opitmized train and test set
#X_train.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set_clean.csv', index=False)
#X_test.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set_clean.csv', index=False)

###Determine the best hyperparameters for the model

In [None]:
# Name der Zielspalte
target_column = 'sentiment'  # Ersetze 'Zielspalte_Name' durch den tatsächlichen Spaltennamen

# Zielvariable und Merkmale extrahieren
y_train = X_train[target_column]  # Zielvariable für das Training
X_train = X_train.drop(columns=[target_column])  # Merkmale für das Training

y_test = X_test[target_column]  # Zielvariable für das Testen
X_test = X_test.drop(columns=[target_column])  # Merkmale für das Testen

# Hyperparameter-Raster definieren
param_grid = {
    'n_estimators': [5000, 7000, 100000],
    'learning_rate': [0.1, 0.09],
    'max_depth': [5, 8, 11],
    'subsample': [0.95],
    #'colsample_bytree': [0.8, 0.9, 1.0],
    #'colsample_bylevel': [0.8, 0.9, 1.0]
}

# GridSearchCV initialisieren
grid_search = GridSearchCV(estimator=XGBRegressor(random_state=42),
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           cv=5,
                           n_jobs=-1)

# GridSearchCV auf die Trainingsdaten anwenden
grid_search.fit(X_train, y_train)

# Beste Parameter und beste MSE anzeigen
print("Beste Hyperparameter:", grid_search.best_params_)
print("Beste MSE:", -grid_search.best_score_)



#Small Data Set, Sentiment: MSE: 0.20955980139053237


KeyboardInterrupt: 

###Training the XGBRegressor model with "sentiment" as target

In [None]:
#Load fresh test and train set
X_train_1 = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set_clean.csv')
X_test_1 = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set_clean.csv')

X_test_1.head(5)

Unnamed: 0,Title,Price,User_id,review/helpfulness,review/score,review/time,sentiment,score,authors,publisher,publishedDate,categories,ratingsCount
0,9338,0.0,412,0.833333,4.0,1161734000.0,3.0,0.418873,1367,1063,,105,1.0
1,8220,0.0,2150,1.0,5.0,1262304000.0,4.0,0.610745,4603,528,2021.0,376,0.0
2,4745,0.0,342,0.0,4.0,1352246000.0,4.0,0.601524,2135,700,,662,339.0
3,2906,0.0,1607,0.25,5.0,1063757000.0,5.0,0.664997,619,906,2000.0,376,1.0
4,4901,0.0,326,1.0,5.0,1190246000.0,4.0,0.517838,2135,285,1995.0,376,7.0


In [None]:
# Name of the target column
target_column = 'sentiment'  # Replace 'Target_Column_Name' with the actual column name

# Extract target variable and features
y_train_1 = X_train_1[target_column]  # Target variable for training
X_train_1 = X_train_1.drop(columns=[target_column])  # Features for training

y_test_1 = X_test_1[target_column]  # Target variable for testing
X_test_1 = X_test_1.drop(columns=[target_column])  # Features for testing

# Best hyperparameters
best_params = {
    'learning_rate': 0.09,
    'max_depth': 11,
    'n_estimators': 7000,
    'subsample': 0.95,
}

# Initialize the model with the best parameters
model_1 = XGBRegressor(**best_params, random_state=42)

# Train the model on the training data
model_1.fit(X_train_1, y_train_1)

# Optionally: Calculate MSE on the test data
from sklearn.metrics import mean_squared_error

predictions = model_1.predict(X_test_1)
mse_1 = mean_squared_error(y_test_1, predictions)
print(f"Test MSE: {mse_1:.4f}")

# Small Data Set, Sentiment: MSE: 0.1934


Test MSE: 0.1934


In [None]:
# Modell speichern
#model_filename = "/content/drive/MyDrive/AI_Projekt_24/Data/xgb_model_small_dataset_sentiment.joblib"
#joblib.dump(model, model_filename)
#print(f"Modell wurde gespeichert als {model_filename}")

In [None]:
#Load fresh test and train set
X_train_2 = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set_clean.csv')
X_test_2 = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set_clean.csv')

 # Remove the columns from the sentimental analysis again
X_test_2.drop(['sentiment', 'score'], axis=1, inplace=True)
X_train_2.drop(['sentiment', 'score'], axis=1, inplace=True)

X_test_2.head(5)

Unnamed: 0,Title,Price,User_id,review/helpfulness,review/score,review/time,authors,publisher,publishedDate,categories,ratingsCount
0,9338,0.0,412,0.833333,4.0,1161734000.0,1367,1063,,105,1.0
1,8220,0.0,2150,1.0,5.0,1262304000.0,4603,528,2021.0,376,0.0
2,4745,0.0,342,0.0,4.0,1352246000.0,2135,700,,662,339.0
3,2906,0.0,1607,0.25,5.0,1063757000.0,619,906,2000.0,376,1.0
4,4901,0.0,326,1.0,5.0,1190246000.0,2135,285,1995.0,376,7.0


### Training the second model with "review/score" as target
This model is based on the same hyperparameters as the first model but uses "review/score" as the target and leaves out the sentimental analysis. The MSE is afterwards compared with the MSE of the first model.

In [None]:
# Name of the target column
target_column = 'review/score'  # Replace 'Target_Column_Name' with the actual column name

# Extract target variable and features
y_train_2 = X_train_2[target_column]  # Target variable for training
X_train_2 = X_train_2.drop(columns=[target_column])  # Features for training

y_test_2 = X_test_2[target_column]  # Target variable for testing
X_test_2 = X_test_2.drop(columns=[target_column])  # Features for testing

# Best hyperparameters
best_params = {
    'learning_rate': 0.09,
    'max_depth': 11,
    'n_estimators': 7000,
    'subsample': 0.95,
}

# Initialize the model with the best parameters
model_2 = XGBRegressor(**best_params, random_state=42)

# Train the model on the training data
model_2.fit(X_train_2, y_train_2)

# Optionally: Calculate MSE on the test data
from sklearn.metrics import mean_squared_error

predictions = model_2.predict(X_test_2)
mse_2 = mean_squared_error(y_test_2, predictions)
print(f"Test MSE: {mse_2:.4f}")

# Small Data Set, Sentiment: MSE: 0.3559


Test MSE: 0.3559


In [None]:
# Vergleich der MSE-Werte und Ausgabe des Ergebnisses
if mse_1 < mse_2:
    print(f"The MSE value of the model that predicts the sentiment is {mse_1:.4f} and is therefore better than the model that predicts the review score. Therefore, doing the sentimental analysis improved the model significantly.")
else:
    print(f"The MSE value of the model that predicts the sentiment is {mse_1:.4f} and is therefore worse than the model that predicts the review score. Therefore, doing the sentimental analysis did not improved the model significantly.")


The MSE value of the model that predicts the sentiment is 0.1934 and is therefore better than the model that predicts the review score. Therefore, doing the sentimental analysis improved the model significantly.


#Building a book predictor
In the last step, the model with the best accuracy will be used to recommend books to users.

##Preparation of data for new user
Merging all data to one big data set => final cleaned version
One user (#new user = A19O80VYV3XFJ8) was previously omitted for testing purposes will now be included.

In [None]:
import pandas as pd

In [None]:
# load data
small_test = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_test_set.csv')
small_train = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/small_train_set.csv')
# A user (#new user = A19O80VYV3XFJ8) who was previously omitted for testing purposes
new_user = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Train_and_testsets/new_user.csv')

In [None]:
# merge test and trainig set
merge_df = pd.concat([small_test, small_train])

In [None]:
# merge data set with new user
merged_df = pd.concat([merge_df, new_user])

In [None]:
# save data frame in CSV
# merged_df.to_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Cleaned/small_test_train_all_users.csv', index=False)

## Load XGBoost Model

In [None]:
# Load all necessary libraries and imports for this step
import joblib
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [None]:
# model path
model_path = '/content/drive/MyDrive/AI_Projekt_24/Data/xgb_model_small_dataset.joblib'

# load model
model = joblib.load(model_path)

In [None]:
# load data
df = pd.read_csv('/content/drive/MyDrive/AI_Projekt_24/Data/Cleaned/small_test_train_all_users.csv')

##Conversion of data for predictions

XGBoost can only work with data in float and int format. Therefore, categorical data, dates or fractions must first be converted. Everything that was done with the training data must also be done with the “new” data with the new user.

In [None]:
#Transforming categorical columns into numerical labels

# The list of columns to be converted
categorical_columns = ['Title', 'User_id', 'authors', 'publisher', 'categories']

# LabelEncoder initialising
label_encoders = {}

# Transform the whole data
for column in categorical_columns:
    le = LabelEncoder()

    # Adapt the LabelEncoder to the entire data and transform it
    df[column] = le.fit_transform(df[column])

    # Save the LabelEncoder in case further transformations are required later
    label_encoders[column] = le


In [None]:
# Define the directory path for saving LabelEncoders
save_dir = '/content/drive/MyDrive/AI_Projekt_24/Data/'

# Save all LabelEncoders to disk
for column, le in label_encoders.items():
    joblib.dump(le, f'{save_dir}label_encoder_{column}.joblib')

In [None]:
#Transform the column "helpfulness/review"

#Function to transform the column
def convert_helpfulness_to_percentage(df, column_name):
    # Erstelle zwei neue Spalten für Zähler und Nenner
    df['helpfulness_numerator'] = df[column_name].str.split('/', expand=True)[0].astype(float)
    df['helpfulness_denominator'] = df[column_name].str.split('/', expand=True)[1].astype(float)

    # Calculate the percentages
    df[column_name] = df['helpfulness_numerator'] / df['helpfulness_denominator']

    # Remove the auxiliary columns when they are no longer needed
    df.drop(['helpfulness_numerator', 'helpfulness_denominator'], axis=1, inplace=True)

# Apply the conversion to both DataFrames
convert_helpfulness_to_percentage(df, 'review/helpfulness')


In [None]:
# Replace NaN values in all columns of the DataFrame with 0
df.fillna(0, inplace=True)

In [None]:
#Convert the "publishedDate" column

#Function to transform the column
def process_published_date(df, column_name):
    # Convert the column into datetime objects, handling dates that are not fully specified
    df[column_name] = pd.to_datetime(df[column_name], errors='coerce', format='%Y-%m-%d')

    # For years without a month and day, fill in the date on the first day of the year
    df[column_name] = df[column_name].fillna(pd.to_datetime(df[column_name].astype(str) + '-01-01', format='%Y-%m-%d', errors='coerce'))

    # Extract the year and replace the column with the years
    df[column_name] = df[column_name].dt.year

# Apply the processing to both DataFrames
process_published_date(df, 'publishedDate')

# Output of the DataFrames for checking
print("df after the clean-up:")
print(df.head())

df after the clean-up:
   Title  Price  User_id  review/helpfulness  review/score   review/time  \
0   9338    0.0      413            0.833333           4.0  1.161734e+09   
1   8220    0.0     2151            1.000000           5.0  1.262304e+09   
2   4745    0.0      343            0.000000           4.0  1.352246e+09   
3   2906    0.0     1608            0.250000           5.0  1.063757e+09   
4   4901    0.0      327            1.000000           5.0  1.190246e+09   

   sentiment     score  authors  publisher  publishedDate  categories  \
0        3.0  0.418873     1367       1063            NaN         105   
1        4.0  0.610745     4603        528         2021.0         376   
2        4.0  0.601524     2135        700            NaN         662   
3        5.0  0.664997      619        906         2000.0         376   
4        4.0  0.517838     2135        285         1995.0         376   

   ratingsCount  
0           1.0  
1           0.0  
2         339.0  
3        

## Book Prediction for new user

In [None]:
# Filter user data and prepare features
user_id = 'A19O80VYV3XFJ8'
encoded_user_id = label_encoders['User_id'].transform([user_id])[0]  # Get encoded value
#user_data = df[df['User_id'] == encoded_user_id]  # Use encoded ID for filtering
user_data = df[df['User_id'] == encoded_user_id].copy()  # Use encoded ID for filtering

# Extract the relevant features (these must be the same ones you used during training)
user_features = user_data[["Title", "Price", "User_id", "review/helpfulness", "review/score", "review/time", "score", "authors", "publisher", "publishedDate", "categories", "ratingsCount"]]

# Make predictions
# Here you make predictions with the loaded model
user_data['prediction'] = model.predict(user_features)

# Recommend books based on predictions
recommended_books = user_data.sort_values('prediction', ascending=False)

# Decode 'Title' and 'User_id' back to their original values
recommended_books['Title'] = label_encoders['Title'].inverse_transform(recommended_books['Title'])
recommended_books['User_id'] = label_encoders['User_id'].inverse_transform(recommended_books['User_id'])

# Show the top 3 recommended books
top_3_books = recommended_books.head(3)
print(f"Recommended books for {user_id}:")
print(top_3_books[['Title', 'prediction']])


Recommended books for A19O80VYV3XFJ8:
                                                    Title  prediction
198153  Seabiscuit: An American Legend (Trade Edition)...    4.798200
198120                                      Crucible, The    4.618678
198161                  Of Mice and Men Hb (New Windmill)    4.356989


In [None]:
# Test if these are unrated books by this user
# Choose User ID
user_id = 'A19O80VYV3XFJ8'

# titles to search
titles_to_search = [
    "Seabiscuit: An American Legend (Trade Edition)",
    "Crucible, The",
    "Of Mice and Men Hb (New Windmill)"
]

# filter for specific User_id
user_data = df[df['User_id'] == user_id]

# Check whether the titles appear in the filtered DataFrame
for title in titles_to_search:
    if title in user_data['Title'].values:
        print(f"'{title}' was rated by User '{user_id}' before.")
    else:
        print(f"'{title}' was not rated by User '{user_id}'.")

'Seabiscuit: An American Legend (Trade Edition)' was not rated by User 'A19O80VYV3XFJ8'.
'Crucible, The' was not rated by User 'A19O80VYV3XFJ8'.
'Of Mice and Men Hb (New Windmill)' was not rated by User 'A19O80VYV3XFJ8'.


###Book Prediction for another user

In [None]:
## Test out another user

# Filter user data and prepare features
user_id = 'A27XUU2DXILHYZ'
encoded_user_id = label_encoders['User_id'].transform([user_id])[0]  # Get encoded value
#user_data = df[df['User_id'] == encoded_user_id]  # Use encoded ID for filtering
user_data = df[df['User_id'] == encoded_user_id].copy()  # Use encoded ID for filtering

# Extract the relevant features (these must be the same ones you used during training)
user_features = user_data[["Title", "Price", "User_id", "review/helpfulness", "review/score", "review/time", "score", "authors", "publisher", "publishedDate", "categories", "ratingsCount"]]

# Make predictions
# Here you make predictions with the loaded model
user_data['prediction'] = model.predict(user_features)

# Recommend books based on predictions
recommended_books = user_data.sort_values('prediction', ascending=False)

# Decode 'Title' and 'User_id' back to their original values
recommended_books['Title'] = label_encoders['Title'].inverse_transform(recommended_books['Title'])
recommended_books['User_id'] = label_encoders['User_id'].inverse_transform(recommended_books['User_id'])

# Show the top 3 recommended books
top_3_books = recommended_books.head(3)
print(f"Recommended books for {user_id}:")
print(top_3_books[['Title', 'prediction']])


Recommended books for A27XUU2DXILHYZ:
                        Title  prediction
97275        Millions of cats    4.998436
25543          Thimble summer    4.367814
1226   MIRACLES ON MAPLE HILL    4.076872


In [None]:
# Test if these are unrated books by this user
# Choose User ID
user_id = 'A27XUU2DXILHYZ'

# titles to search
titles_to_search = [
    "Millions of cats",
    "Thimble summer",
    "MIRACLES ON MAPLE HILL"
]

# filter for specific User_id
user_data = df[df['User_id'] == user_id]

# Check whether the titles appear in the filtered DataFrame
for title in titles_to_search:
    if title in user_data['Title'].values:
        print(f"'{title}' was rated by User '{user_id}' before.")
    else:
        print(f"'{title}' was not rated by User '{user_id}'.")

'Millions of cats' was not rated by User 'A27XUU2DXILHYZ'.
'Thimble summer' was not rated by User 'A27XUU2DXILHYZ'.
'MIRACLES ON MAPLE HILL' was not rated by User 'A27XUU2DXILHYZ'.
