# Machine Learning, MDS

## HSE, 2024-25

## Group project. Machine Learning

## General Information

__Date of issue:__ 05.06.2025

__Project defense:__ ~ 19.06.2025

## Execution Format

The group project is a creative, initiative-based group assignment, carried out by students in teams of 4 to 6 people. At the same time, it should be remembered that the larger the team, the higher the overall requirements for the team and the project they perform. Teams are allowed to be formed from students of different streams.

When evaluating the results of the project carried out by the team, it is mandatory to check and consider the contribution of ***each participant*** to the overall work of the team. Consequently, it is extremely important to responsibly approach the selection of teammates, the distribution of responsibilities within the team, and the prevention of "dead weight" in the team.

If necessary, the team has the right by majority vote to exclude a person from the team *before the project defense* by informing the instructor, in case that person did not contribute during the project and was, in fact, the "dead weight." Remember this possibility, as the presence of "dead weight" can negatively affect the quality and grade of the entire project and, consequently, the entire team!

The registration of team compositions is carried out via the link in the table:

https://docs.google.com/spreadsheets/d/132inC1TqG8xfan65CpmJX_O3jJR5eiDugRWZ-hdMlbc/edit?usp=sharing


## Evaluation and Penalties



The coursework project is evaluated based on the team's oral defense (presentation) of their completed project before a committee. The committee is composed of course instructors, course assistants, and invited industry experts.

Within each defense, the committee assigns two types of grades.

The first grade is for the overall project. It is given by the committee based on the depth of the team's exploration of the subject area, the complexity of the tools used in the project, and other aspects related to the content of the project's terms of reference. When assigning this grade, the committee refers to the criteria detailed below in the "Project Evaluation Criteria" section.

The second grade is the individual grade for each team member. This grade is subsequently recorded in the grade book. The individual grade for a team member is determined by the committee based on how well the member understands the completed project, how they respond to questions addressed personally to them by the committee, how their role in the project is evident, and other aspects related to the person's participation in the team's activities. The individual grade cannot exceed the team's overall project grade (the first grade). However, if the participant confidently and correctly answers all questions addressed to them, is well-versed in the material and content of the completed project, and their role in the team's work is unquestionable, then the individual grade becomes equal to the overall project grade (i.e., in an ideal case, the second grade equals the first grade).

To avoid subjectivity and/or bias in the evaluation by certain committee members towards specific teams, as well as any other manifestations of unequal conditions, the distribution of teams among the committees is done completely randomly using the Numpy library :) — the random seed is the date on which the distribution is performed, for example, `np.random.seed(5062025)`.

To maintain the fairness and impartiality of the random distribution, any transfers of teams between committees (committee changes) are not allowed. Student attendance at the project defense is mandatory; a student's contribution without their presence is not evaluated and automatically equals zero. Additionally, having the camera turned on throughout the entire time in the conference during the project defense is mandatory.

## Task description

In the modern business industry, the use of machine learning and artificial intelligence has gained critical importance. Companies striving to remain competitive are actively implementing these technologies for data analysis, process automation, and trend forecasting. This enables businesses to make more informed decisions and respond swiftly to market changes.

In this project, you will practice building predictive machine learning models using real data and learn how these models are subsequently applied. You will also explore how to discover hidden patterns and valuable insights that can help companies optimize their business processes in the future.


### Project Stages

Below are the suggested stages for project implementation:

- Selection of a company and/or platform;
- Formulation of a business problem that the company/platform might want to solve based on data using modern machine learning methods;
- Justification of the formulated business problem;
- Acquisition of necessary data through searching open sources/datasets, parsing/working with APIs;
- Formulation of the problem in terms of machine learning to address the business challenge;
- Selection and justification of a potential pool of methods and algorithms for solving the machine learning problem;
- Preliminary data processing;
- Application of machine learning methods to solve the given problem;
- Obtaining and aggregating results, final conclusions.

### Project requirements

As a result of completing the group project, your team should prepare a presentation and defend it before the committee.

The design of the presentation is entirely up to you, but remember that the result should be relevant for demonstration to a business client — it is most appropriate to perceive the committee evaluating your work in this way. For example, it is not recommended to include lines of code or overuse notebook screenshots in the presentation.

From a project execution standpoint, you need to assume the roles of data engineers: to devise the ideology for obtaining and extracting data from the platform; data analysts: to align the business problem with data-driven formulations; and most importantly, data scientists: to implement the tasks using machine learning methods and extract valuable insights from the work done. Finally, of course, you need to present all this work in an understandable format to a potential client.

Naturally, the more complex and non-trivial your task is, and the deeper and more meaningful the insights derived from it, the higher the quality of your project will be, and the higher it will be rated!

### Useful Tips and Advice

- Charts and visualizations are great — everyone loves charts;
- Ensure that all charts on slides can be read even without your commentary;
- Avoid creating slides that are just text;
- Build your project with the goal of presenting valuable information for the business;
- Every good Data Scientist should be able to work in a team, so the distribution of your efforts in the project is up to you;
- You don't have to limit yourself to one company. You have the entire internet to find additional information that can enrich your data and test your hypotheses. Go for it!


## Project Evaluation Criteria

The grade for the group project as a whole — the first of the two grades — is given on a 10-point scale based on 3 criteria. Additionally, the committee may, at its discretion, add (but not decrease) a certain number of bonus points beyond these criteria.

Each of the criteria will be listed and reviewed below.

### Criterion 1: Task Selection (up to 3 points)

This criterion evaluates how justified and meaningful the chosen business problem is; how thoroughly the transition from the business problem to a machine learning problem was made (how well the machine learning task setup aligns with the original problem from the company's perspective); and how accurate, comprehensive, and business-applicable the insights derived from the research are.

### Criterion 2: Data Extraction and Processing Tools (up to 2 points)

This criterion assesses the technical implementation of the data collection process, both manual and automated, the tools used for preprocessing, data preparation, and EDA. It will evaluate the range of different technical tools implemented in the project; the correctness, efficiency, and appropriateness of their application; as well as the complexity and novelty of the algorithms, approaches, and libraries applied to the data.

### Criterion 3: Machine Learning Tools (up to 5 points)

This criterion evaluates the application of machine learning methods in the project. It will consider the range of various technical tools implemented in the project; the correctness, efficiency, and appropriateness of their application; as well as the complexity and novelty of the algorithms, approaches, and libraries applied to the data. Additionally, a comparative analysis of the models used in the solution will be assessed.

*To receive **4 points** for this criterion, the project must correctly and justifiably apply tools/approaches/libraries not covered in the relevant seminars.*

*To receive **3 points**, you must perform a comparative analysis of the effectiveness of various machine learning models within your project on the same dataset. There must be at least 4 different models, and their nature must differ (for example, decision trees and random forests are algorithms of different natures, whereas CatBoost and LightGBM are not). Their use must be thoroughly described and justified. The comparative analysis should yield clear and significant conclusions about the effectiveness of each algorithm.*

*To receive **2 points**, the project must include the use of Transformer objects (such as scalers, dimensionality reduction algorithms, etc.) in addition to machine learning models.*

#### Imports

In [17]:
import pandas as pd
import numpy as np
import re

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

#### Load dataset

In [39]:
df = pd.read_csv('books_1.Best_Books_Ever.csv', sep=',')

In [19]:
df

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,bookFormat,edition,pages,publisher,publishDate,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",Hardcover,First Edition,374,Scholastic Press,09/14/08,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.50,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",Paperback,US Edition,870,Scholastic Inc.,09/28/04,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",Paperback,,324,Harper Perennial Modern Classics,05/23/06,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",Paperback,"Modern Library Classics, USA / CAN",279,Modern Library,10/10/00,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...,1983116,20452,
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.60,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",Paperback,,501,"Little, Brown and Company",09/06/06,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",https://i.gr-assets.com/images/S/compressed.ph...,1459448,14874,2.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.00,The Fateful Trilogy continues with Fractured. ...,English,2940012616562,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",[],Nook,,0,Cheri Schmidt,May 28th 2011,,[],871,"['311', '310', '197', '42', '11']",94.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,"'Anasazi', sequel to 'The Thirteenth Chime' by...",English,9999999999999,"['Mystery', 'Young Adult']",[],Paperback,First Edition,190,Bokheim Publishing,August 5th 2011,August 3rd 2011,[],37,"['16', '14', '5', '2', '0']",95.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.70,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,English,9781461017097,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",[],Paperback,,280,CreateSpace,March 18th 2011,March 15th 2011,"[""Readers' Favorite Book Award (2011)""]",6674,"['2109', '1868', '1660', '647', '390']",84.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,7.37
52476,11330278-wayward-son,Wayward Son,,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,English,9781450755634,"['Fiction', 'Mystery', 'Historical Fiction', '...",[],Paperback,1st edition,507,Cascada Productions,September 1st 2011,April 5th 2011,[],238,"['77', '78', '59', '19', '5']",90.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,2.86


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards            52478 non-null  object

In [21]:
print("Dataset size: ", df.shape)
print("Dataset columns: ", df.columns)

Dataset size:  (52478, 25)
Dataset columns:  Index(['bookId', 'title', 'series', 'author', 'rating', 'description',
       'language', 'isbn', 'genres', 'characters', 'bookFormat', 'edition',
       'pages', 'publisher', 'publishDate', 'firstPublishDate', 'awards',
       'numRatings', 'ratingsByStars', 'likedPercent', 'setting', 'coverImg',
       'bbeScore', 'bbeVotes', 'price'],
      dtype='object')


#### Columns
bookId : Book Identifier as in goodreads.com

title : Book title

series: Series Name

author: Book's Author

rating: Global goodreads rating

description: Book's description

language: Book's language

ISBN: Book's ISBN

genres: Book's genres

characters: Main characters

bookFormat: Type of binding

edition: Type of edition (ex. Anniversary Edition)

pages: Number of pages

publisher: Editorial

publishDate: publication date

firstPublishDate: Publication date of first edition

awards: List of awards

numRatings: Number of total ratings

ratingsByStars: Number of ratings by stars

likedPercent: Derived field, percent of ratings over 2 starts (as in GoodReads)

setting: Story setting

coverImg: URL to cover image

bbeScore: Score in Best Books Ever list

bbeVotes: Number of votes in Best Books Ever list

price: Book's price (extracted from Iberlibro)


Let's look at this dataframe carefullt

In [22]:
pd.set_option('display.max_columns', None)
df.head(20)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,bookFormat,edition,pages,publisher,publishDate,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",Hardcover,First Edition,374,Scholastic Press,09/14/08,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",Paperback,US Edition,870,Scholastic Inc.,09/28/04,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",Paperback,,324,Harper Perennial Modern Classics,05/23/06,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",Paperback,"Modern Library Classics, USA / CAN",279,Modern Library,10/10/00,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...,1983116,20452,
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",Paperback,,501,"Little, Brown and Company",09/06/06,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",https://i.gr-assets.com/images/S/compressed.ph...,1459448,14874,2.1
5,19063.The_Book_Thief,The Book Thief,,Markus Zusak (Goodreads Author),4.37,Librarian's note: An alternate cover edition c...,English,9780375831003,"['Historical Fiction', 'Fiction', 'Young Adult...","['Liesel Meminger', 'Hans Hubermann', 'Rudy St...",Hardcover,First American Edition,552,Alfred A. Knopf,03/14/06,09/01/05,['National Jewish Book Award for Children’s an...,1834276,"['1048230', '524674', '186297', '48864', '26211']",96.0,"['Molching (Germany)', 'Germany']",https://i.gr-assets.com/images/S/compressed.ph...,1372809,14168,3.8
6,170448.Animal_Farm,Animal Farm,,"George Orwell, Russell Baker (Preface), C.M. W...",3.95,Librarian's note: There is an Alternate Cover ...,English,9780451526342,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...","['Snowball', 'Napoleon', 'Clover', 'Boxer', 'O...",Mass Market Paperback,,141,Signet Classics,04/28/96,08/17/45,"['Prometheus Hall of Fame Award (2011)', 'Retr...",2740713,"['986764', '958699', '545475', '165093', '84682']",91.0,"['England', 'United Kingdom']",https://i.gr-assets.com/images/S/compressed.ph...,1276599,13264,4.42
7,11127.The_Chronicles_of_Narnia,The Chronicles of Narnia,The Chronicles of Narnia (Publication Order) #1–7,"C.S. Lewis, Pauline Baynes (Illustrator)",4.26,"Journeys to the end of the world, fantastic cr...",English,9999999999999,"['Fantasy', 'Classics', 'Fiction', 'Young Adul...","['Polly', 'Aslan', 'Lucy Pevensie', 'Edmund Pe...",Paperback,Reissue Edition,767,HarperCollins,09/16/02,10/28/56,[],517740,"['254964', '167572', '74362', '15423', '5419']",96.0,"['London, England']",https://i.gr-assets.com/images/S/compressed.ph...,1238556,12949,
8,30.J_R_R_Tolkien_4_Book_Boxed_Set,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,The Lord of the Rings #0-3,J.R.R. Tolkien,4.6,"This four-volume, boxed set contains J.R.R. To...",English,9780345538376,"['Fantasy', 'Fiction', 'Classics', 'Adventure'...","['Frodo Baggins', 'Gandalf', 'Bilbo Baggins', ...",Mass Market Paperback,Hobbit Movie Tie-in Boxed set,1728,Ballantine Books,09/25/12,10/20/55,[],110146,"['78217', '22857', '6628', '1477', '967']",98.0,['Middle-earth'],https://i.gr-assets.com/images/S/compressed.ph...,1159802,12111,21.15
9,18405.Gone_with_the_Wind,Gone with the Wind,,Margaret Mitchell,4.3,"Scarlett O'Hara, the beautiful, spoiled daught...",English,9780446675536,"['Classics', 'Historical Fiction', 'Fiction', ...","[""Scarlett O'Hara"", 'Rhett Butler', 'Ashley Wi...",Mass Market Paperback,,1037,Warner Books,04/01/99,06/30/36,"['Pulitzer Prize for Novel (1937)', 'National ...",1074620,"['602138', '275517', '133535', '39008', '24422']",94.0,"['Atlanta, Georgia (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,1087732,11211,5.58


#### Fill NaN

In [23]:
df.isna().sum()

bookId                  0
title                   0
series              29008
author                  0
rating                  0
description          1338
language             3806
isbn                    0
genres                  0
characters              0
bookFormat           1473
edition             47523
pages                2347
publisher            3696
publishDate           880
firstPublishDate    21326
awards                  0
numRatings              0
ratingsByStars          0
likedPercent          622
setting                 0
coverImg              605
bbeScore                0
bbeVotes                0
price               14365
dtype: int64

пропусков в колонке series достаточно мн6ого 55%, но эту колонку не будем удалять, так как значения nan указывают на то, что просто книга без продолжения, в единсвенном томе. По этому заполним пропуски значением 0

In [40]:
df['series'] = df['series'].fillna(0)
df['firstPublishDate'] = df['firstPublishDate'].fillna(0)


def safe_float(price_str):
    if price_str is not np.nan and re.match(r"^\d*\.\d*$", price_str):
        return float(price_str)
    else:
        return np.nan  # Возвращаем NaN, если строка не соответствует паттерну

df['price'] = df['price'].apply(safe_float)

df['price'] = df['price'].fillna(df[df['price'] != 0]['price'].median()).round(2)


очень мн6ого пропусков в колонке edition 91%. По этому нет смысла оставлять эту колонку. Так же в колонках description, language, bookFormat, pages, publisher,  publishDate, likedPercent, coverImg есть не значительные пропуски, по этому строки с NaN можно просто удалить. Удаляем отдельно поле coverImg, потому что оно выглядит бесполезным.

In [None]:

df.drop('edition', axis=1, inplace=True)
df.drop('coverImg', axis=1, inplace=True)
df

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,bookFormat,pages,publisher,publishDate,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",Hardcover,374,Scholastic Press,09/14/08,0,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.50,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",Paperback,870,Scholastic Inc.,09/28/04,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",Paperback,324,Harper Perennial Modern Classics,05/23/06,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",2269402,23328,0
3,1885.Pride_and_Prejudice,Pride and Prejudice,0,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",Paperback,279,Modern Library,10/10/00,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",1983116,20452,0
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.60,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",Paperback,501,"Little, Brown and Company",09/06/06,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",1459448,14874,2.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.00,The Fateful Trilogy continues with Fractured. ...,English,2940012616562,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",[],Nook,0,Cheri Schmidt,May 28th 2011,0,[],871,"['311', '310', '197', '42', '11']",94.0,[],0,1,0
52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,"'Anasazi', sequel to 'The Thirteenth Chime' by...",English,9999999999999,"['Mystery', 'Young Adult']",[],Paperback,190,Bokheim Publishing,August 5th 2011,August 3rd 2011,[],37,"['16', '14', '5', '2', '0']",95.0,[],0,1,0
52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.70,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,English,9781461017097,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",[],Paperback,280,CreateSpace,March 18th 2011,March 15th 2011,"[""Readers' Favorite Book Award (2011)""]",6674,"['2109', '1868', '1660', '647', '390']",84.0,[],0,1,7.37
52476,11330278-wayward-son,Wayward Son,0,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,English,9781450755634,"['Fiction', 'Mystery', 'Historical Fiction', '...",[],Paperback,507,Cascada Productions,September 1st 2011,April 5th 2011,[],238,"['77', '78', '59', '19', '5']",90.0,[],0,1,2.86


In [None]:
df_cleaned = df.dropna()
df_cleaned.isna().sum()

bookId              0
title               0
series              0
author              0
rating              0
description         0
language            0
isbn                0
genres              0
characters          0
bookFormat          0
pages               0
publisher           0
publishDate         0
firstPublishDate    0
awards              0
numRatings          0
ratingsByStars      0
likedPercent        0
setting             0
bbeScore            0
bbeVotes            0
price               0
dtype: int64

In [None]:
df

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,bookFormat,pages,publisher,publishDate,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",Hardcover,374,Scholastic Press,09/14/08,0,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.50,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",Paperback,870,Scholastic Inc.,09/28/04,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",Paperback,324,Harper Perennial Modern Classics,05/23/06,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",2269402,23328,0
3,1885.Pride_and_Prejudice,Pride and Prejudice,0,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",Paperback,279,Modern Library,10/10/00,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",1983116,20452,0
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.60,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",Paperback,501,"Little, Brown and Company",09/06/06,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",1459448,14874,2.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.00,The Fateful Trilogy continues with Fractured. ...,English,2940012616562,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",[],Nook,0,Cheri Schmidt,May 28th 2011,0,[],871,"['311', '310', '197', '42', '11']",94.0,[],0,1,0
52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,"'Anasazi', sequel to 'The Thirteenth Chime' by...",English,9999999999999,"['Mystery', 'Young Adult']",[],Paperback,190,Bokheim Publishing,August 5th 2011,August 3rd 2011,[],37,"['16', '14', '5', '2', '0']",95.0,[],0,1,0
52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.70,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,English,9781461017097,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",[],Paperback,280,CreateSpace,March 18th 2011,March 15th 2011,"[""Readers' Favorite Book Award (2011)""]",6674,"['2109', '1868', '1660', '647', '390']",84.0,[],0,1,7.37
52476,11330278-wayward-son,Wayward Son,0,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,English,9781450755634,"['Fiction', 'Mystery', 'Historical Fiction', '...",[],Paperback,507,Cascada Productions,September 1st 2011,April 5th 2011,[],238,"['77', '78', '59', '19', '5']",90.0,[],0,1,2.86


In [None]:
df[df['language'] != 'English'].shape

(9817, 23)

Usually we have english books. So in the future we can remove non-english books for more clear analyse.

In [None]:
df = df.drop(df[df['language'] != 'English'].index)
df[df['language'] != 'English'].shape

(0, 23)

Strange names of books in 'title' column:

In [None]:
df[df['bookId'] == '30.J_R_R_Tolkien_4_Book_Boxed_Set']['title']

8    J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
Name: title, dtype: object

There is non-ended title with author in the title

Also, There are many cells in 'series' column has NaN value. We suppose it can be just title

A lot of unique contest in 'awards' column can be changed to number of awards for each book

Strange isbn number - 9999999999999

Column 'coverImg' seems useless

#### Deduplication

In [None]:
# Use bookId as default id for books from original dataset
doubles = df[df.duplicated(subset=['bookId'], keep=False)].count()
doubles

bookId              58
title               58
series              58
author              58
rating              58
description         56
language            58
isbn                58
genres              58
characters          58
bookFormat          58
pages               54
publisher           54
publishDate         56
firstPublishDate    58
awards              58
numRatings          58
ratingsByStars      58
likedPercent        58
setting             58
bbeScore            58
bbeVotes            58
price               58
dtype: int64

There are not so much doubles because of different editions, firstPublishDate, languages and Series. It's normal and we can forget about doubles in this dataset.

#### Normalization and preparation before creating a model

In [None]:
df['pages'] = pd.to_numeric(df['pages'], errors='coerce')
df['price'] = pd.to_numeric(df['price'], errors='coerce')

In [None]:
numeric_df = df[['pages', 'rating', 'numRatings', 'likedPercent', 'bbeScore', 'bbeVotes', 'price']].copy()

numeric_df['awards_count'] = df['awards'].apply(len)

# Нормализуем все числовые колонки в диапазон 0-1. Используем RobustScaler() чтобы расчёт был менее чувствителен к выбросам.
robust_scaler = RobustScaler()
robust_data = robust_scaler.fit_transform(numeric_df)

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(numeric_df)
normalized_df = pd.DataFrame(normalized_data, columns=numeric_df.columns, index=numeric_df.index)

# Добавляем категориальные колонки для контекста
context_cols = ['title', 'author', 'description', 'language', 'genres', 'bookFormat', 'publishDate']
final_df = pd.concat([df[context_cols], normalized_df], axis=1)

final_df.head(10)

Unnamed: 0,title,author,description,language,genres,bookFormat,publishDate,pages,rating,numRatings,likedPercent,bbeScore,bbeVotes,price,awards_count
0,The Hunger Games,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",Hardcover,09/14/08,0.02531,0.866,0.904704,0.96,1.0,1.0,0.005664,1.0
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré (Illustrator)",There is a door at the end of a silent corrido...,English,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",Paperback,09/28/04,0.058875,0.9,0.355768,0.98,0.879223,0.882274,0.008212,0.22957
2,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,English,"['Classics', 'Fiction', 'Historical Fiction', ...",Paperback,05/23/06,0.021926,0.856,0.638589,0.95,0.75803,0.764482,0.0,0.076996
3,Pride and Prejudice,"Jane Austen, Anna Quindlen (Introduction)",Alternate cover edition of ISBN 9780679783268S...,English,"['Classics', 'Fiction', 'Romance', 'Historical...",Paperback,10/10/00,0.018881,0.852,0.425375,0.94,0.662404,0.670249,0.0,0.0
4,Twilight,Stephenie Meyer,About three things I was absolutely positive.\...,English,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",Paperback,09/06/06,0.033904,0.72,0.70434,0.78,0.487488,0.487484,0.002337,0.614077
5,The Book Thief,Markus Zusak (Goodreads Author),Librarian's note: An alternate cover edition c...,English,"['Historical Fiction', 'Fiction', 'Young Adult...",Hardcover,03/14/06,0.037355,0.874,0.260237,0.96,0.458548,0.464351,0.004229,0.46906
6,Animal Farm,"George Orwell, Russell Baker (Preface), C.M. W...",Librarian's note: There is an Alternate Cover ...,English,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",Mass Market Paperback,04/28/96,0.009542,0.79,0.388838,0.91,0.426412,0.434731,0.004919,0.038734
7,The Chronicles of Narnia,"C.S. Lewis, Pauline Baynes (Illustrator)","Journeys to the end of the world, fantastic cr...",English,"['Fantasy', 'Classics', 'Fiction', 'Young Adul...",Paperback,09/16/02,0.051905,0.852,0.073454,0.96,0.413705,0.42441,0.0,0.0
8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,"This four-volume, boxed set contains J.R.R. To...",English,"['Fantasy', 'Fiction', 'Classics', 'Adventure'...",Mass Market Paperback,09/25/12,0.116938,0.92,0.015627,0.98,0.387399,0.396953,0.023536,0.0
9,Gone with the Wind,Margaret Mitchell,"Scarlett O'Hara, the beautiful, spoiled daught...",English,"['Classics', 'Historical Fiction', 'Fiction', ...",Mass Market Paperback,04/01/99,0.070177,0.86,0.152461,0.94,0.363326,0.367464,0.006209,0.034483


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42661 entries, 0 to 52477
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         42661 non-null  object 
 1   author        42661 non-null  object 
 2   description   42151 non-null  object 
 3   language      42661 non-null  object 
 4   genres        42661 non-null  object 
 5   bookFormat    42215 non-null  object 
 6   publishDate   42289 non-null  object 
 7   pages         41310 non-null  float64
 8   rating        42661 non-null  float64
 9   numRatings    42661 non-null  float64
 10  likedPercent  42296 non-null  float64
 11  bbeScore      42661 non-null  float64
 12  bbeVotes      42661 non-null  float64
 13  price         42653 non-null  float64
 14  awards_count  42661 non-null  float64
dtypes: float64(8), object(7)
memory usage: 5.2+ MB
