### 1. Import libraries ###

# NYT Bestsellers Analysis #

## Overview ###

This project is an end-to-end analysis and machine learning pipeline for the New York Times Fiction Bestsellers dataset. It includes:
- **Data Cleaning & Feature Engineering:**
Conversion of dates, extraction of year/quarter/month, and creation of text-based features (title length, word count). In addition, the reviews column (which is a string representation of a list with one dictionary) is parsed to extract review links, and binary flags are created to indicate the presence of various review types.
- **Interactive Dashboard:**
A multi-tab dashboard built with Dash and Plotly (using a dark theme) that includes:
    - ***Overview Tab:*** A slider for selecting a year (or "Total" for all years) that filters the data and updates three interactive graphs displaying:
        - Top 10 Publishers (by number of bestsellers)
        - Top 10 Authors (by number of bestsellers)
        - Top 10 Authors (by maximum weeks on the bestseller list)
    - ***Book Reviews Tab:*** A dropdown listing only those books that have at least one review (displayed as "Title - Author"). When a book is selected, the dashboard displays the book’s title, author, a clickable review link (if available), and the book cover image, which is dynamically fetched from the Open Library Covers API. The cover image is aligned to the right.
- **Machine Learning Pipeline:**
An ensemble model is built to predict the number of weeks a book stays on the bestseller list. Key points include:
    -Aggregated features are computed: the average weeks on the list for publishers and the maximum weeks on the list for authors.
    - The model uses additional features such as rank, previous week’s rank, and the review flags.
    - Synthetic data augmentation is applied to continuous features by adding Gaussian noise.
    - A VotingRegressor ensemble (combining RandomForestRegressor, GradientBoostingRegressor, and ExtraTreesRegressor) is trained and evaluated using Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

### Data Cleaning & Feature Engineering ###
- **Date Conversion:***
Convert the bestsellers_date to a datetime object and extract year, quarter, and month.
- **Text Features:**
Compute the length of the book title and the word count of the description.
- **Review Parsing:**
The reviews column is parsed to convert its string representation into a Python list of dictionaries. Then, binary flags are created for each review type (book review, first chapter, Sunday review, article chapter)

### Dashboard ###
- **Overview Tab:**
Uses a slider (with a minimum equal to the first available year and a maximum equal to last_year + 1, where the maximum is labeled "Total") to filter data by year. It displays three interactive graphs:
    - ***Publishers Graph:*** Top 10 publishers by the number of bestsellers.
    - ***Authors Graph:*** Top 10 authors by the number of bestsellers.
    - ***Weeks Graph:*** Top 10 authors by maximum weeks on the bestseller list.
- **Book Reviews Tab:**
The dropdown displays only books with at least one review. When a book is selected, the dashboard shows:
    - The title and author on the left.
    - A clickable review link.
    - The cover image on the right (fetched dynamically from the Open Library Covers API based on the book’s ISBN).

### Machine Learning ###
- **Feature Aggregation:***
Compute aggregated features:
    - ***Publisher:*** Average weeks on the bestseller list.
    - ***Author:*** Maximum weeks on the bestseller list.
- **Data Augmentation:**
Synthetic data augmentation is applied to continuous features (e.g., rank, aggregated features) by adding Gaussian noise.
- **Ensemble Modeling:**
A VotingRegressor ensemble is built combining:
    - `RandomForestRegressor`
    - `GradientBoostingRegressor`
    - `ExtraTreesRegressor`

The model is evaluated using Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

In [2]:
import pandas as pd
import ast
import plotly.express as px
import dash
from dash import dcc, html
from dash.dependencies import Input, Output

### 2. Load and clean dataset ###

In [4]:
df = pd.read_csv('NYT Fiction Bestsellers - Bestsellers.csv')
df

Unnamed: 0,list_name,title,author,publisher,bestsellers_date,published_date,rank,rank_last_week,weeks_on_list,desc,isbns,book_details,reviews
0,Combined Print and E-Book Fiction,JAMES,Percival Everett,Doubleday,2024-12-21,2025-01-05,1,1,9,A reimagining of “Adventures of Huckleberry Fi...,"[{'isbn10': '0385550367', 'isbn13': '978038555...","[{'title': 'JAMES', 'description': ""A reimagin...","[{'book_review_link': '', 'first_chapter_link'..."
1,Combined Print and E-Book Fiction,THE WOMEN,Kristin Hannah,St. Martin's,2024-12-21,2025-01-05,2,4,46,"In 1965, a nursing student follows her brother...","[{'isbn10': '1250178630', 'isbn13': '978125017...","[{'title': 'THE WOMEN', 'description': 'In 196...","[{'book_review_link': '', 'first_chapter_link'..."
2,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-12-21,2025-01-05,3,3,6,A misunderstood girl named Elphaba is declared...,"[{'isbn10': '0061792942', 'isbn13': '978006179...","[{'title': 'WICKED', 'description': 'A misunde...",[{'book_review_link': 'https://www.nytimes.com...
3,Combined Print and E-Book Fiction,FOURTH WING,Rebecca Yarros,Red Tower,2024-12-21,2025-01-05,4,5,74,Violet Sorrengail is urged by the commanding g...,"[{'isbn10': '1649374046', 'isbn13': '978164937...","[{'title': 'FOURTH WING', 'description': 'Viol...","[{'book_review_link': '', 'first_chapter_link'..."
4,Combined Print and E-Book Fiction,THE HOUSEMAID,Freida McFadden,Grand Central,2024-12-21,2025-01-05,5,11,76,Troubles surface when a woman looking to make ...,"[{'isbn10': '1538742578', 'isbn13': '978153874...","[{'title': 'THE HOUSEMAID', 'description': 'Tr...","[{'book_review_link': '', 'first_chapter_link'..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12425,Combined Print and E-Book Fiction,WHAT THE NIGHT KNOWS,Dean Koontz,Random House,2011-01-30,2011-02-13,16,0,0,Someone is murdering entire families — recreat...,"[{'isbn10': '0553807722', 'isbn13': '978055380...","[{'title': 'WHAT THE NIGHT KNOWS', 'descriptio...","[{'book_review_link': '', 'first_chapter_link'..."
12426,Combined Print and E-Book Fiction,TRUE GRIT,Charles Portis,Penguin Group,2011-01-30,2011-02-13,17,0,0,A 14-year-old Arkansas girl hires a “one-eyed ...,"[{'isbn10': '159020459X', 'isbn13': '978159020...","[{'title': 'TRUE GRIT', 'description': 'A 14-y...","[{'book_review_link': '', 'first_chapter_link'..."
12427,Combined Print and E-Book Fiction,ARCHANGEL’S CONSORT,Nalini Singh,Penguin Group,2011-01-30,2011-02-13,18,0,0,The vampire hunter Elena Deveraux and her love...,"[{'isbn10': '0425240134', 'isbn13': '978042524...","[{'title': 'ARCHANGEL’S CONSORT', 'description...","[{'book_review_link': '', 'first_chapter_link'..."
12428,Combined Print and E-Book Fiction,CROSS FIRE,James Patterson,"Little, Brown",2011-01-30,2011-02-13,19,0,0,"Tracking the murderer of a relative, Alex Cros...","[{'isbn10': '0316018783', 'isbn13': '978031601...","[{'title': 'CROSS FIRE', 'description': 'Track...","[{'book_review_link': '', 'first_chapter_link'..."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12430 entries, 0 to 12429
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   list_name         12430 non-null  object
 1   title             12430 non-null  object
 2   author            12430 non-null  object
 3   publisher         12430 non-null  object
 4   bestsellers_date  12430 non-null  object
 5   published_date    12430 non-null  object
 6   rank              12430 non-null  int64 
 7   rank_last_week    12430 non-null  int64 
 8   weeks_on_list     12430 non-null  int64 
 9   desc              12334 non-null  object
 10  isbns             12430 non-null  object
 11  book_details      12430 non-null  object
 12  reviews           12430 non-null  object
dtypes: int64(3), object(10)
memory usage: 1.2+ MB


In [8]:
# Convert the `bestsellers_date` column to datetime format. 
# Invalid dates will be set to NaT.
df['bestsellers_date'] = pd.to_datetime(df['bestsellers_date'], errors='coerce')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12430 entries, 0 to 12429
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   list_name         12430 non-null  object        
 1   title             12430 non-null  object        
 2   author            12430 non-null  object        
 3   publisher         12430 non-null  object        
 4   bestsellers_date  12430 non-null  datetime64[ns]
 5   published_date    12430 non-null  object        
 6   rank              12430 non-null  int64         
 7   rank_last_week    12430 non-null  int64         
 8   weeks_on_list     12430 non-null  int64         
 9   desc              12334 non-null  object        
 10  isbns             12430 non-null  object        
 11  book_details      12430 non-null  object        
 12  reviews           12430 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(9)
memory usage: 1.2+ MB


### 3. Analize datase ### 

Create new features to enrich the dataset.

In [12]:
# Extract a 'year_quarter' column from 'bestsellers_date' (e.g., "2024Q4").
df['year_quarter'] = df['bestsellers_date'].dt.to_period("Q").astype(str)

In [14]:
# Extract the year from 'bestsellers_date'
df['year'] = df['bestsellers_date'].dt.year

# Extract month name and month number for further analysis.
df['month'] = df['bestsellers_date'].dt.strftime("%B")
df['month_num'] = df['bestsellers_date'].dt.month

In [16]:
# Calculate the length of each book title (number of characters)
df['title_length'] = df['title'].apply(len)

In [18]:
# Calculate the word count for the book description.
df['desc_word_count'] = df['desc'].fillna('').apply(lambda x: len(x.split()))

In [20]:
# Parse the 'reviews' column to convert its string representation into Python objects.
def parse_reviews(review):
    try:
        if isinstance(review, str) and review.strip() != "":
            return ast.literal_eval(review)
        return []
    except Exception:
        return []
df["reviews_parsed"] = df["reviews"].apply(parse_reviews)

Create new binary features from the reviews_parsed column.

In [24]:
# Create new binary features from the reviews_parsed column.
# For each review type, check if the first dictionary in the list has a non-empty link.
def safe_strip(value):
    """Return a stripped string if value is not None; otherwise, return an empty string."""
    if value is None:
        return ""
    return str(value).strip()

def extract_review_flags(reviews):
    # Default flags: 0 if no link is found.
    has_book_review = 0
    has_first_chapter = 0
    has_sunday_review = 0
    has_article_chapter = 0
    if isinstance(reviews, list) and len(reviews) > 0 and isinstance(reviews[0], dict):
        review_dict = reviews[0]
        # Use safe_strip to handle potential None values.
        book_link = safe_strip(review_dict.get("book_review_link", ""))
        first_chapter_link = safe_strip(review_dict.get("first_chapter_link", ""))
        sunday_review_link = safe_strip(review_dict.get("sunday_review_link", ""))
        article_chapter_link = safe_strip(review_dict.get("article_chapter_link", ""))
        has_book_review = 1 if book_link != "" else 0
        has_first_chapter = 1 if first_chapter_link != "" else 0
        has_sunday_review = 1 if sunday_review_link != "" else 0
        has_article_chapter = 1 if article_chapter_link != "" else 0
    return pd.Series([has_book_review, has_first_chapter, has_sunday_review, has_article_chapter])

# Apply the function and create four new columns.
df[["has_book_review", "has_first_chapter", "has_sunday_review", "has_article_chapter"]] = df["reviews_parsed"].apply(extract_review_flags)

df.head()

Unnamed: 0,list_name,title,author,publisher,bestsellers_date,published_date,rank,rank_last_week,weeks_on_list,desc,...,year,month,month_num,title_length,desc_word_count,reviews_parsed,has_book_review,has_first_chapter,has_sunday_review,has_article_chapter
0,Combined Print and E-Book Fiction,JAMES,Percival Everett,Doubleday,2024-12-21,2025-01-05,1,1,9,A reimagining of “Adventures of Huckleberry Fi...,...,2024,December,12,5,23,"[{'book_review_link': '', 'first_chapter_link'...",0,0,0,0
1,Combined Print and E-Book Fiction,THE WOMEN,Kristin Hannah,St. Martin's,2024-12-21,2025-01-05,2,4,46,"In 1965, a nursing student follows her brother...",...,2024,December,12,9,20,"[{'book_review_link': '', 'first_chapter_link'...",0,0,0,0
2,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-12-21,2025-01-05,3,3,6,A misunderstood girl named Elphaba is declared...,...,2024,December,12,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
3,Combined Print and E-Book Fiction,FOURTH WING,Rebecca Yarros,Red Tower,2024-12-21,2025-01-05,4,5,74,Violet Sorrengail is urged by the commanding g...,...,2024,December,12,11,22,"[{'book_review_link': '', 'first_chapter_link'...",0,0,0,0
4,Combined Print and E-Book Fiction,THE HOUSEMAID,Freida McFadden,Grand Central,2024-12-21,2025-01-05,5,11,76,Troubles surface when a woman looking to make ...,...,2024,December,12,13,20,"[{'book_review_link': '', 'first_chapter_link'...",0,0,0,0


### 4. Create a Dash app

#### Overview Tab ####

In this tab, a radio items control lets the user select a year. Options include "Total" (all years) and individual years.

Based on the selection, three graphs are updated:
- Publishers: count of books per publisher.
- Authors: count of books per author.
- Top Weeks: maximum weeks on the bestseller list by author.

#### Book Reviews Tab ####

In [72]:
# The slider's minimum is the first available year and maximum is (last_year + 1) which represents "Total".
available_years = sorted(df["year"].dropna().unique().astype(int))
first_year = int(available_years[0])
last_year = int(available_years[-1])
slider_min = first_year
slider_max = last_year + 1  # now a Python int
slider_marks = {int(yr): str(yr) for yr in available_years}
slider_marks[slider_max] = "Total"

In [74]:
# Filter books that have at least one review (any of the four flags is 1)
df_reviews = df[(df['has_book_review']==1) | (df['has_first_chapter']==1) | 
                (df['has_sunday_review']==1) | (df['has_article_chapter']==1)]
df_reviews

Unnamed: 0,list_name,title,author,publisher,bestsellers_date,published_date,rank,rank_last_week,weeks_on_list,desc,...,year,month,month_num,title_length,desc_word_count,reviews_parsed,has_book_review,has_first_chapter,has_sunday_review,has_article_chapter
2,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-12-21,2025-01-05,3,3,6,A misunderstood girl named Elphaba is declared...,...,2024,December,12,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
17,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-12-14,2024-12-29,3,2,5,A misunderstood girl named Elphaba is declared...,...,2024,December,12,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
31,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-12-07,2024-12-22,2,1,4,A misunderstood girl named Elphaba is declared...,...,2024,December,12,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
45,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-11-30,2024-12-15,1,6,3,A misunderstood girl named Elphaba is declared...,...,2024,November,11,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
65,Combined Print and E-Book Fiction,WICKED,Gregory Maguire,Morrow,2024-11-23,2024-12-08,6,12,2,A misunderstood girl named Elphaba is declared...,...,2024,November,11,6,17,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12414,Combined Print and E-Book Fiction,WATER FOR ELEPHANTS,Sara Gruen,Algonquin,2011-01-30,2011-02-13,5,0,1,"After his parents die in a car accident, young...",...,2011,January,1,19,20,[{'book_review_link': 'https://www.nytimes.com...,1,0,1,0
12415,Combined Print and E-Book Fiction,THE CONFESSION,John Grisham,Knopf Doubleday,2011-01-30,2011-02-13,6,0,1,A criminal wants to save an innocent man on de...,...,2011,January,1,14,21,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,0
12416,Combined Print and E-Book Fiction,CUTTING FOR STONE,Abraham Verghese,Knopf Doubleday,2011-01-30,2011-02-13,7,0,1,"Twin brothers, conjoined at birth and then sep...",...,2011,January,1,17,16,"[{'book_review_link': '', 'first_chapter_link'...",0,1,1,0
12418,Combined Print and E-Book Fiction,THE HELP,Kathryn Stockett,Penguin Group,2011-01-30,2011-02-13,9,0,1,A young white woman and two black maids in 196...,...,2011,January,1,8,11,[{'book_review_link': 'https://www.nytimes.com...,1,0,0,1


In [76]:
# Create options for the dropdown: display "Title - Author" and use the dataframe index as the value.
book_options = [{'label': f"{row['title']} - {row['author']}", 'value': idx} for idx, row in df_reviews.iterrows()]

In [78]:
# Define a helper function to extract an ISBN from the 'isbns' column.
def get_isbn(row):
    try:
        isbns = row['isbns']
        if isinstance(isbns, str) and isbns.strip() != '':
            isbn_list = ast.literal_eval(isbns)
            if isinstance(isbn_list, list) and len(isbn_list) > 0:
                # Prefer isbn13 if available, otherwise use isbn10.
                first = isbn_list[0]
                if 'isbn13' in first and first['isbn13']:
                    return first['isbn13']
                elif 'isbn10' in first and first['isbn10']:
                    return first['isbn10']
    except Exception:
        return None
    return None

#### Layout Definition ####

In [80]:
app = dash.Dash(__name__)
app.layout = html.Div([
    html.H1('NYT Bestsellers Analysis', style={"color": "white"}),
    dcc.Tabs(id='tabs', children=[
        # Overview Tab (Tab 2)
        dcc.Tab(
            label='Overview',
            children=[
                html.Div([
                    html.Label('Select Year:', style={'color': 'white', 'marginRight': '20px'}),
                    dcc.Slider(
                        id='year-slider',
                        min=slider_min,
                        max=slider_max,
                        step=1,
                        value=slider_max,  # default to "Total"
                        marks=slider_marks,
                        tooltip={'always_visible': True, 'placement': 'bottom'},
                    )
                ], style={'padding': '20px'}),
                dcc.Graph(id='pub-graph'),
                dcc.Graph(id='auth-graph'),
                dcc.Graph(id='weeks-graph')
            ]
        ),
        # Book Reviews Tab (Tab 3)
        dcc.Tab(
            label='Book Reviews',
            children=[
                html.Div([
                    html.Label('Select a Book', style={'color': 'white'}),
                    dcc.Dropdown(
                        id='book-dropdown',
                        options=book_options,
                        value=book_options[0]['value'] if book_options else None
                    )
                ], style={'width': '60%', 'padding': '20px'}),
                html.Div(id='book-review-output', style={'padding': '20px', "color": "white"})
            ]
        )
    ])
], style={'backgroundColor': '#2c2c2c', 'padding': '20px'})

#### Callbacks for Overview Tab ####

In [82]:
@app.callback(
    [Output('pub-graph', 'figure'),
     Output('auth-graph', 'figure'),
     Output('weeks-graph', 'figure')],
    [Input('year-slider', 'value')]
)
def update_overview(selected_value):
    # If slider value equals slider_max, use the full dataset ("Total")
    if selected_value == slider_max:
        dff = df.copy()
    else:
        dff = df[df['year'] == selected_value]
    
    # Publishers graph: top 10 publishers by count (descending).
    pub_data = dff.groupby('publisher').size().reset_index(name='count')
    pub_data = pub_data.sort_values('count', ascending=False).head(10)
    fig_pub = px.bar(pub_data,
                     x='publisher',
                     y='count',
                     labels={'publisher': 'Publisher', 'count': 'Number of Bestsellers'},
                     title='Top 10 Publishers',
                     color='count',
                     color_continuous_scale=px.colors.sequential.Viridis,
                     template='plotly_dark')
    fig_pub.update_layout(xaxis_tickangle=-45)
    
    # Authors graph: top 10 authors by count (descending).
    auth_data = dff.groupby('author').size().reset_index(name='count')
    auth_data = auth_data.sort_values('count', ascending=False).head(10)
    fig_auth = px.bar(auth_data,
                      x='author',
                      y='count',
                      labels={'author': 'Author', 'count': 'Number of Bestsellers'},
                      title='Top 10 Authors',
                      color='count',
                      color_continuous_scale=px.colors.sequential.Plasma,
                      template='plotly_dark')
    fig_auth.update_layout(xaxis_tickangle=-45)
    
    # Weeks graph: top 10 authors by maximum weeks on the list (descending).
    weeks_data = dff.groupby('author')['weeks_on_list'].max().reset_index(name='max_weeks')
    weeks_data = weeks_data.sort_values('max_weeks', ascending=False).head(10)
    fig_weeks = px.bar(weeks_data,
                       x='author',
                       y='max_weeks',
                       labels={'author': 'Author', 'max_weeks': 'Max Weeks on List'},
                       title='Top 10 Authors by Maximum Weeks',
                       color='max_weeks',
                       color_continuous_scale=px.colors.sequential.Inferno,
                       template='plotly_dark')
    fig_weeks.update_layout(xaxis_tickangle=-45)
    
    return fig_pub, fig_auth, fig_weeks

#### Callback for Book Reviews Tab ####

In [84]:
@app.callback(
    Output('book-review-output', 'children'),
    [Input('book-dropdown', 'value')]
)
def update_book_review(selected_idx):
    if selected_idx is None:
        return 'No book selected.'
    try:
        row = df_reviews.loc[selected_idx]
    except KeyError:
        return 'Book not found.'
    title = row['title']
    author = row['author']
    isbn = get_isbn(row)
    if isbn:
        img_url = f"http://covers.openlibrary.org/b/isbn/{isbn}-L.jpg"
        img_tag = html.Img(src=img_url, style={'maxWidth': '300px', 'marginTop': '20px'})
    else:
        img_tag = html.Div('No cover image available.')
    return html.Div([
        html.H3(f'Title: {title}'),
        html.H4(f'Author: {author}'),
        img_tag
    ])

### 5. Machine Learning ###

In [40]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, VotingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

Compute aggregated features:

In [48]:
# Calculate the average weeks_on_list for each publisher.
publisher_avg = df.groupby('publisher')['weeks_on_list'].mean().reset_index().rename(
    columns={'weeks_on_list': 'publisher_avg_weeks'}
)
publisher_avg

Unnamed: 0,publisher,publisher_avg_weeks
0,7th House,0.00000
1,Abbi Glines,1.30000
2,Ace,1.02439
3,Addison Moore,2.50000
4,Alcove,1.00000
...,...,...
396,Zebra/Kensington,1.00000
397,Zebra/Mulholland,1.00000
398,Zondervan,1.00000
399,ePublishing Works!,0.00000


In [50]:
# Calculate the average weeks_on_list for each author.
author_max = df.groupby('author')['weeks_on_list'].max().reset_index().rename(
    columns={'weeks_on_list': 'author_max_weeks'}
)
author_max

Unnamed: 0,author,author_max_weeks
0,A L Jackson,1
1,A S A Harrison,6
2,A.J. Finn,31
3,Aaron Allston,1
4,Ab Jimenez,17
...,...,...
943,Zadie Smith,0
944,Zakiya Dalila Harris,2
945,Zane,1
946,edited David Baldacci,0


In [52]:
# Merge these aggregated features back into the main DataFrame.
df_ml = df.merge(publisher_avg, on='publisher', how='left')
df_ml = df_ml.merge(author_max, on='author', how='left')

Select features for the machine learning model:
- Rank features: `rank`, `rank_last_week`
- Categorical features: `publisher`, `author` (to be one-hot encoded)
- Aggregated features: `publisher_avg_weeks`, `author_avg_weeks`
- New review-related features: `has_book_review`, `has_first_chapter`, `has_sunday_review`, `has_article_chapter`

In [54]:
features = [
    'rank',
    'rank_last_week',
    'publisher',
    'author',
    'publisher_avg_weeks',
    'author_max_weeks',
    'has_book_review',
    'has_first_chapter',
    'has_sunday_review',
    'has_article_chapter'
]

In [56]:
# Drop rows with missing values in the selected features or target.
df_ml = df_ml.dropna(subset=features + ['weeks_on_list'])

In [58]:
# Define X (features) and y (target variable).
X = df_ml[features]
y = df_ml['weeks_on_list']

In [60]:
# One-hot encode categorical features ('publisher' and 'author').
X = pd.get_dummies(
    X,
    columns=['publisher', 'author'],
    drop_first=True
)

In [62]:
# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [64]:
def augment_data(X, y, continuous_cols, factor=2, noise_level=0.1):
    augmented_X = X.copy()
    augmented_y = y.copy()
    stds = X[continuous_cols].std()
    synthetic_rows = []
    synthetic_targets = []
    for i, row in X.iterrows():
        for _ in range(factor - 1):
            new_row = row.copy()
            for col in continuous_cols:
                noise = np.random.normal(0, noise_level * stds[col])
                new_row[col] = row[col] + noise
            synthetic_rows.append(new_row)
            synthetic_targets.append(y.loc[i])
    if synthetic_rows:
        synthetic_df = pd.DataFrame(synthetic_rows, columns=X.columns)
        augmented_X = pd.concat([augmented_X, synthetic_df], axis=0)
        augmented_y = pd.concat([augmented_y, pd.Series(synthetic_targets)], axis=0)
    return augmented_X, augmented_y

In [66]:
continuous_cols = [
    'rank',
    'rank_last_week',
    'publisher_avg_weeks',
    'author_max_weeks'
]
X_train_aug, y_train_aug = augment_data(X_train, y_train, continuous_cols, factor=2, noise_level=0.1)

In [68]:
reg1 = RandomForestRegressor(n_estimators=100, random_state=42)
reg2 = GradientBoostingRegressor(n_estimators=100, random_state=42)
reg3 = ExtraTreesRegressor(n_estimators=100, random_state=42)
ensemble = VotingRegressor([('rf', reg1), ('gb', reg2), ('et', reg3)])
ensemble.fit(X_train_aug, y_train_aug)

y_pred = ensemble.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("\nEnsemble Model Performance on Augmented Data:")
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)


Ensemble Model Performance on Augmented Data:
Mean Squared Error: 193.57217575065533
Root Mean Squared Error: 13.91302180515273


In [86]:
# Run the Dash App
if __name__ == '__main__':
    app.run_server(debug=True)