# Project - What makes a good book?

Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

+ price
+ popularity (target variable)
+ review/summary
+ review/text
+ review/helpfulness
+ authors
+ categories
+ 
You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

books = pd.read_csv("data/books.csv")

# Recode columns
books.columns = ['title', 'price', 'review_helpfulness', 'review_summary', 'review_text',
       'description', 'authors', 'categories', 'popularity']

# Preprocessing

+ Catagorical columns
    + Rm non-alphanumeric characters and extra spaces from categorical text columns

+ Numeric cols
    + Convert the review_helpfulness column (e.g., "3/5") into a numerical fraction.
    + Randomly impute missing values in review_helpfulness based on valid entries.
    + Scale the price column using `StandardScaler`.
    
+ Combine text columns into a single review column.


In [None]:
# Catagorical columns
def clean_text(text):
    cleaned = re.sub(r'[^a-zA-Z0-9]', '_', text)
    cleaned = re.sub(r'_+', '_', cleaned)  # Remove consecutive underscores
    return cleaned.strip('_')  # Remove leading and trailing underscores

catagorical_df = books[['title', 'authors', 'categories']].applymap(clean_text)

# Numeric columns
def safe_fraction_conversion(x):
    if '/' in x:
        numerator, denominator = x.split('/')
        numerator = float(numerator)
        denominator = float(denominator)
        return None if denominator == 0 else numerator / denominator
    return float(x)

books['review_helpfulness'] = books['review_helpfulness'].apply(safe_fraction_conversion)
valid_values = books['review_helpfulness'].dropna()

def random_imputer(val):
    if np.isnan(val):
        return np.random.choice(valid_values)  # Randomly choose from valid values
    return val

books['review_helpfulness'] = books['review_helpfulness'].apply(random_imputer)
books['price'] = StandardScaler().fit_transform(books[['price']])

# Text columns
review_col_df = books[['review_summary', 'review_text', 'description']].fillna('')
review_col_df['review'] = review_col_df.apply(
    lambda x: ' '.join(x.dropna().astype(str)),
    axis=1
)
review_col_df.drop(columns=['review_summary', 'review_text', 'description'], inplace=True)

# Define Pipelines and run

+ Pre-processing
    + Scale numeric columns
    + One-hot encode categorical columns 
    + Apply TF-IDF vectorization to the combined review text, extracting 2- and 3-gram features with a maximum of 100 features

+ Classifiers
    + Loop through a list of classifiers (RandomForestClassifier, LogisticRegression, and SVC).


In [None]:
numeric_cols = ['price', 'review_helpfulness']
categorical_cols = ['title', 'authors', 'categories']
text_col = 'review'

# Pre-processing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('tfidf', TfidfVectorizer(max_features=100, stop_words='english', ngram_range=(2, 3)), text_col)
    ]
)

#  Classifiers pipeline
classifiers = [
    ('Random Forest', RandomForestClassifier(max_depth=5, random_state=42)),
    ('Logistic Regression', LogisticRegression(max_iter=1000, random_state=42)),
    ('SVM', SVC(random_state=42))
]

# Train the model and evaluate the accuracy for each classifier

+ Fit a Pipeline for each classifier, which applies the preprocessing steps and then fits the classifier on the data.
+ Split the data into training and test sets and evaluate the accuracy for each classifier.

In [None]:
for name, clf in classifiers:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', clf)
    ])
    # Split dataset into training and testing sets
    X = books.drop(columns='popularity')
    y = pd.get_dummies(books['popularity'], drop_first=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train and evaluate each classifier
    print(f"Testing {name}...")
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy for {name}: {accuracy:.4f}")

# Evaluate feature importance for RandomForest (extra)

+ Fit a RandomForestClassifier using the pipeline.
+ Extract the feature importances after training.
+ Display the top 10 most important features from the model.
+ Why: Feature importance helps identify which aspects of the data (e.g., price, text n-grams, categories) have the most predictive power for the target variable.

In [None]:
rf_clf = RandomForestClassifier(max_depth=5, random_state=42)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', rf_clf)
])
pipeline.fit(X_train, y_train)

# Extract feature importances from Random Forest
rf_clf = pipeline.named_steps['classifier']
importances = rf_clf.feature_importances_

# Sort and display top 10 features
feature_names = np.concatenate([numeric_cols, preprocessor.named_transformers_['cat'].get_feature_names_out(), preprocessor.named_transformers_['tfidf'].get_feature_names_out()])
top_10_features = pd.Series(importances, index=feature_names).nlargest(10)
print("Top 10 important features:")
print(top_10_features)

# Final Submission

In [None]:
#Final Submission
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

books = pd.read_csv("data/books.csv")
books.columns = ['title', 'price', 'review_helpfulness', 'review_summary', 'review_text',
       'description', 'authors', 'categories', 'popularity']

# Function to clean text
def clean_text(text):
    return re.sub(r'[^a-zA-Z0-9]', '_', re.sub(r'_+', '_', text)).strip('_')

# Apply text cleaning on categorical columns
books['title'] = books['title'].apply(clean_text)
books['authors'] = books['authors'].apply(clean_text)
books['categories'] = books['categories'].apply(clean_text)

# Convert review_helpfulness to numeric, coercing errors to NaN
books['review_helpfulness'] = pd.to_numeric(books['review_helpfulness'].apply(safe_fraction_conversion), errors='coerce')

# Impute missing values in review_helpfulness
valid_values = books['review_helpfulness'].dropna()
books['review_helpfulness'] = books['review_helpfulness'].apply(lambda val: np.random.choice(valid_values) if pd.isna(val) else val)

# Scale the price column
books['price'] = StandardScaler().fit_transform(books[['price']])

# Combine text columns and process text data
review_col_df = books[['review_summary', 'review_text', 'description']].fillna('')
review_col_df['review'] = review_col_df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
books['review'] = review_col_df['review']

# Define numeric, categorical, and text columns
numeric_cols = ['price', 'review_helpfulness']
categorical_cols = ['title', 'authors', 'categories']
text_col = 'review'

# Set up preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('tfidf', TfidfVectorizer(max_features=100, stop_words='english', ngram_range=(2, 3)), text_col)
    ]
)

# Build the final pipeline with Logistic Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Prepare input X and target y
X = books.drop(columns='popularity')
y = pd.get_dummies(books['popularity'], drop_first=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline and assess accuracy
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
model_accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Accuracy for Logistic Regression: {accuracy:.4f}")



# Suggested Solution

In [None]:

# Import some required packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split  

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Inspect the DataFrame
books.info()

# Visualize popularity frequencies
sns.countplot(data=books, x="popularity")
plt.show()

# Check frequencies
print(books["categories"].value_counts())

# Filter out rare categories to avoid overfitting
books = books.groupby("categories").filter(lambda x: len(x) > 100)

# One-hot encoding categories
categories = pd.get_dummies(books["categories"], drop_first=True)

# Bring categories into the DataFrame
books = pd.concat([books, categories], axis=1)

# Remove original column
books.drop(columns=["categories"], inplace=True)

# Get number of total reviews 
books["num_reviews"] = books["review/helpfulness"].str.split("/", expand=True)[1]

# Get number of helpful reviews 
books["num_helpful"] = books["review/helpfulness"].str.split("/", expand=True)[0]

# Convert to integer datatype
for col in ["num_reviews", "num_helpful"]:
    books[col] = books[col].astype(int)
    
# Add percentage of helpful reviews as a column to normalize the data
books["perc_helpful_reviews"] = books["num_helpful"] / books["num_reviews"]

# Fill null values
books["perc_helpful_reviews"].fillna(0, inplace=True)

# Drop original column
books.drop(columns=["review/helpfulness"], inplace=True)

# Convert strings to lowercase
for col in ["review/summary", "review/text", "description"]:
    books[col] = books[col].str.lower()
    
# Create a list of positive words to measure positive text sentiment
positive_words = ["great", "excellent", "good", "interesting", "enjoy", "helpful", "useful", "like", "love", "beautiful", "fantastic", "perfect", "wonderful", "impressive", "amazing", "outstanding", "remarkable", "brilliant", "exceptional", "positive",
    "thrilling"]

# Instantiate a CountVectorizer
vectorizer = CountVectorizer(vocabulary=positive_words)

# Fit and transform review/text 
review_text = books["review/text"]
text_transformed = vectorizer.fit_transform(review_text.fillna(''))

# Fit and transform review/summary
review_summary = books["review/summary"]
summary_transformed = vectorizer.fit_transform(review_summary.fillna(''))

# Fit and transform description
description = books["description"]
description_transformed = vectorizer.fit_transform(description.fillna(''))

# Add positive counts into DataFrame to add measures of positive sentiment
books["positive_words_text"] = text_transformed.sum(axis=1).reshape(-1, 1)
books["positive_words_summary"] = summary_transformed.sum(axis=1).reshape(-1, 1)
books["positive_words_description"] = description_transformed.sum(axis=1).reshape(-1, 1)

# Remove original columns
books.drop(columns=["review/text", "review/summary", "description"], inplace=True)

# Splitting into features and target values
X = books.drop(columns=["title", "authors", "popularity"]).values
y = books["popularity"].values.reshape(-1, 1)

# Splitting into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate and fit a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=120, max_depth=50, min_samples_split=5, random_state=42, class_weight="balanced")
clf.fit(X_train, y_train.ravel()) 

# Evaluate accuracy
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

model_accuracy = clf.score(X_test, y_test)

# Differences between my solution and suggestion solution

The main differences between you approach and the suggested solution lie in the preprocessing steps, feature engineering, and the choice of the classification model. Here’s a breakdown of the key differences that could explain why your Logistic Regression pipeline resulted in slightly higher accuracy (0.7142 vs. 0.7090) compared to the Random Forest model in the suggested solution:

#### 1. Review Helpfulness Handling:

Your Approach:

You used a function (safe_fraction_conversion) to handle the review_helpfulness column, converted it to numeric, and applied imputation to handle missing values by randomly selecting valid values. This approach keeps the column as a continuous numeric feature.

Suggested Approach:

The suggested solution splits review_helpfulness into two parts: num_reviews and num_helpful, converting them into integers. Then, it calculates the percentage of helpful reviews (perc_helpful_reviews), which adds a normalized feature to the dataset.
Impact: Both approaches handle review_helpfulness, but your approach keeps it in its original form (after coercion to numeric), while the suggested approach normalizes the data and drops the original column.

#### 2. Text Preprocessing:

Your Approach:

You used a TfidfVectorizer with n-grams (bigrams and trigrams) to transform the text data into numerical features. This approach captures term frequency-inverse document frequency and n-gram context in the text data.

Suggested Approach:

The suggested solution uses a CountVectorizer with a predefined list of positive words, transforming the text into sentiment-related features. The counts of positive words are added as new features for review_text, review_summary, and description, after which the original text columns are dropped.
Impact: Your approach uses more comprehensive text feature extraction (TF-IDF with n-grams), which may capture a wider range of text-based information, while the suggested approach focuses only on positive sentiment-related words, which limits the scope of text features.

3. Categorical Data Handling:
Your Approach:
You applied OneHotEncoder to categorical columns such as title, authors, and categories as part of your preprocessing pipeline. This ensures that all unique categories are represented as binary vectors.
Suggested Approach:
The suggested solution filters out rare categories in the categories column (keeping only those with more than 100 samples) and then applies one-hot encoding. However, the title and authors columns are not encoded and are dropped in the final dataset.
Impact: Your approach retains more information by encoding all categorical variables, while the suggested solution drops some useful categorical columns (such as title and authors), which may reduce the model's ability to capture useful patterns from those fields.

### 4. Numerical Feature Scaling:

Your Approach:

You scaled numerical features like price using StandardScaler to normalize them.

Suggested Approach:

There is no explicit mention of feature scaling in the suggested approach, meaning numerical values (like price) are likely used as-is without scaling.
Impact: Normalizing numerical values in your approach may contribute to a more consistent input range, which could lead to better performance for classifiers that are sensitive to feature scales (e.g., Logistic Regression). In contrast, the suggested solution doesn’t appear to normalize numeric data.

5. Model Choice:

Your Approach:

You used Logistic Regression as the classifier. Logistic Regression is a linear model, and it often works well with data that has been thoroughly preprocessed, especially with normalized numerical data and TF-IDF features.

Suggested Approach:

The suggested solution uses a Random Forest Classifier with hyperparameters such as n_estimators=120, max_depth=50, and min_samples_split=5. Random Forests are robust and tend to perform well when handling complex datasets with both categorical and numerical features, but they might struggle when overfitting on rare categories.

Impact: Logistic Regression might have worked better due to your comprehensive text processing and feature scaling. The Random Forest, while powerful, could have been affected by overfitting, especially given the heavy use of positive sentiment-based word counts in the suggested solution.

#### 6. Imputation:

Your Approach:

You used imputation to fill missing values in review_helpfulness, ensuring no missing values remained before the modeling stage.

Suggested Approach:

The suggested solution doesn't explicitly handle missing data, which may leave some issues in the dataset that Random Forest can handle more gracefully than Logistic Regression, but this may still lead to inaccuracies.
Impact: Your approach explicitly handles missing values, ensuring clean data for Logistic Regression, which is more sensitive to missing data than Random Forests.

#### 7. Feature Selection:

Your Approach:

You allowed the classifier to consider all available features, including scaled numeric values, one-hot encoded categorical data, and TF-IDF transformed text features.

Suggested Approach:

The suggested solution specifically creates sentiment-related features based on predefined positive words, which limits the variety of features the model can use.
Impact: Your approach generates a richer feature set, allowing the classifier to capture more detailed patterns in the data, while the suggested approach limits the features to sentiment-based counts, potentially missing out on other important signals.

Why Your Accuracy Is Higher:

Text Feature Richness: Your use of TF-IDF and n-grams likely captured more nuanced information from the text fields compared to the sentiment-focused CountVectorizer in the suggested approach.

Feature Scaling: Logistic Regression tends to perform better with properly scaled data, and your scaling of numeric values like price may have contributed to better accuracy.

Handling of Missing Values: The clear imputation strategy for missing review_helpfulness values may have contributed to better model performance by reducing the noise from missing or invalid data.

While the suggested approach may work well for a Random Forest model, your approach's richer preprocessing, especially for text, and the use of a linear model like Logistic Regression with feature scaling seem to be the reasons behind the slightly higher accuracy.