![bookstore](bookstore.jpg)


Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

* `price`
* `popularity` (target variable)
* `review/summary`
* `review/text`
* `review/helpfulness`
* `authors`
* `categories`

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

In [38]:
# Import some required packages
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.ensemble import RandomForestClassifier

# Ensure you have the necessary NLTK data files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Preview the first five rows
books.head()

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/repl/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/repl/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,title,price,review/helpfulness,review/summary,review/text,description,authors,categories,popularity
0,We Band of Angels: The Untold Story of America...,10.88,2/3,A Great Book about women in WWII,I have alway been a fan of fiction books set i...,"In the fall of 1941, the Philippines was a gar...",'Elizabeth Norman','History',Unpopular
1,Prayer That Brings Revival: Interceding for Go...,9.35,0/0,Very helpful book for church prayer groups and...,Very helpful book to give you a better prayer ...,"In Prayer That Brings Revival, best-selling au...",'Yong-gi Cho','Religion',Unpopular
2,The Mystical Journey from Jesus to Christ,24.95,17/19,Universal Spiritual Awakening Guide With Some ...,The message of this book is to find yourself a...,THE MYSTICAL JOURNEY FROM JESUS TO CHRIST Disc...,'Muata Ashby',"'Body, Mind & Spirit'",Unpopular
3,Death Row,7.99,0/1,Ben Kincaid tries to stop an execution.,The hero of William Bernhardt's Ben Kincaid no...,"Upon receiving his execution date, one of the ...",'Lynden Harris','Social Science',Unpopular
4,Sound and Form in Modern Poetry: Second Editio...,32.5,18/20,good introduction to modern prosody,There's a lot in this book which the reader wi...,An updated and expanded version of a classic a...,"'Harvey Seymour Gross', 'Robert McDowell'",'Poetry',Unpopular


In [39]:
# inspect the data
display(books.isna().sum(), books.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15719 entries, 0 to 15718
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               15719 non-null  object 
 1   price               15719 non-null  float64
 2   review/helpfulness  15719 non-null  object 
 3   review/summary      15719 non-null  object 
 4   review/text         15719 non-null  object 
 5   description         15719 non-null  object 
 6   authors             15719 non-null  object 
 7   categories          15719 non-null  object 
 8   popularity          15719 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.1+ MB


title                 0
price                 0
review/helpfulness    0
review/summary        0
review/text           0
description           0
authors               0
categories            0
popularity            0
dtype: int64

None

In [40]:
# Engenier features for the dataset
# OneHot the authors and categories columns
oh_cols = pd.get_dummies(books[['authors','categories']])

# Encode target variable as 'Popular' = 1 else 0
books.loc[:,'popularity'] = books.popularity.apply(lambda x: 1 if x == 'Popular' else 0)

# Split review/helpfulness into two columns
books[['review_score', 'helpfulness']] = books['review/helpfulness'].str.split('/', expand=True).astype(float)

# Drop redundant columns
books.drop(columns=['authors','categories','review/helpfulness'], inplace=True)

In [41]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define a function to preprocess text: tokenize, lemmatize, and remove stopwords
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Convert text to lowercase and tokenize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(lemmatized_tokens)

In [34]:
# Preprocess text columns
text_columns = ['title', 'review/summary', 'review/text', 'description']

for col in text_columns:
    books[col] = books[col].apply(preprocess_text)

In [35]:
# Define feature columns
numerical_features = ['price', 'review_score', 'helpfulness']
text_features = ['title', 'review/summary', 'review/text', 'description']

# Initialize TF-IDF Vectorizer for text columns
tfidf = TfidfVectorizer(max_features=500)

# StandardScaler for numerical features
scaler = StandardScaler()

# Combine numerical and text features into one feature set
X_num = scaler.fit_transform(books[numerical_features])

# Apply TF-IDF to text features and concatenate with numerical features
X_text = pd.DataFrame()
for col in text_features:
    X_tfidf = tfidf.fit_transform(books[col]).toarray()
    X_text = pd.concat([X_text, pd.DataFrame(X_tfidf)], axis=1)

In [36]:
# Combine processed numerical and text features
X = pd.concat([pd.DataFrame(X_num), X_text], axis=1)

# Target variable
y = books['popularity']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the classifier
model_accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {model_accuracy:.2f}")

Model accuracy: 0.78
