---

#### What does this notebook, `features.py` do ? – engineering quick linguistics

Takes the CSV from step 1 and enriches each row with lightweight text statistics, then writes `tokenized_reviews.csv`.

1. **Rating difference** – `rating_diff = user_rating – avg_rating` (did the reviewer buck consensus?).   
2. **Quotation flag** – Does the text contain a double quote `"..."`?  (`quote = 1/0`).  
3. **Tokenisation** –  
   * Sentence split, word split, lower-case, keep only alphabetic.  
   * Counts for `num_words`, `avg_sent_len` (words ÷ sentences), `avg_word_len`.   
4. **Part-of-speech fractions** – run `nltk.pos_tag`, compute % of words that are verbs (`pct_verbs`), nouns, adjectives/adverbs.  
5. **Sentiment** – VADER compound score averaged across sentences.  
6. **Stop-word removal + lemmatisation** – store the cleaned tokens (needed later for BOW/TF-IDF).  
7. **Save** – both numeric columns and the token list go to `tokenized_reviews.csv`.

*Micro-example* (after feature step)  

| num_words | sentiment | pct_verbs | quote |
|-----------|-----------|-----------|-------|
| 220 | –0.14 | 0.12 | 1 |

---

In [2]:
!pip install swifter

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
  Installing build dependencies ... [?25done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting dask>=2.10.0 (from dask[dataframe]>=2.10.0->swifter)
  Downloading dask-2025.4.1-py3-none-any.whl.metadata (3.8 kB)
Collecting partd>=1.4.0 (from dask>=2.10.0->dask[dataframe]>=2.10.0->swifter)
  Downloading partd-1.4.2-py3-none-any.whl.metadata (4.6 kB)
Collecting toolz>=0.10.0 (from dask>=2.10.0->dask[dataframe]>=2.10.0->swifter)
  Using cached toolz-1.0.0-py3-none-any.whl.metadata (5.1 kB)
Collecting locket (from partd>=1.4.0->dask>=2.10.0->dask[dataframe]>=2.10.0->swifter)
  Downloading locket-1.0.0-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading dask-2025.4.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━

In [3]:
import os
import time
import pandas as pd
import swifter
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger') # Download the missing resource
nltk.download('averaged_perceptron_tagger_eng')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
pd.options.mode.chained_assignment = None

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to /Users/vasu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/vasu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vasu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/vasu/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/vasu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/vasu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [4]:
# read review data
data = pd.read_csv("../data/filtered_reviews.csv")
data = data.drop(columns=['book_id','ratings_count','review_likes','like_share'])

start_time = time.time()

# difference between user rating and average book rating
data["rating_diff"] = data["user_rating"]-data["avg_rating"]
data = data.drop(columns=['avg_rating'])

# flag if review contains a quotation
data["quote"] = data["review_text"].str.contains("\"")

# tokenize for review length (num words), avg sentence length, avg word length
data["tokenized_sents"] = data["review_text"].swifter.apply(nltk.tokenize.sent_tokenize) 
data["num_sentences"] = data["tokenized_sents"].swifter.apply(len)                         
data["tokenized_words"] = data["review_text"].swifter.progress_bar(True).apply(lambda review: 
                                                                               [word.lower() for word in nltk.tokenize.word_tokenize(review) 
                                                                                if word.isalpha()])
data["num_words"] = data["tokenized_words"].swifter.apply(len)                             
data["avg_sent_len"] = data["num_words"]/data["num_sentences"]
data["num_letters"] = data["tokenized_words"].swifter.apply(lambda review: len([letter for word in review for letter in word]))  
data["avg_word_len"] = data["num_letters"]/data["num_words"]
data = data.drop(columns=['review_text','num_sentences','num_letters'])

# part of speech tagging
data["pos_tags"] = data["tokenized_words"].swifter.apply(nltk.pos_tag) 

total_time = time.time() - start_time
print(f"\nTraining completed in: {total_time:.2f} seconds")

Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [02:23<00:00, 11895.54it/s]
Pandas Apply: 100%|██████████████████████████████| 1710198/1710198 [00:00<00:00, 3119307.24it/s]
Pandas Apply: 100%|█████████████████████████████████| 1710198/1710198 [12:38<00:00, 2253.77it/s]
Pandas Apply: 100%|██████████████████████████████| 1710198/1710198 [00:00<00:00, 2746004.24it/s]
Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [00:28<00:00, 59283.42it/s]
Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [1:17:26<00:00, 368.04it/s]



Training completed in: 5593.71 seconds


In [6]:
data.head()

Unnamed: 0,user_id,user_reviews,user_rating,days_since_review,popular,rating_diff,quote,tokenized_sents,tokenized_words,num_words,avg_sent_len,avg_word_len,pos_tags
0,8842281e1d1347389f2ab93d60773d4d,218,5,96,1,0.99,True,"[This is a special book., It started slow for ...","[this, is, a, special, book, it, started, slow...",354,17.7,4.618644,"[(this, DT), (is, VBZ), (a, DT), (special, JJ)..."
1,8842281e1d1347389f2ab93d60773d4d,218,3,353,1,-1.1,True,"[A fun, fast paced science fiction thriller., ...","[a, fun, fast, paced, science, fiction, thrill...",458,13.085714,4.303493,"[(a, DT), (fun, NN), (fast, RB), (paced, VBD),..."
2,8842281e1d1347389f2ab93d60773d4d,218,4,406,1,-0.04,True,[A fascinating book about community and belong...,"[a, fascinating, book, about, community, and, ...",494,27.444444,4.757085,"[(a, DT), (fascinating, JJ), (book, NN), (abou..."
3,8842281e1d1347389f2ab93d60773d4d,218,5,534,1,0.57,True,"[I haven't read a ton of ""history of the world...","[i, have, read, a, ton, of, history, of, the, ...",684,19.542857,4.77193,"[(i, NNS), (have, VBP), (read, VBN), (a, DT), ..."
4,8842281e1d1347389f2ab93d60773d4d,218,4,669,1,-0.31,True,"[A beautiful story., It is rare to encounter a...","[a, beautiful, story, it, is, rare, to, encoun...",253,18.071429,3.897233,"[(a, DT), (beautiful, JJ), (story, NN), (it, P..."


In [7]:
# pos tags
def count_pos(pos_tags, pos):
    counts = 0
    for word, tag in pos_tags:
        if tag and tag[0] in pos:
            counts += 1
    return counts

data["verbs"] = data["pos_tags"].swifter.progress_bar(True).apply(count_pos, pos=["V"])
data["pct_verbs"] = data["verbs"] / data["num_words"]
data["nouns"] = data["pos_tags"].swifter.progress_bar(True).apply(count_pos, pos=["N"])
data["pct_nouns"] = data["nouns"] / data["num_words"]
data["adj"] = data["pos_tags"].swifter.progress_bar(True).apply(count_pos, pos=["J", "R"])
data["pct_adj"] = data["adj"] / data["num_words"]
data = data.drop(columns=['pos_tags', 'verbs', 'nouns', 'adj'])

Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [00:44<00:00, 38452.60it/s]
Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [00:25<00:00, 66540.89it/s]
Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [00:27<00:00, 62315.05it/s]


In [10]:
# sentiment analysis
def review_sentiment(review_sents):
    sid = SentimentIntensityAnalyzer()
    comptot = 0
    for sentence in review_sents:
        scores = sid.polarity_scores(sentence)
        comptot += scores['compound']
    return comptot / len(review_sents)

# Apply sentiment analysis in parallel with progress bar
data["sentiment"] = data["tokenized_sents"].swifter.progress_bar(True).apply(review_sentiment)

# Drop the tokenized sentences column
data = data.drop(columns=['tokenized_sents'])

Pandas Apply: 100%|████████████████████████████████| 1710198/1710198 [1:44:16<00:00, 273.36it/s]


In [11]:
data.head()

Unnamed: 0,user_id,user_reviews,user_rating,days_since_review,popular,rating_diff,quote,tokenized_words,num_words,avg_sent_len,avg_word_len,pct_verbs,pct_nouns,pct_adj,sentiment
0,8842281e1d1347389f2ab93d60773d4d,218,5,96,1,0.99,True,"[this, is, a, special, book, it, started, slow...",354,17.7,4.618644,0.19209,0.225989,0.163842,0.12365
1,8842281e1d1347389f2ab93d60773d4d,218,3,353,1,-1.1,True,"[a, fun, fast, paced, science, fiction, thrill...",458,13.085714,4.303493,0.209607,0.235808,0.120087,0.096211
2,8842281e1d1347389f2ab93d60773d4d,218,4,406,1,-0.04,True,"[a, fascinating, book, about, community, and, ...",494,27.444444,4.757085,0.15587,0.275304,0.147773,0.047972
3,8842281e1d1347389f2ab93d60773d4d,218,5,534,1,0.57,True,"[i, have, read, a, ton, of, history, of, the, ...",684,19.542857,4.77193,0.19883,0.248538,0.141813,0.220063
4,8842281e1d1347389f2ab93d60773d4d,218,4,669,1,-0.31,True,"[a, beautiful, story, it, is, rare, to, encoun...",253,18.071429,3.897233,0.209486,0.205534,0.114625,0.0685


In [17]:
# further text processing
# remove stop words

start_time = time.time()

stopwords = nltk.corpus.stopwords.words('english')
data["tokenized_words"] = data["tokenized_words"].swifter.progress_bar(True).apply(lambda review: 
                                                                                   [word for word in review if word not in stopwords])

# lemmatization
wnl = nltk.stem.wordnet.WordNetLemmatizer()
data["tokenized_words"] = data["tokenized_words"].swifter.progress_bar(True).apply(lambda review: 
                                                                                   [wnl.lemmatize(word) for word in review])

# reorder columns
data = data[["popular", "user_reviews", "days_since_review", "user_rating", "rating_diff",
             "num_words", "avg_word_len", "avg_sent_len", "pct_verbs", "pct_nouns", "pct_adj",
             "quote", "sentiment", "tokenized_words"]]

total_time = time.time() - start_time
print(f"\nTraining completed in: {total_time:.2f} seconds")

Pandas Apply: 100%|█████████████████████████████████| 1710198/1710198 [05:50<00:00, 4874.39it/s]
Pandas Apply: 100%|█████████████████████████████████| 1710198/1710198 [04:24<00:00, 6461.14it/s]



Training completed in: 795.99 seconds


In [None]:
# save dataset
save_path = '../data/tokenized_reviews.csv'
data.to_csv(save_path, index=False)

### Read the saved data

In [5]:
data = pd.read_csv(save_path)

In [6]:
data.shape

(1710198, 14)

In [7]:
data.sample(5)

Unnamed: 0,popular,user_reviews,days_since_review,user_rating,rating_diff,num_words,avg_word_len,avg_sent_len,pct_verbs,pct_nouns,pct_adj,quote,sentiment,tokenized_words
1159061,1,206,654,0,-3.0,14,3.785714,4.666667,0.142857,0.357143,0.142857,False,0.0,"['wow', 'really', 'limit', 'crap', 'people', '..."
1194985,1,256,956,5,1.22,416,4.293269,52.0,0.204327,0.206731,0.161058,True,0.3104,"['top', 'six', 'reason', 'v', 'virgin', 'suck'..."
872076,1,557,814,4,-0.43,16,3.8125,16.0,0.0,0.125,0.4375,False,0.0243,"['quite', 'good', 'first', 'four', 'still', 's..."
1639315,0,34,1017,5,0.93,73,4.287671,12.166667,0.178082,0.232877,0.205479,False,0.32,"['brilliantly', 'written', 'book', 'well', 'cr..."
1380611,0,197,1117,3,-0.86,161,3.763975,32.2,0.204969,0.223602,0.161491,False,0.0614,"['favorite', 'book', 'ever', 'think', 'writing..."
