# 1. Feature Building 

This notebook outlines the steps for feature building. The features_function.py file contains all the necessary functions for extracting the required features. These functions are imported and applied to both the hedonic and utilitarian datasets. Data preprocessing was already completed in the SentimentScore file, so the Sentiment and Subjectivity features are already built.

In [1]:
# Importing libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Importing Feature Engineering Functions from features.functions.py

from Feature_Functions import (
    feature_building
)



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/paulahofmann/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Importing Data
data_hedonic = pd.read_csv ('/Users/paulahofmann/Documents/Coding/Online-Review/2 FeaturePreperation/Data_with_Features/Final Data/Old Data/Total_Features_Hedonic_Subj.csv')
data_utilitarian = pd.read_csv ('/Users/paulahofmann/Documents/Coding/Online-Review/2 FeaturePreperation/Data_with_Features/Final Data/Old Data/Total_Features_Utilitarian_Subj.csv')

## 1.1 Building Features 

Building features for each Product Category and Product, using automatically feature building function from the modul Feature Functions, which adds the necessary 12 Features for Model Training to the function. 

These are the added features:
* Helpful_Binary (HR):
Helpful Binary: Indicates whether a review got at least one helpful vote, than it is classified as helpful review
* Helpful Ratio (HR):
Calculates the ratio of helpful votes for each review relative to the total helpful votes across all reviews.
* POS Tag Counts:
Counts the number of adverbs, adjectives, and nouns in each review text.
* Word Count (WordC):
Calculates the total number of words in each review text.
* Sentence Count (SentC):
Counts the total number of sentences in each review text.
* Average Words per Sentence (SentL):
Calculates the average number of words per sentence in each review text.
* Title Length (TitleL):
Counts the number of characters in the title of each review. If the title is empty or consists only of special characters, it sets the length to 1.
* Flesch-Kincaid Readability Score:
Calculates the Flesch-Kincaid readability score for each review text.
* Review Extremity (RewExt):
Calculates the difference between the review rating and the average product rating.
* Elapsed Time (ElapDays):
Calculates the elapsed time (in days) since each review was posted.
* Image Check (Image):
Checks whether each review contains images and assigns a binary value (0 for no images, 1 for images).
* Verified Purchase (VerPur):
Checks whether the purchase was verified or not.
* Sentiment (Sentiment):
Classified the overall sentiment of the text
* Subjectivity (Subjective):
Reflects the subjectivity of the text
* Adjective Ratio (AdjR):
Ratio of adjective in the review text relative to total word count
* Noun Ratio (NounR):
Ratio of nouns in the review text relative to total word count
* Adverb Ratio (AdvR):
Ratio of adverbs in the review text relative to total word count
* Rating Count (RatC):
Counts the number of reviews in the text

In [4]:
# Checking again for NaN Values in the text + title column and deleting them
data_hedonic = data_hedonic.dropna(subset=['text'])
data_hedonic = data_hedonic.dropna(subset=['title_x'])

# Checking again for NaN Values in the text + title column and deleting them
data_utilitarian = data_utilitarian.dropna(subset=['text'])
data_utilitarian = data_utilitarian.dropna(subset=['title_x'])

In [5]:
# Adding Features to Data Utilitarian
feature_building (data_utilitarian) 

Unnamed: 0,Rating,title_x,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,Verified_Purchase,...,Image,VerPur,year,month,hour,day_of_week,is_weekend,NounR,AdjR,AdvR
0,5.0,Affordability,I use more experience rolls but this is great ...,[],B095CN96JS,B0C6TS1PGY,AG5OFHYJ3MMFRJVCWNJ7VQKRW7SA,2022-05-14 21:20:48,0,True,...,0,1,2022,5,21,5,1,0.307692,0.153846,0.000000
1,5.0,Great buy,I expected just to have some extra rolls on ha...,[],B095CN96JS,B0C6TS1PGY,AF6T7BPN3CDGPES43LTSZCFXZPAQ,2023-02-23 19:23:56,1,True,...,0,1,2023,2,19,3,0,0.111111,0.037037,0.148148
2,5.0,Good value - comparable to Angel Soft,My price line for finding deals on toilet pape...,[],B095CN96JS,B0C6TS1PGY,AFCKN7G26GYGSCJVJH7SEAZORFSA,2022-07-20 02:02:12,0,True,...,0,1,2022,7,2,2,0,0.216216,0.067568,0.027027
3,2.0,Not sanitary,Container was filthy and had huge gap exposing...,[],B095CN96JS,B0C6TS1PGY,AEDZBEEPJHOH4AFYLCYUICHJDVZA,2022-08-02 02:50:55,0,False,...,0,0,2022,8,2,1,0,0.166667,0.055556,0.055556
4,5.0,Strong and absorbent,Quality and price,[],B095CN96JS,B0C6TS1PGY,AECBOI4L6BAUSKD4W5X2VQ2O6ELQ,2022-09-30 03:18:11,0,True,...,0,1,2022,9,3,4,0,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18296,5.0,Really cool mouse,"Great mouse, got it on sale so I bought two. C...",[],B07GBZ4Q68,B0BVVTQ5JP,AEEKEPJF3AERGGKQVC2JEFV5NPDQ,2023-08-28 22:55:32,0,True,...,0,1,2023,8,22,0,0,0.166667,0.000000,0.000000
18297,3.0,the clicks are not durable,"It lasted less than a year, and I'm not a very...",[],B07GBZ4Q68,B0BVVTQ5JP,AETZILG54RWBRYELI3BCNJ2K2CBQ,2023-07-23 23:16:58,0,True,...,0,1,2023,7,23,6,1,0.153846,0.153846,0.076923
18298,5.0,Great mouse,"Premium feel, excellent sensor and the price's...",[],B07GBZ4Q68,B0BVVTQ5JP,AHMZVA2VAL5EJ4OTZ6GHMJ4JVSEQ,2023-07-25 17:01:48,0,True,...,0,1,2023,7,17,1,0,0.375000,0.250000,0.000000
18299,1.0,Double click issues after 3 months,It's now September 19th and my g502 hero is al...,[],B07GBZ4Q68,B0BVVTQ5JP,AFLS6KPPHQ57W2NIYBT2RSPT4IGQ,2019-09-20 01:45:04,1,True,...,0,1,2019,9,1,4,0,0.120000,0.146667,0.080000


In [None]:
# Adding Features to Data Hedonic 
feature_building (data_hedonic) 

In [37]:
# Reording the columns in the DataFrame

# Reordered column list
reordered_columns = reordered_columns = [
    # Review Details
    'title_x', 'text', 'text_cleaned', 'text_cleaned1', 'images','Image', 'helpful_vote', 'total_helpful_votes', 
    'helpful_ratio', 'RewExt','Rating',
    
    # Product Information
    'asin', 'parent_asin', 'main_category', 'prod_title', 'average_rating', 'features', 'price', 
    'RatingC',  'Prod',"prod_type",
    
    # User Information
    'user_id', 'VerPur', 'verified_purchase',
    
    # Time Information
    'timestamp', 'year', 'is_weekend', 'ElapDays', 
    
    # Text Analysis
    'WordC', 'SentC', 'SentL', 'TitleL', 'noun_count', 'adj_count', 'adv_count', 
    'NounR', 'AdjR', 'AdvR','FRE', 'Sentiment', "Sentiment_Classification",'Subjective'
]

# Apply reindexing to DataFrame
data_hedonic = data_hedonic.reindex(columns=reordered_columns)
data_utilitarian = data_utilitarian.reindex(columns=reordered_columns)


In [None]:
# Saving to CSV
#data_hedonic.to_csv('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Hedonic_Final.csv', index=False)
#data_utilitarian.to_csv('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Utilitarian_Final.csv', index=False)