In [1]:
# Importing Feature Engineering Functions from features.functions.py

from Feature_Functions import (
    calculate_helpful_ratio,
    count_pos_tags,
    word_count,
    sentence_count,
    average_words_per_sentence,
    title_length,
    calculate_flesch_reading_score,
    calculate_review_extremity,
    calculate_elapsed_time,
    image_check,
    extract_timestamp,
    verified_purchase,
    feature_building
)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/paulahofmann/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import pandas as pd

#Importing Data
data_hedonic = pd.read_csv ('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Hedonic_Final.csv')
data_utilitarian = pd.read_csv ('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Utilitarian_Final.csv')

## 1. Building Features 

Building features for each Product Category and Product, using automatically feature building function from the modul Feature Functions, which adds the necessary 12 Features for Model Training to the function. 

These are the added features:
* Helpful Ratio (HR):
Calculates the ratio of helpful votes for each review relative to the total helpful votes across all reviews.
* POS Tag Counts:
Counts the number of adverbs, adjectives, and nouns in each review text.
* Word Count:
Calculates the total number of words in each review text.
* Sentence Count:
Counts the total number of sentences in each review text.
* Average Words per Sentence:
Calculates the average number of words per sentence in each review text.
* Title Length (TL):
Counts the number of characters in the title of each review. If the title is empty or consists only of special characters, it sets the length to 1.
* Flesch-Kincaid Readability Score:
Calculates the Flesch-Kincaid readability score for each review text.
* Review Extremity:
Calculates the difference between the review rating and the average product rating.
* Elapsed Time:
Calculates the elapsed time (in days) since each review was posted.
* Image Check:
Checks whether each review contains images and assigns a binary value (0 for no images, 1 for images).
* Verified Purchase:
Checks whether the purchase was verified or not.
* Day of Week 


In [3]:
# Checking for NaN Values in the text + title column and deleting them
data_hedonic = data_hedonic.dropna(subset=['text'])
data_hedonic = data_hedonic.dropna(subset=['title_x'])

# Checking for NaN Values in the text + title column and deleting them
data_utilitarian = data_utilitarian.dropna(subset=['text'])
data_utilitarian = data_utilitarian.dropna(subset=['title_x'])

In [4]:
# Transforming the timestamp column into a datetime object
data_utilitarian['timestamp'] = pd.to_datetime(data_utilitarian['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
data_utilitarian['timestamp'] = data_utilitarian['timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')

In [5]:
# Adding Features to Data Utilitarian
feature_building (data_utilitarian) 

Unnamed: 0,rating,title_x,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,...,day_of_week,is_weekend,product,ver_purch,#nouns,#adj,#adv,subjective_score,neutral_score,prod_type
0,5.0,Affordability,I use more experience rolls but this is great ...,[],B095CN96JS,B0C6TS1PGY,AG5OFHYJ3MMFRJVCWNJ7VQKRW7SA,2022-05-14 21:20:48,0,True,...,5,1,Toilet Paper,1,0.307692,0.153846,0.000000,0.762744,0.237256,0
1,5.0,Great buy,I expected just to have some extra rolls on ha...,[],B095CN96JS,B0C6TS1PGY,AF6T7BPN3CDGPES43LTSZCFXZPAQ,2023-02-23 19:23:56,1,True,...,3,0,Toilet Paper,1,0.111111,0.037037,0.148148,0.969671,0.030329,0
2,5.0,Good value - comparable to Angel Soft,My price line for finding deals on toilet pape...,[],B095CN96JS,B0C6TS1PGY,AFCKN7G26GYGSCJVJH7SEAZORFSA,2022-07-20 02:02:12,0,True,...,2,0,Toilet Paper,1,0.216216,0.067568,0.027027,0.793879,0.206121,0
3,2.0,Not sanitary,Container was filthy and had huge gap exposing...,[],B095CN96JS,B0C6TS1PGY,AEDZBEEPJHOH4AFYLCYUICHJDVZA,2022-08-02 02:50:55,0,False,...,1,0,Toilet Paper,0,0.166667,0.055556,0.055556,0.797570,0.202430,0
4,5.0,Strong and absorbent,Quality and price,[],B095CN96JS,B0C6TS1PGY,AECBOI4L6BAUSKD4W5X2VQ2O6ELQ,2022-09-30 03:18:11,0,True,...,4,0,Toilet Paper,1,0.666667,0.000000,0.000000,0.210694,0.789306,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22086,5.0,Works wonders,Does exactly as it’s meant to,[],B07L61DNMR,B01EY96W72,AFCCDIPYZ5GJVTZKSHJJM26IFDLA,2023-07-27 04:57:33,0,True,...,3,0,Hair Brush,1,0.000000,0.000000,0.166667,0.581350,0.418650,0
22087,5.0,Resultado,"Me encanto, no daña! Súper espectacular",[],B00JJ7T2V8,B01EY96W72,AESHTS76OFZSD7BRF4T6J4NCLUUQ,2023-06-02 17:52:42,1,True,...,4,0,Hair Brush,1,0.166667,0.166667,0.000000,0.846064,0.153936,0
22088,5.0,Amazing!!!,"Amazing, works perfect for wet hair and don’t ...",[],B09TPRMPKT,B01EY96W72,AG2WVRMZTHQP2RWHOXPOBRNKYCPA,2023-03-19 20:34:39,0,True,...,6,1,Hair Brush,1,0.142857,0.214286,0.000000,0.978454,0.021546,0
22089,5.0,This brush has lifted a weight off my shoulders,No my toddler still did not like me brushing h...,[],B07VHP5Y6S,B01EY96W72,AH2ZV3NOTXBNF5SZUTWTOVJOMHUA,2023-07-06 01:33:15,1,True,...,3,0,Hair Brush,1,0.213115,0.032787,0.147541,0.989107,0.010893,0


In [6]:
# Adding Features to Data Hedonic 
feature_building (data_hedonic) 

Unnamed: 0,rating,title_x,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,...,day_of_week,is_weekend,product,ver_purch,#nouns,#adj,#adv,subjective_score,neutral_score,prod_type
0,5.0,Love this,My kids have so much fun with this game. Its a...,[],B01N1081RO,B08JHZHWZ3,AE6YLEEPJ47WLVVHEJ4CSBONBVBA,2021-01-13 16:07:30,1,True,...,2,0,Video Games,1,0.277778,0.111111,0.111111,0.956940,0.043060,1
1,5.0,The fun games that you remember.. now on the N...,These are 3 of the classic 3-D Mario games.. t...,[],B08G3MN6KP,B08JHZHWZ3,AHTBBASAHXHHOXKLSSZG2IPUDDFA,2020-11-15 00:23:29,0,True,...,6,1,Video Games,1,0.215686,0.137255,0.039216,0.917250,0.082750,1
2,5.0,So much fun!!,I remember being in 5th grade when Mario 64 ca...,[],B08G3MN6KP,B08JHZHWZ3,AE5YCFORXJBSGKBKXSUL53BGJW2A,2020-09-19 16:03:14,0,False,...,5,1,Video Games,0,0.068966,0.080460,0.103448,0.913347,0.086653,1
3,5.0,Wish this had more...,I remember the joy of playing the SNES All-Sta...,[],B08G3MN6KP,B08JHZHWZ3,AFKVE5HRFK3X27DPXME72PUR5NXQ,2021-01-03 15:18:48,0,True,...,6,1,Video Games,1,0.137615,0.073394,0.119266,0.974091,0.025909,1
4,5.0,It is the physical copy and it is a good price.,It's the real deal!!,[],B08G3MN6KP,B08JHZHWZ3,AGNXQ6O3UYR4DCO27DTY7XEAL4UQ,2020-09-21 17:01:37,0,True,...,0,0,Video Games,1,0.250000,0.250000,0.000000,0.952895,0.047105,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22206,5.0,Great price,Smells so good,[],B000P22TIY,B0BJMV1QTR,AFYSDN2YELEGJ45YZM4Z4PXZRIBQ,2019-02-19 00:11:36,0,True,...,1,0,Perfume,1,0.000000,0.333333,0.333333,0.964594,0.035406,1
22207,1.0,The box smelled good...,"Lid busted to pieces, cologne everywhere...hop...",[{'small_image_url': 'https://m.media-amazon.c...,B08F3W312Q,B0BJMV1QTR,AGECJEQLXR5BEGW3VXVQDID4NWSA,2023-06-27 20:30:56,0,True,...,1,0,Perfume,1,0.230769,0.076923,0.076923,0.873186,0.126814,1
22208,1.0,The box smelled good...,"Lid busted to pieces, cologne everywhere...hop...",[{'small_image_url': 'https://m.media-amazon.c...,B08F3W312Q,B0BJMV1QTR,AGECJEQLXR5BEGW3VXVQDID4NWSA,2023-06-27 20:30:56,0,True,...,1,0,Perfume,1,0.230769,0.076923,0.076923,0.873186,0.126814,1
22209,5.0,Great for the price,My boyfriend loves it,[],B000P22TIY,B0BJMV1QTR,AHIGXX6JORKVB77PHXVYZKDAHQ3A,2019-07-07 21:36:27,0,False,...,6,1,Perfume,0,0.250000,0.000000,0.000000,0.561413,0.438587,1


In [7]:
# Transforming adverb,adjective, and noun counts into ratios for better comparability
def calculate_ratios(data):
    data['#nouns'] = data['noun_count'] / data['word_count']
    data['#adj'] = data['adj_count'] / data['word_count']
    data['#adv'] = data['adv_count'] / data['word_count']
    return data

# Applying the function to the DataFrame
calculate_ratios(data_hedonic)
calculate_ratios(data_utilitarian)


Unnamed: 0,rating,title_x,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,...,day_of_week,is_weekend,product,ver_purch,#nouns,#adj,#adv,subjective_score,neutral_score,prod_type
0,5.0,Affordability,I use more experience rolls but this is great ...,[],B095CN96JS,B0C6TS1PGY,AG5OFHYJ3MMFRJVCWNJ7VQKRW7SA,2022-05-14 21:20:48,0,True,...,5,1,Toilet Paper,1,0.307692,0.153846,0.000000,0.762744,0.237256,0
1,5.0,Great buy,I expected just to have some extra rolls on ha...,[],B095CN96JS,B0C6TS1PGY,AF6T7BPN3CDGPES43LTSZCFXZPAQ,2023-02-23 19:23:56,1,True,...,3,0,Toilet Paper,1,0.111111,0.037037,0.148148,0.969671,0.030329,0
2,5.0,Good value - comparable to Angel Soft,My price line for finding deals on toilet pape...,[],B095CN96JS,B0C6TS1PGY,AFCKN7G26GYGSCJVJH7SEAZORFSA,2022-07-20 02:02:12,0,True,...,2,0,Toilet Paper,1,0.216216,0.067568,0.027027,0.793879,0.206121,0
3,2.0,Not sanitary,Container was filthy and had huge gap exposing...,[],B095CN96JS,B0C6TS1PGY,AEDZBEEPJHOH4AFYLCYUICHJDVZA,2022-08-02 02:50:55,0,False,...,1,0,Toilet Paper,0,0.166667,0.055556,0.055556,0.797570,0.202430,0
4,5.0,Strong and absorbent,Quality and price,[],B095CN96JS,B0C6TS1PGY,AECBOI4L6BAUSKD4W5X2VQ2O6ELQ,2022-09-30 03:18:11,0,True,...,4,0,Toilet Paper,1,0.666667,0.000000,0.000000,0.210694,0.789306,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22086,5.0,Works wonders,Does exactly as it’s meant to,[],B07L61DNMR,B01EY96W72,AFCCDIPYZ5GJVTZKSHJJM26IFDLA,2023-07-27 04:57:33,0,True,...,3,0,Hair Brush,1,0.000000,0.000000,0.166667,0.581350,0.418650,0
22087,5.0,Resultado,"Me encanto, no daña! Súper espectacular",[],B00JJ7T2V8,B01EY96W72,AESHTS76OFZSD7BRF4T6J4NCLUUQ,2023-06-02 17:52:42,1,True,...,4,0,Hair Brush,1,0.166667,0.166667,0.000000,0.846064,0.153936,0
22088,5.0,Amazing!!!,"Amazing, works perfect for wet hair and don’t ...",[],B09TPRMPKT,B01EY96W72,AG2WVRMZTHQP2RWHOXPOBRNKYCPA,2023-03-19 20:34:39,0,True,...,6,1,Hair Brush,1,0.142857,0.214286,0.000000,0.978454,0.021546,0
22089,5.0,This brush has lifted a weight off my shoulders,No my toddler still did not like me brushing h...,[],B07VHP5Y6S,B01EY96W72,AH2ZV3NOTXBNF5SZUTWTOVJOMHUA,2023-07-06 01:33:15,1,True,...,3,0,Hair Brush,1,0.213115,0.032787,0.147541,0.989107,0.010893,0


In [8]:
data_hedonic.to_csv('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Hedonic_Final.csv', index=False)
data_utilitarian.to_csv('/Users/paulahofmann/Documents/Coding/Online-Review/FeaturePreperation/Data_with_Features/Final Data/Utilitarian_Final.csv', index=False)

In [None]:
# Reording the columns in the DataFrame

# Reordered column list
reordered_columns = ['rating', 'title_x', 'text', 'images', 'asin', 'parent_asin', 'user_id',
       'timestamp', 'helpful_vote', 'verified_purchase', 'text_cleaned',
       'text_cleaned1', 'sentiment', 'main_category', 'prod_title',
       'average_rating', 'rat_count', 'features', 'price', 'helpful_ratio',
       'noun_count', 'adj_count', 'adv_count', 'word_count', 'sent_count',
       'sent_length', 'title_length', 'FRE', 'review_ext', 'elap_days',
       'image', 'year', 'month', 'day', 'hour', 'product', 'ver_purch',
       '#nouns', '#adj', '#adv']

# Apply reindexing to DataFrame
data = data.reindex(columns=reordered_columns)


In [None]:
print (data.columns)

In [None]:
## Summarizing all Features in a List

input_features = ['rating','rating_number','timestamp', 'sentiment', 'price', 'noun_count', 'adj_count', 'adv_count', 'word_count', 
                  'sentence_count', 'avg_words_per_sentence', 'title_length', 'F-K_score', 'review_extremity', 
                  'elapsed_time_days', 'image', 'year','month','day','hour']

output_feature = 'helpful_ratio'

## Input Features Summary and Description

### 1. rating
   - Description: The numerical rating given by the reviewer.
   - Type: Continuous

### 2. rating_number
   - Description: Number of ratings the product has received.
   - Type: Continuous

### 3. timestamp
   - Description: The timestamp of when the review was posted.
   - Type: Datetime

### 4. sentiment
   - Description: Sentiment score of the review text, indicating the positivity or negativity of the sentiment.
   - Type: Continuous

### 5. price
   - Description: Price of the product.
   - Type: Continuous

### 6. noun_count
   - Description: Count of nouns in the review text.
   - Type: Integer

### 7. adj_count
   - Description: Count of adjectives in the review text.
   - Type: Integer

### 8. adv_count
   - Description: Count of adverbs in the review text.
   - Type: Integer

### 9. word_count
   - Description: Total number of words in the review text.
   - Type: Integer

### 10. sentence_count
   - Description: Total number of sentences in the review text.
   - Type: Integer

### 11. avg_words_per_sentence
   - Description: Average number of words per sentence in the review text.
   - Type: Continuous

### 12. title_length
   - Description: Length of the review title in characters.
   - Type: Integer

### 13. F-K_score
   - Description: Flesch-Kincaid readability score of the review text.
   - Type: Continuous

### 14. review_extremity
   - Description: Difference between the review rating and the average product rating.
   - Type: Continuous

### 15. elapsed_time_days
   - Description: Elapsed time (in days) since the review was posted.
   - Type: Continuous

### 16. image
   - Description: Binary variable indicating whether the review contains images.
   - Type: Binary (0 or 1)

### 17. year
   - Description: Year component of the review timestamp.
   - Type: Integer

### 18. month
   - Description: Month component of the review timestamp.
   - Type: Integer

### 19. day
   - Description: Day component of the review timestamp.
   - Type: Integer

### 20. hour
   - Description: Hour component of the review timestamp.
   - Type: Integer
