<center>
    <h1><b>Data-Mining Techniques Assignment</b></h1>
</center>

This notebook is part of a university course assignment **'Data-Mining Techniques'**. This project involves e-commerce analysis using the **Amazon Product Dataset**. The project is devided into two parts:
1. Data Exploration & feature engineering, and
2. machine learning tasks including clustering, classification, recommendation system, and sentiment analysis.

The members of this assignment are shown in the following table

<div align="center">

| Ονοματεπώνυμο    | Αριθμός Μητρώου  |        email         |
| :-------------:  | :-------------:  |   :-------------:    |
| Ζήκας Αντώνιος   | 1115202100038    | sdi2100038@di.uoa.gr |
| Κώτσιλας Σταύρος | 1115201700292    | sdi1700292@di.uoa.gr |

</div>

# Part 1: Data Pre-processing
In the first part we will explore the datasets we are going to use and do some pre-processing and analysis on them. We chose to work with the following categories:
1. `All_Beauty`
2. `Digital_Music`
3. `Gift_Cards`
4. `Magazine_Subscriptions`
5. `Video_Games`

## Task 1: Data Exploration and Feature Engineering
### 1. Data Preperation
In this section we will extract our datafor the five categories above. We will download the JSON files and we will parse them in order to create the CSV files that we are going to use for the rest of the tasks.

#### Downloading the datasets
We are going to define a function that will download the datasets for us. Here we are going to use `streamming=True` so we don't download the entire dataset at once, but we will be able to access its contents. This is done for experimenting purposes.

In [307]:
from datasets import load_dataset

def download_datasets(categories, data_type="review"):
    ''' Downloads the specified type of datasets (review or meta) for the given categories. '''
    
    if data_type not in ["review", "meta"]:
        raise ValueError("Invalid data_type. Choose either 'review' or 'meta'.")
    
    # Loop through the categories and download the datasets
    # using the load_dataset function from the datasets library
    datasets = []
    for category in categories:
        print(f"Downloading {data_type} dataset for category: {category}")
        dataset = load_dataset(
            "McAuley-Lab/Amazon-Reviews-2023",
            f"raw_{data_type}_{category}",
            trust_remote_code=True,
            streaming=True
        )
        datasets.append(dataset)
    
    return datasets

Let's download the datasets for the five categories specified above. We will download both **reviews** and **meta** data for the categories.

In [308]:
# Define the categories to download (can be modified as needed)
categories = ["All_Beauty", "Digital_Music", "Gift_Cards", "Magazine_Subscriptions", "Video_Games"]

# Download the review and meta data for the specified categories
review_datasets = download_datasets(categories, data_type="review")
meta_datasets = download_datasets(categories, data_type="meta")

print("\nDatasets downloaded successfully.")

Downloading review dataset for category: All_Beauty
Downloading review dataset for category: Digital_Music
Downloading review dataset for category: Gift_Cards
Downloading review dataset for category: Magazine_Subscriptions
Downloading review dataset for category: Video_Games
Downloading meta dataset for category: All_Beauty
Downloading meta dataset for category: Digital_Music
Downloading meta dataset for category: Gift_Cards
Downloading meta dataset for category: Magazine_Subscriptions
Downloading meta dataset for category: Video_Games

Datasets downloaded successfully.


#### Creation of CSV files
Finally let's create the corresponding **CSV files** for the datasets and save them locally to use them later. We will also define a function that will handle this for us.

In [309]:
import pandas as pd
import os

def construct_csv_files(categories, datasets, max_records=100, output_dir="output"):
    ''' Constructs dictionaries for each category from the review or meta datasets and saves them as CSV files. '''
    os.makedirs(output_dir, exist_ok=True)  # Ensure the output directory exists
    
    categories_dictionaries = {}
    for category, dataset in zip(categories, datasets):
        csv_path = os.path.join(output_dir, f"{category}_data.csv")
        
        # Check if the CSV file already exists
        if os.path.exists(csv_path):
            print(f" - CSV file for category '{category}' already exists at: {csv_path}. Skipping creation.")
            continue
        
        for i, record in enumerate(dataset['full']):
            if i == 0:
                dictionary = {key: [] for key in record.keys()}
            for key in record.keys():
                dictionary[key].append(record[key])
            if i == max_records - 1:
                break
        
        # Save the dictionary as a CSV file
        df = pd.DataFrame(dictionary)
        df.to_csv(csv_path, index=False)
        print(f" - CSV file created for category '{category}' at: {csv_path} ({len(df)} records)")
        
        categories_dictionaries[category] = dictionary
    
    return categories_dictionaries

In [310]:
data_path = "../data" # Path to save the CSV files

print("\nConstructing CSV files for review datasets...")
review_dictionaries = construct_csv_files(categories, review_datasets, max_records=1000, output_dir=f"{data_path}/review")

print("\nConstructing CSV files for meta datasets...")
meta_dictionaries = construct_csv_files(categories, meta_datasets, max_records=1000, output_dir=f"{data_path}/meta")


Constructing CSV files for review datasets...
 - CSV file for category 'All_Beauty' already exists at: ../data/review/All_Beauty_data.csv. Skipping creation.
 - CSV file for category 'Digital_Music' already exists at: ../data/review/Digital_Music_data.csv. Skipping creation.
 - CSV file for category 'Gift_Cards' already exists at: ../data/review/Gift_Cards_data.csv. Skipping creation.
 - CSV file for category 'Magazine_Subscriptions' already exists at: ../data/review/Magazine_Subscriptions_data.csv. Skipping creation.
 - CSV file for category 'Video_Games' already exists at: ../data/review/Video_Games_data.csv. Skipping creation.

Constructing CSV files for meta datasets...
 - CSV file for category 'All_Beauty' already exists at: ../data/meta/All_Beauty_data.csv. Skipping creation.
 - CSV file for category 'Digital_Music' already exists at: ../data/meta/Digital_Music_data.csv. Skipping creation.
 - CSV file for category 'Gift_Cards' already exists at: ../data/meta/Gift_Cards_data.csv.

#### Loading the CSV files to Pandas Dataframes
Next we will pre-process and clean our data. We start by loading the loading the CSV files we just saved to **Pandas Dataframes**. This will let us work more conviniently. Let's create a function for that and load the dataframes.

In [311]:
def load_csv_files(categories, data_path="../data"):
    ''' Loads the CSV files for the specified categories into pandas dataframes. '''
    
    # Initialize a dictionary to hold the dataframes
    dataframes = {'review': {}, 'meta': {}}
    
    # Loop through the categories and load the CSV files into dataframes
    for mode in ['review', 'meta']:
        for category in categories:
            csv_path = os.path.join(data_path, mode, f"{category}_data.csv")
            if os.path.exists(csv_path):
                dataframes[mode][category] = pd.read_csv(csv_path)
                print(f" - Loaded {mode} data for category '{category}' from: {csv_path}")
            else:
                print(f" - CSV file for category '{category}' not found at: {csv_path}. Skipping loading.")
    
    return dataframes

In [312]:
dataframes = load_csv_files(categories, data_path=data_path)

 - Loaded review data for category 'All_Beauty' from: ../data/review/All_Beauty_data.csv
 - Loaded review data for category 'Digital_Music' from: ../data/review/Digital_Music_data.csv
 - Loaded review data for category 'Gift_Cards' from: ../data/review/Gift_Cards_data.csv
 - Loaded review data for category 'Magazine_Subscriptions' from: ../data/review/Magazine_Subscriptions_data.csv
 - Loaded review data for category 'Video_Games' from: ../data/review/Video_Games_data.csv
 - Loaded meta data for category 'All_Beauty' from: ../data/meta/All_Beauty_data.csv
 - Loaded meta data for category 'Digital_Music' from: ../data/meta/Digital_Music_data.csv
 - Loaded meta data for category 'Gift_Cards' from: ../data/meta/Gift_Cards_data.csv
 - Loaded meta data for category 'Magazine_Subscriptions' from: ../data/meta/Magazine_Subscriptions_data.csv
 - Loaded meta data for category 'Video_Games' from: ../data/meta/Video_Games_data.csv


#### Dataframes Visualization
Let's have a look at the **review** and **meta** data for the first category: **All_Beauty**

In [313]:
all_beauty_review_df = dataframes['review']['All_Beauty']
all_beauty_meta_df = dataframes['meta']['All_Beauty']

In [314]:
all_beauty_review_df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5.0,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923,0,True
1,4.0,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070,1,True
2,5.0,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,1589665266052,2,True
3,1.0,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220,0,True
4,5.0,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534,0,True


In [315]:
all_beauty_meta_df.head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,All Beauty,"Howard LC0008 Leather Conditioner, 8-Ounce (4-...",4.8,10,[],[],,"{'hi_res': [None, 'https://m.media-amazon.com/...","{'title': [], 'url': [], 'user_id': []}",Howard Products,[],"{""Package Dimensions"": ""7.1 x 5.5 x 3 inches; ...",B01CUPMQZE,,,
1,All Beauty,Yes to Tomatoes Detoxifying Charcoal Cleanser ...,4.5,3,[],[],,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Yes To,[],"{""Item Form"": ""Powder"", ""Skin Type"": ""Acne Pro...",B076WQZGPM,,,
2,All Beauty,Eye Patch Black Adult with Tie Band (6 Per Pack),4.4,26,[],[],,"{'hi_res': [None, None], 'large': ['https://m....","{'title': [], 'url': [], 'user_id': []}",Levine Health Products,[],"{""Manufacturer"": ""Levine Health Products""}",B000B658RI,,,
3,All Beauty,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4...",3.1,102,[],[],,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Cherioll,[],"{""Brand"": ""Cherioll"", ""Item Form"": ""Powder"", ""...",B088FKY3VD,,,
4,All Beauty,Precision Plunger Bars for Cartridge Grips – 9...,4.3,7,"['Material: 304 Stainless Steel; Brass tip', '...",['The Precision Plunger Bars are designed to w...,,"{'hi_res': [None], 'large': ['https://m.media-...","{'title': [], 'url': [], 'user_id': []}",Precision,[],"{""UPC"": ""644287689178""}",B07NGFDN6G,,,


#### Data Cleaning
Now that we have our data loaded into pandas dataframe, we can start **pre-processing** them. The **Data Cleaning** that we are going to apply consists of three tasks:
1. Handle missing values,
2. Normalize prices, and
3. Pre-process text

First we will make sure our dataframes doesn't contain duplicates

In [316]:
# Drop duplicates in the all dataframes
for mode in ['review', 'meta']:
    print(f"\nDropping duplicates in {mode} data...")
    for category in categories:
        dataframes[mode][category].drop_duplicates(inplace=True)
        dataframes[mode][category].reset_index(drop=True, inplace=True)
        print(f" - Dropped duplicates in {mode} data for category '{category}'")


Dropping duplicates in review data...
 - Dropped duplicates in review data for category 'All_Beauty'
 - Dropped duplicates in review data for category 'Digital_Music'
 - Dropped duplicates in review data for category 'Gift_Cards'
 - Dropped duplicates in review data for category 'Magazine_Subscriptions'
 - Dropped duplicates in review data for category 'Video_Games'

Dropping duplicates in meta data...
 - Dropped duplicates in meta data for category 'All_Beauty'
 - Dropped duplicates in meta data for category 'Digital_Music'
 - Dropped duplicates in meta data for category 'Gift_Cards'
 - Dropped duplicates in meta data for category 'Magazine_Subscriptions'
 - Dropped duplicates in meta data for category 'Video_Games'


##### i) Handle Missing Values
We start the cleaning procedure by **handling missing values**. We will start by identifying which columns of the datasets have missing values.

In [317]:
# First search in the review dataframes
for category in categories:
    dataframe = dataframes['review'][category]
    missing_summary = dataframe.isnull().sum()
    print(f"Missing summary in {category} review data:\n{missing_summary[missing_summary > 0]}")

Missing summary in All_Beauty review data:
Series([], dtype: int64)
Missing summary in Digital_Music review data:
Series([], dtype: int64)
Missing summary in Gift_Cards review data:
Series([], dtype: int64)
Missing summary in Magazine_Subscriptions review data:
Series([], dtype: int64)
Missing summary in Video_Games review data:
Series([], dtype: int64)


In [318]:
# Next, search in the meta dataframes
for category in categories:
    dataframe = dataframes['meta'][category]
    missing_summary = dataframe.isnull().sum()
    print(f"Missing summary in {category} meta data:\n{missing_summary[missing_summary > 0]}")

Missing summary in All_Beauty meta data:
price               799
store                99
bought_together    1000
subtitle           1000
author             1000
dtype: int64
Missing summary in Digital_Music meta data:
price               390
store                61
bought_together    1000
subtitle           1000
author             1000
dtype: int64
Missing summary in Gift_Cards meta data:
main_category       117
price               641
store                15
bought_together    1000
subtitle           1000
author             1000
dtype: int64
Missing summary in Magazine_Subscriptions meta data:
price              1000
store                53
bought_together    1000
subtitle           1000
author             1000
dtype: int64
Missing summary in Video_Games meta data:
main_category         4
price               466
store                25
bought_together    1000
subtitle           1000
author             1000
dtype: int64


As we see the missing values we are dealing with are refering to the **meta** data of the datasets. The review datasets seems to not have any missing value. As for the meta datasets, it seems to have empty values at columns: **main_category, price, store, bought_together, subtitle, and author**. We are going to deal with these missing values as follows:
- `main_category` $\rightarrow$ fill with the most common value from the other rows
- `price` $\rightarrow$ fill with 0.0
- `store` $\rightarrow$ fill with ''
- `bought_together` $\rightarrow$ fill with empty list []
- `subtitle` $\rightarrow$ fill with ''
- `author` $\rightarrow$ fill with ''

Let's define a function that will handle missing values for a dataframe, using the above sceptic

In [319]:
def handle_missing_values(dataframe):
    ''' Handles missing values in a given dataframe. '''
    
    # Fill 'main_category' with the most common value of the column
    dataframe['main_category'] = dataframe['main_category'].fillna(dataframe['main_category'].mode()[0])

    # Fill 'price' with the 0.0 value
    dataframe['price'] = dataframe['price'].fillna(0.0)

    # Fill 'store' with empty string
    dataframe['store'] = dataframe['store'].fillna("")

    # Fill 'bought_together' with empty list
    dataframe['bought_together'] = dataframe['bought_together'].fillna("[]")

    # Fill 'subtitle' with empty string
    dataframe['subtitle'] = dataframe['subtitle'].fillna("")

    # Fill 'author' with empty string
    dataframe['author'] = dataframe['author'].fillna("")

    return dataframe

Let's apply the above function to all meta dataframes

In [320]:
# Handle missing values for all the meta dataframes
for dataframe in dataframes['meta'].values():
    dataframe = handle_missing_values(dataframe)
    dataframes['meta'][category] = dataframe

And finally let's see the results

In [321]:
# Next, search in the meta dataframes
for category in categories:
    dataframe = dataframes['meta'][category]
    missing_summary = dataframe.isnull().sum()
    print(f"Missing summary in {category} meta data:\n{missing_summary[missing_summary > 0]}")

Missing summary in All_Beauty meta data:
Series([], dtype: int64)
Missing summary in Digital_Music meta data:
Series([], dtype: int64)
Missing summary in Gift_Cards meta data:
Series([], dtype: int64)
Missing summary in Magazine_Subscriptions meta data:
Series([], dtype: int64)
Missing summary in Video_Games meta data:
Series([], dtype: int64)


As we see all the missing values have been removed with the previous logic, and we can continue with normalizing prices.

##### ii) Normalizing prices
Next we will normalize the prices of the meta dataframes. There are many ways of normalizing numeric values. Some of them are:
- Min-Max normalization (Scaling to 0 - 1)
- Standardization (Z-score Normalization)
- Log Normalization
- Currency Normalization (less common)

In this notebook we will use **Min-Max normalization (Scaling to 0 - 1)**. Let's define a function for that.

In [322]:
def normalize_prices(dataframe):
    ''' 
    Normalizes the prices in a given dataframe, using min-max normalization (scaling to [0, 1]),
    and creating a new column 'normalized_price' to store the normalized values'.
    '''
    min_price, max_price = dataframe['price'].min(), dataframe['price'].max()
    if min_price == max_price:
        dataframe['normalized_price'] = 0.0
    else:
        dataframe['normalized_price'] = (dataframe['price'] - min_price) / (max_price - min_price)
        
    return dataframe

Let's apply this function to the meta data.

In [323]:
for category in categories:
    dataframe = dataframes['meta'][category]
    dataframe = normalize_prices(dataframe)
    dataframes['meta'][category] = dataframe
    print(f" - Normalized prices for category '{category}'")

 - Normalized prices for category 'All_Beauty'
 - Normalized prices for category 'Digital_Music'
 - Normalized prices for category 'Gift_Cards'
 - Normalized prices for category 'Magazine_Subscriptions'
 - Normalized prices for category 'Video_Games'


Let's have a look at the meta data of the 'All Beauty' category and check what is the **maximum** and **minimum** price in each dataset related to their normalized values.

In [324]:
dataframes['meta']['All_Beauty'].head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author,normalized_price
0,All Beauty,"Howard LC0008 Leather Conditioner, 8-Ounce (4-...",4.8,10,[],[],0.0,"{'hi_res': [None, 'https://m.media-amazon.com/...","{'title': [], 'url': [], 'user_id': []}",Howard Products,[],"{""Package Dimensions"": ""7.1 x 5.5 x 3 inches; ...",B01CUPMQZE,[],,,0.0
1,All Beauty,Yes to Tomatoes Detoxifying Charcoal Cleanser ...,4.5,3,[],[],0.0,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Yes To,[],"{""Item Form"": ""Powder"", ""Skin Type"": ""Acne Pro...",B076WQZGPM,[],,,0.0
2,All Beauty,Eye Patch Black Adult with Tie Band (6 Per Pack),4.4,26,[],[],0.0,"{'hi_res': [None, None], 'large': ['https://m....","{'title': [], 'url': [], 'user_id': []}",Levine Health Products,[],"{""Manufacturer"": ""Levine Health Products""}",B000B658RI,[],,,0.0
3,All Beauty,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4...",3.1,102,[],[],0.0,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Cherioll,[],"{""Brand"": ""Cherioll"", ""Item Form"": ""Powder"", ""...",B088FKY3VD,[],,,0.0
4,All Beauty,Precision Plunger Bars for Cartridge Grips – 9...,4.3,7,"['Material: 304 Stainless Steel; Brass tip', '...",['The Precision Plunger Bars are designed to w...,0.0,"{'hi_res': [None], 'large': ['https://m.media-...","{'title': [], 'url': [], 'user_id': []}",Precision,[],"{""UPC"": ""644287689178""}",B07NGFDN6G,[],,,0.0


In [325]:
for category in categories:
    print(f"{category}: Max price: {dataframes['meta'][category]['price'].max()}", end=", ")
    print(f"Min price: {dataframes['meta'][category]['price'].min()}", end=", ")
    print(f"Normalized max price: {dataframes['meta'][category]['normalized_price'].max()}", end=", ")
    print(f"Normalized min price: {dataframes['meta'][category]['normalized_price'].min()}")

All_Beauty: Max price: 179.95, Min price: 0.0, Normalized max price: 1.0, Normalized min price: 0.0
Digital_Music: Max price: 911.08, Min price: 0.0, Normalized max price: 1.0, Normalized min price: 0.0
Gift_Cards: Max price: 2000.0, Min price: 0.0, Normalized max price: 1.0, Normalized min price: 0.0
Magazine_Subscriptions: Max price: 0.0, Min price: 0.0, Normalized max price: 0.0, Normalized min price: 0.0
Video_Games: Max price: 0.0, Min price: 0.0, Normalized max price: 0.0, Normalized min price: 0.0


As we see the prices have been normalized successfully. We can continue with the text preprocessing.

##### iii) Pre-processing text
We will now begin the text pre-processing of the data. Fistly we will decide witch columns need to be processed. As we saw in the dataframes representations there are two columns in the review data that contain text. These are **`title`** and **`text`**, while in the meta data only the **`title`** column contain text. So we are going to apply text pre-processing as follows:
- For the review data $\rightarrow$ columns `title` and `text`
- For the meta data $\rightarrow$ column `title`

The next step is to determine what **pre-processing techniques** will be used. We have decided to apply the following rules on the pre-processing:
1. **Lowercase** text,
2. Remove **punctuation**,
3. **Stemming** and **Lemmatization** of words,
4. Remove **URLS** (https://...), **user mentions** (@user123) and **hashtags** (#hashtag_example), and
5. Remove **stop-words**

We will create a function that will apply the above rules to a single string variable. First we will make sure all the appropriate packages are installed in our system.

In [326]:
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/antonis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/antonis/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/antonis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

And then we will define our function

In [327]:
import re

def preprocess_text(text):
    ''' 
    Preprocesses the text by removing special characters, converting to lowercase, 
    and removing stop-words. 
    '''
    
    # 1. Lowercase the text and remove punctuation
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)

    # 2. Apply stemming or lemmatization to words
    stemmer = nltk.stem.PorterStemmer()
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # 3. Remove URLs, user mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+|#\w+', '', text)
    text = re.sub(r'#\S+', '', text)

    # 4. Remove stop-words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    return ' '.join(tokens)

Let's apply the cleaning function to the columns mentioned above

In [328]:
import time

print("Pre-processing started")
t0 = time.time()

print(" - Preprocessing review data...")
for dataframe in dataframes['review'].values():
    dataframe['cleaned_title'] = dataframe['title'].apply(preprocess_text)
    dataframe['cleaned_text'] = dataframe['text'].apply(preprocess_text)

print(" - Preprocessing meta data...")
for dataframe in dataframes['meta'].values():
    dataframe['cleaned_title'] = dataframe['title'].apply(preprocess_text)

print(f"\nPreprocessing completed in {time.time() - t0:.2f} seconds")

Pre-processing started
 - Preprocessing review data...
 - Preprocessing meta data...

Preprocessing completed in 2.57 seconds


Let's take a look at the pre-processed data.

In [329]:
dataframes['review']['All_Beauty'].head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,cleaned_title,cleaned_text
0,5.0,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923,0,True,love scent overpow,thi spray realli nice smell realli good goe re...
1,4.0,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070,1,True,work great smell littl weird,thi product doe need wish wa odorless soft coc...
2,5.0,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,1589665266052,2,True,ye,smell good feel great
3,1.0,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220,0,True,synthet feel,felt synthet
4,5.0,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534,0,True,,love


In [330]:
dataframes['meta']['All_Beauty'].head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author,normalized_price,cleaned_title
0,All Beauty,"Howard LC0008 Leather Conditioner, 8-Ounce (4-...",4.8,10,[],[],0.0,"{'hi_res': [None, 'https://m.media-amazon.com/...","{'title': [], 'url': [], 'user_id': []}",Howard Products,[],"{""Package Dimensions"": ""7.1 x 5.5 x 3 inches; ...",B01CUPMQZE,[],,,0.0,howard lc0008 leather condition 8ounc 4pack
1,All Beauty,Yes to Tomatoes Detoxifying Charcoal Cleanser ...,4.5,3,[],[],0.0,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Yes To,[],"{""Item Form"": ""Powder"", ""Skin Type"": ""Acne Pro...",B076WQZGPM,[],,,0.0,ye tomato detoxifi charcoal cleanser pack 2 ch...
2,All Beauty,Eye Patch Black Adult with Tie Band (6 Per Pack),4.4,26,[],[],0.0,"{'hi_res': [None, None], 'large': ['https://m....","{'title': [], 'url': [], 'user_id': []}",Levine Health Products,[],"{""Manufacturer"": ""Levine Health Products""}",B000B658RI,[],,,0.0,eye patch black adult tie band 6 per pack
3,All Beauty,"Tattoo Eyebrow Stickers, Waterproof Eyebrow, 4...",3.1,102,[],[],0.0,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",Cherioll,[],"{""Brand"": ""Cherioll"", ""Item Form"": ""Powder"", ""...",B088FKY3VD,[],,,0.0,tattoo eyebrow sticker waterproof eyebrow 4d i...
4,All Beauty,Precision Plunger Bars for Cartridge Grips – 9...,4.3,7,"['Material: 304 Stainless Steel; Brass tip', '...",['The Precision Plunger Bars are designed to w...,0.0,"{'hi_res': [None], 'large': ['https://m.media-...","{'title': [], 'url': [], 'user_id': []}",Precision,[],"{""UPC"": ""644287689178""}",B07NGFDN6G,[],,,0.0,precis plunger bar cartridg grip 93mm bag 10 p...


### 2. Ratings and Reviews
In this section we will visualise some data using maplotlib.

#### Distribution of Products Ratings
First we will visualise the **distribution of products rating** within each of the 5 categories. We will also determin if there are any categories with significantly higher or lower average ratings.