<center>
    <h1><b>Data-Mining Techniques Assignment</b></h1>
</center>

This notebook is part of a university course assignment **'Data-Mining Techniques'**. In this project involves e-commerce analysis using the **Amazon Product Dataset**. The project is devided into two parts:
1. Data Exploration & feature engineering, and
2. machine learning tasks including clustering, classification, recommendation system, and sentiment analysis.

The members of this assignment are shown in the following table

<div align="center">

| Ονοματεπώνυμο    | Αριθμός Μητρώου  |        email         |
| :-------------:  | :-------------:  |   :-------------:    |
| Ζήκας Αντώνιος   | 1115202100038    | sdi2100038@di.uoa.gr |
| Κώτσιλας Σταύρος | 1115201700292    | sdi1700292@di.uoa.gr |

</div>

# Part 1: Data Pre-processing
In the first part we will explore the datasets we are going to use and do some pre-processing and analysis on them. We chose to work with the following categories:
1. `All_Beauty`
2. `Digital_Music`
3. `Gift_Cards`
4. `Magazine_Subscriptions`
5. `Video_Games`

## Task 1: Data Exploration and Feature Engineering
### 1. Data Preperation
In this section we will extract our datafor the five categories above. We will download the JSON files and we will parse them in order to create the CSV files that we are going to use for the rest of the tasks.

#### Downloading the datasets
We are going to define a function that will download the datasets for us. Here we are going to use `streamming=True` so we don't download the entire dataset at once, but we will be able to access its contents. This is done for experimenting purposes.

In [1]:
from datasets import load_dataset

def download_datasets(categories, data_type="review"):
    ''' Downloads the specified type of datasets (review or meta) for the given categories. '''
    
    if data_type not in ["review", "meta"]:
        raise ValueError("Invalid data_type. Choose either 'review' or 'meta'.")
    
    # Loop through the categories and download the datasets
    # using the load_dataset function from the datasets library
    datasets = []
    for category in categories:
        print(f"Downloading {data_type} dataset for category: {category}")
        dataset = load_dataset(
            "McAuley-Lab/Amazon-Reviews-2023",
            f"raw_{data_type}_{category}",
            trust_remote_code=True,
            streaming=True
        )
        datasets.append(dataset)
    
    return datasets

  from .autonotebook import tqdm as notebook_tqdm


Let's download the datasets for the five categories specified above. We will download both **reviews** and **meta** data for the categories.

In [2]:
# Define the categories to download (can be modified as needed)
categories = ["All_Beauty", "Digital_Music", "Gift_Cards", "Magazine_Subscriptions", "Video_Games"]

# Download the review and meta data for the specified categories
review_datasets = download_datasets(categories, data_type="review")
meta_datasets = download_datasets(categories, data_type="meta")

print("\nDatasets downloaded successfully.")

Downloading review dataset for category: All_Beauty
Downloading review dataset for category: Digital_Music
Downloading review dataset for category: Gift_Cards
Downloading review dataset for category: Magazine_Subscriptions
Downloading review dataset for category: Video_Games
Downloading meta dataset for category: All_Beauty
Downloading meta dataset for category: Digital_Music
Downloading meta dataset for category: Gift_Cards
Downloading meta dataset for category: Magazine_Subscriptions
Downloading meta dataset for category: Video_Games

Datasets downloaded successfully.


#### Creation of CSV files
Finally let's create the corresponding **CSV files** for the datasets and save them locally to use them later. We will also define a function that will handle this for us.

In [3]:
import pandas as pd
import os

def construct_csv_files(categories, datasets, max_records=100, output_dir="output"):
    ''' Constructs dictionaries for each category from the review or meta datasets and saves them as CSV files. '''
    os.makedirs(output_dir, exist_ok=True)  # Ensure the output directory exists
    
    categories_dictionaries = {}
    for category, dataset in zip(categories, datasets):
        csv_path = os.path.join(output_dir, f"{category}_data.csv")
        
        # Check if the CSV file already exists
        if os.path.exists(csv_path):
            print(f" - CSV file for category '{category}' already exists at: {csv_path}. Skipping creation.")
            continue
        
        for i, record in enumerate(dataset['full']):
            if i == 0:
                dictionary = {key: [] for key in record.keys()}
            for key in record.keys():
                dictionary[key].append(record[key])
            if i == max_records:
                break
        
        # Save the dictionary as a CSV file
        df = pd.DataFrame(dictionary)
        df.to_csv(csv_path, index=False)
        print(f" - CSV file created for category '{category}' at: {csv_path} ({i} records)")
        
        categories_dictionaries[category] = dictionary
    
    return categories_dictionaries

In [4]:
print("\nConstructing CSV files for review datasets...")
review_dictionaries = construct_csv_files(categories, review_datasets, max_records=1000, output_dir="../data/review")

print("\nConstructing CSV files for meta datasets...")
meta_dictionaries = construct_csv_files(categories, meta_datasets, max_records=1000, output_dir="../data/meta")


Constructing CSV files for review datasets...
 - CSV file created for category 'All_Beauty' at: ../data/review/All_Beauty_data.csv (1000 records)
 - CSV file created for category 'Digital_Music' at: ../data/review/Digital_Music_data.csv (1000 records)
 - CSV file created for category 'Gift_Cards' at: ../data/review/Gift_Cards_data.csv (1000 records)
 - CSV file created for category 'Magazine_Subscriptions' at: ../data/review/Magazine_Subscriptions_data.csv (1000 records)
 - CSV file created for category 'Video_Games' at: ../data/review/Video_Games_data.csv (1000 records)

Constructing CSV files for meta datasets...
 - CSV file created for category 'All_Beauty' at: ../data/meta/All_Beauty_data.csv (1000 records)
 - CSV file created for category 'Digital_Music' at: ../data/meta/Digital_Music_data.csv (1000 records)
 - CSV file created for category 'Gift_Cards' at: ../data/meta/Gift_Cards_data.csv (1000 records)
 - CSV file created for category 'Magazine_Subscriptions' at: ../data/meta/M