
### General Graph Structure

The dataset will be structured as follows:
- **`train_data`**, **`valid_data`**, and **`test_data`** share the same **edge_index**, which represents training edges in an **undirected** format (i.e., both `(u, v)` and `(v, u)` exist).
- **`edge_type`**:
  - `0` for **user-item** edges
  - `1` for **item-user** edges
- **`edge_attr`**: Represents edge features in `edge_index`.
- **`target_edge_index`**: The directed edges in each split.
  - For `train_data`, this is the directed version of `edge_index`.
  - `target_edge_type` should be **all `0s`**.
  - `target_edge_attr` contains features of `target_edge_index`.
- **Additional attributes**:
  - `num_nodes`, `num_users`, `num_items`: Should be set accordingly.
  - `num_relations = 2` (user-item and item-user relationships).
  - **Optional**: Include `x_user` and `x_item` as user and item features in `train_data`, `valid_data`, and `test_data`.

---


## Adding Yelp Gowalla
Gowalla directly downloads all relevant files

## Adding Yelp Dataset
#### **1.1 Interaction Data**
The interaction data is directly downloaded in the code via raw_file link

#### **1.2 Edge Features**
Download the edge feature files from RecBole:
[RecBole Yelp Dataset](https://recbole.io/dataset_list.html)

Files to download:
- `yelp_academic_dataset_business.json`
- `yelp_academic_dataset_review.json`
- `yelp_academic_dataset_user.json`

---

### **2. Move Files to the Correct Directory**

After downloading, place all files into the following directory:
```bash
/itet-stor/trachsele/net_scratch/tl4rec/model_outputs/data/yelp18/raw
```


## How to put ML-1M, BookX, Epinions, and LastFM Datasets in CSV files

### **1. Download and Organize the Datasets**

We will use the following repository for dataset preparation:
[https://github.com/recsys-benchmark/DaisyRec-v2.0/tree/dev](https://github.com/recsys-benchmark/DaisyRec-v2.0/tree/dev)

#### **Steps to Download**
1. Clone the DaisyRec repository:
   ```bash
   git clone --branch dev https://github.com/recsys-benchmark/DaisyRec-v2.0.git
   ```
2. Navigate into the directory:
   ```bash
   cd DaisyRec-v2.0
   ```
3. Place your dataset files into the folder specified in the repository's documentation.

---

### **2. Modify `loader.py` to Include More Edge Features**

Edit the file `daisy/utils/loader.py` to modify dataset loading for LastFM, BookX, and Epinions to obtain more edge features.


#### **Modify the LastFM Loader**
```python
elif self.src == 'lastfm':
    df = pd.read_csv(f'{self.ds_path}user_artists.dat', sep='\t')
    df.rename(columns={'userID': self.uid_name, 'artistID': self.iid_name, 'weight': self.inter_name}, inplace=True)
    # Fake timestamp column
    df[self.tid_name] = 1
```

#### **Modify the BookX Loader**
```python
elif self.src == 'book-x':
    df = pd.read_csv(f'{self.ds_path}BX-Book-Ratings.csv', delimiter=";", encoding="latin1")
    df.rename(columns={'User-ID': self.uid_name, 'ISBN': self.iid_name, 'Book-Rating': self.inter_name}, inplace=True)
    # Fake timestamp column
    df[self.tid_name] = 1
```

#### **Modify the Epinions Loader**
```python
elif self.src == 'epinions':
    d = sio.loadmat(f'{self.ds_path}rating_with_timestamp.mat')
    prime = []
    for val in d['rating_with_timestamp']:
        user, item, category, rating, helpfulness, timestamp = val[0], val[1], val[2], val[3], val[4], val[5]
        prime.append([user, item, category, rating, helpfulness, timestamp])
    
    # Add category and helpfulness to DataFrame
    df = pd.DataFrame(prime, columns=[self.uid_name, self.iid_name, "category", self.inter_name, "helpfulness", self.tid_name])
    del prime
    gc.collect()
```

---

### **3. Modify `test.py` to Save Processed CSV Files**

Edit `test.py` to include dataset splitting and saving:

Find the following line:
```python
splitter = TestSplitter(config)
train_index, test_index = splitter.split(df)
```

Immediately after, add:
```python
df['split'] = 'none'
df.loc[train_index, 'split'] = 'train'
df.loc[test_index, 'split'] = 'test'
# Save the CSV file
# df.to_csv('epinions_raw_with_splits.csv', index=False)
```

This ensures the CSV files contain the necessary split information.

---

### **4. Update `basic.yaml` for Each Dataset**

Modify the file `daisy/assets/basic.yaml` based on the dataset you are using:

#### **General Changes**
```yaml
dataset: 'ml-1m'   # Change this to your dataset name
prepro: 10filter  # Removes all items with fewer than 10 interactions
```

#### **Splitting Methods**
- **For ML-1M and Epinions:**
  ```yaml
  val_method: 'tsbr'  # Time-aware split by ratio
  test_method: 'tsbr'
  ```
- **For BookX and LastFM:**
  ```yaml
  val_method: 'rsbr'  # Random split by ratio
  test_method: 'rsbr'
  ```

#### **Positive Threshold for Ratings**
- **For ML-1M:**
  ```yaml
  positive_threshold: 4  # Ratings < 4 are not treated as interactions
  ```
- **For Other Datasets:**
  ```yaml
  positive_threshold: ~  # No filtering applied
  ```


### **5. Run the Code and Save the CSV Files**

Once the changes are made, run the following to process and split datasets:
```bash
python test.py
```

After running the script, locate the generated CSV file in the folder specified by the repository's config:
```bash
/itet-stor/trachsele/net_scratch/tl4rec/model_outputs/data/{dataset_name}/raw
```

Now the datasets are in the required CSV format and ready to use! 🚀

## How to Prepare the Dataset from a CSV file
Used for adding your own dataset or for adding ML-1M, Epinions, BookX, and LastFM

### 1. Dataset Requirements

Your dataset should be in a CSV file with the following columns:
- **user**: Integer representing the user ID (0 to `num_users - 1`).
- **item**: Integer representing the item ID (0 to `num_items - 1`).
- **split**: Specifies whether the interaction belongs to the `"train"` or `"test"` set.
- **feature1, feature2, ...**: Additional edge features (e.g., rating, timestamp, etc.).


### 2. Using `BaseRecDataset`

If your dataset meets the above structure, you can use our `BaseRecDataset` class to process it. 
This class provides:
- **Automatic loading and processing** of the CSV file.
- **Splitting data into train/test/validation**.
- **Handling user and item features**.
- **Creating an undirected graph for training edges**.


Subclasses should override:
- `custom_preprocessing(self, df)`: To handle dataset-specific modifications.
- `get_meta_info(self)`: To define metadata for edge features.
- `train_split_ratio()` and `valid_split_ratio()`: If custom split ratios are needed.
- `raw_file_names`: If the raw file has a different name.

---

### 3. Example: Using `Ml1m` Dataset

```python
class Ml1m(BaseRecDataset):
    dataset_name = "ml-1m"
    
    def custom_preprocessing(self, df):
        # Rename "items" to "item" if needed.
        df = super().custom_preprocessing(df)
        return df
        
    def get_meta_info(self):
        # Define metadata for the dataset
        meta_info = preprocess_data.get_meta_info()
        meta_info["numerical_cols"] = ["rating", "timestamp"]
        meta_info["drop_cols"] = ["user", "item"]
        return meta_info
```

Here, `get_meta_info()` specifies:
- **`numerical_cols`**: `rating` and `timestamp` are numerical features.
- **`drop_cols`**: `user` and `item` should not be processed as edge features.

---
### 4. Metadata Structure (`get_meta_info()`)

`meta_info` provides information about feature types:
```python
def get_meta_info():
    meta_info = {}
    meta_info["numerical_cols"] = []  # Numerical features
    meta_info["date_cols"] = []  # Expected in "%Y-%m-%d %H:%M:%S"
    meta_info["categorical_cols"] = []  # Low-cardinality categorical features
    meta_info["str_cols"] = []  # High-cardinality strings (e.g., names)
    meta_info["drop_cols"] = []  # Features to drop (e.g., names)
    meta_info["ls_of_cat_string"] = []  # List of categorical strings per entry
    meta_info["dont_touch"] = []  # Features that should remain unchanged
    return meta_info
```

---


## Adding Amazon Datasets

### **1. Download the Amazon Datasets**

The Amazon datasets can be downloaded from the following link:
[Google Drive - Amazon Datasets](https://drive.google.com/drive/folders/1a_u52mIEUA-1WrwsNZZa-aoGJcMmVugs)

---

### **2. Organize the Files**

Once downloaded, move the dataset files into the appropriate folder in our repository.

For each dataset, rename the files as follows:

#### **Fashion Dataset**
| Original Filename | New Filename |
|-------------------|-------------|
| `Fashion.txt` | `amazon_fashion.txt` |
| `CXTDictSasRec_Fashion.dat` | `amazon_fashion_ctxt.dat` |
| `Fashion_imgs.dat` | `amazon_fashion_feat_cat.dat` |

#### **Other Datasets**
Similarly, apply the same renaming pattern for other datasets:

### **3. Move to the Correct Folder**

Ensure that the renamed files are placed in the dataset directory as specified in the repository configuration.



Now the Amazon datasets are properly organized and ready for use! 🚀