# Custom Datasets

So far, we have only used the MNIST dataset, which is easily accessible through torchvision. What do we do when we have our own data which we want to use with PyTorch?

In this notebook, we will covert our own raw data into PyTorch datasets that can be processed by our PyTorch models

## What should our dataset be able to do?
### ` __getitem__ `
Our dataset should be a set of many examples. We should be able to index it like `my_dataset[3]` to get the example at position 3. The `__getitem__` function defines how the dataset is indexed, it is a function which should return an example datapoint given the example index as an argument. 

`mydataset[2]` is equivalent to `my_dataset.__getitem__(2)`

### `__len__`
The `__len__` function must return the length of the dataset we are loading in.

`len(mydataset)` is equivalent to `my_dataset.__len__()`

### It should also inherit from `torch.utils.data.Dataset`
This just makes sure that we implement everything that we need to so that our dataset will be compatible with other utilities from torch such as the `DataLoader`.

## Dataset 1: The Auto MPG Dataset

This dataset contains 398 examples of cars with 7 numerical features and their corresponding miles per gallon (MPG) as a label.

In [3]:
import torch
from torch.utils.data import Dataset
import pandas as pd

The utils function in this folder contains a function which gets the data as a pandas dataframe after parsing and cleaning it.

In [4]:
from utils import get_auto_mpg_data

auto_mpg_data = get_auto_mpg_data()
auto_mpg_data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0


Without copying the code below, your job is to convert the above dataframe into a PyTorch dataset.

In [6]:
from torch.utils.data import Dataset

class AutoMPGDataset(Dataset):
    def __init__(self):
        self.features = get_auto_mpg_data()
        self.labels = self.features.pop('mpg')

    def __getitem__(self, idx):
        features, labels = self.features.iloc[idx], self.labels[idx]
        features, labels = torch.tensor(features), torch.tensor(labels)
        return features, labels

    def __len__(self):
        return self.len(self.labels)

dset = AutoMPGDataset()
dset.labels

dset[0]

(tensor([8.0000e+00, 3.0700e+02, 1.3000e+02, 3.5040e+03, 1.2000e+01, 7.0000e+01,
         1.0000e+00], dtype=torch.float64),
 tensor(18., dtype=torch.float64))

A common way that we might then use this dataset would be to create a torch `DataLoader` from it.

In [None]:
from torch.utils.data import DataLoader
my_dataloader = # use dataset to create dataloader

## Notebook complete 

