# Data and EDA

We have downloaded our respective datasets, but we need to separate the data that we want, from the datasets that we have. 

### Datasets

#### Fast Food Classification V2
```
- test
|   |
|   - folders of class
|              |
|              - images of class.
- train
|   |
|   - folders of class
|              |
|              - images of class.
- valid
|   |
|   - folders of class
|              |
|              - images of class.
```

#### Oxford Pets
```
- annotations
|   |
|   - information about the dataset.
|              
- images
|   |
|   - the images of the dataset (title has label in it).
|              
- valid
|   |
|   - folders of class
|              |
|              - images of class.
```

#### Stanford Dogs
```
- annotations
|   |
|   - Annotation
|              |
|              - folders of each class
|                        |
|                        - annotation information for each image in the classes.
|
- images
|   |
|   - Images
|        |
|        - folders of each class 'c'
|                   |
|                   - images of dogs belonging to class 'c'.
```

We are going to be using PyTorch Datasets, specifically we are going to use the `ImageFolder` so eventually, we need to get our dataset into the format of `root/class_folders/images`. 

## Creating Our Dataset

In [1]:
from pathlib import Path
import os
import shutil

In [2]:
# Data directories.
root_data_dir = Path.cwd() / "original_data"
fast_food = root_data_dir / "Fast Food Classification V2"
oxford_pets = root_data_dir / "OxfordPets"
stanford_dogs = root_data_dir / "StanfordDogs"


#### Sorting Oxford Pet Dataset

Lets now sort out how to collect each of our samples from the Oxford Pet Dataset, we are dealing with cats and dogs in this dataset, luckily the makers have given all the dog images names that start with a lower case letter, and cat images names that start with an upper case letter. 

We are going to scan the directory, if the letter is lowercase, we will create a folder for that dog type, then fill it with all of the images of that class of dog. 

In [7]:
def collect_class_names(source_dir):
    class_names = []
    for dir_object in os.scandir(source_dir):
        if dir_object.name[0].islower():
            name = dir_object.name.split('_')[:-1]
            name = "_".join(name)
            class_names.append(name)
    return class_names

def create_dirs(class_names:list, source_folder:Path, destination_folder:Path):
    for class_name in class_names:
        
        class_dir = destination_folder / class_name
        if not ((destination_folder / class_name).is_dir()):
            os.mkdir(class_dir)

        for img_path in os.listdir(source_folder):
            if class_name in img_path:
                shutil.move(source_folder / img_path, destination_folder / class_name)

source_dir = oxford_pets / "images"
destination_dir = Path("./data/")
class_names = collect_class_names(source_dir)
create_dirs(class_names, source_dir, destination_dir)




Okay so we have now got a certain amount of our dog data into class folders. We can now either add the Stanford dogs to the existing folders (if we already have data for them) or we create new folders and add them there. 


#### Sorting Stanford Dogs Dataset

In [14]:
for class_folder in os.scandir(stanford_dogs / "images/Images"):
    
    # getting the class names
    class_name = class_folder.name.lower().split('-')[1:]
    class_name = "_".join(class_name)

    # checking if the class dir already exists, if not, make one
    class_dir = Path('./data/' + class_name)
    if not class_dir.is_dir():
        os.mkdir(class_dir)

    if ".DS_Store" in class_folder.path:
        continue

    for img in os.listdir(class_folder.path):
        img_path = os.path.join(class_folder.path, img)
        shutil.move(img_path, class_dir)


Okay, so we have all of our classes together. We have the different breeds of dogs and we have the photos of donuts. We have ALOT of classes, alot more than I originally expected, lets do some filtering and remove some classes then we will do some EDA.