# Automated Smile Detection Part 1: Dataset Organization
This first notebook will walk through downloading the dataset and creating the training, validation, and testing sets for future modeling. The original dataset was downloaded from this [repo](https://github.com/hromi/SMILEsmileD). It consists of 3,690 images of smiling faces and 9,476 images of non-smiling faces.  

# Imports

In [38]:
# Imports
import os
import re
import shutil

%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic


# Establish Paths for Splits  
I decided to follow the folder structure laid out in Deep Learning with Python, Chapter 5, by Francois Chollet [1].  
Folder Structure:  
![](img/01_folder_structure.png)

In [67]:
# Original Data 
ORIGINAL_DATASET = "/users/jasonadam/github/SMILEsmileD/SMILEs" 
NEGATIVES_PATH = os.path.join(BASE_PATH, "negatives/negatives7")
POSITIVES_PATH = os.path.join(BASE_PATH, "positives/positives7")

# New Paths for Splits
NEW_BASE_TRAIN = "data/train"
NEW_BASE_VALIDATE = "data/validation"
NEW_BASE_TEST = "data/test"

# Positive Samples (i.e. Smiles)
POSITIVE_TRAIN = os.path.join(NEW_BASE_TRAIN, "positive")
POSITIVE_VALIDATE = os.path.join(NEW_BASE_VALIDATE, "positive")
POSITIVE_TEST = os.path.join(NEW_BASE_TEST, "positive")

# Negative Samples (i.e. Not Smiling)
NEGATIVE_TRAIN = os.path.join(NEW_BASE_TRAIN, "negative")
NEGATIVE_VALIDATE = os.path.join(NEW_BASE_VALIDATE, "negative")
NEGATIVE_TEST = os.path.join(NEW_BASE_TEST, "negative")

In [68]:
# Check Total Files
print("There are {} negative images.".format(len(os.listdir(NEGATIVES_PATH))))
print("There are {} positive images.".format(len(os.listdir(POSITIVES_PATH))))

There are 9476 negative images.
There are 3690 positive images.


# Split Into Train, Validation, Test  
## Collect Filenames & Filter Non-JPG Files

In [69]:
jpg_pattern = re.compile(".jpg$")

NEG_FILES = [
    os.path.join(NEGATIVES_PATH, i)
    for i in os.listdir(NEGATIVES_PATH)
    if jpg_pattern.search(i) is not None
]

POS_FILES = [
    os.path.join(POSITIVES_PATH, i)
    for i in os.listdir(POSITIVES_PATH)
    if jpg_pattern.search(i) is not None
]

## Split Positive Files

In [70]:
# Split Positives
for f in POS_FILES[0:2690]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(POSITIVE_TRAIN, dst_file))


for f in POS_FILES[2690:3190]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(POSITIVE_VALIDATE, dst_file))


for f in POS_FILES[3190:3690]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(POSITIVE_TEST, dst_file))

## Split Negative Files

In [71]:
# Split Negatives 
for f in NEG_FILES[0:2690]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(NEGATIVE_TRAIN, dst_file))


for f in NEG_FILES[2690:3190]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(NEGATIVE_VALIDATE, dst_file))


for f in NEG_FILES[3190:3690]:
    dst_file = f.split("/")[-1]
    shutil.copyfile(src=f, dst=os.path.join(NEGATIVE_TEST, dst_file))

## Validate Files were Moved  
I decided to limit the negative file counts to the same as the positive so that there was an even split for training the model.

In [77]:
print("-----POSITIVE-----")
print(len(os.listdir(POSITIVE_TRAIN)))
print(len(os.listdir(POSITIVE_VALIDATE)))
print(len(os.listdir(POSITIVE_TEST)))

print("-----NEGATIVE-----")
print(len(os.listdir(NEGATIVE_TRAIN)))
print(len(os.listdir(NEGATIVE_VALIDATE)))
print(len(os.listdir(NEGATIVE_TEST)))

-----POSITIVE-----
2690
500
500
-----NEGATIVE-----
2690
500
500
