## Task 3: Class balanced training and test data split
-----------------------------------------------------------------------------------
Author: Rajesh Siraskar
Created: 14-Dec-2018
- 15-Dec-2018: Define approach
- 16-Dec-2018: Develop code for class balancing, and creating the tuples


**Instructions:**
- Function that performs a class-balanced random train/test split
- Parameter: fraction specifying training split
- Use sklearn.model_selection methods
- Output two lists of (filename, class) tuples
- The proportion of positive and negative instances in each list should be approx. equal

**Approach:**
- Use 'glob' module to get all positive and negative instance image files
- Create two separate lists
- Class balance them based on the length of the smaller list
- Convert to tuples with the class label added to file name
- Shuffle
- Draw N samples from both - we now have a class balanced list
- Split both lists into into 70:30 

In [1]:
# Import modules
import glob
import numpy as np
import random

In [2]:
## Top level parameters
path_positive = 'images/training_positive_instances/'
path_negative = 'images/training_negative_instances/'

In [3]:
# Import all image files and read into separate arrays
positive_instances = []
negative_instances = []

# Gather all .png file names
positive_instances = glob.glob (path_positive + '*.png')
negative_instances = glob.glob (path_negative + '*.png')

# Get number of instances
print('Positive instances: ', len(positive_instances))
print('Negative instances: ', len(negative_instances))

# Minimum instances for class balancing
min_instances = min (len(positive_instances), len(negative_instances))
print('Minimum instances for class balancing: ', min_instances)

Positive instances:  1824
Negative instances:  2393
Minimum instances for class balancing:  1824


Limit each array to min instances

In [4]:
positive_instances = positive_instances[0:min_instances]
negative_instances = negative_instances[0:min_instances]

# Get number of instances
print('Positive instances: ', len(positive_instances))
print('Negative instances: ', len(negative_instances))

Positive instances:  1824
Negative instances:  1824


Classes are now balanced. Randomly shuffle the instances

In [5]:
random.shuffle(positive_instances)
random.shuffle(negative_instances)

Create list of labels and zip them up with the file names to create tuples of the form:
    
    (class_label, file_name)

In [6]:
positive_labels = min_instances*['positive']
negative_labels = min_instances*['negative']

class_balanced_positive_instances = list(zip(positive_labels, positive_instances))
class_balanced_negative_instances = list(zip(negative_labels, negative_instances))

Test print 3 tuples from each class:

In [7]:
print('Positive instances:')
print(class_balanced_positive_instances[0:3])
print('\n\n Negative instances:')
print(class_balanced_negative_instances[0:3])

Positive instances:
[('positive', 'images/training_positive_instances\\PED_T400_00122_001.png'), ('positive', 'images/training_positive_instances\\PED_T400_00298_001_R.png'), ('positive', 'images/training_positive_instances\\PED_T210_00088_001_B.png')]


 Negative instances:
[('negative', 'images/training_negative_instances\\PBG_X_T210_00009__V01.png'), ('negative', 'images/training_negative_instances\\BKG_T400_00080.png'), ('negative', 'images/training_negative_instances\\PBG_X_T210_00208__V00.png')]


In [8]:
def SplitInstancesIntoTrainingAndTestSets(path_positive, path_negative, file_type='*.png', 
                                          training_proportion = 0.7):
    # Import all image files and read into separate arrays
    positive_instances = []
    negative_instances = []

    # Gather all .png file names
    positive_instances = glob.glob (path_positive + file_type)
    negative_instances = glob.glob (path_negative + file_type)

    # Get number of instances
    print('Positive instances: ', len(positive_instances))
    print('Negative instances: ', len(negative_instances))

    # Minimum instances for class balancing
    min_instances = min (len(positive_instances), len(negative_instances))
    print('Minimum instances for class balancing: ', min_instances)
    
    positive_instances = positive_instances[0:min_instances]
    negative_instances = negative_instances[0:min_instances]

    # Check that classes are balanced
    if (len(positive_instances) == len(negative_instances)):
        print('CHECK: Classes balanced. Number of instances in both: ', len(positive_instances))
    
    # Randomly shuffle instances
    random.shuffle(positive_instances)
    random.shuffle(negative_instances)
    
    positive_labels = min_instances*['positive']
    negative_labels = min_instances*['negative']

    class_balanced_positive_instances = list(zip(positive_labels, positive_instances))
    class_balanced_negative_instances = list(zip(negative_labels, negative_instances))
    
    # Build the complete data set by appending both lists
    data_set = class_balanced_positive_instances
    [data_set.append(ni) for ni in class_balanced_negative_instances]
    
    total_instances = len(data_set)
    print('\nSPLIT DATA SET\n- Total instances:', total_instances)
    
    # Shuffle the data
    random.shuffle(data_set)
    
    # Split the data set into training and test
    len_training = int(training_proportion * total_instances)
    training_set = data_set[0:len_training]
    test_set = data_set[len_training:total_instances]
    
    print('- Training instances:', len(training_set))
    print('- Test instances:', len(test_set))
    
    return(training_set, test_set)

In [9]:
train, test = SplitInstancesIntoTrainingAndTestSets(path_positive, path_negative)

# Test to see proportion of positive and negatives
l_classes = [i[0] for i in train]
np = 0
nn = 0
for class_type in l_classes:
    if (class_type == 'positive'): np += 1
    if (class_type == 'negative'): nn += 1

print('\n Training set: Ratio of positive/negative: {:.2f}'.format(np/nn))

l_classes = [i[0] for i in test]
np = 0
nn = 0
for class_type in l_classes:
    if (class_type == 'positive'): np += 1
    if (class_type == 'negative'): nn += 1

print('\n Test set: Ratio of positive/negative: {:.2f}'.format(np/nn))

Positive instances:  1824
Negative instances:  2393
Minimum instances for class balancing:  1824
CHECK: Classes balanced. Number of instances in both:  1824

SPLIT DATA SET
- Total instances: 3648
- Training instances: 2553
- Test instances: 1095

 Training set: Ratio of positive/negative: 1.02

 Test set: Ratio of positive/negative: 0.94
