# Valeo: Image Classification: day/night/weather
## EDA pipeline of Night and Day

The goal here is to create a EDA pipeline to do day/night classification.

## Loading the dataset

Load the dataset from the weather type and data type (train \ test \ validation)

In [2]:
import os
import numpy as np
from PIL import Image
import pandas as pd

In [3]:
def load_images(weather_type, data_type):
    """
    Load images from the weather_type/data_type folder
    :param weather_type: fog or night or rain or snow
    :type weather_type: String
    :param data_type: train or val or test or train_ref or val_ref or test_ref
    :type data_type: String
    :return: list of images and list of respective paths
    :rtype: Lists
    """
    data = []
    data_paths = []
    counter = 0
    path = '../input/acdc-dataset/dataset ACDC/rgb_anon/' + weather_type + '/' + data_type + '/'

    # For each Gopro directory, for each image, store the image and its path in train and train_paths respectively
    for directory_name in os.listdir(path):
        gopro_path = path + directory_name
        for image_name in os.listdir(gopro_path):
            image_path = gopro_path + "/" + image_name
            image = Image.open(image_path)
            data.append(image)
            data_paths.append(image_path)

            # Counter to see progression
            counter += 1
            if counter%100 == 0:
                print(str(counter) + " " + data_type + " images loaded")
    
    return data, data_paths

Load the dataset of night, respectively set as varaibles: `train_day`, `train_night`, `valid_day`, `valid_night` 

In [4]:
train_day, train_day_paths = load_images('night', 'train_ref')
train_night, train_night_paths = load_images('night', 'train')
valid_day, valid_day_paths = load_images('night', 'val_ref')
valid_night, valid_night_paths = load_images('night', 'val')

## EDA
Let'see some sample day and night images

In [5]:
import matplotlib.pyplot as plt
import cv2
%matplotlib inline

In [6]:
img_day = cv2.imread(str(train_day_paths[0]))
img_night = cv2.imread(str(train_night_paths[0]))

In [7]:
# Resizing image to height and width of 500
img_day = cv2.resize(img_day, (500,500))
img_night = cv2.resize(img_night, (500,500))

In [8]:
# Converting from BGR format to RGB format for visualization
day_rgb = cv2.cvtColor(img_day, cv2.COLOR_BGR2RGB)
night_rgb = cv2.cvtColor(img_night, cv2.COLOR_BGR2RGB)

In [9]:
# Visualizing images using matplotlib
fig, ax = plt.subplots(1,2,figsize=(10,15))
ax[0].imshow(day_rgb)
ax[0].set_title('Day')
ax[1].imshow(night_rgb)
ax[1].set_title('Night')

Images taken during day are generally brighter than images taken during night. We can use this fact to build a simple baseline model.

For this, we need to get the average brightness in an image. RGB image does not help much in this case.

We can convert the image from RGB colorspace to Hue Saturation Value (HSV) colorspace.
The Value in HSV indicates brightness at different positions. Therefore, utilize image from this colorspace to build a basic classifier.

In [10]:
# converting image to HSV colorspace
day_hsv = cv2.cvtColor(img_day, cv2.COLOR_BGR2HSV)
night_hsv = cv2.cvtColor(img_night, cv2.COLOR_BGR2HSV)

In [11]:
# Visualizing images using matplotlib
fig, ax = plt.subplots(1,2,figsize=(10,15))
ax[0].imshow(day_hsv)
ax[0].set_title('Day')
ax[1].imshow(night_hsv)
ax[1].set_title('Night')

I don't think this looks very useful. Maybe we could split the channels and visualize them.

In [12]:
# splitting channels of day and night hsv images
dh, ds, dv = cv2.split(day_hsv)
nh, ns, nv = cv2.split(night_hsv)

In [13]:
fig, ax = plt.subplots(2,3,figsize=(15,10))
ax[0][0].imshow(dh)
ax[0][0].set_title('Hue')
ax[0][1].imshow(ds)
ax[0][1].set_title('Saturation')
ax[0][2].imshow(dv)
ax[0][2].set_title('Value')

ax[1][0].imshow(nh)
ax[1][0].set_title('Hue')
ax[1][1].imshow(ns)
ax[1][1].set_title('Saturation')
ax[1][2].imshow(nv)
ax[1][2].set_title('Value')

It seems like that the **Value** channel has higher pixel values where image is bright.

## Baseline model (Average brightness)

Now find average brightness of day and night images, and we can use this as threshold to classify images.

In [14]:
def avg_brightness(path):
    ''' Return average brightness of each image in a dataset
    
    Args:
        path (string): The path of the dataset
        
    Returns:
        avg_brightness(list): the average brightness of each image
    '''
    avg_brightnesses = []
    for curr_file in path:
          img = cv2.imread(str(curr_file)) # reading img 
          img = cv2.resize(img, (500, 500)) # resizing image
          img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) # converting to hsv
          avg_bright = np.mean(img[:, :, 2]) # calculating average value of Value channel from HSV image
          avg_brightnesses.append(avg_bright) # appending to array
    return avg_brightnesses

In [15]:
# lists to store avg. brightness from Value channel of each image
day_brightness, night_brightness = [], []
day_brightness = avg_brightness(train_day_paths)
night_brightness = avg_brightness(train_night_paths)

In [16]:
# calculating average brightness
day_avg_brightness = sum(day_brightness)/len(day_brightness)
night_avg_brightness = sum(night_brightness)/len(night_brightness)
day_avg_brightness, night_avg_brightness

In [17]:
def ratio_over_thre(array, threshold):
    array_over_thre = np.array(array) > threshold
    return sum(array_over_thre)/len(array)

Visualize distribution of brightness in day and night images

In [18]:
fig, ax = plt.subplots(1,2,figsize=(10,5))
ax[0].hist(day_brightness)
ax[0].set_title('Day')
ax[1].hist(night_brightness)
ax[1].set_title('Night')

Use a threshold=90 for average brightness, which covers much of the distribution for Day images as well as Night images

In [19]:
threshold = 90.
ratio_over_90_day = ratio_over_thre(day_brightness, threshold)
ratio_under_90_night = 1 - ratio_over_thre(night_brightness, threshold)
print("The ratio over {:.2f} in day_brightness is: {:.1f}%; \nThe ratio over {:.2f} in night_brightness is: {:.1f}%"\
      .format(threshold, ratio_over_90_day*100, threshold, ratio_under_90_night*100))

In [20]:
fig, ax = plt.subplots(1,2,figsize=(10,5))
ax[0].hist(day_brightness)
ax[0].set_title('Day')
ax[0].axvline(90, color='red')
ax[1].hist(night_brightness)
ax[1].set_title('Night')
ax[1].axvline(90, color='red')

## Validation

A simple function that takes in threshold as input, classifies images in validation set and returns the accuracy

In [21]:
avg_brightness_val_day = avg_brightness(valid_day_paths)
avg_brightness_val_night = avg_brightness(valid_night_paths)
    
def validate(avg_b_val_d, avg_b_val_n, threshold=90.):
    corrects = 0 # tracks running correct values
    total = len(valid_day_paths) + len(valid_night_paths) # total number of images in validaton set
    
    corrects = sum(np.array(avg_b_val_d) > threshold) + sum(np.array(avg_b_val_n) < threshold)
    accuracy = corrects/total # calculating percentage of correctly classified images
    return accuracy

In [22]:
accuracy = validate(avg_brightness_val_day, avg_brightness_val_night, threshold=90.)
print('The accuracy with threshold--90.0 is: {:.1f}%'.format(accuracy*100.0))

Try to validate for threshold = 90

In [23]:
valid_scores = []
thresholds = np.arange(70, 100, 0.1)
for thresh in thresholds:
    valid_scores.append(validate(avg_brightness_val_day, avg_brightness_val_night, threshold = thresh))

In [24]:
valid_scores_df = pd.DataFrame({
    'valid_score': pd.Series(valid_scores, index=thresholds)
})

In [25]:
val_scores = valid_scores_df.valid_score
max_val_score, max_val_score_id = val_scores.max(), val_scores.idxmax()
print('The max validation score is: {:.1f}%, when threshold is {:.1f}'.format(max_val_score*100, max_val_score_id))

In [26]:
import seaborn as sns
plt.figure(figsize=(8,4), tight_layout=True)
colors = sns.color_palette('pastel')
val_scores.plot(x=valid_scores_df.index)
plt.xlabel('Threshold')
plt.ylabel('Validate Score')
plt.title('Score according to Threshold')
plt.show()

## 