# 1.EDA pipeline of Night and Day

The goal here is to create a EDA pipeline to do day/night classification.

## 1.1 Loading the dataset
---

Load the dataset from the weather type and data type (train \ test \ validation)

In [1]:
import os
import numpy as np
import pandas as pd

from PIL import Image
import seaborn as sns
import matplotlib.pyplot as plt
import cv2
from tqdm import tqdm

%matplotlib inline

In [2]:
def combine_into_df(data, path):
    ''' Combine the path and data into a dataframe
    
    Args:
        data (PIL.PngImagePlugin.PngImageFile): the pixel data of images
        path (string): the store path of each image
    
    Returns:
        pd.DataFrame: a DataFrame composed of image data and its path
    '''
    df = pd.DataFrame({'img': pd.Series(data), 'path': pd.Series(path)})
    return df


def load_images(weather_type, data_type):
    """
    Load images from the weather_type/data_type folder
    :param weather_type: fog or night or rain or snow
    :type weather_type: String
    :param data_type: train or val or test or train_ref or val_ref or test_ref
    :type data_type: String
    :return: dataframe, with columns = ['img', 'path']
    :rtype: pd.DataFrame
    """
    data = []
    data_paths = []
    counter = 0
    path = '../input/acdc-dataset/dataset ACDC/rgb_anon/' + weather_type + '/' + data_type + '/'

    # For each Gopro directory, for each image, store the image and its path in train and train_paths respectively
    for directory_name in os.listdir(path):
        gopro_path = path + directory_name
        for image_name in os.listdir(gopro_path):
            image_path = gopro_path + "/" + image_name
            image = Image.open(image_path)
            data.append(image)
            data_paths.append(image_path)

            # Counter to see progression
            counter += 1
            if counter%100 == 0:
                print(str(counter) + " " + data_type + " images loaded")
                
    df = combine_into_df(data, data_paths)
    
    return df

Load the dataset of night, respectively set as varaibles: `train_day`, `train_night`, `valid_day`, `valid_night` 

In [3]:
train_day_df = load_images('night', 'train_ref')
train_night_df = load_images('night', 'train')
val_day_df = load_images('night', 'val_ref')
val_night_df = load_images('night', 'val')
test_day_df = load_images('night', 'test_ref')
test_night_df = load_images('night', 'test')
len_train, len_val, len_test = len(train_day_df) + len(train_night_df), len(val_day_df) + len(val_night_df), len(test_day_df) + len(test_night_df) 

## 1.2 EDA
---
Let's see some sample day and night images

In [4]:
img_day = cv2.imread(train_day_df.path[0])
img_night = cv2.imread(train_night_df.path[0])

# Resizing image to height and width b 500
img_day = cv2.resize(img_day, (500,500))
img_night = cv2.resize(img_night, (500,500))

In [5]:
# Converting from BGR format to RGB format for visualization
day_rgb = cv2.cvtColor(img_day, cv2.COLOR_BGR2RGB)
night_rgb = cv2.cvtColor(img_night, cv2.COLOR_BGR2RGB)

In [6]:
# Visualizing images using matplotlib
fig, ax = plt.subplots(1,2,figsize=(10,15))
ax[0].imshow(day_rgb)
ax[0].set_title('Day')
ax[1].imshow(night_rgb)
ax[1].set_title('Night')

Images taken during day are generally brighter than images taken during night. We can use this fact to build a simple baseline model.

For this, we need to get the average brightness in an image. RGB image does not help much in this case.

We can convert the image from RGB colorspace to **Hue Saturation Value (HSV) colorspace**.
The Value in HSV indicates brightness at different positions. 

Therefore, utilize image from **HSV** colorspace, and split the 3 channels and visualize them, then build a basic classifier.

In [7]:
# converting image to HSV colorspace
day_hsv = cv2.cvtColor(img_day, cv2.COLOR_BGR2HSV)
night_hsv = cv2.cvtColor(img_night, cv2.COLOR_BGR2HSV)

# splitting channels of day and night hsv images
dh, ds, dv = cv2.split(day_hsv)
nh, ns, nv = cv2.split(night_hsv)

In [10]:
fig, ax = plt.subplots(2,3,figsize=(15,10))
ax[0][0].imshow(dh)
ax[0][0].set_title('Hue')
ax[0][1].imshow(ds)
ax[0][1].set_title('Saturation')
ax[0][2].imshow(dv)
ax[0][2].set_title('Value')

ax[1][0].imshow(nh)
ax[1][0].set_title('Hue')
ax[1][1].imshow(ns)
ax[1][1].set_title('Saturation')
ax[1][2].imshow(nv)
ax[1][2].set_title('Value')

It seems like that the **Value** channel has higher pixel values where image is bright.

# 2. Baseline model (Average brightness)

Now find average brightness of day and night images, and we can use this as threshold to classify images.

In [11]:
def add_avg_brightness(df):
    ''' Return average brightness of each image in a dataset
    
    Args:
        path (string): The path of the dataset
        
    Returns:
        list: the average brightness of each image
    '''
    avg_brightnesses = np.empty((len(df), 1))
    for index in tqdm(range(len(df))):
        img = cv2.imread(df.iloc[index].path) # reading img 
        img = cv2.resize(img, (500, 500)) # resizing image
        img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) # converting to hsv
        avg_bright = np.mean(img[:, :, 2]) # calculating average value per pixel of Value channel from HSV image
        avg_brightnesses[index] = avg_bright # appending to array, numpy operation is faster than list operation
    df['avg_b'] = avg_brightnesses

## 2.1 Calculate accuracy and get the maximum
---
A simple function that takes in threshold as input, classifies images in train/validation/test set and returns the accuracy.

In [12]:
def get_accuracy(avg_b_day, avg_b_night, threshold=90.):
    ''' Calculate the accuracy of brightness of day and night
    
    Args:
        avg_b_day (list of float64): the average brightness of day (train set /val set /test set)
        avg_b_day (list of float64): the average brightness of night (train set /val set /test set)
        threshold (float64): the threshold brightness to classify day/night   
        
    Returns:
        float64: the accuracy of classification of day/night
    '''
    corrects = 0 # tracks running correct values
    total = len(avg_b_day) + len(avg_b_night) # total number of images in set
    
    corrects = sum(np.array(avg_b_day) > threshold) + sum(np.array(avg_b_night) < threshold)
    accuracy = corrects/total # calculating percentage of correctly classified images
    return accuracy

In [13]:
def find_max_accur(avg_b_day, avg_b_night, thresholds):
    ''' Traverse the thresolds to find the maximal accuracy and corresponding index(threshold), according to the avearge brightness
    
    Args:
        avg_b_day (list of float64): the average brightness of day (train set /val set /test set)
        avg_b_day (list of float64): the average brightness of night (train set /val set /test set)
        thresholds (list of float64): the brightness thresholds to classify day/night   
        
    Returns:
        pd.DataFrame: the table of (threshold, accuracy)
        float64: The maximal validation accuracy and corresponding threshold
    '''
    accuracy = []
    for thresh in thresholds:
        accuracy.append(get_accuracy(avg_b_day, avg_b_night, threshold = thresh))
    accuracy_df = pd.DataFrame({
        'accuracy': pd.Series(accuracy, index=thresholds)
    })  # combine threshold and accuracy into a dataframe
    
    # Find the maximum value and corresponding index(threshold) of accuracy
    max_accuracy, max_threshold = accuracy_df.accuracy.max(), accuracy_df.accuracy.idxmax()
    
    # Print out the max accuracy and best threshold
    ratio_over_day = np.sum(np.array(avg_b_day) > max_threshold) / len(avg_b_day)
    ratio_under_night = 1 - np.sum(np.array(avg_b_night) > max_threshold) / len(avg_b_night)
    print("The ratio over {:.2f} in day_brightness is: {:.1f}%; \nThe ratio under {:.2f} in night_brightness is: {:.1f}%"\
          .format(max_threshold, ratio_over_day*100, max_threshold, ratio_under_night*100))
    
    # Visualize the distribution with the best threshold
    fig, ax = plt.subplots(1,2,figsize=(10,5))
    ax[0].hist(avg_b_day)
    ax[0].set_title('Day')
    ax[0].axvline(max_threshold, color='red')
    ax[1].hist(avg_b_night)
    ax[1].set_title('Night')
    ax[1].axvline(max_threshold, color='red')
    
    return accuracy_df, max_accuracy, max_threshold

### 2.1.1 The maximal accuracy and best threshold of the **Train set**
First get the average brightness of each image in the train set, and calculate the maximal accracy and corresponding threshold. 

In [14]:
# lists to store avg. brightness from Value channel of each image
add_avg_brightness(train_day_df)
add_avg_brightness(train_night_df)

In [15]:
thresholds_70_90 = np.arange(70, 90, 0.1)
accuracy_train_df, max_train_accuracy, max_train_threshold \
    = find_max_accur(train_day_df['avg_b'], train_night_df['avg_b'], thresholds_70_90)
print('The maximal validation accuracy is: {:.1f}%, where threshold is {:.1f}'.format(max_train_accuracy*100, max_train_threshold))
    
plt.figure(figsize=(8,4), tight_layout=True)
colors = sns.color_palette('pastel')
accuracy_train_df.accuracy.plot(x=accuracy_train_df.index)
plt.vlines(max_train_threshold, ymin=accuracy_train_df.accuracy.min(), ymax= accuracy_train_df.accuracy.max(), color='red')
plt.xlabel('Threshold')
plt.ylabel('Validate Score')
plt.title('Score according to Threshold')
plt.show()

### 2.1.2 The maximal accuracy and best threshold of the **Validation set**

Calculate the average brightness of day and night with the Value Channel of each image, and then validate

In [16]:
add_avg_brightness(val_day_df)
add_avg_brightness(val_night_df)

Find the maximal accuracy and corresponding threshold

In [17]:
thresholds_70_90 = np.arange(70, 90, 0.1)
accuracy_val_df, max_val_accuracy, max_val_threshold \
    = find_max_accur(val_day_df['avg_b'], val_night_df['avg_b'], thresholds_70_90)
print('The maximal validation accuracy is: {:.1f}%, where threshold is {:.1f}'.format(max_val_accuracy*100, max_val_threshold))
    
plt.figure(figsize=(8,4), tight_layout=True)
colors = sns.color_palette('pastel')
accuracy_val_df.accuracy.plot(x=accuracy_val_df.index)
plt.vlines(max_val_threshold, ymin=accuracy_val_df.accuracy.min(), ymax= accuracy_val_df.accuracy.max(), color='red')
plt.xlabel('Threshold')
plt.ylabel('Validate Score')
plt.title('Score according to Threshold')
plt.show()

We received an accuracy of **92.5%** with the train set and an accuracy of **97.6%** with the validation set by using only one feature extraction, i.e. the average brightness of the image. (V Channel)

We can't use brightness alone, because there's large shadow in day image and artifical light in night image, which influence the brightness greatly.

We could work more on this to figure out better features extraction. 


---
---
---

## 2.2  Look into the Misclassified Images
---

After find the best threshold, we create **2 columns to store the predicted label and true label of each image**. <br/>
`pred_label = {'Day', 'Night'}`   and  `true_label = {'Day', 'Night'}`.<br/>
And then return all the misclassified images with their predicted label, and their true labels.

### 2.2.1 Add labels to justify if images are misclassified

In [18]:
train_day_df.loc[train_day_df['avg_b'] > max_train_threshold, 'pred_label'] = 'Day' 
train_day_df.loc[train_day_df['avg_b'] <= max_train_threshold, 'pred_label'] = 'Night' 
train_night_df.loc[train_night_df['avg_b'] > max_train_threshold, 'pred_label'] = 'Day' 
train_night_df.loc[train_night_df['avg_b'] <= max_train_threshold, 'pred_label'] = 'Night' 

train_day_df['true_label'] = 'Day'
train_night_df['true_label'] = 'Night'

In [19]:
val_day_df.loc[val_day_df['avg_b'] > max_val_threshold, 'pred_label'] = 'Day' 
val_day_df.loc[val_day_df['avg_b'] <= max_val_threshold, 'pred_label'] = 'Night' 
val_night_df.loc[val_night_df['avg_b'] > max_val_threshold, 'pred_label'] = 'Day' 
val_night_df.loc[val_night_df['avg_b'] <= max_val_threshold, 'pred_label'] = 'Night' 

val_day_df['true_label'] = 'Day'
val_night_df['true_label'] = 'Night'

If the `pred_label != true_label`, images are misclassified.

In [104]:
def get_misclassified_images(df):
    ''' Get the list of misclassified images
    
    Args:
        df(pd.Dataframe): the original dataframe of images
       
    Returns:
        pd.Series of PIL.PngImagePlugin.PngImageFile: the misclassified images
    '''
    misclassified_images = df[df['pred_label'] != df['true_label']] 
    return misclassified_images

### 2.2.2 Define the Visualization functions of misclassified images

- `visualize_mis_images`: Visualize the real images.
- `scatter_plot_mis_images`: show the misclassified images in a scatter plot.

In [173]:
import math
    
def visualize_mis_images(mis_images):
    ''' Visualize misclassified example(s), show the true label / brigtness / predicted label
    
    Args:
        mis_images (pd.DataFrame): the misclassified images
    '''
    len_images = len(mis_images) if len(mis_images)<=25 else 25
    num = math.ceil(math.sqrt(len_images))
    idxs = random.sample(range(0, len(mis_images)), len_images)
    
    fig = plt.figure(figsize=(num**2,num**2)) if num > 3 else  plt.figure(figsize=((num+1)**2,(num+1)**2))
    plt.title("Misclassified images (True Label - Brightness - Predicted Label)",  fontsize=24)
    for count, index in enumerate(idxs):
        ax = fig.add_subplot(num, num, count + 1, xticks=[], yticks=[])
        image = mis_images.iloc[index].img
        label_true = mis_images.iloc[index].true_label
        label_pred = mis_images.iloc[index].pred_label
        bright = mis_images.iloc[index]['avg_b'] if 'avg_b' in mis_images else  mis_images.iloc[index]['avg_V'] 
        ax.imshow(image)
        ax.set_title("{} {:0.0f} {}".format(label_true, bright, label_pred))

        if count==len_images-1:
            break
    
def scatter_plot_mis_images(MISCLASSIFIED_avg_hsv, x_name, y_name, label_name, mode, loc_legend='lower left'):
    ''' Visualize misclassified example(s) with scatter plot, and print out the accuracy of classification
    
    Args:
        MISCLASSIFIED_avg_hsv (pd.DataFrame): the misclassified images with HSV and RGB value
        x_name (String): the name of X_axis, also the field of dataframe, such as `avg_H`, `avg_R`
        y_name (String): the name of Y_axis, also the field of dataframe, such as `avg_H`, `avg_R`
        label_name (String): the label of images, such as `true_label`, `pred_label`
        mode (String): 3 choices -- train / validation / test mode
        loc_legend (String): the location of the legend of the plot
    ''' 
    ax = sns.scatterplot(x=x_name, y=y_name, hue=label_name, data=MISCLASSIFIED_avg_hsv, legend='full')
    ax.legend(loc=loc_legend)
    ax.set_title('[{:s} -- {:s}] of MISCLASSIFED Images ({:s})'.format(x_name, y_name, mode), fontsize=20)
    length = len_train if mode=='train' else len_val if mode=='val' else len_test
    accuracy = (1-len(MISCLASSIFIED_avg_hsv)/ length)*100
    print('The accuracy is: {:.2f} %'.format(accuracy))
    plt.show()

In [166]:
mis_train_day_df = get_misclassified_images(train_day_df)
mis_train_night_df = get_misclassified_images(train_night_df)
MISCLASSIFIED_train = pd.concat([mis_train_day_df, mis_train_night_df])

In [22]:
mis_val_day_df = get_misclassified_images(val_day_df)
mis_val_night_df = get_misclassified_images(val_night_df)
MISCLASSIFIED_val = pd.concat([mis_val_day_df, mis_val_night_df])

### 2.2.3 Visualize the misclassified images
Visualize some of the images we classified wrong (for example in the `mis_val_day_df` and `mis_val_night_df`) and note any qualities that make them difficult to classify. This will help us identify any weaknesses in our classification algorithm.

In [174]:
visualize_mis_images(MISCLASSIFIED_train)

In [176]:
visualize_mis_images(MISCLASSIFIED_val)

When visualizing the HSV channels we can detect very fast that all misqualified night images have a very low hue value which is caused by the artificial yellow lights. The value lays about in the region of $< 50$.

## 3. Increase accuracy by extracing more features

###  3.0  Normalization and Standardization
And I tried standardilize the brightness of validation set, the result is much worse, it's **52.8%**. <br/>And also normalization of brightness of validation set, the maximal accuracy is **75.00%**, which is also worse than the accuracy of unnormalized brightness (**97.6%**).  

### 3.1 Add HSV and RGB value with images

We want to analyse the HSV channels and RGB channels, so we create function `avg_hsv_rgb` in order to attach the average channel value with each image, with help of `pd.DataFrame`.  

In [107]:
def avg_hsv_rgb(df_):
    ''' Calculate the average HSV and RGB of each image in a dataframe
    
    Args:
        df (pd.DataFrame): The dataframe to add the average HSV and RGB value
        
    Returns:
        3 lists: average Hue, average Saturation, average Value
    '''
    avg_Hs, avg_Ss, avg_Vs = [], [], []
    avg_Rs, avg_Gs, avg_Bs = [], [], []
    df = df_.copy()  # use copy() to avoid replacing the original dataframe
    for index in tqdm(range(len(df))):
        img = cv2.imread(df.iloc[index].path) # reading img 
        img = cv2.resize(img, (500, 500)) # resizing image
        
        img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) # converting to hsv
        avg_H, avg_S, avg_V = np.mean(img[:, :, 0]), np.mean(img[:, :, 1]), np.mean(img[:, :, 2]) # calculating average value per pixel of Value channel from HSV image
        avg_Hs.append(avg_H)
        avg_Ss.append(avg_S)
        avg_Vs.append(avg_V)
        
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # converting to hsv
        avg_R, avg_G, avg_B = np.mean(img[:, :, 0]), np.mean(img[:, :, 1]), np.mean(img[:, :, 2]) # calculating average value per pixel of Value channel from HSV image
        avg_Rs.append(avg_R)
        avg_Gs.append(avg_G)
        avg_Bs.append(avg_B)
    
    df['avg_H'], df['avg_S'], df['avg_V'] = avg_Hs, avg_Ss, avg_Vs   
    df['avg_R'], df['avg_G'], df['avg_B'] = avg_Rs, avg_Gs, avg_Bs  
    return df


def data_initialize(mode='val'):
    ''' Load the data set depending on the mode (train/valid/test), and attach average HSV and RGB
        Channel value with each image
        
    Args:
        mode (String): the type of dataset (train/ val/ test), default: val
    '''
    if mode == 'val':
        day_df, night_df = load_images('night', 'val_ref'), load_images('night', 'val')
    elif mode == 'train':
        day_df, night_df = load_images('night', 'train_ref'),  load_images('night', 'train')
    elif mode == 'test':
        day_df, night_df = load_images('night', 'test_ref'), load_images('night', 'test')

    # Calculate the average HSV and RGB value 
    day_df = avg_hsv_rgb(day_df)
    night_df = avg_hsv_rgb(night_df)
    
    return day_df, night_df

Calculate the average HSV and RGB value of day, night and misclassified images of train set / validation set. 

In [110]:
# Load data and calculate average HSV and RGB value
train_day_hsv_rgb, train_night_hsv_rgb = data_initialize(mode='train')
val_day_hsv_rgb, val_night_hsv_rgb = data_initialize(mode='val')
test_day_hsv_rgb, test_night_hsv_rgb = data_initialize(mode='test')

According to our analysis above 
>  all misqualified night images have a very low hue value which is caused by the artificial yellow lights, and the Hue value lays about in the region of  $<50$ .

we check the difference of the **Hue value** between Day Images and Night Images of train set 

In [115]:
def day_night_scatter(df_day, df_night, x_name, y_name, line_pos, loc_legend='upper left'):
    ''' Scatter plot of Day Images and Night Images from the same dataset in a 1*2 subplots
    
    Args:
        df_day/day_night (pd.DataFrame): the day/night images with HSV and RGB value
        x_name (String): the name of X_axis, also the field of dataframe, such as `avg_H`, `avg_R`
        y_name (String): the name of Y_axis, also the field of dataframe, such as `avg_H`, `avg_R`
        label_name (String): the label of images, such as `true_label`, `pred_label`
        loc_legend (String): the location of the legend of the plot
    '''
    fig = plt.figure(figsize=(12,4))
    ax1 = plt.subplot(1,2,1)
    ax1 = sns.scatterplot(x=x_name, y=y_name, data=df_day, legend='full')
    ax1.legend(loc=loc_legend)
    ax1.axhline(line_pos, color='red')
    ax2 = plt.subplot(1,2,2)
    ax2 = sns.scatterplot(x=x_name, y=y_name, data=df_night, legend='full')
    ax2.legend(loc=loc_legend)
    ax2.axhline(line_pos, color='red')

In [117]:
day_night_scatter(train_day_hsv_rgb, train_night_hsv_rgb, 'avg_V', 'avg_H', line_pos=30., loc_legend='lower left')

As mentioned above, in the training set, almost no Day Images have Hue channel values lower than 30, while nighttime images have.

The Hue value range of day images in train set is about $[30, 120]$, and the Hue value of night images in train set is about $[15, 140]$.

### 3.2 Improve by  Hue Channel filter

The following approach only looks at the upper part of the image and additionally takes the experience about **hue values** from above into account, so images with artificial bright lights will be flagged as night images, images with large shadow will be flagged as day images

In [137]:
def estimate_label_improve_H(day_df_, night_df_, is_test=False, **kwargs):
    ''' For train/valid, find the maximal accuracy and best threshold, 
        For test, use the specified threshold to calculate the accuracy
        For train/valid/test, add pred_label and true_label in DataFrame
        
    Args:
        day_df_/ night_df_ (pd.DataFrame): the dataframe of images
        is_test (Boolean): if True, test mode is on
    '''
    day_df, night_df = day_df_.copy(), night_df_.copy()
    
    if is_test == False:
        # First, find the best threshold to classify
        thresholds_70_90 = np.arange(70, 90, 0.1)
        accuracy_df, max_accuracy, max_threshold = find_max_accur(day_df.avg_V, night_df.avg_V, thresholds_70_90)
        print('The maximal accuracy is: {:.1f}%, where threshold is {:.1f}'.format(max_accuracy*100, max_threshold))
    else:
        max_threshold = kwargs['threshold']
        test_accuracy = get_accuracy(day_df.avg_V, night_df.avg_V, max_threshold)
        print('The accuracy of test set is: {:.1f}%, where threshold is {:.1f}'.format(test_accuracy*100, max_threshold))
    
    # Second, classify images with H Channel Improvement and add labels
    day_df.loc[(day_df['avg_V'] <= max_threshold) | \
               ((day_df['avg_V'] > max_threshold) & (day_df['avg_H'] <= 35.)), 'pred_label'] = 'Night' 
    day_df.fillna({'pred_label': 'Day'}, inplace=True)
    night_df.loc[(night_df['avg_V'] <= max_threshold) | \
                 ((night_df['avg_V'] > max_threshold) & (night_df['avg_H'] <= 35.)), 'pred_label'] = 'Night' 
    night_df.fillna({'pred_label': 'Day'}, inplace=True) 
    
    day_df['true_label'] = 'Day'
    night_df['true_label'] = 'Night'
    
    return day_df, night_df


### 3.2.1 H channel filter in the train set

In [127]:
# Calculate the maxmial accuracy
train_day_df_final, train_night_df_final = estimate_label_improve_H(train_day_hsv_rgb, train_night_hsv_rgb)

# Find the misclasified images
mis_train_day_df_final = get_misclassified_images(train_day_df_final)
mis_train_night_df_final = get_misclassified_images(train_night_df_final)
MISCLASSIFIED_train_final = pd.concat([mis_train_day_df_final, mis_train_night_df_final])

In [128]:
# Draw the scatter plot of misclassified images in train set
scatter_plot_mis_images(MISCLASSIFIED_train_avg_hsv, 'avg_b', 'avg_H', 'true_label', mode='train', loc_legend='lower left')
# Draw the scatter plot of misclassified images of train set 
scatter_plot_mis_images(MISCLASSIFIED_train_final, 'avg_V', 'avg_H', 'true_label', mode='train', loc_legend='upper left')

As we can see, by correcting the images with low **H channel value**, the misclassified night images reduced (especially the images with hue value < 30), and the accuracy increases from $92.5\%$ to $93.75\%$.

> Note: some Day Images are misclassified at the same time, which is classified correctly before, but the number of new misclassified images is less than the number of new correct classificied images, so the accuracy increases.

### 3.2.2 H channel filter in the Validation set

In [129]:
# Calculate the maxmial accuracy
val_day_df_final, val_night_df_final =  estimate_label_improve_H(val_day_hsv_rgb, val_night_hsv_rgb)

# Find the misclasified images
mis_val_day_df_final = get_misclassified_images(val_day_df_final)
mis_val_night_df_final = get_misclassified_images(val_night_df_final)
MISCLASSIFIED_val_final = pd.concat([mis_val_day_df_final, mis_val_night_df_final])

In [130]:
# Draw the scatter plot of misclassified images in vcalidation set
scatter_plot_mis_images(MISCLASSIFIED_val_avg_hsv, 'avg_b', 'avg_H', 'true_label', mode='val', loc_legend='lower left')
# Draw the scatter plot of misclassified images of validation set
scatter_plot_mis_images(MISCLASSIFIED_val_final, 'avg_V', 'avg_H', 'true_label', mode='val', loc_legend='upper left')

As shown above, in the validation set, the misclassified night images (yellow points in the upper figure) are corrected (removed in the below figure).  

And the accuracy of validation set increases, from $97.64\%$ to $98.58\%$.

In [37]:
visualize_mis_images(MISCLASSIFIED_val_final)

We can see that we main problem of the 3 misclassified day images is that their brightness is very low, due to the large shadow. So we need to analysis the shadow's influence in RGB channels and corresponding improvement methods.

### 3.2.3  H channel filter in the Test set

In [132]:
test_day_hsv_rgb, test_night_hsv_rgb = data_initialize(mode='test')

In [138]:
test_threshold = 81.0  # use the best threshold of the training set

test_day_df_final, test_night_df_final = estimate_label_improve_H(test_day_hsv_rgb, test_night_hsv_rgb, is_test=True, threshold=test_threshold)  
mis_test_day_df_final = get_misclassified_images(test_day_df_final)
mis_test_night_df_final = get_misclassified_images(test_night_df_final)
MISCLASSIFIED_test_final = pd.concat([mis_test_day_df_final, mis_test_night_df_final])

In [177]:
# Draw the scatter plot of misclassified images of test set
scatter_plot_mis_images(MISCLASSIFIED_test_final, 'avg_G', 'avg_H', 'true_label', mode='test', loc_legend='upper left')

visualize_mis_images(mis_test_night_df_final)

We get the original accuracy $89.6\%$ by using the best threshold of train set directly.
> Note: at first I made a mistake, I use the same way to find the maximal accuracy of test set as what I did in train set and validation set.<br/>
But we cannot get the accuracy of test set in this way, because **we can never train the test set !!**

And after add **H channel filter**, the accuracy of test set increases to $93.5\%$ (increased by $3.9\%$).

## 3.3 Further improvement by RGB Channel

It seems that we still can improve the accuracy by extracting more features.

We can notice that in validation set all the misclassified images  are 'Day Images', and in train set over 70% misclassified images are 'Day Images'.

And there's much large shadow in Day images, which leads to the low brightness. 

Now let's study the **influence of large shadow in RGB Channels of "Day Images"** and try to extract more useful features.

In [142]:
def scatter_rgb(df, x_name, label_name, mode, loc_legend, title, hlines=None, is_mis=False):
    ''' Scatter plot of images in (1 * 3) subplots
        X axis is a Channel Value selected from HSV and RGB,
        Y axis is R Channel, G Channel, B Channel respectively in each subplot
        
    Args:
        df (pd.DataFrame): the misclassified images with HSV and RGB value
        x_name (String): the name of X_axis, also the field of dataframe, such as `avg_H`, `avg_R`
        label_name (String): the label of images, such as `true_label`, `pred_label`
        mode (String): 3 choices -- train / validation / test mode
        loc_legend (String): the location of the legend of the plot
        title (String): the title of the figure (not subplot)
        hlines (list of float): the position of horizontal line in each subplot. Default None
        is_mis (Boolean): the condition of calculating the accuracy of classification
                if True, calculate the accuracy and print it in the terminal 
    '''
    f = plt.figure(figsize=(20,5))
    rgb = ['avg_R', 'avg_G', 'avg_B']
    f.suptitle('{:s} (Mode: {:s})'.format(title, mode), fontsize=20)
    for i in range(3):
        ax = f.add_subplot(1, 3 ,i+1)
        ax = sns.scatterplot(x=x_name, y=rgb[i], hue=label_name, data=df, legend='full')
        ax.legend(loc=loc_legend)
        if hlines != None: ax.axhline(hlines[i], color='red')
        ax.set_title('[{:s} -- {:s}]'.format(x_name, rgb[i]), fontsize=12)
    
    plt.show()
    
    if is_mis:
        length = len_train if mode=='train' else len_val if mode=='val' else len_test
        accuracy = (1-len(df)/ length)*100
        print('The accuracy is: {:.2f} %'.format(accuracy))

In [143]:
scatter_rgb(MISCLASSIFIED_train_final, 'avg_V',  'true_label', mode='train', loc_legend='upper left', title='MisClassified Images', is_mis=True)
hlines = [95, 130, 125]
scatter_rgb(train_day_df_final, 'avg_V',  'pred_label', mode='train', loc_legend='upper left', title='Day Images', hlines=hlines)
scatter_rgb(train_night_df_final, 'avg_V',  'pred_label', mode='train', loc_legend='upper left', title='Night Images', hlines=hlines)

We can notice that the G channel value of Day Images is much lower than that of Night Images.<br/> ( Day --- lower than 60, Night -- Higher than 60 ) 

In [145]:
def improve_RGB(val_day_df_, val_night_df_):
    val_day_df, val_night_df = val_day_df_.copy(), val_night_df_.copy()
    
    val_day_df.loc[(val_day_df['pred_label'] == 'Night') & (val_day_df['avg_R'] >= 95.), 'pred_label'] = 'Day' 
    val_night_df.loc[(val_night_df['pred_label'] == 'Night') & (val_night_df['avg_R'] >= 95.), 'pred_label'] = 'Day'
    
    val_day_df.loc[(val_day_df['pred_label'] == 'Day') & (val_day_df['avg_G'] >= 130.), 'pred_label'] = 'Night' 
    val_night_df.loc[(val_night_df['pred_label'] == 'Day') & (val_night_df['avg_G'] >= 130.), 'pred_label'] = 'Night'
    
    val_day_df.loc[(val_day_df['pred_label'] == 'Day') & (val_day_df['avg_G'] >= 130.), 'pred_label'] = 'Night' 
    val_night_df.loc[(val_night_df['pred_label'] == 'Day') & (val_night_df['avg_G'] >= 130.), 'pred_label'] = 'Night'
    
    return val_day_df, val_night_df 
    

### 3.3.1 RGB Channel Filter in the Train set

In [147]:
# Get the improved train set by filtering the G channel
train_day_df_final_G,  train_night_df_final_G = improve_RGB(train_day_df_final,  train_night_df_final)

# Find the misclassified images of train set
mis_train_day_df_final_G = get_misclassified_images(train_day_df_final_G)
mis_train_night_df_final_G = get_misclassified_images(train_night_df_final_G)
MISCLASSIFIED_train_final_G = pd.concat([mis_train_day_df_final_G, mis_train_night_df_final_G])

In [148]:
# Draw the scatter plot of misclassified images of train set
scatter_rgb(MISCLASSIFIED_train_final, 'avg_V',  'true_label', mode='train', loc_legend='upper left', title='MisClassified Images', is_mis=True)
scatter_rgb(MISCLASSIFIED_train_final_G, 'avg_V',  'true_label', mode='train', loc_legend='upper left', title='MisClassified Images After RGB', is_mis=True)

The accuracy increases to $95.12\%$ from $94.00\%$ by RGB Channel Filter.

### 3.3.2 RGB Channel Filter in the Validation set

In [149]:
# Get the improved validation set by filtering the G channel
val_day_df_final_G,  val_night_df_final_G = improve_RGB(val_day_df_final,  val_night_df_final)

# Find the misclassified images of validation set
mis_val_day_df_final_G = get_misclassified_images(val_day_df_final_G)
mis_val_night_df_final_G = get_misclassified_images(val_night_df_final_G)
MISCLASSIFIED_val_final_G = pd.concat([mis_val_day_df_final_G, mis_val_night_df_final_G])

In [150]:
# Draw the scatter plot of misclassified images of validation set
scatter_rgb(MISCLASSIFIED_val_final, 'avg_V',  'true_label', mode='val', loc_legend='upper left', title='MisClassified Images', is_mis=True)
scatter_rgb(MISCLASSIFIED_val_final_G, 'avg_V',  'true_label', mode='val', loc_legend='upper left', title='MisClassified Images After RGB', is_mis=True)

The accuracy of validation set doesn't change.

### 3.3.3 RGB Channel Filter in the Test set

In [151]:
scatter_rgb(MISCLASSIFIED_test_final, 'avg_V',  'true_label', mode='test', loc_legend='upper left', title='MisClassified Images', is_mis=True)
hlines = [95, 130, 125]
scatter_rgb(test_day_df_final, 'avg_V',  'pred_label', mode='test', loc_legend='upper left', title='Day Images', hlines=hlines)
scatter_rgb(test_night_df_final, 'avg_V',  'pred_label', mode='test', loc_legend='upper left', title='Night Images', hlines=hlines)

In [152]:
test_day_df_final_G, test_night_df_final_G = improve_RGB(test_day_df_final, test_night_df_final)
# Find the misclassified images
mis_test_day_df_final_G = get_misclassified_images(test_day_df_final_G)
mis_test_night_df_final_G = get_misclassified_images(test_night_df_final_G)
MISCLASSIFIED_test_final_G = pd.concat([mis_test_day_df_final_G, mis_test_night_df_final_G])

In [153]:
# Draw the scatter plot of misclassified images of test set
scatter_rgb(MISCLASSIFIED_test_final, 'avg_V',  'true_label', mode='test', loc_legend='upper left', title='MisClassified Images', is_mis=True)
scatter_rgb(MISCLASSIFIED_test_final_G, 'avg_V',  'true_label', mode='test', loc_legend='upper left', title='MisClassified Images After RGB', is_mis=True)

The accuracy of test set increases to $93.90\%$ from $93.50\%$ by RGB Channel Filter.

In [178]:
visualize_mis_images(MISCLASSIFIED_test_final_G)