# Data Exploration

In this file, we will explore the image data from Kaggle competition [State Farm Distracted Driver Detection](https://www.kaggle.com/c/state-farm-distracted-driver-detection/data). 

The dataset contains three files
- `imgs.zip`: zipped folder of all (train/test) images
- `sample_submission.csv`: a sample submission file in the correct format
- `driver_imgs_list.csv`: a list of training images, their subject (driver) id, and class id

We first unzip the file `imgs.zip` to obtain the raw image data.

In [1]:
import zipfile
import os
from tqdm import tqdm

with zipfile.ZipFile('imgs.zip', 'r') as f:
    if not os.path.exists('train') or not os.path.exists('test'):
        for i in tqdm(range(len(f.infolist()))):
            f.extract(f.infolist()[i])
    else:
        print('The dataset has been unzipped.')

The dataset has been unzipped.


We can use `driver_imgs_list.csv` to check if all files is unzipped. 

In [2]:
import pandas as pd
import numpy as np

imgs_list = pd.read_csv('driver_imgs_list.csv')
paths_list = []
miss_files = []
for i in range(len(imgs_list)):
    path = 'train/'+imgs_list.iloc[i]['classname']+'/'+imgs_list.iloc[i]['img']
    paths_list.append(path)
    if not os.path.exists(path):
        miss_files.append(path)
if len(miss_files) == 0:
    print('No file is missing. There are {} training images.'.format(len(imgs_list)))
else:
    print('The following files are missing:\n')
    print(miss_files)

No file is missing. There are 22424 training images.


## basic statistics of the data

In [3]:
# function to get the list of paths in a directory
def get_paths(root):
    paths = []
    for path, subdirs, files in os.walk(root):
        for name in files:
            paths.append(os.path.join(path, name))
    return paths

image number of test set

In [4]:
test_paths = get_paths('test')
print('There are {} images in the test set.'.format(len(test_paths)))

There are 79726 images in the test set.


the number of drivers and drivers' names

In [5]:
subj = np.unique(imgs_list['subject'])
print('There are {} drivers.'.format(len(subj)))
print(subj)

There are 26 drivers.
['p002' 'p012' 'p014' 'p015' 'p016' 'p021' 'p022' 'p024' 'p026' 'p035'
 'p039' 'p041' 'p042' 'p045' 'p047' 'p049' 'p050' 'p051' 'p052' 'p056'
 'p061' 'p064' 'p066' 'p072' 'p075' 'p081']


count image numbers for each drivers

In [6]:
subj_count = {}
for sub in subj:
    subj_count[sub] = len(imgs_list[imgs_list['subject']==sub])

print(subj_count)

{'p002': 725, 'p026': 1196, 'p022': 1233, 'p081': 823, 'p015': 875, 'p012': 823, 'p064': 820, 'p035': 848, 'p049': 1011, 'p072': 346, 'p056': 794, 'p024': 1226, 'p014': 876, 'p016': 1078, 'p042': 591, 'p050': 790, 'p021': 1237, 'p052': 740, 'p045': 724, 'p066': 1034, 'p061': 809, 'p041': 605, 'p075': 814, 'p051': 920, 'p039': 651, 'p047': 835}


number of classes

In [7]:
cls_name = np.unique(imgs_list['classname'])
print('There are {} classes.'.format(len(cls_name)))
print(cls_name)

There are 10 classes.
['c0' 'c1' 'c2' 'c3' 'c4' 'c5' 'c6' 'c7' 'c8' 'c9']


image numbers for each class

In [8]:
cls_count = {}
for cls in cls_name:
    cls_count[cls] = len(imgs_list[imgs_list['classname']==cls])

print(cls_count)

{'c7': 2002, 'c1': 2267, 'c8': 1911, 'c9': 2129, 'c4': 2326, 'c5': 2312, 'c6': 2325, 'c0': 2489, 'c2': 2317, 'c3': 2346}


## animation of the data

In [9]:
import cv2
import imageio
import matplotlib.pyplot as plt
import matplotlib.animation as animation
%matplotlib inline  

In [10]:
for i in tqdm(range(len(subj))):
# for s in subj:
    s = subj[i]
    ss = imgs_list[imgs_list['subject']==s]
    frames = []
    # for j in tqdm(range(len(ss.values))):
    for f in ss.values:
        # f = ss.values[j]
        frame = cv2.imread('train/'+f[1]+'/'+f[2])
        frame = cv2.resize(frame, (160, 120))
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(frame)
    if not os.path.exists('gifs'):
        os.makedirs('gifs')
    imageio.mimsave('gifs/'+s+'_moive.gif', frames, duration=0.1)

100%|██████████| 26/26 [14:04<00:00, 29.10s/it]


In [45]:
s1 = imgs_list[imgs_list['subject']==subj[0]]
frames = []
for f in s1.values:
    frame = cv2.imread('train/'+f[1]+'/'+f[2])
    frame = cv2.resize(frame, (320, 240))
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frames.append(frame)

In [52]:
imageio.mimsave('movie_imgio.gif', frames, duration=0.05)