**Notices**

Copyright (c) 2019 Intel Corporation.

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


# Explore VMMRdb 

## Objective

Understand ways to find a data set and to analyze a data set to have more in depth information about the dataset before we start any preprocessing or training.

## Activities 
**In this section of the training you will**
- Fetch and visually inspect a dataset 

As you follow this notebook, complete **Activity** sections to finish this workload. 


## Find a Dataset

Artificial intelligence projects depend upon data. When beginning a project, data scientists look for existing data sets that are similar to or match the given problem. This saves time and money, and leverages the work of others, building upon the body of knowledge for all future projects. 

Typically you begin with a search engine query. For this project, we were looking for a data set with an unencumbered license.

This project starts with [Vehicle Make and Model Recognition Dataset (VMMRdb)](http://vmmrdb.cecsresearch.org/)   which is large in scale and diversity, containing 9,170 classes consisting of 291,752 images, covering models manufactured between 1950 to 2016. VMMRdb dataset contains images that were taken by different users, different imaging devices, and multiple view angles, ensuring a wide range of variations to account for various scenarios that could be encountered in a real-life scenario. The cars are not well aligned, and some images contain irrelevant background. The data covers vehicles from 712 areas covering all 412 sub-domains corresponding to US metro areas. VMMRdb dataset can be used as a baseline for training a robust model in several real-life scenarios for traffic surveillance. 

## Fetch & Inspect Your Data

The code below will look at the dataset with its subdirectories to create a simple frequency analysis of the dataset. We'll use simple frequency distributions to familiriaze ourself with the dataset and its structure. Before running this cell, download VMMRdb and unzip it into a folder named **VMMR**. 

Click the cell below and then click **Run**.

In [None]:
#Fetch and Inspect your data
import os
import glob
import glob
from collections import defaultdict

#Path to unzipped 9170 classes folder
cwd = os.getcwd()
d = cwd + '/SubsetVMMR'

folder_list = [os.path.join(d, o) for o in os.listdir(d) 
                    if os.path.isdir(os.path.join(d,o))]

#Create all files in a list
file_list = glob.glob("SubsetVMMR/*/*")

#Create list of class names
class_name_list = [o for o in os.listdir(d) if os.path.isdir(os.path.join(d,o))]

#Create a tree structure of the dataset 
class_dict = defaultdict(dict)
for tmp in class_name_list:
    model = "_".join(tmp.split("_")[:-1])
    car_make = tmp.split("_")[0]
    if car_make not in class_dict:
        class_dict[car_make] = {}
        class_dict[car_make][model] = {}
        class_dict[car_make][model][tmp] = len(os.listdir(os.path.join(d, tmp)))
    elif model not in class_dict[car_make]:
            class_dict[car_make][model] = {}
            class_dict[car_make][model][tmp] = len(os.listdir(os.path.join(d, tmp)))
    else:
        class_dict[car_make][model][tmp] = len(os.listdir(os.path.join(d, tmp)))

### Part 1: Car Make Distribution

After creating a tree structured representation of the dataset considering its manufacturer, model, year distribution, we look at the visually interactive plots to make comparisons among different classes. The distribution below provides car make variation in the VMMRdb. The number corresponds to each car make represents the available number of car models with different years in VMMRdb. For example, **chevrolet** has 1013 different numbers of model and year variation in the database. 

In order to be able to control the graphs below, install ipywidgets and enable them on jupyter notebook if you have not done so. To achieve this task, you can uncomment the cell below and run it. If you already have ipywidgets and it's enabled, you can skip this step and move to next cell. 

Click the cell below and then click **Run**.

In [None]:
"""
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension
"""

In [None]:
#Create function to display interactive plotting
import collections
import operator
import pygal
from IPython.display import display, HTML
from ipywidgets import interact

base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

def galplot(chart):
    rendered_chart = chart.render(is_unicode=True)
    plot_html = base_html.format(rendered_chart=rendered_chart)
    display(HTML(plot_html))

#Check only the Manufacturer Distribution among 9170 classes
tmp_dict = {}
for element in class_name_list:
    tmp = element.split("_")[0]
    if tmp in tmp_dict:
        tmp_dict[tmp] += 1
    else:
        tmp_dict[tmp] = 1
#Sort the car makes
tmp_dict = collections.OrderedDict(sorted(tmp_dict.items(), key=lambda x: x[1], reverse=True))
        
def f(Car_Make_Number=10):
    line_chart = pygal.HorizontalBar()
    line_chart.title = 'Car Make Distribution'
    x = 0
    for keys in tmp_dict:
        line_chart.add(keys, tmp_dict[keys])
        x += 1
        if x>=Car_Make_Number:
            break;
    galplot(line_chart)
interact(f, Car_Make_Number=(1, len(tmp_dict),1));

### Part 2: Car Model Distribution
Now that we know the car make distribution in the database, we can take a closer look at car model distribution. By looking at the distribution of car models we want to understand which models have most number of images with the combination of year variation. 

**Activity**

In the cell below, update **Car_Brand** with the interest of your car manufacturer and **Run** to see the number of images in each model. 

Example: Car_Brand = "honda"

In [None]:
Car_Brand = "chevrolet"
tmp = class_dict[Car_Brand]
tmp_dict = {}
for tmp2 in tmp:
    tmp_dict[tmp2] = sum(tmp[tmp2].values())
tmp_dict = collections.OrderedDict(sorted(tmp_dict.items(), key=lambda x: x[1], reverse=True))
def class_dist_plot(Car_Model_Number = 5):
    pie_chart = pygal.Pie(inner_radius= 0.4)
    tmp3 = Car_Brand + " " + "car models distribution" 
    pie_chart.title = tmp3
    x=0
    for tmp2 in tmp_dict:
        pie_chart.add(tmp2, tmp_dict[tmp2])
        x += 1
        if x>=Car_Model_Number:
            break;
    galplot(pie_chart)    
interact(class_dist_plot, Car_Model_Number=(1, len(tmp_dict),1))

### Part 3: Car Year Distribution
We now know the variation of car makes, and models. With this activity, we can compare the specific car model on its year distribution. 

**Activity**

In order to yearly check data distribution, class_dist_plot function is created. It's taking two variables one of which is the choice of Car manufacturer: **Car_Brand** and car model: **Car_Model**. 

Update these two variables and hit **Run**. Example: Car_Brand = "honda", Car_Model = "honda_pilot"

In [None]:
Car_Brand = "honda"
Car_Model = "honda_civic"
tmp = class_dict[Car_Brand][Car_Model]
#Sort based on the year and create a plot
od = collections.OrderedDict(sorted(tmp.items()))
def class_dist_plot(start_year, end_year):
    dot_chart = pygal.Dot(x_label_rotation=45)
    dot_chart.title = " ".join(Car_Model.split("_")) + " year distribution"
    dot_chart.x_labels = [tmp2.split("_")[-1] for tmp2 in od if int(tmp2.split("_")[-1])>=start_year and 
                         int(tmp2.split("_")[-1])<=end_year]
    dot_chart.add(Car_Model, [od[tmp2] for tmp2 in od if int(tmp2.split("_")[-1])>=start_year and 
                         int(tmp2.split("_")[-1])<=end_year])
    galplot(dot_chart)
start_tmp = int(list(od.keys())[0].split("_")[-1])
end_tmp = int(list(od.keys())[-1].split("_")[-1])
interact(class_dist_plot, start_year=(start_tmp, end_tmp,1), end_year=(start_tmp, end_tmp, 1))

### Part 4: Display Random Images
The activities so far included in this notebook aim at having a better understanding of car make model year distribution. We can also take a look at some random images in the dataset to further explore how the images vary in the database.  

**Activity**

In the cell below, update the display_images function by changing the **numOfImages** parameter to a number from 1 to 5. Then, hit **Run**. 

Example: numOfImages = 5

In [None]:
import random
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

def display_images(file_list, numOfImages = 5):
    indicies = random.sample(range(len(file_list)), numOfImages * numOfImages)    
    fig, axes = plt.subplots(nrows=numOfImages,ncols=numOfImages, figsize=(15,15), sharex=True, sharey=True, frameon=False)
    for i,ax in enumerate(axes.flat):
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        #Pick a random picture from the file list
        imgplot = mpimg.imread(file_list[indicies[i]], 0)
        ax.imshow(imgplot)
        ax.text(10,20,file_list[indicies[i]].split("/")[-2], fontdict={"backgroundcolor": "black","color": "white" })
        ax.axis('off')
    plt.tight_layout(h_pad=0, w_pad=0)            
display_images(file_list)

## Resources

TensorFlow* Optimizations on Modern Intel® Architecture, https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Intel Optimized TensorFlow Wheel Now Available, https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available

Build and Install TensorFlow* on Intel® Architecture, https://software.intel.com/en-us/articles/build-and-install-tensorflow-on-intel-architecture

TensorFlow, https://www.tensorflow.org/

## Case Studies

Manufacturing Package Fault Detection Using Deep Learning, https://software.intel.com/en-us/articles/manufacturing-package-fault-detection-using-deep-learning

Automatic Defect Inspection Using Deep Learning for Solar Farm, https://software.intel.com/en-us/articles/automatic-defect-inspection-using-deep-learning-for-solar-farm

## Citations

A Large and Diverse Dataset for Improved Vehicle Make and Model Recognition
F. Tafazzoli, K. Nishiyama and H. Frigui
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2017. 
