# Image filtering tutorial
In this notebook, we will see how to use Python to solve a real-life data science problem! We will look through a folder containing image files and filter it so that we are left with only high-resolution images, which could be a step in our data cleaning pipeline.

## Context
If we want to build a model that takes image data as input (e.g. an image classifier), we first need to obtain a dataset of images to work with. For instance, we might right a data mining script that scrapes the web for images of different categories.

Are we done at this point? Absolutely not. We might want to filter the data to retain only certain images. For instance, images blindly scraped from the web come in all shapes and sizes, and we might only want to keep images that are above a certain resolution to impose a lower bound on quality. This is exactly what we are going to do in this tutorial.

## Our goal
Given a directory (folder) containing image files, create a new directory that only contains the images which have a height and width greater than or equal to some `height_min` and `width_min`. For this tutorial, let's say that `height_min` and `width_min` are both `400`.

## What we'll learn about
- Control flow (loops and conditionals)
- Data structures (strings, lists, tuples, and dictionaries)
- Errors
- The `os` package for filesystem manipulations
- The `PILLOW` package for reading image files
- How to think through intended program logic before writing code

## Requirements
Since we're working with image data, we need a package that will let us read
image files and their resolutions (height and width). For this, we will be using
the `PILLOW` package. Open your command line, activate your virtual environment,
and install PILLOW using the comands below:

Activate your virual environment (you don't need to do this if your already see `(base)` at the
start of your console line).
```
source activate base
```

Install PILLOW
```
conda install -c anaconda pillow
```

### **Note**:
In this notebook, I repeat a lot of code in each cell. This is not necessary (and undesirable), as variables you define in previous cells
continue to exist in subsequent ones. I am only doing it here for education purposes so that we can see all the code at once.

# Before we begin
We want to look at the images in the directory `raw_images` and save the ones above a resolution of `400x400` to the directory `cleaned_images`.

Let's first think about how we would do this manually to get an intuition for how we could code it...
Try and do the task with just keyboard and mouse on your computer, paying careful attention to the steps you're taking.

## Pseudo-code
```
For each file in the raw_images directory:
    Open the image
    Look at its resolution
    If its height and width are both greater than or equal to 400:
        Save the file to the cleaned_images directory
```

# Time to code!

## Step 1: get a list of files in the `raw_images` directory

In [1]:
import os    # One of many very useful built-in python packages

files = os.listdir('raw_images')    # Given a path to a directory, this function returns a list of all the filenames in it
print(files[:10])

['airplane.jpg', 'apple.jpg', 'arctriumph.jpg', 'backpack.jpg', 'bagel.jpg', 'banana.jpg', 'baseball.jpg', 'baseballbat.jpg', 'basketball.jpg', 'bed.jpg']


## Step 2: walk through the images one by one and open them

But how on earth do we "open an image" in code?? When we know what we want to achieve in Python but don't readily know how to do it, a helpful thing to do is Google it!

One of the most popular is the `PIL` package, and people online show you how to use it to get an image's size. With this package, we can do things like look at an image's resolution, display it on the screen, change its size, and modify its pixels. For this tutorial, we just need to open the image and look at it's height and width.

In [None]:
from PIL import Image

example_image = Image.open('example_image.jpg')

# Get the image's resolution
width, height = example_image.size
print('Image width: ' + str(width))
print('Image height: ' + str(height))

# Display the image
example_image.show()

# Save a copy of the image (similar to using "Save As" with an open image on your computer)
example_image.save('example_image copy.jpg')

Image width: 1512
Image height: 2016


### Now that we see how to open and image and get it's resolution, let's try to achieve step 2

In [5]:
files = os.listdir('raw_images')

for file in files:
    image = Image.open(file)

### Uh-oh, we have an error when we try to open the file. PIL claims that the file we're trying to open doesn't exist. What could be the cause...

Answer: It's looking in the current folder we're running Python from, and the actual files are in a subdirectory called `raw_images`. What it wants is the path `raw_images/[file name]`. We need to add that folder name to each of our file names before opening them. Let's try again:

In [6]:
files = os.listdir('raw_images')

for file in files:
    image = Image.open('raw_images/' + file)

## Step 3: Get the image's resolution and check if it is above 400x400 pixels
Currently, we're just opening the file without doing anything. What we want to do is get its resolution and see if it is above
our minimum requirement of 400x400 pixels.

In [7]:
files = os.listdir('raw_images')

for file in files:
    image = Image.open('raw_images/' + file)

    width, height = image.size
    if height >= 400 and width >= 400:
        print('Image ' + file + ' is ABOVE the minimum resolution')
    else:
        print('Image ' + file + ' is BELOW the minimum resolution')

Image airplane.jpg is BELOW the minimum resolution
Image apple.jpg is ABOVE the minimum resolution
Image arctriumph.jpg is ABOVE the minimum resolution
Image backpack.jpg is ABOVE the minimum resolution
Image bagel.jpg is ABOVE the minimum resolution
Image banana.jpg is ABOVE the minimum resolution
Image baseball.jpg is ABOVE the minimum resolution
Image baseballbat.jpg is ABOVE the minimum resolution
Image basketball.jpg is ABOVE the minimum resolution
Image bed.jpg is ABOVE the minimum resolution
Image bowtiepasta.jpg is ABOVE the minimum resolution
Image breadslice.jpg is ABOVE the minimum resolution
Image broom.jpg is BELOW the minimum resolution
Image button.jpg is ABOVE the minimum resolution
Image car.jpg is BELOW the minimum resolution
Image cathedral.jpg is ABOVE the minimum resolution
Image chesspiece.jpg is ABOVE the minimum resolution
Image clock.jpg is ABOVE the minimum resolution
Image clown.jpg is BELOW the minimum resolution
Image coffeecup.jpg is ABOVE the minimum reso

# Step 4: Copy the image over
Right now, we're just printing whether images are above/below our threshold. What we really want to do is save images above the threshold into the `cleaned_images` directory and do nothing otherwise.

In [5]:
files = os.listdir('raw_images')

for file in files:
    image = Image.open('raw_images/' + file)

    width, height = image.size
    if height >= 400 and width >= 400:
        image.save('cleaned_images/' + file)

# And we're done! The `cleaned_images` directory now contains all of the images above our thresholded resolution.

## What can we improve?
1. **Don't hard-code values**. Instead of writing constants all over the code (e.g. `400`, `raw_images`, assign them to variables with meaningful names. This will make your code more readable, and it will make it easier to modify them in one place rather than having to find them all over your code.
2. **Use descriptive and accurate variable names**. Do we really have a list of `files`? More accurately, we have a list of `filenames`. Don't be afraid to use longer, more descriptive variable names.
3. **Use intermediate variable names**. Rather than bunch up multiple steps in one line, split them up into multiple lines and give meaningful intermediate variable names. For instance, `raw_images/' + file` is a file path. On a separate line, we can make this explicit. This will make your code more readable (if a bit longer).
4. **Use comments to describe abstract logic**. Comments help make clear what you're trying to achieve, and helps other programmers quickly understand what your code does at a high-level. You don't need to comment every line, as your code should be informative on its own (e.g. by using good variable names), but you should try commenting blocks of code.

In [7]:
# Define program inputs
source_dir = 'raw_images'
destination_dir = 'cleaned_images'
min_width, min_height = 400, 400

filenames = os.listdir(source_dir)

for filename in filenames:
    # Open the image file
    file_path = source_dir + '/' + filename
    image = Image.open(file_path)

    # If the image resolution is above our minimum requirement, save it to our destination directory
    width, height = image.size
    if height >= min_height and width >= min_width:
        image.save(destination_dir + '/' + file)

# Bonus: Get some statistics on image resolutions in our dataset

## Context
As a data scientist, it is essential to be an expert in the dataset you work with. You have a toolset of modeling and data preprocessing techniques
at your disposal, but which you use depend on both your goals and your dataset.

For instance, many models that can be applied to image data will require that your images all be at the same resolution. How should you go about
doing this if your dataset contains images and all sorts of resolutions? What resolution should you pick? What aspect ratio? Should you crop images
or stretch them? To answer such questions, a great first step is to know what the distribution of image resolutions actually is in your dataset.
This is a part of "data exploration", an essential component of data science.

## Our goal
We're going to modify the above code so that we also get some statistics about how resolutions are distributed in our "cleaned_images" dataset.
In particular, for each resolution (width-height pair) that exists in the dataset, we want to find out how many images have that resolution.
For example, if we had a dataset of images with the following (width, height) pairs:
```
[(450, 600), (800, 500), (450, 600)]
```
We would want our program to output statistics that look like the following:
```
(450, 600) -> 2
(800, 500) -> 1
```

# Before we begin
Once again, let's first think about how we would do this manually to get an intuition for how we could code it...
Try and do the task with just keyboard and mouse on your computer (and a piece of paper this time), paying careful attention to the steps you're taking.

## Pseudo-code
```
Define a variable "resolution_counts" to keep track of them

For each file in the raw_images directory:
    Open the image
    Look at its resolution
    If its height and width are both greater than or equal to 400:
        Save the file to the cleaned_images directory
        
        If its (width, height) pair is not already in our "resolution_counts":
            Add a (width, height) entry in "resolution_counts" with a value of 1
        Else:
            Increase the value of the (width, height) entry in "resolution_counts" by 1
```

In [44]:
# Define program inputs
source_dir = 'raw_images'
destination_dir = 'cleaned_images'
min_width, min_height = 400, 400

filenames = os.listdir(source_dir)

# Variable to keep track of the statistics of image resolutions in our dataset
resolution_counts = {}

for filename in filenames:
    # Open the image file
    file_path = source_dir + '/' + filename
    image = Image.open(file_path)

    # If the image resolution is above our minimum requirement, save it to our destination directory
    width, height = image.size
    if height >= min_height and width >= min_width:
        image.save(destination_dir + '/' + file)
        
        # Also incre
        resolution = (width, height)
        if resolution not in resolution_counts:
            resolution_counts[resolution] = 1
        else:
            resolution_counts[resolution] += 1    # Shorthand for "resolution_counts[resolution] = resolution_counts[resolution] + 1"
            
# Display the resolution statistics nicely
for resolution, count in resolution_counts.items():    # Iterate through the dictionary keys and values at the same time
    print('Resolution: {}, Count: {}'.format(resolution, count))

Skipping non-image file: metadata.txt
Resolution: (400, 800), Count: 3
Resolution: (700, 400), Count: 10
Resolution: (800, 400), Count: 5
Resolution: (500, 500), Count: 11
Resolution: (500, 700), Count: 3
Resolution: (400, 600), Count: 4
Resolution: (600, 500), Count: 9
Resolution: (400, 500), Count: 2
Resolution: (500, 900), Count: 1
Resolution: (600, 400), Count: 5
Resolution: (800, 600), Count: 1
Resolution: (500, 600), Count: 5
Resolution: (900, 900), Count: 1
Resolution: (600, 600), Count: 2
Resolution: (1600, 1500), Count: 1
Resolution: (700, 1200), Count: 1
Resolution: (1700, 1300), Count: 1
Resolution: (400, 700), Count: 7
Resolution: (500, 1700), Count: 1
Resolution: (2100, 1600), Count: 1
Resolution: (1800, 800), Count: 1
Resolution: (700, 500), Count: 2
Resolution: (700, 600), Count: 1
Resolution: (400, 400), Count: 1


# And we're done!
Ideally, rather than just print these statistics, we would visualize them graphically in some more informative way,
but we'll get there soon. If you're interested, try coming back to this code in Week 3 after we learn about plotting libraries and visualize
this data as a **2D histogram**.