### The Assignment ###
Take a [ZIP file](https://en.wikipedia.org/wiki/Zip_(file_format)) of images and process them, using a [library built into python](https://docs.python.org/3/library/zipfile.html) that you need to learn how to use. A ZIP file takes several different files and compresses them, thus saving space, into one single file. The files in the ZIP file we provide are newspaper images (like you saw in week 3). Your task is to write python code which allows one to search through the images looking for the occurrences of keywords and faces. E.g. if you search for "pizza" it will return a contact sheet of all of the faces which were located on the newspaper page which mentions "pizza". This will test your ability to learn a new ([library](https://docs.python.org/3/library/zipfile.html)), your ability to use OpenCV to detect faces, your ability to use tesseract to do optical character recognition, and your ability to use PIL to composite images together into contact sheets.

Each page of the newspapers is saved as a single PNG image in a file called [small_img.zip](./small_img.zip). These newspapers are in english, and contain a variety of stories, advertisements and images. 

Here's an example of the output expected. Using the [small_img.zip](./small_img.zip) file, if I search for the string "Christopher" I should see the following image:
![Christopher Search](./small_project.png)

In [None]:
from zipfile import ZipFile
from PIL import Image
from IPython.display import display
import pytesseract
import cv2 as cv
import numpy as np

# load face detection classifier
face_cascade = cv.CascadeClassifier("haarcascade_frontalface_default.xml")

# dictionary to store images, faces and text
img_dic = {}

# store PIL images in the dictionary only once to be accessed latter
with ZipFile("images.zip") as archive:
    for entry in archive.infolist():
        with archive.open(entry) as file:
            img = Image.open(file).convert("RGB")
            img_dic[entry.filename] = {"pil_img":img}

for img_name in img_dic:

    # cut and store faces in dictionary for each image
    img_cv = np.array(img_dic[img_name]["pil_img"]) 
    img_gray = cv.cvtColor(img_cv, cv.COLOR_BGR2GRAY)
    bbs = face_cascade.detectMultiScale(img_gray, 1.3, 5)
    img_dic[img_name]["faces"] = []
    for x,y,w,h in bbs:
        face = img_dic[img_name]["pil_img"].crop((x, y, x+w, y+h))
        face.thumbnail((100,100),Image.ANTIALIAS)
        img_dic[img_name]["faces"].append(face)
        
    # store texts in dictionary for each image
    img_dic[img_name]["text"] = pytesseract.image_to_string(img_dic[img_name]["pil_img"])

# search "name" in each image's text
# if there is "name", display faces found in that image
def search(name):
    for img_name in img_dic:
        if (name in img_dic[img_name]["text"]):
            if(len(img_dic[img_name]["faces"]) > 0):
                print(f"Results found in file {img_name}")
                h = int(len(img_dic[img_name]["faces"])/5)+1
                contact_sheet=Image.new("RGB",(500, 100*h))
                x = 0
                y = 0
                for img in img_dic[img_name]["faces"]:
                    contact_sheet.paste(img, (x, y))
                    if x + 100 == contact_sheet.width:
                        x = 0
                        y += 100
                    else:
                        x += 100
                display(contact_sheet)
            else:
                print(f"Result found in file {img_name} \nBut there were no faces in that file\n\n")
    return