# **Image Pre-Processing Notebook**
Use this notebook and the code within it to convert all minecraft image data into a dataframe containing certain image features and coordinates of the sun or moon if present. 

The code in this notebook makes the following assumptions:
1. You have a folder named "CS189" somewhere in your Google Drive, and that you've added a shorcut to this folder to "My Drive."
2. All images have specific file names (more details below), and that they are in zip files in the CS189 folder. 
3. You are using RGB colored images.

In [None]:
#Run this cell to import useful tools for the rest of image processing
import numpy as np
import cv2
import pandas as pd
import os
import re
import seaborn as sns
import zipfile
from PIL import Image
!pip install --upgrade imutils
import matplotlib.pyplot as plt
import imutils
from google.colab.patches import cv2_imshow
%matplotlib inline

Requirement already up-to-date: imutils in /usr/local/lib/python3.6/dist-packages (0.5.3)


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Now that we've taken care of imports and have linked the notebook to Google Drive, we can feed in the images. As a relict of debugging the image processing, we've set up this if statement that can be use to switch between a folder full of images for debugging and the actual raw data. 

Note the "image_folder" variable which links to a specific path in Google Drive which contains either debug data or real data. 

In the non-debug use case, our "CS189" folder contains a "screenshot_data.zip" file, which when unzipped contains another folder called "screenshot_data". We assume that all images contained within have file names such as "y{yaw value}_p{pitch value}_t{tick value}.jpg". For instance, an image could have the file name "y120_p30_t0.jpg" corresponding to yaw 120, pitch 30, and 0 ticks. For negative values we have "y-90_p-60_t180.jpg" for example. The pitch, yaw, and tick are all very important features downstream.

We also print the number of images here as a sanity check.

In [None]:
# Get image folder and num images
debug = False
if debug:
  image_folder = "drive/My Drive/CS189/poojan is speed"
else:
  image_folder = "drive/My Drive/CS189/screenshot_data.zip"
  with zipfile.ZipFile(image_folder) as z:
    z.extractall()
  image_folder = "screenshot_data"
cs189_images = os.listdir(image_folder)
num_images = len(os.listdir(image_folder))
print(num_images)

16200


The following for loop is the primary workhorse for this notebook. First we prepare a data frame that contains the filename, tick, yaw, pitch, and the (X,Y) coordinates of the center of either the sun or the moon if present in the image. 

This is accomplised by using OpenCV to first binarize and threshold the image, converting a colored RGB image into a black and white image in which only the moon or the sun should appear as white due to pixels with brightness below a certain threshold being set to 0. As you can see from the structure of the if statements inside, the threshold is set based on time of day, and edge cases such as facing away from the sun at sunrise, or facing away from the sun at sunset. 

Contours are found for binary images, and centroids are reported. If an image contains 2 or more centroids (which should never be the case because the sun and the moon are never together in frame) then the image and its file path are displayed. 

If the sun or moon is present, its (X,Y) coordinates are set to those that are found in the image. If the sun or moon is not present, both coordinates are set to 0. In the second, smaller for loop near the end we append several images which we know the view was directed at the ground and so expect no celestial bodies. Finally, the data frame we were constructing throughout this loop is written to a csv file and saved in the "CS189" folder defined earlier.

In [None]:
#indices = np.random.randint(0, num_images, 100)
indices = range(num_images)
all_data = pd.DataFrame(columns = ["filename", "tick", "yaw", "pitch", "sun_X", 
                                   "sun_Y", "moon_X", "moon_Y"])
for i in indices:
  fname = cs189_images[i]
  image_path = os.path.join(image_folder, fname)
  tick = 500

  #Use Regex to determine yaw, pitch, and tick from filenames
  matches = re.search('y(-?[\d]+)_p(-?[\d]+)_t([\d]+).*', fname)
  yaw, pitch, tick = [int(i) for i in matches.groups()]

  #Load in the image using the file path
  image = cv2.imread(image_path)
  time_of_day = 'day'
  cutoff = 220

  #Use time of day and other information to set the threshold for each
  #image for later binarization. These thresholds were experimentally determined
  #by us. In general, higher thresholds cause signal loss but reduce noise, so 
  #we try to set our thresholds as low as possible to still detect centroids
  #but not let in too much noise that we start detecting objects that aren't
  #the sun or the moon. 
  if tick > 22500: 
    time_of_day = 'sunrise'
    cutoff = 190
    if yaw > 0 and tick < 23200:
      cutoff = 115
  elif tick >= 0 and tick < 12500:
    time_of_day = 'day'
  elif tick >= 12500 and tick < 13500:
    time_of_day = 'sunset'
    cutoff = 170
    if yaw < 0 and tick > 12800:
      cutoff = 115
  else:
    time_of_day = 'night'
    cutoff = 90

  #Conver the image to black and white
  gray_image = cv2.cvtColor(np.uint8(image), cv2.COLOR_BGR2GRAY)
  perc_thres = np.percentile(np.ndarray.flatten(gray_image), 99.5)
  ret,thresh = cv2.threshold(gray_image,cutoff,255,0)

  #Find contours in the image
  cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, 
                        cv2.CHAIN_APPROX_SIMPLE)
  cnts = imutils.grab_contours(cnts)

  stars = []
  image = gray_image
  counter = 0

  #Only consider image contours larger than a certain area (filters out noise)
  for c in cnts:
    area = cv2.contourArea(c)
    if area < 2500:
      continue
    counter += 1
    # compute the center of the contour
    cv2.drawContours(image, [c], -1, (0, 255, 0), 2)
    M = cv2.moments(c)
    cX = int(M["m10"] / M["m00"])
    cY = int(M["m01"] / M["m00"])
    stars.append([cX,cY])
    cv2.drawContours(image, [c], -1, (0, 255, 0), 2)
    cv2.circle(image, (cX, cY), 7, (255, 255, 255), -1)
    cv2.putText(image, "center", (cX - 20, cY - 20), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
    cv2.waitKey(0)

  #Set (X,Y) coordinates for the sun or the moon
  moon_x = 0
  moon_y = 0
  sun_x = 0
  sun_y = 0
  if time_of_day == 'day' and len(stars) > 0:
    sun_x = stars[0][0]
    sun_y = stars[0][1]
  elif time_of_day == 'night' and len(stars) > 0:
    moon_x = stars[0][0]
    moon_y = stars[0][1]
  elif time_of_day == 'sunrise' and len(stars) > 0: #pos yaw = moon at sunrise
    if yaw < 0:
      sun_x = stars[0][0]
      sun_y = stars[0][1]
    else:
      moon_x = stars[0][0]
      moon_y = stars[0][1]
  elif time_of_day == 'sunset' and len(stars) > 0: #pos yaw = sun at sunset
    if yaw > 0:
      sun_x = stars[0][0]
      sun_y = stars[0][1]
    else:
      moon_x = stars[0][0]
      moon_y = stars[0][1]

  #yaw positive means sun, negative means moon
  if counter < 2:
    all_data = all_data.append({"filename": fname,
                              "tick": tick, "yaw":yaw, "pitch":pitch, 
                              "sun_X":sun_x, "sun_Y": sun_y,
                              "moon_X":moon_x, "moon_Y": moon_y}, ignore_index=True)
  if counter > 1:
    print(f"Full image path is {image_path}")
    cv2_imshow(image)


#For loop for data we know should not have the sun or moon since we're looking
#at the ground in these images.
all_data
for yaw in range(-180, 170, 20):
  for pitch in range(45, 91, 15):
    for tick in range(0,24000, 240):
      image_path = "y"+str(yaw)+"_p"+str(pitch)+"_t"+str(tick)+".jpg"
      sun_x, sun_y, moon_x, moon_y = 0,0,0,0
      all_data = all_data.append({"filename": image_path,
                              "tick": tick, "yaw":yaw, "pitch":pitch, 
                              "sun_X":sun_x, "sun_Y": sun_y,
                              "moon_X":moon_x, "moon_Y": moon_y}, ignore_index=True)
      
#Finally, write all data to a CSV file
all_data.to_csv("drive/My Drive/CS189/data_df.csv")
