# Image Basics

## Vectorization 

Below is a minimum working procedure to turn an image into a low-dimensional vector using VGG (a widely-studied image processing architecture). 

In [None]:
from keras.applications.vgg19 import VGG19
from keras.models import Model
import keras.utils as image

from tensorflow.keras import regularizers
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.applications.vgg19 import preprocess_input
import numpy as np
from numpy.linalg import norm


vgg_model = VGG19(include_top = False, weights = 'imagenet', pooling='max', input_shape=(224, 224, 3))
# trick here: set include_top = F so that the model can be used for feature extraction

Read in image and do some preprocessing

In [None]:
imgpath = "content/005BHtGojw1f30um7zp9yj30np0hsgnb.jpg"

# read images as pixel representations
img = image.load_img(imgpath, target_size = (224, 224, 3))

x = image.img_to_array(img)

# add new axis to x; the model expects an input tensor of shape (batch_size, height, width, channels)
x = np.expand_dims(x, axis = 0)
print (x.shape)

x = preprocess_input(x)

In [None]:
x

turning this 224 * 224 * 3 input image vector into a 512 dimensinoal vector

In [None]:
features = vgg_model.predict(x)
print (features.shape)

In [None]:
flattened_features = features.flatten()
normalized_features = flattened_features / norm(flattened_features)
print (normalized_features.shape)

In [None]:
print (normalized_features)

## Calculate similarity

here, we calculate cosine similarities between each pair of images


In [None]:
## first, get all images ending up with jpg


import glob
imgs = glob.glob("content/*.jpg") 
print (imgs)

In [None]:
## second, calculate pairwise similarity

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)



In [None]:
## extracted image representation for the first image
result_tuple = []

img1 = imgs[0]
print (img1)

img = image.load_img(img1, target_size = (224, 224, 3))
x = image.img_to_array(img)
x = np.expand_dims(x, axis = 0)
x = preprocess_input(x)
features = vgg_model.predict(x)
flattened_features = features.flatten()
normalized_features1 = flattened_features / norm(flattened_features)

In [None]:
## extracted image representation for the second to the last image
## and calculate cosine similarity between the first image and each of the rest image

for img2 in imgs[1:]:

    if img1 == img2:
      continue # avoid comparing the same picture
    img = image.load_img(img2, target_size = (224, 224, 3))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis = 0)
    x = preprocess_input(x)
    features = vgg_model.predict(x)
    flattened_features = features.flatten()
    normalized_features2 = flattened_features / norm(flattened_features)

    cos = cosine_similarity(normalized_features1, normalized_features2)
    print (img2, cos)

    result_tuple.append((img2, cos))

import pandas as pd
d = pd.DataFrame(result_tuple)
d.head()




## Exercise
I take a sample of dataset from here 
https://www.kaggle.com/crowdflower/twitter-user-gender-classification

There are two tasks you may explore:

- Try a clustering algorithm on the image dataset (`profile_pictures_training`) 
   - Can you find meaningful clusters?
- Try a supervised ML task:
   - use images in `profile_pictures_training` train your model. The `training_metadata.csv` contains a `gender` column which is the outcome for training.
   - Then predict gender of images in `profile_pictures_test`. Verify your predictions the the gender in `test_metadata.csv`.
   - You can use any method, e.g., logistic regression, SVM, random forests, or any other things you know.




If you are not familiar with coding in Python, feel free to save the data to R and download that to your disk and use R.

For instance, the below code will save your data to a file "test.csv" and then you can download that to your local disk.

In [None]:
with open ("test.csv", 'w') as w:
  for cell in normalized_features:
    w.write(f"{cell},")

the below cell is how you upload the material on google colab

In [None]:
!unzip -uq /content/material.zip