# Project 3 | Group 10
## Section I: Logistic Classifier, GBM (boosted trees), Random Forest Method Using Resized Images Without Filtering

** Remember to run `detect_features.py` and `resize_images.py` to generate SFrame and image data necessary to execute notebook commands**

**key to model variables**:
   - model_1 : logistic classifier
   - model_2 : boosted tree classifier
   - model_3 : random forest method

**Step 1**: Import all necessary packages for performing analysis

In [24]:
import graphlab as gl
import numpy as np
from os import listdir
from os.path import isfile, join
import cv2

**Step 2:** Set the base image directory and the path to the csv file containing the raw image labels.  We will use these directories for importing info into the IPython Notebook

In [3]:
base_dir = '/Users/galen/Desktop/image_classification/'

sframe_dir = join(base_dir, 'data/sframe')
img_dir = join(base_dir, 'data')
feature_dir = join(base_dir, 'data/csv_features')
figure_dir = join(base_dir,'figs')
model_dir = join(base_dir,'output/trained_model')

label_file = join(base_dir, 'data/csv_labels/labels.csv')
sift_file = join(feature_dir,'sift_features.csv')
submission_file = join(base_dir, 'output/contest_submission.csv')

**Step 3:** We are given sift features in a csv.  We use a script called `generate_sframe.py` to generate an SFrame which we will use for our analysis.

In [4]:
sf = gl.load_sframe(join(sframe_dir,'sift_features'))

This non-commercial license of GraphLab Create for academic use is assigned to gsimmons17@gsb.columbia.edu and will expire on December 07, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1490376349.log


We can see from the stucture of the SFrame that the given sift features and labels are loaded, as well as the images that have been resized with the `resize_images.py` script we wrote

In [5]:
sf.head()

id,sift_features,label,image
0,"[0.0, 0.0, 0.0, 0.0, 0.000635729986243, ...",chicken,Height: 256 Width: 256
1,"[0.0, 0.0, 0.0, 0.0, 0.000626179971732, ...",chicken,Height: 256 Width: 256
2,"[0.000846739974804, 0.0, 0.0, 0.0, ...",chicken,Height: 256 Width: 256
3,"[0.0, 0.0, 0.0, 0.0, 0.000565610011108, ...",chicken,Height: 256 Width: 256
4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00277010002173, 0.0, ...",chicken,Height: 256 Width: 256
5,"[0.0, 0.0, 0.0, 0.0, 0.000596660014708, ...",chicken,Height: 256 Width: 256
6,"[0.000898469996173, 0.0, 0.0, 0.0, 0.0, ...",chicken,Height: 256 Width: 256
7,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",chicken,Height: 256 Width: 256
8,"[0.0, 0.0, 0.0, 0.00112869997974, 0.0, ...",chicken,Height: 256 Width: 256
9,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00160600000527, 0.0, ...",chicken,Height: 256 Width: 256


*side note:* We'll save a copy of the id and label columns for later use (we will construct other SFrame's using different image data)

In [6]:
general_sframe = sf.select_columns(['id','label'])

**Step 4:** In general, it is a good idea to randomly split data into a training and validation set.  We do this to make sure that our models are not overfitting. 

In [7]:
train, validation = sf.random_split(.5, seed = 0) # returns 50% training, 50% validation from sf

**Step 5:** We train the models and calculate accuracy on our validation sets.

In [8]:
model_1 = gl.logistic_classifier.create(sf, target='label', features=['sift_features'], validation_set=validation)
model_2 = gl.boosted_trees_classifier.create(sf, target='label', features=['sift_features'], validation_set=validation)
model_3 = gl.random_forest_classifier.create(sf, target='label', features=['sift_features'], validation_set=validation)

In [9]:
acc_1 = model_1.evaluate(validation)
acc_2 = model_2.evaluate(validation)
acc_3 = model_3.evaluate(validation)

*Below we print the confusion matrices for each model and the validation accuracy*

In [10]:
acc_1['confusion_matrix'], acc_2['confusion_matrix'], acc_3['confusion_matrix']

(Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |     dog      |       dog       |  505  |
 |   chicken    |     chicken     |  496  |
 |     dog      |     chicken     |   1   |
 |   chicken    |       dog       |   1   |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |   chicken    |       dog       |   5   |
 |     dog      |       dog       |  501  |
 |   chicken    |     chicken     |  492  |
 |     dog      |     chicken     |   5   |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +-

In [11]:
print("Model 1 Validation Accuracy: %.4f" % acc_1['accuracy'])
print("Model 2 Validation Accuracy: %.4f" % acc_2['accuracy'])
print("Model 3 Validation Accuracy: %.4f" % acc_3['accuracy'])

Model 1 Validation Accuracy: 0.9980
Model 2 Validation Accuracy: 0.9900
Model 3 Validation Accuracy: 0.8883


**From this information, we can conclude that model 1, the logistic classifier has the best results on the given sift features**

**Step 6:** We can now store our predictions from each model in columns in the original SFrame

In [19]:
sf['m1_label'] = model_1.predict(sf)
sf['m2_label'] = model_2.predict(sf)
sf['m3_label'] = model_3.predict(sf)

We also save our predictions in csv for the contest

In [20]:
contest_predictions = sf['m1_label'].apply(lambda x: 0 if x=='chicken' else 1)
gl.SFrame({'predictions': contest_predictions}).export_csv(submission_file)

It would be interesting to see an example of an image where the models disagree.  Let's find one.

In [21]:
disagree = sf[sf['m2_label']!=sf['m3_label']]

Let's have a look at some disagreements

In [22]:
disagree.head()

id,sift_features,label,image,m1_label,m2_label,m3_label
25,"[0.0, 0.0, 0.00256410008296, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
27,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.000759869988542, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
46,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0024037999101, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
59,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00466360012069, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
60,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
63,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00221730000339, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
66,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
70,"[0.0, 0.0, 0.00244499999098, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
100,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00733500020579, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog
108,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0053619001992, 0.0, ...",chicken,Height: 256 Width: 256,chicken,chicken,dog


We can see that for id==72, model 2 predicts chicken, but model 3 predicts dog.  Let's look at that image:

In [163]:
gl.canvas.set_target('ipynb')
disagree[disagree['id']==72]['image'].show()

For id==121, model 2 predicts dog, but model 3 predicts chicken.  Let's show that image too:

In [164]:
disagree[disagree['id']==121]['image'].show()

We can also look at examples of images that were dogs but were incorrectly predicted to be chicken

In [165]:
disagree.tail()

id,sift_features,label,image,m1_label,m2_label,m3_label
1900,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1913,"[0.0, 0.0, 0.0, 0.00112230004743, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1916,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00387350004166, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1917,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1926,"[0.000745150027797, 0.0, 0.0, 0.000745150027797, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1931,"[0.00173709995579, 0.0, 0.0, 0.000579030020162, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1933,"[0.0, 0.0, 0.00079616997391, 0.0, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1945,"[0.0, 0.0, 0.0, 0.0, 0.00107120000757, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1972,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00548439985141, 0.0, ...",dog,Height: 256 Width: 256,dog,dog,chicken
1995,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.00189389998559, 0.0, ...",dog,Height: 256 Width: 256,dog,dog,chicken


In [69]:
disagree[disagree['id']==1887]['image'].show()

## Section II: Using Filtered Images And Training Models with Limited Data

In the previous section, we were able to train an accurate classifier using 1000 training set images (50% of 2000).  In this section, we explore how we might extract image features from filtered images to train a model with high accuracy using more limited image data.  In the `resize_images.py` script, we generated two sets of transformed images that were generated using `opencv`.

We load the filtered images into SFrames that we will use for our analysis

In [11]:
edge_sf = gl.load_sframe(join(sframe_dir,'edge_sf'))
sobel_sf = gl.load_sframe(join(sframe_dir,'edge_sf'))

Let's inspect an example of how the images were transformed.  We look at `id==1887` which is the same image we looked at in the previous section:

In [13]:
gl.canvas.set_target('ipynb')
edge_sf[edge_sf['id']==1887]['image'].show(), sobel_sf[sobel_sf['id']==1887]['image'].show()

(None, None)

In [15]:
sample = edge_sf.sample(0.05, seed=0) # sample 5% of the data

To train our neural network, we'll perform the same random splitting method, but take only 20% of the dataset to train our neural network classifier.  This deep convolutional neural network model is based on the imagenet challenge.

In [138]:
train_sf, valid_sf = edge_sf.random_split(.2, seed=0) # this time we take only 20% of the dataset to train

First, we can make an example showing SIFT keypoints on a sample image

In [29]:
example_img = join(img_dir,'image_1234.jpg')
img = cv2.imread(example_img, 0)
gray = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)

sift = cv2.xfeatures2d.SIFT_create()
kp = sift.detect(gray, None)
img = cv2.drawKeypoints(gray,kp,img)
cv2.imwrite(join(figure_dir,'sift_keypoints.jpy'), img)

error: /Users/jenkins/miniconda/1/x64/conda-bld/conda_1486588158526/work/opencv-3.1.0/modules/features2d/src/draw.cpp:113: error: (-215) !outImage.empty() in function drawKeypoints


In [None]:
gl.Image(join(figure_dir,'sift_keypoints.jpg')).show()

In [26]:
example_img = join(img_dir,'image_1234.jpg')
img = cv2.imread(example_img, 0)

# transform images
ex_edge = cv2.Canny(img,100,200)
ex_sobel = cv2.Sobel(img, cv2.CV_8U, 0, 1, ksize=5)

# perform sift keypoint detection
sift = cv2.xfeatures2d.SIFT_create()
(kp_edge, des_edge) = sift.detectAndCompute(ex_edge, None)
(kp_sobel, des_sobel) = sift.detectAndCompute(ex_sobel, None)

img_edge = cv2.drawKeypoints(ex_edge, kp_edge, img)
img_sobel = cv2.drawKeypoints(ex_sobel, kp_sobel, img)

error: /Users/jenkins/miniconda/1/x64/conda-bld/conda_1486588158526/work/opencv-3.1.0/build/opencv_contrib/modules/xfeatures2d/src/sift.cpp:770: error: (-5) image is empty or has incorrect depth (!=CV_8U) in function detectAndCompute


In [52]:
edge_img = gl.image_analysis.load_images(join(img_dir,'img_edge'))
edge_img.rename({'image':'edge_img'})
sobel_img = gl.image_analysis.load_images(join(img_dir,'img_sobel'))
sobel_img.rename({'image':'sobel_img'})

path,sobel_img
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256
/Users/galen/Desktop/imag e_classification/data ...,Height: 256 Width: 256


# Section III: Supplemental Models/Feature Extraction Implemented in `sklearn` and `opencv`

## Hog Features

In [None]:
def compute_hog_features():
    import cv2
    import pandas as pd
    from os import os.listdir
    
    winSize = (256, 256)
    block_size= (64, 64)
    block_stride= (32, 32)
    cell_size= (32, 32)
    nbins = 9
    padding = (32,32)

    # Create hog feature extraction
    hog = cv2.HOGDescriptor(winSize, block_size, block_stride, cell_size, nbins)

    # Reading the images and calculating the features
    hog_dict = {}
    for pic in os.listdir(img_dir):
        if pic.split('.')[1] == 'jpg':
            pic_name = pic.split('.')[0]
            pic_n = pic_name.split('_')[1]
            img_read = cv2.imread(os.path.join(pic_path,pic))
            img_read = cv2.resize(img_read, (128,128)) # to get the same number of features
            hog_dict[pic_n] = hog.compute(img_read, padding)

    my_dictionary = {k: v.tolist() for k, v in hog_dict.items()}
    hog_feature = {k: [i[0] for i in v] for k, v in my_dictionary.items()}
    hog_feature = pd.DataFrame.from_dict(hog_feature, orient='index')
    
    # sort the dateframe
    hog_feature_sort=hog_feature.sort_values(by=[0])
    # Saving the contents into a file
    hog_feature.to_csv(join(feature_dir,'hog_features.csv'))

## Dimension Reduction PCA SVM

Import necessary models from `sklearn`

In [175]:
def compute_pca_features():
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC
    from sklearn.svm import LinearSVC
    from sklearn.grid_search import GridSearchCV
    from sklearn.decomposition import PCA
    
    sift_feature= pd.read_csv(sift_file)
    sift_t = sift_feature.T
    label = pd.read_csv(label_file)

    dat = pd.concat([label.reset_index(drop=True), sift_t.reset_index(drop=True)],axis=1)
    dat=dat.rename(columns = {'V1':'y'})

    # Split data into trainset and test set
    X_train, X_test, y_train, y_test = train_test_split(dat.drop('y',axis=1), dat['y'],
    random_state=0)

    # First, use linear SVM with grid search over a few choice of c to get a baseline
    grid = GridSearchCV(LinearSVC(), {'C': [1.0, 2.0, 4.0, 8.0]})
    grid.fit(X_train, y_train)
    print("accuracy on training set: %f" % grid.score(X_train, y_train))
    print("accuracy on test set: %f" % grid.score(X_test, y_test))


    # Fit principal component analysis model to the data
    pca = PCA()
    pca.fit(X_train)
    X_train_pca = pd.DataFrame(pca.transform(X_train))
    X_test_pca = pd.DataFrame(pca.transform(X_test))

    # Saving the contents into a file
    np.savetxt(join(feature_dir,"X_train_pca.csv"),X_train_pca,delimiter=",")
    np.savetxt(join(feature_dir,"X_test_pca.csv"),X_test_pca,delimiter=",")
    
    return(X_train, y_train, X_test, y_test)

In [176]:
def svc_sel(Xtrain, ytrain, Xtest, ytest, nfolds):
    from sklearn.grid_search import GridSearchCV

    Cs = [5,10, 50,100,150]
    gammas = [40,50,100,150]
    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=nfolds)
    grid_search.fit(Xtrain, ytrain)
    grid_search.best_params_
    grid_search.score(Xtrain,ytrain)
    grid_search.score(Xtest,ytest)
    return {'Best parameter': grid_search.best_params_, 
            'Training accurate rate': grid_search.score(Xtrain,ytrain),
            'Test accurate rate': grid_search.score(Xtest,ytest),
            'Best model': grid_search}

In [177]:
from sklearn.svm import SVC
xtrain, ytrain, xtest, ytest = compute_pca_features()
result = svc_sel(xtrain, ytrain, xtest, ytest, 6)

accuracy on training set: 0.776667
accuracy on test set: 0.730000


KeyboardInterrupt: 

## Gradient Boosting Classifier: Baseline Model
### Implemented in `sklearn`

In [183]:
def compute_gbm():
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    import pandas as pd
    
    sift_feature= pd.read_csv(sift_file)
    sift_t = sift_feature.T
    label = pd.read_csv(label_file)

    dat = pd.concat([label.reset_index(drop=True), sift_t.reset_index(drop=True)],axis=1)
    dat=dat.rename(columns = {'V1':'y'})

    # Split data into trainset and test set
    trainX, testX, trainY, testY = train_test_split(dat.drop('y',axis=1), dat['y'],
    random_state=0)
    
    seed = 0
    model = GradientBoostingClassifier(n_estimators = 500,
                                       max_depth = 5,
                                       subsample = 0.5,
                                       max_features='log2',
                                       random_state=seed)
    model.fit(trainX, trainY)
    
    trainX = pd.DataFrame(model.transform(trainX))
    testX = pd.DataFrame(model.transform(testX))
    
    # Saving the contents into a file
    np.savetxt(join(feature_dir,"trainX_gbm.csv"),trainX,delimiter=",")
    np.savetxt(join(feature_dir,"testX_gbm.csv"),testX,delimiter=",")
    
    return(model, trainX, trainY, testX, testY)

In [184]:
model, trainX, trainY, testX, testY = compute_gbm()



In [182]:
model.get_params

<bound method GradientBoostingClassifier.get_params of GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=5,
              max_features='log2', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=500, presort='auto', random_state=0,
              subsample=0.5, verbose=0, warm_start=False)>

## Surf Feature Extraction using OpenCV

In [None]:
def compute_surf_features(image_dir, feature_name):
    import scipy.cluster.vq
    import os
    from sklearn.preprocessing import StandardScaler
    from sklearn import preprocessing
    
    image_paths = []
    for files in os.listdir(image_dir):
        dir = os.path.join(image_dir, files)
        image_paths.append(dir)

    # Create surf feature(Speeded-up Robust Features) extraction and keypoint detector objects
    # Reading the image and calculating the features and corresponding descriptors
    des_list = []
    for  image_path in image_paths:
        img = cv2.imread(image_path)
        img = cv2.resize(img, (128,128))
        gray= cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        surf = cv2.xfeatures2d.SURF_create()
        (kps, descs) = surf.detectAndCompute(gray, None)
        des_list.append((image_path, descs))  

    # Stack all the descriptors vertically in a numpy array
    descriptors = des_list[0][1]
    for image_path, descriptor in des_list[1:]:
        descriptors = np.vstack((descriptors, descriptor))
    # Perform k-means clustering
    k = 5000 # Number of clusters
    voc, variance = kmeans(descriptors, k, 1)  # Perform Kmeans with default values

    # Calculate the histogram of features
    im_features = np.zeros((len(image_paths), k), "float32") 
    for i in range(len(image_paths)):
        words, distance = vq(des_list[i][1],voc)
        for w in words:
            im_features[i][w] += 1


    # Perform Tf-Idf vectorization
    nbr_occurences = np.sum( (im_features > 0) * 1, axis = 0)
    # Calculating the number of occurrences
    idf = np.array(np.log((1.0*len(image_paths)+1) / (1.0*nbr_occurences + 1)), 'float32')
    # Giving weight to one that occurs more frequently

    # Scaling the words
    stdSlr = StandardScaler().fit(im_features)
    im_features = stdSlr.transform(im_features)

    # Saving the contents into a file
    np.savetxt(join(feature_dir,feature_name),im_features,delimiter=",")

Let's compute surf features on our edge images

In [92]:
compute_surf_features(join(img_dir,'img_edge'), 'surf_features_edge')

ValueError: all the input array dimensions except for the concatenation axis must match exactly