# Lab Two - Exploring Data

You are to perform preprocessing and exploratory analysis of a data set: exploring the statistical summaries of the features, visualizing the attributes, and addressing data quality. This report is worth 10% of the final grade. Please upload a report (<b>one per team</b>) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output.

<b>Dataset requirements</b>: Choose a dataset that is comprised of image data. The data should be directories of images. That is, the dataset should not yet be pre-processed. The following are required for the dataset:

<ol>
    <li>The data includes at least 1000 images</li>
    <li>The size of the images should be larger than 20x20 pixels</li>
    <li>The dataset should have a well defined prediction task (i.e., a label for each image)</li>
    <li>The dataset cannot be MNIST or Fashion MNIST</li>
</ol>

<i><b>A note on grading</b>: This lab is mostly about visualizing and understanding your dataset. The largest share of the points is from how you interpret the visuals that you make. Making the visuals is not enough to satisfy each of the rubrics below—you should appropriately explain what the implications of the visualizations are. In other words, expect about 20% of the available points for visuals that have no substantive discussion.</i>

## Business Understanding (2pts)

<ul>
    <li><b>[2 points]</b> Give an overview of the dataset. Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). What is the prediction task for your dataset and which third parties would be interested in the results? Why is this data important? Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to the identified third parties? Be specific and use your own words to describe the aspects of the data.</li>
</ul>

## Data Preparation (1pt)

<ul>
    <li><b>[.5 points]</b> Read in your images as numpy arrays. Resize and recolor images as necessary.</li>
    <li><b>[.4 points]</b> Linearize the images to create a table of 1-D image features (each row should be one image).</li>
    <li><b>[.1 points]</b> Visualize several images.</li>
</ul>

In [None]:
import pandas as pd
import numpy as np
#https://www.geeksforgeeks.org/how-to-convert-images-to-numpy-array/
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory
from pathlib import Path

directory_in_str = './Coronahack-Chest-XRay-Dataset/train/'

#array that will hold numpy arrays of images
imagesToDisplay = []
npImages = []
linNpImages = []

pathlist = list(Path(directory_in_str).rglob('*.jpeg'))
pathlist.sort()
for path in pathlist:
    # because path is object not string
    path_in_str = str(path)
    
    #https://auth0.com/blog/image-processing-in-python-with-pillow/
    #convert to grayscale and resize image
    image = Image.open(path_in_str).convert('LA')
    resizedImage = image.resize((100,100))
    imagesToDisplay.append(resizedImage)
    
    #convert to numpy array
    numpyImage = np.asarray(resizedImage)
    npImages.append(numpyImage)
    numpyImage = numpyImage[:,:,0]
    
    #linearize data
    linNumpyImage = numpyImage.flatten().reshape(1,10000)
    linNpImages.append(linNumpyImage)


In [None]:
summary = pd.read_csv('./Chest_xray_Corona_Metadata.csv')
classifications = summary.Label.to_numpy()

summary['Label'] = summary.Label
summary[summary.Label=='Pneumonia'] = 'Pneumonia'
summary[summary.Label=='Normal'] = 'Normal'
labels = summary.Label.to_numpy()

def plot_gallery(images, titles, h, w, n_row=3, n_col=3):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(n_col * n_col, 6 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row * 2, n_col, i + 1)
        plt.imshow(imagesToDisplay[i], cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
    for j in range(n_row * n_col):
        plt.subplot(n_row * 2, n_col, n_row * n_col + j + 1)
        plt.imshow(imagesToDisplay[-1*j], cmap=plt.cm.gray)
        plt.title(titles[-1*j], size=12)
        plt.xticks(())
        plt.yticks(())
        
plot_gallery(imagesToDisplay, labels, 100, 100)

## Data Reduction (6pts)

<ul>
    <li><b>[.5 points]</b> Perform linear dimensionality reduction of the images using principal components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.
    <li><b>[.5 points]</b> Perform linear dimensionality reduction of your image data using randomized principle components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.</li>
    <li><b>[2 points]</b>  Compare the representation using PCA and Randomized PCA. The method you choose to compare dimensionality methods should quantitatively explain which method is better at representing the images with fewer components.  Do you prefer one method over another? Why?</li>
    <li><b>[1 points]</b> Perform feature extraction upon the images using any feature extraction technique (e.g., gabor filters, ordered gradients, DAISY, etc.).</li>
    <li><b>[2 points]</b> Does this feature extraction method show promise for your prediction task? Why? Use visualizations to analyze this questions. For example, visualize the differences between statistics of extracted features in each target class. Another option, use a heat map of the pairwise differences (ordered by class) among all extracted features. Another option, build a nearest neighbor classifier to see actual classification performance.</li>
</ul>

<li><b>[.5 points]</b> Perform linear dimensionality reduction of the images using principal components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.

In [None]:
# manipulated from Sebastian Raschka Example (your book!)
# also from hi blog here: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })

In [None]:
X = linNpImages
X = [np.concatenate(i) for i in X]

# specify height and width
h, w = (100,100)

In [None]:
from sklearn.decomposition import PCA

n_components = 300

pca = PCA(n_components=n_components)
pca.fit(X.copy())
eigenlungs_pca = pca.components_.reshape((n_components, h, w))
plot_explained_variance(pca)

Analyzing the explained variance graph for each component, we can see that to represent an image with 80% accuracy, it takes 30 components and 90% accuracy takes about 117 components. We can adequately represent our image with 80% accuracy because there is a fewer amount of components required to create a fairly accurate image.

In [None]:
eigenlung_titles = ["eigenlung %d" % i for i in range(eigenlungs_pca.shape[0])]
plot_gallery(eigenlungs_pca, eigenlung_titles, h, w)

In [None]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))

In [None]:
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA')
plt.grid(False)

## honestly i dont know what the fuck this means rn

 <li><b>[.5 points]</b> Perform linear dimensionality reduction of your image data using randomized principle components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.

In [None]:
# lets do some PCA of the features and go from 1850 features to 300 features

n_components = 300
rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
eigenlungs = rpca.components_.reshape((n_components, h, w))
plot_explained_variance(rpca)

In [None]:
eigenlung_titles = ["eigenlung %d" % i for i in range(eigenlungs.shape[0])]
plot_gallery(eigenlungs, eigenlung_titles, h, w)

In [None]:
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))

plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomized PCA')
plt.grid(False)

## Exceptional Work (1pt)

<ul>
    <li>One idea (<b>required for 7000 level students</b>): Perform feature extraction upon the images using DAISY. Rather than using matching go the images with the total DAISY vector, you will instead use key point matching. You will need to investigate appropriate methods for key point matching using DAISY. NOTE: this often requires some type of brute force matching per pair of images, which can be computationally expensive.</li>
</ul>