# CCE AIML -learners project on  data compression for Machine learning
###by Shristi Singh, Parinita Bora


#### The algorithms experimented are SVD, PCA, K-mean with high resolution image data
#### This notepad is with reference to PCA for an image of cosmic object captured by James Webb Space Telescope.

## 1. Introduction
For training a machine learning model When there is large amount of  unlabeled data, unsupervised learning algorithms helps in undestanding data.
Unsupervised learning also can help in  dimensionality reduction. 
Dimensionality reduction again can  help in data visualization (e.g. t-SNA method) 
When the data is reduced, the complexity of the model can be reduced, so as the traing time.


## 2.  Brief review -Principal Component Analysis (PCA)
https://en.wikipedia.org/wiki/Principal_component_analysis


Principal Component Analysis is commonly used   dimensionality reduction method.
The data is projected onto its orthogonal subspace.

<img align="right" src="https://upload.wikimedia.org/wikipedia/commons/f/f5/GaussianScatterPCA.svg" width="400">

In the figure if the observations are in ellipsoid feature space.
If the basis set vectors are orthogonal, highly correlated features can be removed, now the data  lies in a subspace having a smaller dimension.
This allows reduction of space with the newer projection. Each of the ellipsoid axes with maximal dispersion is choosen.    

 

#The mathematics behind:

In order to decrease the dimensionality of our data from $n$ to $k$ with $k \leq n$, we sort our list of axes in order of decreasing dispersion and take the top-$k$ of them.

##Step 1.Calculate the covariance matrix of the data
As per definition, covariance  the covariance of two features is : $$cov(X_i, X_j) = E[(X_i - \mu_i) (X_j - \mu_j)] = E[X_i X_j] - \mu_i \mu_j,$$ where $\mu_i$ is the expected value of the $i$th feature. 

-The covariance is symmetric
-The covariance of a vector with itself is equal to its dispersion.
Hence the covariance matrix is symmetric with the dispersion of the corresponding features on the diagonal.
 Non-diagonal values are the covariances of the corresponding pair of features. In terms of matrices where $\mathbf{X}$ is the matrix under observations, the covariance matrix is as follows:

$$\Sigma = E[(\mathbf{X} - E[\mathbf{X}]) (\mathbf{X} - E[\mathbf{X}])^{T}]$$

##step 2.Extract the eigenvectors and the eigenvalues of that matrix
Matrices have  eigenvalues and eigenvectors  as linear operator.  This describes a part of the  space that can  only stretch when linear operators are applied.The streching is by a corresponding value of eigen value while  the direction of Eigenvectors remaining the same.

That is , a matrix $M$ with eigenvector $w_i$ and eigenvalue $\lambda_i$ satisfies the equation : $M w_i = \lambda_i w_i$.

   
##Step 3. Select the number of desired dimensions and filter the eigenvectors to match it, sorting them by their associated eigenvalue
   
The covariance matrix for $\mathbf{X}$ is a product of $\mathbf{X}^{T} \mathbf{X}$. [Rayleigh quotient](https://en.wikipedia.org/wiki/Rayleigh_quotient), the maximum for sample X resides along eigenvector. Principal components  aim to keep only  the eigenvectors corresponding to the most -$k$ largest eigenvalues.

## Step 4. Multiply the original space by the feature vector generated in the previous step.

The matrix of the data $X$ is multipled by the components to get the projection of the data  onto the orthogonal basis for the chosen components. 

references
- http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues 
-https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca



In [None]:
# input file downloaded online 
from google.colab import files 
file=files.upload()

# The input data file 
an example data file with dimention 570X 985 x 3 is an image of Cosmic object, Captured by James Webb Space Telescope (publicly available in Nasa wesite)   

In [None]:
import numpy as np
import matplotlib.pyplot as plt
img=plt.imread("/content/sample_image.jpg")

#Read and plot the image
plt.axis("off")
plt.imshow(img)
plt.title("The input image "  )
plt.show()

#Image is a three dimensional (RGB) object.

#Reshape img into a matrix
#The array has 570 rows each of pixel 985x3. 
#Reshape it into the form of a matrix that PCA can understand. # 2955 = 985* 3

img_reshaped = np.reshape(img,(570,2955 ))
#print(img_reshaped.shape)

#Make the data centred at origin

img_mean=img_reshaped.mean(axis=0)
img_reshaped=img_reshaped-img_mean



In [None]:
img_mean.shape

In [None]:
img_reshaped.shape

In [None]:


#### Apply PCA  # repeat for components 10 to 100

from sklearn.decomposition import PCA

for components in range( 10,110, 10):
  #components=10
  pca=PCA(n_components=components)
  reduced=pca.fit(img_reshaped)

  #Note that T=XW, where W is the weight matrix with each column representing the eigen vector. Also WW'=I because W is an orthogonal coordinate system and eigen vectors are normalized to unit length.
  #To get back from T coordinate system to W coordinate system one needs to do as follows, X=TW'. Note that if W is the complete matrix then X is the exact representation, otherwise not. Also remember to do the de-normalization of X to get to the original values.

  #Following represents the values in the new coordinate system T=XW
  img_transformed_coordinate=pca.transform(img_reshaped)
  print(img_transformed_coordinate.shape)

############  Explained variance 
  print(np.sum(pca.explained_variance_ratio_) )
  plt.grid()
  fname_ev =  "Explained_variance" + str(components)
  #plt.savefig(fname=fname_ev)
  plt.title("pca Explained variance for components =%d" %components )
  plt.plot(np.cumsum(pca.explained_variance_ratio_ * 100))
  plt.xlabel('Number of components')
  plt.ylabel('Explained variance')
  plt.savefig(fname=fname_ev)
  plt.show()
#############  Reconstuction
  #To go back to the old coordinate systesm, either use the inbuilt command or do the operations manually X=TW'
  #img_original_coordinate = pca.inverse_transform(img_transformed_coordinate)

  img_original_coordinate = img_transformed_coordinate.dot(pca.components_)

  #Shifting the mean to original values

  img_original_coordinate = img_original_coordinate+img_mean

  #Reshaping the matrix 

  img_final=np.reshape(img_original_coordinate,(570,985,3))
  img_final=img_final.astype('int')

  img_final[img_final<0]=0
  img_final[img_final>255]=255
  
  plt.title(" Reconstructed Image from components =%d" %components )
  plt.axis("off")
  plt.imshow(img_final)
  #plt.savefig("imagedata_new.jpg")
  #plt.show()
  fname_r =  "reconstructed" + str(components)
  plt.savefig(fname=fname_r, dpi=100)
  plt.show()
################ Reduction 
  reduced_s = []
  for i in range(img_reshaped.shape[1]):
      N_ = i
      totsize = img_reshaped.shape[0]*img_reshaped.shape[1]
      redsize  = img_reshaped.shape[0]*N_ + N_*N_ + N_*img_reshaped.shape[1];
      reduced_s.append((totsize- redsize)/totsize * 100 )

  plt.title(" Percentage Reduction in Image Size for components =%d" %components )
  plt.grid()
  plt.ylabel("Percentage Reduction in Size")
  plt.xlabel(" Number of Principal Components")
  plt.plot(reduced_s)
  fname_pcr =  "Percentage_reduction" + str(components)
  plt.savefig(fname=fname_pcr, dpi=100)
  plt.show()
########Data Compression

# Data Compression Achieved
#Number of values required to store the original image

  original_number_of_values=570*985*3

#Number of values required to store the original image

  new_number_of_values=570*components+985*components+components
  space_required_in_percentage= ((original_number_of_values-new_number_of_values)/original_number_of_values)*100

  print("The compression Ratio for components= %d", components)
  print("%2f" % space_required_in_percentage  )



# Data compression

##calculate compression ratio for components [10, 20,30,40, 50,60,70,80,90,100] - to be updated in  code 

    | components  | compression ratio  |
    |------------ | ------------------ | 
    | 10          |  99.076202         | 
    | 20          |  98.152403         |
    | 30          |  97.228605         |
    | 40          |  96.304806         |
    | 50          |  95.381008         |
    | 60          |  94.457209         |
    | 70          |  93.533411         |
    | 80          |  92.609612         |
    | 90          |  91.685814         |
    | 100         |  90.762015         |

In [None]:
!ls -ltr

references
- [Q&A](http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) 
-https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

