# DATA690 MATH: 
## Project 2 Notebook: PCA
## Barker French, Shaimar Gonzales, Scott Hirabayashi

Class Notes-
- https://colab.research.google.com/drive/1IQ99X-o2nsIAs5oqkqo_-3NEfg9kTl-s?authuser=2

- https://docs.google.com/document/d/1B5p0LMuLAwhF21ZfKw5UpkKU2aLghvI4FGOLyMpzNhw/edit

## Project  2.  Principal  Component  Analysis
### Assigned on Nov 1, 2021; Due on Nov 19, 2021
**Disclosure:** This is a graded project. Groups will be assigned. If you opt out to work alone, please let me and your group members know. Please refer to the code of conduct. 

- In this project, you are expected to replicate principal component analysis: an unsupervised machine learning method to reduce feature dimensions for efficient ML practice. 

- The only Python package that you are allowed to use is NumPy.

## Input/Output
- You will be given randomly generated datasets with many (k) features and N observations.

- You are expected to deliver output that has m features, with m < k, and m should be a direct input from users of your function; or computed from the other user input: Variation Retainment. Variation Retainment will override the input m. (see below for details)

### Functionalities
Define your own helper functions to... (10pt)
- Determine the dimensionality of the data (done)
- Check if your matrices are invertible (done)
- You can use this data set for testing (USING)

### Grading Rubric:
- Functionality to normalize your data (use minmax here) 10pt (have)
- Functionality to compute eigenvectors and eigenvalues: 10pt (have)
- Functionality to select the m principal components: 10pt (have)
- If user inputs m = m and no var_retain: set m = m (OK)
- If user inputs var_retain, compute m to match the variation retainment of the data (OK)

### Functionality to project the data using the principal components 20pt
Your general pipeline: 10pt  
- Raw data -> Normalization -> PCA -> PCA projected data (OK)
- Given a y-column, use the linear regression pipeline you built in Project 1 (OK)
- Separately fit y using Raw data and PCA projected data 20pt (DONE)
- You need to compare the prediction losses to select a model 10pts (DOME)

### Bonus:
- Pack all your functions and pipeline into a Python object 10pt (DONE)
- Set up unit tests for your pipeline (DID NOT DO)


In [None]:
#Import and use Numpy
import pandas as pd 
import numpy as np
import random
import time

In [None]:
#Select Data
site = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt'
df = pd.read_csv(site, header = None)

df.dtypes.value_counts()

float64    75
object     43
int64      29
dtype: int64

In [None]:
df.isna().sum().value_counts()

0    147
dtype: int64

In [None]:
#Test dataset-- Include only Int Variable Columns
sample_dataset = df[df.select_dtypes(include=['int64']).columns]
test_dataset = sample_dataset[sample_dataset.columns[0:10]]
test_target = df[127]

sample_dataset = sample_dataset.to_numpy()
test_dataset = test_dataset.to_numpy()
test_target = test_target.to_numpy()

In [None]:
test_dataset

array([[    1, 11980, 11980, ..., 13600,  5725, 27101],
       [    1, 23123, 23123, ..., 18137,     0, 20074],
       [    1, 29344, 29344, ..., 16644, 21606, 15528],
       ...,
       [   10, 32824, 32824, ..., 13630, 13197, 17313],
       [   10, 13547,     0, ...,  6437,  8271,   171],
       [   10, 28898, 28664, ...,  8163,  9874,  6827]])

In [None]:
test_target

array([0., 0., 0., ..., 0., 0., 0.])

In [None]:
############################### 
#PCA Function                 #
###############################
class PCA_redux(object):
    def __init__(self, arry_):  
    #Set stored variable 'arry_'    
        self.arry_ = arry_

        #Normalize and add ones as norm_x
        norm_x = self.normalize_x()
        self.norm_x = np.hstack((np.ones(shape=(len(norm_x),1)), norm_x))

        #Get Principal Components
        self.get_PC()

        #Error Flag
        self.cannot_run = False
        
#CHECK DIMENSIONS
    def dim(self) -> np.array:
        """Return dimensions of the numpy array"""
        return self.arry_.shape

#CHECK INVERTIBLITY
    def invertible(self) -> str:
        """Check for invertibility by determinant of normed X"""
        norm_arry_ = self.norm_x
        
        #DE-MEAN THE NORMED X
        norm_dmnd = norm_arry_ - norm_arry_.mean(axis = 0)
        det = np.linalg.det(np.dot(norm_dmnd.T, norm_dmnd))

        #Return invertibility  
        if det == 0:
            print(f"det(X'*X) = 0 | Not Invertible")
        else:
            print(f"det(X'*X)= {det} | Invertible")

#NORMALIZE X                
    def normalize_x(self)-> np.array:
        """Normalize the array (minmax)"""
        column_min, column_max = self.arry_.min(axis=0), self.arry_.max(axis=0)
        return abs((self.arry_-column_min)/(column_min - column_max))

############################### 
#CHECK DATA & SELECT PCA NUM  #
###############################
    def initial_check(self) -> str:
        """Run an initial check of the data and invertibility; print basic info"""
        print("*****"*10)
        print(f"Original Dataset Dimensions: {self.dim()}")
  
        #CHECK INVERTIBILITY
        self.invertible()
        print("*****"*10)

        #Return a message if there is a data issue.
        if self.cannot_run == True:
            print("Data condition not ready for PCA.") #changed return to print as the use won't do anything with the text
            return False
        
        #Return a message if the data is OK.
        else:
            return True
############################### 
#Covar/Eigenvals&Vecs         #
###############################
    #COVARIANCE FUNCTION
    def covar(self, data_) -> np.array:
        """Covariance Function that takes data and returns covariance"""
        covar_ = np.dot((data_ - data_.mean(axis = 0)).T,
                      (data_ - data_.mean(axis = 0)))/len(data_)
        return covar_

    #GET PRINCIPAL COMPONENTS (EIGENVAL/EIGENVEC)
    def get_PC(self) -> np.array:
        """Principal Components Function that returns eigenvalues and eigenvectors"""
        data_ = self.norm_x
        cov = self.covar(data_)

        self.eigval, self.eigvec = np.linalg.eig(cov)
        self.eig_pairs = np.linalg.eig(cov)

        return self.eigval, self.eigvec

    #RETURNS ARRAY OF FEATURE IMPORTANCE
    def ret_ordered_features(self) -> np.array:
        """Returns an array of ordered percents"""
        ret_percents = np.cumsum(self.eigval[np.argsort(self.eigval)[::-1]]/self.eigval.sum() * 100)
        return ret_percents

    #RETURNS THE INDEX OF A PERCENTAGE VALUE ENTERED BY THE USER
    def ret_percent_index(self, thresh = 100) -> np.array:
        """Returns the number of columns to hit % threshold"""
        sorted_vecs = np.argsort(self.eigval)
        vec_mags = np.cumsum(self.eigval[np.argsort(self.eigval)[::-1]]/self.eigval.sum() * 100)
        keep_cols = 0

    #RETURNS INDEX VALUE FOR WHEN THRESHOLD IS MET
        for indx_, elem in enumerate(vec_mags):
            if elem <= thresh:
                print(indx_)
            else:               
                keep_cols = indx_
                break

        print(vec_mags)
        return keep_cols
    
    #COMBINES THE TWO
    def ret_compare_index(self, k = 1, thresh = 100) -> int:
        """Full function for returning the number of eigenvectors"""
        pct_features = 0

        sorted_vecs = np.argsort(self.eigval)
        vec_mags = np.cumsum(self.eigval[np.argsort(self.eigval)[::-1]]/self.eigval.sum() * 100)
        ret_features_pca = np.argsort(self.eigval)

        for indx_, elem in enumerate(vec_mags):
            if elem <= thresh:
                continue
            else:               
                pct_features = indx_
                break
                    
    #Keep the highest number of features to capture the variance
        if pct_features >= k:
            self.keep_features = pct_features
            return pct_features
        else:
            self.keep_features = k
            return k
############################### 
#Run PCA|Return Prjtd Ftrs    #
###############################
    #Create Projection Matrix
    def create_projection_matrix(self, n = 2) -> np.array:
        """Creates a projection transposed matrix, and transposes again for prjmatrx"""
        projection_matrix = (self.eigvec.T[:][:n]).T
        self.projection_matrix = projection_matrix
        return projection_matrix 

    #Project Data
    def project_data(self) -> np.array:
        """Projects the data onto projection matrix for new data"""
        return np.dot(self.norm_x, self.projection_matrix)

    def run_pca(self, sel_features = 1, capture_pct = 100) -> np.array:
        """Full function run, it initializes a check, then runs features"""
                
        #Create Projected Features
        sel_features = self.ret_compare_index(k = sel_features, thresh = capture_pct)
        self.create_projection_matrix(n = sel_features)
        pca_matrix = self.project_data()

        #Return pca_matrix
        self.pca_matrix = pca_matrix
        return pca_matrix
    
    def run_pca_report(self, sel_features = 1, capture_pct = 100) -> np.array:
        """Creates an in-depth report of the PCA data itself."""
        self.initial_check()    
        
        #Create Projected Features
        sel_features = self.ret_compare_index(k = sel_features, thresh = capture_pct)
        self.create_projection_matrix(n = sel_features)
        pca_matrix = self.project_data()
        
        #Print Statements for Outputs
        print(f"{sel_features} features selected. New Matrix= {pca_matrix.shape}.")
        print("*****"*10)

        #Return pca_matrix
        self.pca_matrix = pca_matrix
        time.sleep(5)
        return pca_matrix

###########################
# Linear Regression Class #
###########################
class linear_reg(object):
    #Initialize with input array, target, features
    def __init__(self, arry_, target_):
        self.arry_ = arry_ #Input Array
        self.target_ = target_ #Target Feature
        self.features_ = arry_.shape[1] #Number of Features/Variables
        
        #Flag if issue with running. 0 if can rn, 1 if cannot run
        self.cannot_run = 0

    # a. Return data dimensions        
    def dim(self):
        """Return Data Dimensions"""
        print("Data Shape:",self.arry_.shape, self.target_.shape)
        return self.arry_.shape

    # b. Return Data Check
    def data_check(self):
        """Return Data Check"""
    # if the number of observations is less than the number of parameters (bad)
        if (self.arry_.shape[0] < self.features_): 
            self.cannot_run = 1
            print("Data = False: Not enough data for modeling with OLS.")

    # Number of observations is greater than the number of parameters (good)
        else: 
            self.cannot_run = 0
            print("Data = True: Enough data for modeling with OLS.")
           
#############################
# a. Min/Max Normalization  #
#############################
    def normalize_x(self) -> np.array:
        """Normalization function for X"""
        x = self.arry_
        x_norm = np.zeros(shape=x.shape)
        n = self.arry_.shape[1]

        for j in range (1, n + 1):
            xj_min = min(x[:, j-1])
            xj_max = max(x[:, j-1])
            
            for i in range(1,n+1):
                x_norm[i-1,j-1] = (x[i-1,j-1] - xj_min) / (xj_max  - xj_min)

        #Save x_norm
        self.x_norm = x_norm

        #Return x_norm
        return x_norm

######################################
# b. Min/Max Normalization of Matrix #
######################################
    #Compute Big X
    def big_x(self, normlze = True) -> np.array:
        """Create big x"""
        if normlze == True:
            x = self.normalize_x()
        else:
            x = self.arry_

        n = self.arry_.shape[0]
        ones = np.ones(shape = (n,1))
        X = np.hstack((ones, x))
        self.X = X

        return self.X

    #e. Check if Invertible
    def invertible(self):
        """Check for invertibility"""
        X = self.arry_
        
        if (np.linalg.det(np.dot(X.T,X)) == 0):
            self.cannot_run = 1
            print("Invertible = False:","det(X'*X) = 0.")
            
        else:
            self.cannot_run = 0
            print("Invertible = True:","det(X'*X) =", np.linalg.det(np.dot(X.T,X)))
            

    #f. collinearity
    def collinearity(self):
        """Check for collinearity"""
        X = self.arry_
        k = self.features_

        r = np.linalg.matrix_rank(np.dot(X.T,X))

        if (r == (k+1)):
            print("X'*X is a full rank matrix and is invertible. Just another way to check invertibility.")

        else:
            if (r == 1):
                print("Collinearity = True:","Perfect collinearity in X.")
                #self.cannot_run == 1

            else:
                print("Collinearity = False:", "No collinearity in X.")
                #print("X is collinearity. Not enough data.")
                #self.cannot_run == 0

#############################
# d. Linear Regression      #
#############################
    def check_all(self) -> str:
        """Check all three data checks"""
        #Run the three checks for data
        self.data_check()
        self.invertible()
        self.collinearity()
        
        #Return a message if there is a data issue.
        if self.cannot_run == 1:
            print("Data condition not ready for OLS.") #changed return to print as the use won't do anything with the text
            return False
        
        #Return a message if the data is OK.
        else:
            return True

    def run_regression(self) -> list:
        """Run Regression Function"""
        #Stop function from returning a value if there is an issue.
        if self.cannot_run == 1:
            return []

        else:
            #Normalize for linear regression          
            X = self.big_x(normlze=True)
            y = self.target_

            ones = np.ones(shape = (len(self.arry_),1))
            self.X = np.hstack((ones, self.arry_))
            
            #Compute Coefficients
            beta = np.dot(np.linalg.inv(np.dot(self.X.T, self.X)), np.dot(self.X.T, y))

            results = beta.reshape(1,-1).tolist()[0]
            
            #Return Coefficients
            return results

## References
References-
https://dev.to/akaame/implementing-simple-pca-using-numpy-3k0a

https://stackoverflow.com/questions/36771525/python-pca-projection-into-lower-dimensional-space

## Testing Data
Check documentation of comments, check components, etc.

In [None]:
#Check Comment Documentation
print(PCA_redux.get_PC.__doc__)

Principal Components Function that returns eigenvalues and eigenvectors


In [None]:
#Test Dimensions
test.dim()

(2215, 10)

In [None]:
test.norm_x

array([[1.00000000e+00, 0.00000000e+00, 2.70083291e-04, ...,
        6.41146521e-02, 1.19270833e-02, 2.55272453e-01],
       [1.00000000e+00, 0.00000000e+00, 1.79390006e-03, ...,
        8.55034886e-02, 0.00000000e+00, 1.89083031e-01],
       [1.00000000e+00, 0.00000000e+00, 2.64462823e-03, ...,
        7.84650198e-02, 4.50125000e-02, 1.46262893e-01],
       ...,
       [1.00000000e+00, 1.00000000e+00, 3.12052183e-03, ...,
        6.42560815e-02, 2.74937500e-02, 1.63076343e-01],
       [1.00000000e+00, 1.00000000e+00, 4.84372160e-04, ...,
        3.03460305e-02, 1.72312500e-02, 1.61070032e-03],
       [1.00000000e+00, 1.00000000e+00, 2.58363727e-03, ...,
        3.84829342e-02, 2.05708333e-02, 6.43055621e-02]])

In [None]:
test.get_PC()

(array([1.01923240e-01, 4.73485509e-02, 6.67911152e-03, 2.52754911e-03,
        1.51733888e-03, 1.35248999e-03, 9.05279546e-04, 2.36605991e-04,
        1.23892818e-04, 6.67798096e-07, 0.00000000e+00]),
 array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  1.00000000e+00],
        [ 9.99393540e-01, -3.29359848e-02,  6.59913991e-03,
         -5.82190175e-03,  4.90650132e-03, -3.93448853e-03,
          3.11541941e-03, -2.07275183e-04, -1.01357571e-03,
          1.99691781e-05,  0.00000000e+00],
        [-4.02152845e-03, -4.17783705e-03,  6.84163594e-03,
         -1.79229398e-01,  6.68352193e-01,  1.19746934e-01,
         -3.35081052e-03,  6.72084917e-02, -2.29581229e-02,
         -7.08312870e-01,  0.00000000e+00],
        [-4.02150622e-03, -2.54473420e-03,  8.21526755e-03,
         -1.78868640e-01,  6.71421210e-01,  1.19845974e-01,
      

In [None]:
#Test Invertibility
test.invertible()

det(X'*X) = 0 | Not Invertible


In [None]:
#Test Normalization
test.normalize_x()

array([[0.00000000e+00, 2.70083291e-04, 1.63603896e-03, ...,
        6.41146521e-02, 1.19270833e-02, 2.55272453e-01],
       [0.00000000e+00, 1.79390006e-03, 3.15777370e-03, ...,
        8.55034886e-02, 0.00000000e+00, 1.89083031e-01],
       [0.00000000e+00, 2.64462823e-03, 4.00733951e-03, ...,
        7.84650198e-02, 4.50125000e-02, 1.46262893e-01],
       ...,
       [1.00000000e+00, 3.12052183e-03, 4.48258288e-03, ...,
        6.42560815e-02, 2.74937500e-02, 1.63076343e-01],
       [1.00000000e+00, 4.84372160e-04, 0.00000000e+00, ...,
        3.03460305e-02, 1.72312500e-02, 1.61070032e-03],
       [1.00000000e+00, 2.58363727e-03, 3.91447586e-03, ...,
        3.84829342e-02, 2.05708333e-02, 6.43055621e-02]])

In [None]:
#Return Eigenvectors
test.eigvec

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00],
       [ 9.99393540e-01, -3.29359848e-02,  6.59913991e-03,
        -5.82190175e-03,  4.90650132e-03, -3.93448853e-03,
         3.11541941e-03, -2.07275183e-04, -1.01357571e-03,
         1.99691781e-05,  0.00000000e+00],
       [-4.02152845e-03, -4.17783705e-03,  6.84163594e-03,
        -1.79229398e-01,  6.68352193e-01,  1.19746934e-01,
        -3.35081052e-03,  6.72084917e-02, -2.29581229e-02,
        -7.08312870e-01,  0.00000000e+00],
       [-4.02150622e-03, -2.54473420e-03,  8.21526755e-03,
        -1.78868640e-01,  6.71421210e-01,  1.19845974e-01,
        -3.64366184e-03,  6.13621977e-02, -2.65029885e-02,
         7.05879146e-01,  0.00000000e+00],
       [ 1.86306716e-02,  5.15374788e-01,  1.62950512e-01,
         5.41090419e-01,  1.49056980e-01,  1.78040754e-01,
  

In [None]:
#Create Projection Matrix with 5 components
test.create_projection_matrix(n = 8)

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00],
       [ 9.99393540e-01, -3.29359848e-02,  6.59913991e-03,
        -5.82190175e-03,  4.90650132e-03, -3.93448853e-03,
         3.11541941e-03, -2.07275183e-04],
       [-4.02152845e-03, -4.17783705e-03,  6.84163594e-03,
        -1.79229398e-01,  6.68352193e-01,  1.19746934e-01,
        -3.35081052e-03,  6.72084917e-02],
       [-4.02150622e-03, -2.54473420e-03,  8.21526755e-03,
        -1.78868640e-01,  6.71421210e-01,  1.19845974e-01,
        -3.64366184e-03,  6.13621977e-02],
       [ 1.86306716e-02,  5.15374788e-01,  1.62950512e-01,
         5.41090419e-01,  1.49056980e-01,  1.78040754e-01,
        -1.39672881e-02, -4.98546773e-01],
       [ 1.84084545e-02,  4.98382216e-01,  1.45404633e-01,
         2.83099007e-01,  1.80034738e-02,  1.51177196e-01,
        -3.05579754e-02,  5.47448164e-01],
       [ 1.31868886e-02,  4.843845

In [None]:
test.create_projection_matrix(n = 5).shape

(11, 5)

In [None]:
np.dot(test.norm_x,test.create_projection_matrix(n = 8))

array([[ 3.39826724e-02,  1.00334281e+00, -2.48729600e-02, ...,
         6.45045915e-02, -1.91698404e-02, -1.38792990e-02],
       [ 2.12772374e-02,  6.25526683e-01, -4.49313279e-02, ...,
        -2.15063033e-03, -2.30302447e-02,  1.24513886e-02],
       [ 1.54336487e-02,  4.62618038e-01, -4.18287224e-02, ...,
        -2.51992388e-02,  2.72714548e-02,  1.19741477e-02],
       ...,
       [ 1.01160463e+00,  3.23639838e-01, -7.77461285e-02, ...,
        -2.92809228e-02,  1.67438317e-02,  1.86790331e-02],
       [ 1.00417391e+00,  1.17773831e-01,  4.19952941e-02, ...,
        -1.12060938e-02,  1.40001888e-02, -9.66400618e-03],
       [ 1.00777655e+00,  2.22880128e-01, -3.71207459e-04, ...,
        -1.44157952e-02,  1.45620873e-02, -2.89188121e-03]])

In [None]:
test.ret_ordered_features()

array([ 62.67774285,  91.79475568,  95.90207833,  97.4563958 ,
        98.38948403,  99.22119836,  99.77790043,  99.92340139,
        99.99958934, 100.        , 100.        ])

In [None]:
test.ret_compare_index(k = 3, thresh = 99)

5

In [None]:
test.run_pca(sel_features = 5, capture_pct = 80)

array([[ 3.39826724e-02,  1.00334281e+00, -2.48729600e-02, ...,
         2.15417069e-02,  6.45045915e-02, -1.91698404e-02],
       [ 2.12772374e-02,  6.25526683e-01, -4.49313279e-02, ...,
         2.58352512e-02, -2.15063033e-03, -2.30302447e-02],
       [ 1.54336487e-02,  4.62618038e-01, -4.18287224e-02, ...,
         1.99296504e-02, -2.51992388e-02,  2.72714548e-02],
       ...,
       [ 1.01160463e+00,  3.23639838e-01, -7.77461285e-02, ...,
         2.16015605e-02, -2.92809228e-02,  1.67438317e-02],
       [ 1.00417391e+00,  1.17773831e-01,  4.19952941e-02, ...,
         1.52062108e-02, -1.12060938e-02,  1.40001888e-02],
       [ 1.00777655e+00,  2.22880128e-01, -3.71207459e-04, ...,
         1.46059707e-02, -1.44157952e-02,  1.45620873e-02]])

In [None]:
test.run_pca_report(sel_features = 5, capture_pct = 80)

**************************************************
Original Dataset Dimensions: (2215, 10)
det(X'*X) = 0 | Not Invertible
**************************************************
5 features selected. New Matrix= (2215, 5).
**************************************************


array([[ 3.39826724e-02,  1.00334281e+00, -2.48729600e-02,
         9.02672919e-02,  2.15417069e-02],
       [ 2.12772374e-02,  6.25526683e-01, -4.49313279e-02,
         6.82143134e-02,  2.58352512e-02],
       [ 1.54336487e-02,  4.62618038e-01, -4.18287224e-02,
         2.94578287e-02,  1.99296504e-02],
       ...,
       [ 1.01160463e+00,  3.23639838e-01, -7.77461285e-02,
         1.10774108e-02,  2.16015605e-02],
       [ 1.00417391e+00,  1.17773831e-01,  4.19952941e-02,
         1.90994772e-02,  1.52062108e-02],
       [ 1.00777655e+00,  2.22880128e-01, -3.71207459e-04,
        -4.02539454e-03,  1.46059707e-02]])

In [None]:
test.create_projection_matrix(n=8)

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00],
       [ 9.99393540e-01, -3.29359848e-02,  6.59913991e-03,
        -5.82190175e-03,  4.90650132e-03, -3.93448853e-03,
         3.11541941e-03, -2.07275183e-04],
       [-4.02152845e-03, -4.17783705e-03,  6.84163594e-03,
        -1.79229398e-01,  6.68352193e-01,  1.19746934e-01,
        -3.35081052e-03,  6.72084917e-02],
       [-4.02150622e-03, -2.54473420e-03,  8.21526755e-03,
        -1.78868640e-01,  6.71421210e-01,  1.19845974e-01,
        -3.64366184e-03,  6.13621977e-02],
       [ 1.86306716e-02,  5.15374788e-01,  1.62950512e-01,
         5.41090419e-01,  1.49056980e-01,  1.78040754e-01,
        -1.39672881e-02, -4.98546773e-01],
       [ 1.84084545e-02,  4.98382216e-01,  1.45404633e-01,
         2.83099007e-01,  1.80034738e-02,  1.51177196e-01,
        -3.05579754e-02,  5.47448164e-01],
       [ 1.31868886e-02,  4.843845

#Comparing Functions
Check and compare the two functions with loss to see which had better performance

In [None]:
#Run linear regression for the regular dataset
reg_test_obj = linear_reg(arry_ = test_dataset, target_ = test_target)
lin_reg = reg_test_obj.run_regression()
lin_reg

[1.7214490906949278,
 0.012209648076662555,
 -6.93306674019419e-05,
 7.086380040075895e-05,
 -5.546289411882961e-05,
 2.3360195429504007e-07,
 -0.0004182536975333329,
 0.000479304132855316,
 -4.029119613292698e-06,
 3.7994084500826437e-06,
 -3.334642159754129e-06]

In [None]:
#Run linear regression for PCA
test = PCA_redux(arry_ = test_dataset)
pca_arry = test.run_pca(sel_features = 10, capture_pct = 100)

pca_lin_reg = linear_reg(arry_ = pca_arry, target_ = test_target)
pca_lin_reg.run_regression()

[0.9830694327747489,
 -0.05532836197509078,
 -1.1189878149046923,
 0.489162714727733,
 -13.972557820565582,
 9.764449280107048,
 0.7131380225124877,
 2.272209039404468,
 -24.892760803319533,
 27.69931742205023,
 725.3744768461158]

In [None]:
y_pred_pca

[-0.05532836197515211,
 -1.1189878149046706,
 0.48916271472772427,
 -13.972557820566145,
 9.764449280106499]

In [None]:
np.dot(test_dataset, y_pred_reg) + reg_constant

array([-0.45521719,  0.23505594,  0.88323892, ...,  1.33124311,
        1.00264547,  2.02040253])

#Full function for comparison

In [None]:
#Run linear regression for PCA
pca_start = PCA_redux(arry_ = test_dataset)
pca_arry = pca_start.run_pca(sel_features = 5, capture_pct = 50)

pca_lin_reg = linear_reg(arry_ = pca_arry, target_ = test_target)
pca_lin_reg.run_regression()

#Use a loss function to compare the values here
# MSE = sum of expected - observed squared divided by length of data(avg)
y_pred_reg = reg_test_obj.run_regression()[1:]
reg_constant = reg_test_obj.run_regression()[0]

y_pred_pca = pca_lin_reg.run_regression()[1:]
pca_constant = pca_lin_reg.run_regression()[0]

lin_loss = round((1/len(test_dataset))* (np.sum((test_target - (np.dot(test_dataset, y_pred_reg)) + reg_constant)) ** 2),10)
pca_loss = round((1/len(test_dataset))* (np.sum((test_target- (np.dot(pca_arry, y_pred_pca)) + pca_constant)) ** 2),10)

#Returns whichever model performed better here.
if abs(lin_loss) < pca_loss:
    print(f"Linear ({lin_loss}) is better than PCA ({pca_loss})")
elif abs(lin_loss) == pca_loss:
    print(f"Performance of Linear ({lin_loss}) == PCA ({pca_loss})") 
else:
    print(f"PCA ({pca_loss}) is better than Linear ({lin_loss})")

(pca_loss - lin_loss) / pca_loss

PCA (21722.0025691491) is better than Linear (26255.6085706306)


-0.208710315130908

In [None]:
(1/len(test_dataset)* (np.sum((test_target - (np.dot(test_dataset, y_pred_reg)) + reg_constant)) ** 2),10)

(26255.608570630644, 10)

In [None]:
pca_lin_reg.run_regression()[0]

1.5657883931837708

In [None]:
y_pred_pca = pca_lin_reg.run_regression()[1::]
constant = pca_lin_reg.run_regression()[0]

In [None]:
np.dot(test_dataset, reg_test_obj.run_regression()[1:])

array([-2.17666628, -1.48639315, -0.83821017, ..., -0.39020598,
       -0.71880362,  0.29895344])

In [None]:
np.dot(pca_arry, pca_lin_reg.run_regression()[1:]) + pca_lin_reg.run_regression()[0]

array([-0.62190916,  0.14181425,  0.80981032, ...,  1.165786  ,
        1.28059587,  1.45931236])

#Test Minmax Scaler and PCA

In [None]:
import sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(test_dataset)
scaler.transform(test_dataset)

array([[0.00000000e+00, 2.70083291e-04, 1.63603896e-03, ...,
        6.41146521e-02, 1.19270833e-02, 2.55272453e-01],
       [0.00000000e+00, 1.79390006e-03, 3.15777370e-03, ...,
        8.55034886e-02, 0.00000000e+00, 1.89083031e-01],
       [0.00000000e+00, 2.64462823e-03, 4.00733951e-03, ...,
        7.84650198e-02, 4.50125000e-02, 1.46262893e-01],
       ...,
       [1.00000000e+00, 3.12052183e-03, 4.48258288e-03, ...,
        6.42560815e-02, 2.74937500e-02, 1.63076343e-01],
       [1.00000000e+00, 4.84372160e-04, 0.00000000e+00, ...,
        3.03460305e-02, 1.72312500e-02, 1.61070032e-03],
       [1.00000000e+00, 2.58363727e-03, 3.91447586e-03, ...,
        3.84829342e-02, 2.05708333e-02, 6.43055621e-02]])

In [None]:
import numpy as np
from sklearn.decomposition import PCA
X = test_dataset
pca = PCA(n_components=8)
pca.fit(X)

PCA(n_components=10)

In [None]:
from sklearn.preprocessing import minmax_scale
x_std = MinMaxScaler().fit_transform(test_dataset)
pca = PCA(n_components=8)
pca.fit_transform(x_std)

array([[-0.4793861 ,  0.59420054, -0.00963761, ..., -0.06990237,
        -0.03122167,  0.01767663],
       [-0.49209153,  0.21638442,  0.01042076, ..., -0.00324714,
        -0.03508208, -0.00865405],
       [-0.49793512,  0.05347578,  0.00731815, ...,  0.01980147,
         0.01521962, -0.00817681],
       ...,
       [ 0.49823586, -0.08550242,  0.04323556, ...,  0.02388315,
         0.004692  , -0.0148817 ],
       [ 0.49080514, -0.29136843, -0.07650586, ...,  0.00580832,
         0.00194836,  0.01346134],
       [ 0.49440778, -0.18626213, -0.03413936, ...,  0.00901802,
         0.00251026,  0.00668921]])

In [None]:
from sklearn.preprocessing import minmax_scale
x_std = MinMaxScaler().fit_transform(test_dataset)
pca = PCA(n_components=8)
pca.fit_transform(x_std)

array([[0.00000000e+00, 2.70083291e-04, 1.63603896e-03, ...,
        6.41146521e-02, 1.19270833e-02, 2.55272453e-01],
       [0.00000000e+00, 1.79390006e-03, 3.15777370e-03, ...,
        8.55034886e-02, 0.00000000e+00, 1.89083031e-01],
       [0.00000000e+00, 2.64462823e-03, 4.00733951e-03, ...,
        7.84650198e-02, 4.50125000e-02, 1.46262893e-01],
       ...,
       [1.00000000e+00, 3.12052183e-03, 4.48258288e-03, ...,
        6.42560815e-02, 2.74937500e-02, 1.63076343e-01],
       [1.00000000e+00, 4.84372160e-04, 0.00000000e+00, ...,
        3.03460305e-02, 1.72312500e-02, 1.61070032e-03],
       [1.00000000e+00, 2.58363727e-03, 3.91447586e-03, ...,
        3.84829342e-02, 2.05708333e-02, 6.43055621e-02]])

In [None]:
test.normalize_x()

array([[0.00000000e+00, 2.70083291e-04, 1.63603896e-03, ...,
        6.41146521e-02, 1.19270833e-02, 2.55272453e-01],
       [0.00000000e+00, 1.79390006e-03, 3.15777370e-03, ...,
        8.55034886e-02, 0.00000000e+00, 1.89083031e-01],
       [0.00000000e+00, 2.64462823e-03, 4.00733951e-03, ...,
        7.84650198e-02, 4.50125000e-02, 1.46262893e-01],
       ...,
       [1.00000000e+00, 3.12052183e-03, 4.48258288e-03, ...,
        6.42560815e-02, 2.74937500e-02, 1.63076343e-01],
       [1.00000000e+00, 4.84372160e-04, 0.00000000e+00, ...,
        3.03460305e-02, 1.72312500e-02, 1.61070032e-03],
       [1.00000000e+00, 2.58363727e-03, 3.91447586e-03, ...,
        3.84829342e-02, 2.05708333e-02, 6.43055621e-02]])

In [None]:
x_std == test.normalize_x()

array([[ True, False,  True, ..., False, False,  True],
       [ True, False, False, ..., False,  True,  True],
       [ True, False,  True, ..., False, False,  True],
       ...,
       [ True,  True, False, ...,  True,  True,  True],
       [ True,  True,  True, ..., False,  True,  True],
       [ True,  True, False, ..., False,  True,  True]])

In [None]:
comp = pca.components_
comp

array([[ 9.99393540e-01, -4.02152845e-03, -4.02150622e-03,
         1.86306716e-02,  1.84084545e-02,  1.31868886e-02,
         1.08580356e-02, -3.70190752e-04, -2.26010688e-03,
         1.40427332e-02],
       [-3.29359848e-02, -4.17783705e-03, -2.54473420e-03,
         5.15374788e-01,  4.98382216e-01,  4.84384596e-01,
         4.35330198e-01,  1.03409079e-01,  3.21553498e-02,
         2.21420558e-01],
       [-6.59913991e-03, -6.84163594e-03, -8.21526755e-03,
        -1.62950512e-01, -1.45404633e-01, -5.83186781e-02,
        -6.00470196e-02, -4.15791124e-02, -5.93718158e-03,
         9.71275445e-01],
       [ 5.82190175e-03,  1.79229398e-01,  1.78868640e-01,
        -5.41090419e-01, -2.83099007e-01,  4.03937623e-01,
         6.01535162e-01, -1.76547832e-01, -2.95288208e-02,
        -7.66407627e-02],
       [ 4.90650132e-03,  6.68352193e-01,  6.71421210e-01,
         1.49056980e-01,  1.80034738e-02, -1.56608698e-01,
        -8.21280975e-02,  2.16922999e-01,  2.26369465e-02,
         3.