<h1>Deep Fake Recognition Project</h1>
<h2>Course: DSIP</h2>
<h2>Classification Model: Logistic Regression With Gradient Descent</h2>
<h3>Authors: Roman Rezvin, Niv Rave</h3>

<h2>Readme</h2>
The following project is our implementation of the Logistic Regression classification with Gradient Descent model.</br>
The used dataset contains 1000 images sized 1024X1024, divided to 500 'Real' and 500 'Fake' images </br>
and labled accordingly. The goal of the project is to be able to classify new images as real or fake using</br>
the properties achieved during the learning process.</br>
The model was built from scratch and the code contains the implementation of all the used calculations,</br>
without any ML/AI external libraries. We used Open-CV to process the dataset images, Pandas and Numpy for</br>
convenient storage and usage of data structures, seaborn and matplotlib.pyplot for the plots and graphical endpoints</br>
and os for operaing system related actions.</br>
The notebook contains different sections, each has a header that describes the following cells and their responesbility</br>
and inside comments that describe the micro responsebilities and implementations.</br>
The first few cells contain different utilities for plotting and data handling.</br>
The following cells contain the Logistic Regression model's different functions and utilities, including</br>
algorithms and optimization functions.</br>
The last cells contain the concrete model's implementation and import/export functions to fasten and ease the</br>
model's handling.</br>
</br>
<b>Regarding the model itself and the ways to initialize the model, we suggest 4 options:</b></br>
1. Complete learning using pre-defined (optimal and efficient values we found) learning rate and iterations.</br>
2. Complete learning with an algorithm to find the optimal learning rate and iterations.</br>
3. Initialization through import - importing our existing model, attached in the 'model.npy' file.</br>
4. Initialization of an empty model.</br>
</br>
Each initialization/creation method will affect the time to it takes to initialize, the amount of resources</br>
the project will use in the process and the accuracy.</br>
</br>
The data is tagged as 'Real'=0, 'Fake'=1 and located inside matching folders in a './data' folder in the root of the</br>
notebook (watch the matching functions' documentaion for more information).</br>
</br>
The calculated coefficients are found at model.w_vector for the coefficients vector (sorted by iterations)</br>
or model.wm for the last iteration's coefficients, those to whom the model has converged. Use the import/export</br>
functions to import/export the coefficients for later usage.</br>
</br>
To run a test on new data, place the data in a './data/FutureData' folder in the root of the notebook (watch the matching</br> functions' documentaion for more information).</br>
After placing the data, activate model.test() method to view the classification results.</br>
The results will be written to a 'FutureDataEstimatedLabels.csv' file in the root of the notebook (watch the matching</br> functions' documentaion for more information).</br>

<h2>Imports:</h2>

In [1]:
import numpy as np
import numpy.ma as ma
import matplotlib.pyplot as plt
import cv2
import seaborn as sn
import os #import os library for os location/path related methods and operations
import warnings
import pandas as pd

<h2>Plot utilities</h2>

In [2]:
## This function plots a given confusion matrix using the seaborn.heatmap() function.
## Parameters:
## matrix - a confusion matrix
## color - a string, set the tiles' colors. Use 'Blues' or 'Reds' if unknown
## data_type - a string, the data type being represented by the given matrix, used for the title.
##             'Train','Test','Validation', etc.
def plot_confusion_matrix(confusion_matrix,color,data_type):
    ax = sn.heatmap(confusion_matrix, annot=True, cmap = color, cbar=False, fmt='g')
    ax.set_title(f'{data_type} Data Confusion Matrix');
    ax.set_xlabel('Predicted Values')
    ax.set_ylabel('Labled Values');
    ax.xaxis.set_ticklabels(['False','True'])
    ax.yaxis.set_ticklabels(['False','True'])
    plt.savefig(f'{data_type} Data Confusion Matrix.png')
    plt.show()
    
## Plot the ROC curve using a confuse matrix array
def plot_ROC(confuse_mat_array, data_type):
    plt.plot(confuse_mat_array[:,0], confuse_mat_array[:,1], ',')
    plt.title(f'{data_type} data ROC')
    plt.xlabel('False positive Rate')
    plt.ylabel('True positive Rate')
    plt.savefig(f'{data_type} Data ROC.png')
    plt.show()
    
## Plots azimuthal average of FT magnitudes of an image
def plot_power_spectrum(power_spectrum):
    radius_array = np.arange(0,len(np.squeeze(power_spectrum)),1) ##CHECK WHEN TRYING TO RUN
    plt.plot(radius_array, power_spectrum, linewidth=2)
    plt.title('Power Spectrum of an Image')
    plt.xlabel('Spatial Frequency')
    plt.ylabel('Power spectrum')
    plt.show()

## This function plots multiple cross-entropy plots and returns optimal iterations number and learning rate
def check_learning_rate_array(learning_rate_arr, power, x_train, y_train, if_plot):
    with warnings.catch_warnings():
        # used to hide several overflow warnings
        warnings.filterwarnings('ignore')
        
        # "power" is a power of ten, which is highest num of iterations
        w_arr_iterations = 10**power
        
        # used to store cross-entropy values for plots and finding the optimal learning rate and iterations number
        cross_entropy_mat = np.zeros((power, len(learning_rate_arr)))
        
        # used to loop through given learning rate array
        lin = range(len(learning_rate_arr))

        for i in lin:
            # finds weights with given learining rate and iterations number
            w_learn = get_coefficients(x_train, y_train, w_arr_iterations, learning_rate_arr[i])
            
            # stores cross-entropy value arrays
            for j in range(0, power):
                cross_entropy_mat[j][i] = cross_entropy(x_train, y_train, w_learn[10**(j+1) -1, :])
                
        if(if_plot):        
            # the loop plots graphs and creates legend string
            legend_string =[]
            for j in range(0, power):
                plt.semilogx(learning_rate_arr, cross_entropy_mat[j, :])
                legend_string.append(str(int(10**j)) + " iterations")

            # plot adjustments
            plt.legend(legend_string)
            plt.title('Cross-Entropy vs Learning Rate')
            plt.xlabel('Learning Rate')
            plt.ylabel('Cross-Entropy')
            plt.show()
        
        # finds index tuple of lowes cross-entropy value
        idx = np.unravel_index(np.argmin(cross_entropy_mat, axis=None), cross_entropy_mat.shape)
        
        # optimal iterations number
        optimal_iterations_number = 10**idx[0]
        
        # optimal learning array
        optimal_learning_rate = learning_rate_arr[idx[1]]
        
        plt.savefig('1e'+str(power)+'.png')
        
        return optimal_learning_rate, optimal_iterations_number

<h2>Dataset handling and processing</h2>

<h3>Classes:</h3>

In [3]:
## DataSet class
## An initialized DataSet object gets a path to a folder containing the datasets content,
## than calls the method get_images() which iterates through the folder and uses open-cv's .imread()
## to import each image file in the given folder, later converting the imported dataset to a Numpy
## array and setting the object's .images as the received array.
class DataSet:
    def __init__(self, path):
        self.images = self.get_images(path)
        
    def get_images(self, path):
        images = []
        for file in os.listdir(path):
            filename = path + '/' + os.fsdecode(file)
            images.append(cv2.imread(filename,0))
        return np.array(images, dtype = 'complex_')

<h3>Functions:</h3>

In [4]:
## Return the .images property of any given DataSet object
def get_data(data_object):
    return data_object.images

## Given a DataSet object, the method extracts the .images array, converts it
## to a complex datatype Numpy array and iterates through each image matrix
## in the array, calls the transform_image() and returns the new array
def transform_matrix(data):
    transformed_data_array = get_data(data)
    for i in np.ndindex(transformed_data_array.shape[0]):
        transformed_data_array[i] = transform_image(transformed_data_array[i])
    return transformed_data_array

## This method activates the fft and fftshift functions, adds 1e-8 to each calculated value
## to avoid future math errors. The method returns the calculated DFT for each image.
def transform_image(image):
    transformed_image = np.fft.fft2(image)
    transformed_image = np.fft.fftshift(transformed_image)
    transformed_image = transformed_image + 1e-8
    return transformed_image

## Return an image's DFT's magnitude in a 20log base
def fft_magnitude_20log_e(image):
    return np.round(20*np.log(np.abs(image))).astype(np.uint8)

## Returns normalizied matrix (for each row:(data row - row mean)/ row std ). added +1e-8 to prevent calculation errors
def normalize_data(data):
    return ((data-data.mean())/data.std()) + 1e-8

## Returns the given type's dataset folder path (type = Real or Fake)
def get_data_path(type):
    return os.getcwd() + '/data/train/' + type

<h2>Utilities:</h2>

In [5]:
## The main data utility function.
## The function gets the type of the data (We order the data in a seperate folder for each data type, 
## .../Real and .../Fake for real and fake images. Change the get_data_path function to match your own dataset location
## according to the way you divide it), loads the image files from the given folder by creating a Data object, sending
## the returned value of the get_data_path of the desired type. Than the data is sent to the transform_matrix to  perfrom
## the Fourier transform related methods and we save the returned value in a data variable.
## The function calculates the azimuthal average array by calling the get_azimuthal_average function with the
## fft_magnitude_20log_e returned value of each data matrix (each image) and stores the results in the azimuth_average array.
## The function returns the data, azimuth_average arrays for later usage.
def get_data_azimuthal_average(type):
    data = transform_matrix(DataSet(get_data_path(type)))

    if (type == 'Real'):
        export_real_data(data)
    elif (type == 'Fake'):
        export_fake_data(data)

    azimuth_average = np.zeros((data.shape[0], np.round(data.shape[1]*np.sqrt(2)/2).astype(int) - 2))
    for i in range(0, data.shape[0]):
        azimuth_average[i,:]  = normalize_data(get_azimuthal_average(fft_magnitude_20log_e(data[i,:,:])))
    data=0
    return azimuth_average

## Returns the azimuthal average for each radius.
## The function masks the matrix by growing circles to get only the
## values on the circle with the current radius for each radius in the matrix.
def get_azimuthal_average(magnitudes):
    x_center = y_center = int(magnitudes.shape[1]/2)
    rows = magnitudes.shape[0]
    cols= magnitudes.shape[1]
    [X, Y] = np.meshgrid(np.arange(cols) - x_center, np.arange(rows) - y_center)
    radius = np.sqrt(np.square(X) + np.square(Y))
    radius_array = np.arange(1, np.max(radius), 1)
    intensity = np.zeros(len(radius_array))
    index = 0
    bin_size = 1
    for i in radius_array:
        mask = (np.greater(radius, i - bin_size) & np.less(radius, i + bin_size))
        values = magnitudes[mask]
        intensity[index] = np.mean(values)
        index += 1
    return intensity[1:-1]

## This function divides the train dataset to train and validation dataset.
## It takes the validation data % of the entire train dataset as an argument.
## It returns the train and validation x and y sequences.
def divide_train_validation_data(validation_percent, real_azimuth_avg, fake_azimuth_avg):
    offset = int(real_azimuth_avg.shape[0]*(100-validation_percent)/100)
    x_train = np.concatenate((real_azimuth_avg[0:offset, :], fake_azimuth_avg[0:offset, :]))
    x_validation = np.concatenate((real_azimuth_avg[offset:, :], fake_azimuth_avg[offset:, :]))
    y_train = np.concatenate((np.zeros((1, int(x_train.shape[0]/2)), dtype = int), np.ones((1, int(x_train.shape[0]/2)), dtype = int)),axis=1)
    y_validation = np.concatenate((np.zeros((1, int(x_validation.shape[0]/2)), dtype = int), np.ones((1, int(x_validation.shape[0]/2)), dtype = int)),axis=1)
    return np.squeeze(x_train), np.squeeze(y_train), np.squeeze(x_validation), np.squeeze(y_validation)



<h2>Logistic Regression Functions</h2>

In [6]:
## This is the main Logistic Regression model function - the Gradient Descent.
## It loops through the coefficients vector (w) for M iterations (received as input) and updates
## the coefficients vector on each iteration using a learning rate 'a' (received as input) and performing
## matrix multiplication between the updated logistic sigmoid and the X matrix. Both X and y are the used dataset's
## (xn,yn) data. save_each is a boolean parameter that is used to set if each iteration's weight vector will be saved
## (save_each==true) or just the final one
## The returned value is the optimized coefficients vector.
def gradient_descent_vector(M, w0, a, X, y):
    wm = np.zeros((M, len(w0)))
    wm[0,:] = w0
    for i in range(1,M):
        wm[i,:] = wm[i-1,:] - a*((sigma(wm[i-1,:], X) - y) @ X)
    return np.squeeze(wm)

## The main coefficients (weight vector, w) utility function.
## It gets the x matrix, y vector, the number of iterations 'M' and the learning rate 'a', calculates
## the initial weight vector (w0) by calling the calculate_coeffs function and than calculates and returns
## the optimal weight vector by calling the Gradient Descent algorithm implemented in the gradient_descent_vector function. 
def get_coefficients(x, y, M, a):
    w0 = np.zeros(x.shape[1])
    wm = gradient_descent_vector(M, w0, a, x, y)
    return wm
    
## This function calculates and returns the initial coefficients vector values (w0)
def calculate_coeffs(X,t):
    # If X is column returns scalar
    if X.ndim == 1:
        return ((X.T @ X)**-1)*X.T @ t.T
    # Otherwise, returns matrix
    return np.squeeze(np.linalg.inv(X.T @ X) @ X.T @ t.T)

## This function calculates and returns the value of statistic sigmoid of train data and weights       
def sigma(w,X):
    with warnings.catch_warnings():
        # used to hide several overflow warnings
        warnings.filterwarnings('ignore')
        return np.squeeze((1/(1+np.exp(-(w @ X.T)))))

## The binary prediction function, accepts a probability array of the data and a specific threshold.
## It converts and returns an array where all values in the probability array where value > threshold are 1, else 0
def binary_prediction(probability_array,threshold):
    return np.squeeze(np.where(probability_array.T < threshold, 0, 1))

## This function creates a confusion matrix.
## It accepts the x matrix and y vector of each data sample and a specific threshold.
## The matrix is returned as:
# TP=[0,0]
# FP=[0,1]
# TN=[1,1]
# FN=[1,0]
def create_confusion_matrix(x, y, w, threshold):
    prob_array = sigma(w,x)
    predicted = binary_prediction(prob_array,threshold)
    confusion_matrix = np.zeros((2,2), dtype=int)
    
    for idx in range(len(predicted)):
        i=predicted[idx].astype(int)
        j=y[idx].astype(int)
        confusion_matrix[i,j]+=1
    confusion_matrix = np.flip(confusion_matrix,0)
    confusion_matrix = np.flip(confusion_matrix,1)
    return confusion_matrix

## This function calculates the model accuracy by comparing the predicted values to the labeled values.
def accuracy(predicted, labled):
    return np.squeeze(np.mean(labled == predicted))

## This function calculates the model cross entropy value of the given traindata, test data and weights
def cross_entropy(xn, yn, w):
    return np.squeeze(-np.sum(yn*np.log(sigma(w, xn) + 1e-8) + (1-yn)*np.log(1 - sigma(w, xn) + 1e-8)))

## This function calculates and returns the FP, TP rates and accuracy of each probability value in the
## probability array when used as the threshold. It is used to find the optimal threshold for the classification model.
def get_confuse_mat_array(prob_arr, x_train, y_train, wm):
    threshold_array = prob_arr
    confuse_mat_array=np.zeros((len(threshold_array),3))
    for i in range(len(threshold_array)):
        tmp_mat = create_confusion_matrix(x_train, y_train,
                                            wm, threshold_array[i])
        tp=tmp_mat[0,0]
        tn=tmp_mat[1,1]
        fn=tmp_mat[1,0]
        fp=tmp_mat[0,1]
        tp_rate = tp/(tp+fn)
        fp_rate = fp/(fp+tn)
        accuracy = (tp+tn)/(tp+tn+fn+fp)
        confuse_mat_array[i][0]=fp_rate
        confuse_mat_array[i][1]=tp_rate
        confuse_mat_array[i][2]=accuracy
    return confuse_mat_array

## This function returns the value of the cell in the probability array that when used as the threshold gives the best accuracy
## To get the most optimal threshold we loop through each value in the probability array and check the accuracy
## using the values of the confusion matrix, using each probability as the threshold.
def get_optimal_threshold(confuse_mat_array, prob_arr):
    max = 0
    maxi = 0
    for i in range(confuse_mat_array.shape[0]):
        if confuse_mat_array[i][2]>max:
            max=confuse_mat_array[i][2]
            maxi=i
    return prob_arr[maxi]


## This function splits the data to k chunks and runs the alogirthm to calculate the model's accuracy
## for k iterations when each of the k chunks is used as the validation data and the other k-1 chunks
## are used as the train data. The function returns an array with the accuracy of each iteraion.
def k_fold_cross_validation(folds, real_azimuth_avg, fake_azimuth_avg, lr, it):    
    #shape + 1 for label, pad with zeros for the last column of real and ones for fake
    real = np.zeros((real_azimuth_avg.shape[0], real_azimuth_avg.shape[1] + 1))
    real[:, 0:-1] = real_azimuth_avg
    real[:, -1] = np.zeros(real_azimuth_avg.shape[0])
    
    fake = np.zeros((fake_azimuth_avg.shape[0], fake_azimuth_avg.shape[1] + 1))
    fake[:, 0:-1] = fake_azimuth_avg
    fake[:, -1] = np.ones(fake_azimuth_avg.shape[0])
    
    
    data = np.vstack((real, fake))
    np.random.shuffle(data)
    folded_data = np.array(np.array_split(data, folds, axis = 0))#check if returns list
    
    n = len(folded_data)
    
    accuracy_arr = np.zeros(n)
      
    for i in range(n):
        idx = np.setdiff1d(np.arange(0, n, 1, dtype=int), np.array([i]))
        
        train_x = np.squeeze(np.vstack(folded_data[idx, :, :-1]))
        train_y = np.squeeze(np.hstack(folded_data[idx, :, -1]))
                
        validation_x = np.squeeze(np.vstack(folded_data[i, :, :-1]))
        validation_y = np.squeeze(np.hstack(folded_data[i, :, -1]))
                    
        weigths = gradient_descent_vector(it, calculate_coeffs(train_x, train_y), lr, train_x, train_y)[-1]
        
        prob_arr = sigma(weigths, train_x)
        
        confuse_mat_array = get_confuse_mat_array(prob_arr, train_x, train_y, weigths)
        
        threshold = get_optimal_threshold(confuse_mat_array, prob_arr)
                
        prob_arr = sigma(weigths, validation_x)
        
        predicted = binary_prediction(prob_arr, threshold)
        
        accuracy_arr[i] = accuracy(predicted, validation_y)
                
    return accuracy_arr

<h2>Model import/export and initialization functions</h2>

In [19]:
## Model initialize method
def init_model(train_flag, parameters_flag):        
    model = Model(train_flag, parameters_flag)
    return model

## Export the entire model
def export_model_data(model):
    with open('model.npy', 'wb') as export_file:
        # Export data
        np.save(export_file, model.train_data_set_x)
        np.save(export_file, model.train_data_set_y)
        np.save(export_file, model.validation_data_set_x)
        np.save(export_file, model.validation_data_set_y)
        # Export hyperparameters
        np.save(export_file, model.iterations)
        np.save(export_file, model.learning_rate)
        # Export model properties
        np.save(export_file, model.w_vector)
        np.save(export_file, model.wm)
        np.save(export_file, model.train_prob_array)
        np.save(export_file, model.validation_prob_array)
        np.save(export_file, model.threshold)
        np.save(export_file, model.train_confusion_matrix)
        np.save(export_file, model.validation_confusion_matrix)
        np.save(export_file, model.train_binary_prediction)
        np.save(export_file, model.validation_binary_prediction)
        np.save(export_file, model.train_accuracy)
        np.save(export_file, model.validation_accuracy)     

## Export the coefficients vector - wn into .npy file
def export_coefficients(model):
    with open('coeffs.npy', 'wb') as export_file:
        np.save(export_file, model.wm)

## Export azimuthal average of fake data vector into .npy file
def export_fake_azimuthal_average(data):
    with open('fake_azimuthal_average.npy', 'wb') as export_file:
        np.save(export_file, data)

## Export azimuthal average of real data vector into .npy file       
def export_real_azimuthal_average(data):
    with open('real_azimuthal_average.npy', 'wb') as export_file:
        np.save(export_file, data)
        
## Export both fake and real azimuthal averages into .npy file
def export_azimuthal_average(real,fake):
    export_fake_azimuthal_average(fake)
    export_real_azimuthal_average(real)

## Export azimuthal average of fake raw data matrix into .npy file
def export_fake_data(data):
    with open('fake_data.npy', 'wb') as export_file:
        np.save(export_file, data)

## Export azimuthal average of real raw data matrix into .npy file
def export_real_data(data):
    with open('real_data.npy', 'wb') as export_file:
        np.save(export_file, data)  
        
## Export both fake and real raw data into .npy file
def export_data(real,fake):
    export_fake_data(fake)
    export_real_data(real)
    
## Export model, both real and fake azimuthal averages and both real and fake raw data 
def fast_export(model, real, fake):
    export_entire_model_data(model)
    export_azimuthal_average(real,fake)
    export_data(real,fake)
    
## Import an entire model and return a Model object
def import_model_data(model):
    with open('model.npy', 'rb') as import_file:
        model.train_data_set_x = np.load(import_file)
        model.train_data_set_y = np.load(import_file)
        model.validation_data_set_x = np.load(import_file)
        model.validation_data_set_y = np.load(import_file)
        model.iterations = np.load(import_file)
        model.learning_rate = np.load(import_file)
        model.w_vector = np.load(import_file)
        model.wm = np.load(import_file)
        model.train_prob_array = np.load(import_file)
        model.validation_prob_array = np.load(import_file)
        model.threshold = np.load(import_file)
        model.train_confusion_matrix = np.load(import_file)
        model.validation_confusion_matrix = np.load(import_file)
        model.train_binary_prediction = np.load(import_file)
        model.validation_binary_prediction = np.load(import_file)
        model.train_accuracy = np.load(import_file)
        model.validation_accuracy = np.load(import_file)
    return model

## Import a coefficients vector - wn and return it
def import_coefficients():
    with open('coeffs.npy', 'rb') as import_file:
        wm = np.load(import_file)
    return wm

## Import azimuthal average of fake data vector - fake_azimuthal_average and return it
def import_fake_azimuthal_average():
    with open('fake_azimuthal_average.npy', 'rb') as import_file:
        fake = np.load(import_file)
    return fake

## Import azimuthal average of real data vector - real_azimuthal_average and return it
def import_real_azimuthal_average():
    with open('real_azimuthal_average.npy', 'rb') as import_file:
        real = np.load(import_file)
    return real

## Import and return both fake and real azimuthal averages
def import_azimuthal_average():
    return import_fake_azimuthal_average(), import_real_azimuthal_average()

## Import raw fake data matrix - fake data and return it
def import_fake_data():
    with open('fake_data.npy', 'rb') as import_file:
        fake = np.load(import_file)
    return fake 

## Import raw real data matrix - real data and return it     
def import_real_data():
    with open('real_data.npy', 'rb') as import_file:
        real = np.load(import_file)
    return real

## Import and return both fake and real raw data
def import_data():
    return import_fake_data(), import_real_data()

## Import model, both real and fake azimuthal averages and both real and fake raw data 
def fast_import():
    return import_model_data(), import_azimuthal_average(), import_data()

<h2>Model Optimization And Hyperparameters Tuning</h2>

<h4>The accuracy/iterations/lerning rates arrays are obtained from the calculated grid in the cell below the functions</h4>

In [20]:
## This function creates a grid containing the accuracy for each iterations/learning rate/data split given values.
## The commented lines are another option to the hard-coded (manually obtained) values in the un-commented lines.
## The function accepts a Model object as an argument and by un-commenting the 'np.savetxt' line it exports the grid to a .csv file
def grid_search(model):
    #learning_rate_array = np.logspace(-5, -1, num=20, base=10)
    #interations_array = np.logspace(1, 5, num=20, base=10)
    learning_rate_array = np.array([1e-1, 3e-1, 7e-1, 1e-2, 3e-2, 7e-2, 1e-3, 3e-3 ,7e-3, 1e-4])
    interations_array = np.array([1e+1, 1e+2, 2e+2, 3e+2, 4e+2, 6e+2, 8e+2, 1e+3, 1.11e+3, 1.5e+3, 1e+4])
    data_percent = np.array([10, 25, 30, 50], dtype = int)
    grid_size = int(int(len(learning_rate_array))*int(len(interations_array))*int(len(data_percent)))
    grid = np.zeros((grid_size,5))
    counter = 0
    for dp in data_percent:
        for m in interations_array:
            for lr in learning_rate_array:
                model.change_validation_train_division(dp)
                model.re_set_model(int(m),lr)
                grid[counter] = np.array([model.validation_accuracy, 1-model.validation_accuracy, m, lr, dp])
                counter+=1
    #np.savetxt("grid_search_full_float_expanded.csv", grid, delimiter = ",", fmt="%f", header = 'Accuracy, Error, Iterations, Learning Rate, Data Validation Percent', comments='')
    return grid

## This function calculates the model's cross entropy value of the given data, label and weights
## x = data x matrix, y = labels, wm = weight vector
def cross_entropy(x, y, w):
    return np.squeeze(np.min(-np.sum(yn*np.log(sigma(w, xn) + 1e-8) + (1-yn)*np.log(1 - sigma(w, xn) + 1e-8))))

## This function is used to create a graph describing the change of the Cross Entropy by the step/iteration number of the gradient descent
## x = data x matrix, y = labels, wm = weight vector
def graph_cost_vs_steps(x, y, wm):
    lce = []
    for w in wm:
        lce.append(cross_entropy1(x,y,w))
    plt.plot(range(len(lce)),lce)
    plt.title('Cross entropy by Gradient Descent Step')
    plt.xlabel('Gradient Descent Step')
    plt.ylabel('Cross Entropy')
    plt.savefig('Cross Entropy by Gradient Descent step.png')
    plt.show()
    
## This function calculates and creates a graph of the predictions/probabilities of given data (x).
## data_type = 'Train' or 'Validation', y_type = 'Probability' or 'Prediction'
def plot_predictions(x, y, data_type, y_type):
    parameters_array = {'Predictions': y,
                        'X': x.flatten()}
    params = pd.DataFrame(parameters_array)
    sn.scatterplot(x='X', y='Predictions', data=params, palette="deep")
    plt.title(f'{y_type} on {data_type} Data', y=1.015, fontsize=20)
    plt.xlabel(f'{data_type} data features (mean)')
    plt.ylabel(f'{y_type}');
    plt.savefig(f'{y_type} on {data_type} Data.png')
    
## This function calculates and creates a graph of the accuracy calculated value by iterations for each learning rate
## accuracy_array = calculated accuracy array, iterations_array = different iterations tested, learning_rates_array =
## learning rates array.
def plot_accuracy_iterations_lr_multiple(accuracy_array, iterations_array, learning_rates_array): 
    parameters_array = {'Iterations': iterations_array,
                        'Accuracy': accuracy_array,
                        'Learning Rate': learning_rates_array}
    params = pd.DataFrame(parameters_array)
    sn.relplot(x='Iterations', y='Accuracy', col='Learning Rate', col_wrap = 3, data=params, palette="deep")
    #plt.title("Accuracy vs Iterations per Learning Rate", fontsize=20)
    plt.xlabel("Iterations", labelpad=13)
    plt.ylabel("Accuracy", labelpad=13)
    plt.savefig('Accuracy for each learning rate.png')
    ax = plt.gca()

## This function calculates and creates a graph of the accuracy calculated value by iterations for all the tested learning rates
## accuracy_array = calculated accuracy array, iterations_array = different iterations tested, learning_rates_array =
## learning rates array.
def plot_accuracy_iterations_lr(accuracy_array, iterations_array, learning_rates_array): 
    parameters_array = {'Iterations': iterations_array,
                        'Accuracy': accuracy_array,
                        'Learning Rate': learning_rates_array}
    params = pd.DataFrame(parameters_array)
    plt.figure(figsize=(10, 8))
    sn.scatterplot(x='Iterations', y='Accuracy', hue='Learning Rate', data=params, palette="deep")
    plt.title("Accuracy vs Iterations per Learning Rate", y=1.015, fontsize=20)
    plt.xlabel("Iterations", labelpad=13)
    plt.ylabel("Accuracy", labelpad=13)
    plt.savefig('Accuracy_vs_Iterations.png')
    ax = plt.gca()
    
## This function calculates and creates a graph of the accuracy calculated value by learning rates for all the tested iterations
## accuracy_array = calculated accuracy array, iterations_array = different iterations tested, learning_rates_array =
## learning rates array.
def plot_accuracy_lr_iterations(accuracy_array, iterations_array, learning_rates_array): 
    parameters_array = {'Iterations': iterations_array,
                        'Accuracy': accuracy_array,
                        'Learning Rate': learning_rates_array}
    params = pd.DataFrame(parameters_array)
    plt.figure(figsize=(10, 8))
    sn.scatterplot(x='Learning Rate', y='Accuracy', hue='Iterations', data=params, palette="deep")
    plt.title("Accuracy vs Learning Rate per Iterations", y=1.015, fontsize=20)
    plt.xlabel("Learning Rate", labelpad=13)
    plt.ylabel("Accuracy", labelpad=13)
    plt.savefig('Accuracy_vs_Learning_Rate.png')
    ax = plt.gca()

## This function calculates and creates a graph of the MSE by number of iterations
def graph_error_vs_iterations(model):
    mmse = np.mean((model.w_vector[1:,:] - model.w_vector[0:-1, :])**2 * (model.w_vector.shape[0]-1)-1, axis = 1)
    plt.plot(range(len(mmse)),mmse)
    plt.title('Error vs Iterations')
    plt.xlabel('Iterations')
    plt.ylabel('MSE')
    plt.show()
    print(np.argmin(mmse))

In [9]:
## Grid creation
#grid = grid_search(model)
# Get the accuracy, iterations and learning rates array from the grid for several optimization and plot usage
#accuracy_array = grid[:,0]
#iterations_array = grid[:,2]
#learning_rates_array = grid[:,3]
# Convert the iteration and learning rates arrays (repeated many times) to a unique numpy array (set) for several
# optimization and plot usage 
#iterations_set = np.unique(iterations_array)
#learning_rates_set = np.unique(learning_rates_array)

<h2>The Logistic Regression Model</h2>

In [10]:
## The Model class containing the data for the concrete/trained logistic regression model.
## The only argument needed to create a Model object is a boolean variable to choose between
## training the model from scratch or importing an existing model (train_flag) and another boolean
## variable to choose between the complete optimal learning rate and iterations process (takes a long time)
## or using pre-defined values for both (parameters_flag).
## A Model object can be initialized by the init_model() function or by importing a pre-trained model using the different import_ functions implemented in the cell below, each function
## will fill the model with the desired propeties, ranging from all properties to only data/coefficients/other variations.
## The Model has the test() method that accepts perform predictions using the model's trained properties
## over test dataset folder/object. Read the test() documentation for more information.
## A complete Model object holds all the needed information, parameters and data needed to perform predictions 
## and classification for new data inputs.
class Model:
    def __init__(self, train_flag, parameters_flag):
        if train_flag == True:
            self.create_model(parameters_flag)#True Any
        elif parameters_flag == True:#False True
            self = import_model_data(self)
        else:#False False
            print("Empty model craeted")
    
    ## The model creation/initialization method. Will be called by the init_model() function to activate
    ## the complete model training process. When finished running - the model will hold all the properties needed.
    def create_model(self, parameters_flag):
        
        # Create azimuthal average from the imported data
        real_azimuth_avg = get_data_azimuthal_average('Real')
        fake_azimuth_avg = get_data_azimuthal_average('Fake')
        #fake_azimuth_avg, real_azimuth_avg = import_azimuthal_average()
        #print("Real azimuth shape:")
        #print(real_azimuth_avg.shape)                   SAY SOMETHING ABOUT IT!!!
        #print("Fake azimuth shape:")
        #print(fake_azimuth_avg.shape)
        export_azimuthal_average(real_azimuth_avg, fake_azimuth_avg)
        
        # Create and divide the data to train and validation
        self.train_data_set_x, self.train_data_set_y, self.validation_data_set_x, self.validation_data_set_y = divide_train_validation_data(30,real_azimuth_avg, fake_azimuth_avg)
        if parameters_flag == True:    
            # Calculate optimal learning rate and iterations number by comparing different options and
            # finding the minimal valid cross-entropy
            learning_rate_arr = np.logspace(-10, -1, num=30, base=10)
            # PAY ATTENTION! 5 is the highest recommended power for the function. 
            # Succesful run of 6 on Ryzen 5800x and 32Gb DDR4 RAM
            # took about 2 hours. For the 7th power almost 60Gb RAM required to start the fucntion.
            power = 5
            self.learning_rate, self.iterations = check_learning_rate_array(learning_rate_arr, power, self.train_data_set_x, self.train_data_set_y, True)
        else:
            self.learning_rate = 0.1
            self.iterations = 400
            
        # Perform the needed calculations, create the coefficients vector (the final iteration of the 
        # gradient descent algorithm), the probability arrays, the optimal threshold and with that value the confusion
        # matrices and the binary predictions matrix.
        
        # Set the w_vector - weight vector/coefficients by calling the get_coefficients() function with
        # the model's given data. Sets another property - wm as the final iterations' weight vector.
        self.w_vector=get_coefficients(self.train_data_set_x, self.train_data_set_y, self.iterations, self.learning_rate)
        self.wm=self.w_vector[-1]
        
        # Set the probability arrays for the train data and the validation data by calling the
        # sigma() function with the model's given data.
        self.train_prob_array=sigma(self.wm, self.train_data_set_x)
        self.validation_prob_array=sigma(self.wm, self.validation_data_set_x)
        
        # Calculates the optimal threshold by calling the get_optimal_threshold() function with
        # the model's given data.
        self.threshold = get_optimal_threshold(get_confuse_mat_array(self.train_prob_array, self.train_data_set_x, self.train_data_set_y, self.wm),self.train_prob_array)
        
        # Generate and set the train and validation data's confusion matrices by calling the 
        # create_confusion_matrix() with the model's given data.
        self.train_confusion_matrix = create_confusion_matrix(self.train_data_set_x, self.train_data_set_y, self.wm, self.threshold)
        self.validation_confusion_matrix = create_confusion_matrix(self.validation_data_set_x, self.validation_data_set_y, self.wm, self.threshold)
        
        # Create the train and validation data's binary prediction matrices by calling the binary_prediction()
        # function with the model's given data.
        self.train_binary_prediction = binary_prediction(self.train_prob_array, self.threshold)
        self.validation_binary_prediction = binary_prediction(self.validation_prob_array, self.threshold)
        
        # Calculate and returns the test and validation data's accuracy when using the trained model weight vector
        # to calculate the validation data's binary prediction by calling the accuracy() function with the model's
        # given data.
        self.validation_accuracy = accuracy(self.validation_binary_prediction, self.validation_data_set_y)
        self.train_accuracy = accuracy(self.train_binary_prediction, self.train_data_set_y)

    def change_validation_train_division(self, percent):
        self.train_data_set_x, self.train_data_set_y, self.validation_data_set_x, self.validation_data_set_y = divide_train_validation_data(percent, import_real_azimuthal_average(), import_fake_azimuthal_average())

        
    ## Re-set the model and calculate coefficients, probability array, throeshold and other properties
    ## after manually updating the iterations number / learning rate value.
    ## Accepets the new iterations number and learning rate as arguments.
    def re_set_model(self, iterations, learning_rate):
        self.iterations = iterations
        self.learning_rate = learning_rate
        self.w_vector=get_coefficients(self.train_data_set_x, self.train_data_set_y, self.iterations, self.learning_rate)
        self.wm=self.w_vector[-1]
        self.train_prob_array=sigma(self.wm, self.train_data_set_x)
        self.validation_prob_array=sigma(self.wm, self.validation_data_set_x)
        self.threshold = get_optimal_threshold(get_confuse_mat_array(self.train_prob_array, self.train_data_set_x, self.train_data_set_y, self.wm),self.train_prob_array)
        self.train_confusion_matrix = create_confusion_matrix(self.train_data_set_x, self.train_data_set_y, self.wm, self.threshold)
        self.validation_confusion_matrix = create_confusion_matrix(self.validation_data_set_x, self.validation_data_set_y, self.wm, self.threshold)
        self.train_binary_prediction = binary_prediction(self.train_prob_array, self.threshold)
        self.validation_binary_prediction = binary_prediction(self.validation_prob_array, self.threshold)
        self.train_accuracy = accuracy(self.train_binary_prediction, self.train_data_set_y)
        self.validation_accuracy = accuracy(self.validation_binary_prediction, self.validation_data_set_y)

    ## Export the entire model after training and defining the needed parameters to enable quick usage in the future
    def export_model(self):
        export_model_data(self)
    
    ## Calculates and returns the accuracy obtained after each iteration of the gradient descent algorithm
    def get_accuracy_array(self):
        accuracy_array = []
        for i in range(self.w_vector.shape[0]):
            confusion_matrix = create_confusion_matrix(self.validation_data_set_x, self.validation_data_set_y, self.w_vector[i], self.threshold)
            prob_array = sigma(self.w_vector[i], self.validation_data_set_x)
            prediction = binary_prediction(prob_array, self.threshold)
            accuracy_array.append(accuracy(prediction, self.validation_data_set_y))
        return accuracy_array
            
    ## Test function for the test (new, unseen) data.
    ## This method imports the data, calculates the azimuthal averages of each imported image by
    ## activating the fft algorithms and predicts whether each image is real (0) or fake (1)
    def test(self):
        # Import the test data from the 'data/FutureData' folder. Edit the path to match another folder if needed.
        test = []
        export = []
        path = 'data/FutureData'
        for file in os.listdir(path):
            filename = path + '/' + os.fsdecode(file)
            export.append(os.fsdecode(file))
            test.append(cv2.imread(filename,0))
        
        # Convert to a complex data type numpy array.
        test_arr = np.array(test, dtype = 'complex_')
        
        # Initialize and calculate the azimuthal averages for each image in the test data.
        test_azimuthal_averages = np.zeros((test_arr.shape[0], 722))
        for i in range(test_arr.shape[0]):
            test_azimuthal_averages[i] = get_azimuthal_average(fft_magnitude_20log_e(transform_image(test_arr[i,:,:])))
        
        # Predict each image's label (Real = 0, Fake = 1) using the binary prediction function implemented in
        # this model using the model's coefficients vector and threshold.
        predict = binary_prediction(sigma(self.wm, test_azimuthal_averages), self.threshold)
        
        # Export each image's prediction by its file name to a '.csv' file
        zipped = np.array(list(zip(export,predict)))
        np.savetxt('FutureDataEstimatedLabels.csv' ,zipped, delimiter=',', header = 'File Name, Predicted', fmt = "%s,%s", comments='')
        print(predict)

<h2>Main part</h2>

<h4>Each cell below offers a different way to run the model.</h4>
<h4>The exact results for each run type are commented in the beginning of each cell</h4>

In [11]:
## Full initialization using the optimal hyperparameters we calculated.
## Those parameters have the best accuracy-efficieny rate.
## Learning rate = 0.01, Iterations = 400, Data split validation:test = 25:75
model = init_model(True, False)
model.export_model()

In [12]:
## Full initialization using an algorithm to find the optimal hyperparameters.
## This process takes a long time and uses many resources. The result is more accurate but
## much less efficient.
#model = init_model(True, True)

In [15]:
## Initialize a Model object by importing our trained model from the attached 'model.npy' file.
## This is the quickest method to load the model.
model2 = init_model(False, True)

In [14]:
## Initialize an empty Model object.
## Use the model's methods to set or import the parameters.
#model = init_model(False, False)
#model = import_model_data(model)