### Breif Overview of Our Implementation of an Online Training Approach for Neural Networks 

#### Issues our implemtation hopes to address 
First, we note that it is well known that neural networks are a less than ideal tool for online machine learning problems. In particular, for the problems we are interested in, the following issues are particularly an issue for the problem we are looking at:

- Scaling of training time of neural networks as dimensionality of data increases (as number of parameters in NN increases since our NN must take a "n" dimensional input and output an "n" dimensional output)


- So called "catastrophic forgetting"/"catastrophic inference" is a generally encountered issue when training neural networks online. 

#### Breif description on how these issues are addressed in our implementation

Primary approaches to reduce training times used are:

- Optimize network size. For a given problem $Ax=b$, we know the solution $x$ provided $b$ can be obtained from the inverse operator(assuming it exists) $A^{-1}$ as $x=A^{-1}b$. Then, we know that our neural network that takes a $b$ and gives as a sufficiently "close" solution guess $x$ can be effectively represented as an operator with $O(n^2)$ weights (Note: we can think of matrix multiplication as a convolution operation  with stride 1, Kernel size $n$, and $n$ channels acting on an input of length $n$. In other words we take an inner product with $n$ unknown weights and the input vector of length $n$ an $n$ number of times). It was found through experimentation that using far more(or less) weights than something of order $n^2$ took longer to train. Thus, an optimal neural network to use that is not "too big" or "too small" is something that has $O(n^2)$ weights. We then set up something that more or less mimics the known "true solution" structure of matrix multilication with some neutal network "depth". Fortuitously, this involves using a large channel dimension, which is the dimension over which pytorch parallizes computations over.  


        constructor: 
                device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
                self.Conv1   = torch.nn.Conv1d(1,int(H),D_in, stride=1, padding=0, dilation=1, groups=1, 
                    bias=False,padding_mode='zeros').to(device)
                self.Conv2   = torch.nn.Conv1d(int(H),D_out,1, stride=1, padding=0, dilation=1, groups=1, 
                    bias=False,padding_mode='zeros').to(device)
                self.relu   = torch.nn.LeakyReLU().to(device)
                
        forward:
                Current_batchsize=int(x.shape[0])  # N in pytorch docs
                device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
                x2=x.unsqueeze(1)  # Add channel dimension (C) to input 
                ConvOut1=self.relu(self.Conv1(x2.to(device)))
                ConvOut2=self.Conv2(ConvOut1) 
                y_pred = ConvOut2.view(Current_batchsize, -1)




- Train the Neural network in an online way so that we do not have to retrain the network on the entire training set every time new data is added. 
    - We use an "online batch" stochastic gradient descent approach to incrementally train the neural network where we use new data gathered  along with a random sampling of  "old data" 


- "Filter" data added to the training set in some way so that we do not have data that is "too redundant" (i.e we try to maximize the "value" of data so training time is managable yet still valuable for speeding up computations)

    - Filter approaches currently implented: 
    
    Keep moving average of run time for GMRES computation up to a certain tolerance. Then, only add data when the current run time is larger than the moving average
    
    Add filtered data in small amounts to the training set and then retrain
    
    General philosophy: Only add data to training set that you are fitting poorly at the present 
    



### Excerpt from training pytorch loop 

   def retrain_timed(self):

        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.model.to(device)
        self.loss_val = list()  # clear loss val history
        self.loss_val.append(10.0)

        batch_size=64
        numEpochs=2000
        e1=1e-3
        epoch=0
        while self.loss_val[-1]> e1 and epoch<numEpochs:
####            permutation = torch.randperm(self.x.size()[0])
            for t in range(0,self.x.size()[0],batch_size):
                
####                indices = permutation[t:t+batch_size]

####                batch_x, batch_y = self.x[indices],self.y[indices]

####             batch_xMix=torch.cat((batch_x,self.xNew))
####               batch_yMix=torch.cat((batch_y,self.yNew))

                # Forward pass: Compute predicted y by passing x to the model
                y_pred = self.model(batch_xMix.to(device))

                # Compute and print loss
                loss = self.criterion(y_pred, batch_yMix.to(device))
                self.loss_val.append(loss.item())

                # Zero gradients, perform a backward pass, and update the weights.
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                epoch=epoch+1

                #print diagnostic data
                # print('loss:',loss.item())
                # print('epoch:',epoch)
        print('Final loss:',loss.item())
        self.x=torch.cat((self.x,self.xNew))
        self.y=torch.cat((self.y,self.yNew))
####        self.xNew = torch.empty(0, self.D_in)
####        self.yNew = torch.empty(0, self.D_out)
        # numparams=sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        # print('parameters',numparams)
        self.is_trained = True

The underlying idea of the code above is that for a given number of epochs, we sample a "batch size" amount of data and train with this data per Epoch. Note that at a given time, the model has already been trained with this data. The "online" twist to this approach is that we make sure to add the "new" data to these batches we will be training with. This way, we train the model that has never seen the "new" data with this "new data", while making sure it "sees" a spread out sampling of "past data". Note that in this present implementation, we only add up to 3 new data points at a time. 

The lines of code that implement these ideas are bolded above

#### Code excerpt emphasizing data filters

         res = target[-1]


            # Check if we are in first e tolerance loop
            if refine==False :
                IterErr = resid(A, target, b)
                IterTime=(toc-tic)
                IterErr10=IterErr[10]
                IterErrList.append(IterTime)
                IterErrList10.append(IterErr10)  
                if ProbCount<=Initial_set:
                    func.predictor.add_init(b, res)
                if ProbCount==Initial_set:
                    func.predictor.add_init(b, res)
                    timeLoop=func.predictor.retrain_timed()
                    print('Initial Training')


            # Compute moving averages used to filter data
            if ProbCount>Initial_set:
                IterTime_AVG=moving_average(np.asarray(IterErrList),ProbCount)
                IterErr10_AVG=moving_average(np.asarray(IterErrList10),ProbCount)
                print(IterErrList[-1],IterTime_AVG,IterErrList10[-1],IterErr10_AVG)


            # Filter for data to be added to training set
####            if (IterErrList[-1]>IterTime_AVG) and  refine==True and ProbCount>Initial_set   : 
                blist.append(b)
                reslist.append(res)
                
                # check orthogonality of 3 solutions that met training set critera#
####                if   len(blist)==3 :
                    resMat=np.asarray(reslist)
                    resMat_square=resMat**2
                    row_sums = resMat_square.sum(axis=1,keepdims=True)
                    resMat= resMat/np.sqrt(row_sums)
                    InnerProd=np.dot(resMat,resMat.T)
                    print('InnerProd',InnerProd)
                    func.predictor.add(np.asarray(blist)[0], np.asarray(reslist)[0])
                    cutoff=0.8
                    
####                    # Picking out sufficiently orthogonal subset of 3 solutions gathered
####                    if InnerProd[0,1] and InnerProd[0,2]<cutoff :
####                        if InnerProd[1,2]<cutoff :
####                            func.predictor.add(np.asarray(blist)[1], np.asarray(reslist)[1])
####                            func.predictor.add(np.asarray(blist)[2], np.asarray(reslist)[2])
####                        elif InnerProd[1,2]>=cutoff: 
####                            func.predictor.add(np.asarray(blist)[1], np.asarray(reslist)[1])
####                    elif InnerProd[0,1]<cutoff :
####                        func.predictor.add(np.asarray(blist)[1], np.asarray(reslist)[1])
####                    elif InnerProd[0,2]<cutoff :
####                        func.predictor.add(np.asarray(blist)[2], np.asarray(reslist)[2])
                    
                    if func.predictor.counter>=retrain_freq:
                        if func.debug:
                            print("retraining")
                            print(func.predictor.counter)
                            timeLoop=func.predictor.retrain_timed()
                            trainTime=float(timeLoop[-1])
                            blist=[]
                            reslist=[]
            return target,IterErrList,IterTime_AVG,trainTime,forwardTime,blist,

Two main approaches were used to maximize the "value" of data in our training set. First, we only add data where our GMRES runtime to get a solution up to $e_1$ accuracy was longer than the current running average. In the code we use a moving average with a window of 25 samples. Of course, this window can be tweaked in a number of ways, but this seems to work well for the small experiments we have run so far. Of this data, we collect three at a time, and check the orthogonality of the three solutions computed. Of this set of 3, we add solutions that are sufficiently orthogonal to our training set. Ofcourse, we could write some code to check the orthogonality of a larger set of candidate solutions, or check the "spread" of solutions in some way. 

The lines of code that implement these ideas are bolded above