## In this notebook, I have highlited few of the approaches which I have tried, to learn the inverse of a reassigned spectogram algorithm.

### Let's start with the reassigned Spectogram parameters

In [None]:
n_fft, nperseg = 80, 80 
window = "hamming"
center = False # reassigned times are not alighned to the frame, so it's preferred to use center as False
pad_mode = "wrap" # added for completeness, since we are not centering, there would be no padding used.
win_length, hop_length = 80, 80//4
fill_nan = True # returns bin freq and frame times instead of nan
clip = False # since our objective is to recreate the signal, we would like to keep the freq and time that are beyond the bounds.

### Few General design decisions across architecures 
+ MSE loss with mean reduction
+ AdamW with no weight decay and with AMSGRAD
+ Manual scheduler that reduces the learning rate when loss doesn't improve.
+ Batch size of 1 with Gradient Accumulation of 128 steps

### Approach 1: simple architecture

+ The easiest way to approach is to feed all the information to the model and ask it figure out the mapping on it's own.
+ this is what we do in this approach, where we feed the outputs of reassigned spectogram to our model 

In [None]:
freq, times, mags = librosa.reassigned_spectrogram(
    signal, sr=16000, n_fft=80, 
    window="hamming", center=True, 
    pad_mode="wrap", win_length=80, 
    hop_length=80//4, fill_nan=True, 
    reassign_times=True, clip=False
)

Zxx = np.vstack((freq, times mags)) # (41*3, number_of_frames)

# reshaping for CNN architecture
Zxx = Zxx.reshape(1, 1, Zxx.shape[0], -1)
X = torch.Tensor(Zxx).to(device)

# passing it to our model
out = model(X)

In [None]:
# our model architecture

class flatten(nn.Module):
    def __init__(self):
        super(flatten, self).__init__()
        pass
    def forward(self, out):
        return torch.swapaxes(out.squeeze(), 0, 1).flatten()  
    
# 80 -> n_fft, 123 -> 41 *3, 2-> hop_length/window_overlap
model = nn.Sequential(
   nn.Conv2d(1, 80, kernel_size=(123, 2), stride=1, bias=True),
   nn.Dropout2d(p=0.3),
   nn.BatchNorm2d(80),
   nn.Conv2d(80, 20, kernel_size=(1, 1), stride=1, bias=True), # Acts as a linear layer
   flatten()
).to('cuda')

### Results
+ The best train MSE loss we get is 2e-2, which is not enough to recreate the signal.
+ As you can also observe, in the reassigned_spectrogram function, we used center as True, this is specifically to align with simple architecture of CNN. To make use of center as False, we have to pad and manipulate the signal appropriately. Instead of doing this workaround, we move to the next approach.

### Approach 2: unwrapping the tfr output

+ In this approach, we try to focus on these things:
    + increase the models capacity to learn by decreasing the complexity of the data
        + Instead of passing padded redundant features in a time frame to the model, we remove the redundant values and provide less complex data.
    + utilize center=False during reassign_spectogram process.
    + Adapt the CNN architecture to the new less complex data

In [None]:
freq, times, mags = librosa.reassigned_spectrogram(
    signal, sr=16000, n_fft=80, 
    window="hamming", center=False, 
    pad_mode="wrap", win_length=80, 
    hop_length=80//4, fill_nan=True, 
    reassign_times=True, clip=False
)

# unwraps (41, number_of_frames) -> (1, len(signal))
freq = np.append(freq[:nfft//4, :-1], freq[:,-1])
times = np.append(times[:nfft//4, :-1], times[:,-1])
mags = np.append(mags[:nfft//4, :-1], mags[:,-1])

# stacks to get -> (3, number_of_frames)
Zxx = np.vstack((freq, times, mags))
Zxx = Zxx.reshape(1, 1, Zxx.shape[0], -1)

In [None]:
# our model architecture

class flatten(nn.Module):
    def __init__(self):
        super(flatten, self).__init__()
        pass
    def forward(self, out):
        return torch.swapaxes(out.squeeze(), 0, 1).flatten()  
    
model = nn.Sequential(
   nn.Conv2d(1, 20, kernel_size=(3, 20), stride=20, bias=True),
   nn.Dropout2d(p=0.3),
   nn.BatchNorm2d(20),
   nn.Conv2d(20, 20, kernel_size=(1, 1), stride=1, bias=True),
   flatten()
).to('cuda')

### Results:
+ The best train MSE loss we get is 1e-3, which unfortunately again is not enough.
+ Increasing the architecture capacity doesn't seem to help, which I feel suggests that we are missing some information that are necessary to recreate the signal.

### Approach 3: hacky training approach
+ Based on the reassign_spectogram algorithm, I felt that what our model needs to learn aren't generalized features but one specific formulae that is common for all training data.
    + Therefore instead of training on our Librispeech full dataset, I picked only couple of signals/samples to train our model. The learning rate was kept very low so that our model doesn't overfit that easily and the model architecture was slighly adapted to mitigate overfitting and increase model capacity.

In [None]:
# our model architecture

class flatten(nn.Module):
    def __init__(self):
        super(flatten, self).__init__()
        pass
    def forward(self, out):
        return torch.swapaxes(out.squeeze(), 0, 1).flatten()  
    
model = nn.Sequential(
   nn.Conv2d(1, 40, kernel_size=(3, 20), stride=20, bias=True),
   nn.Dropout2d(p=0.8),
   nn.BatchNorm2d(40),
   nn.Conv2d(40, 20, kernel_size=(1, 1), stride=1, bias=True),
   nn.Dropout2d(p=0.8),
   nn.BatchNorm2d(20),
   nn.Conv2d(20, 20, kernel_size=(1, 1), stride=1, bias=True),
   flatten()
).to('cuda')

### Results

+ The best train MSE without overfitting we achieved was 1e-4. The model was able to re-create signals when it was overfitted to achieve 1e-5+ MSE loss. Unfortunately, we still weren't able to recreate the signal without overfitting.