Skip to content

Jaehoon9201/Python-data-preprocessing

Repository files navigation

Characteristics of FFT

  • Include this [FFT_ex1.py], all examples do FFT over the entire interval. If you change FFT windows(unit: # of samples or seconds) , a frequency resolution decreases by the proportion Fsamp/FFTwindows. Eventhough, range of frequency analysis is not changed.

  • Reference

    • Lowest Detectable Frequency

      • F0 = 5*(SR/Window Size)
      • ex ) F0 = 5*(44100/1024) ≃ 215 Hz.
    • WS = 5*Fsamp/F0

      • F0 = 440 Hz --> WS = 501
      • F0 = 100 Hz --> WS = 2205
    • TR = Window Size/Fsamp

      • If window size is big --> time domain is shrinked.
    • FR = Fsamp/Window Size

      • The more bins, the more slices of frequency range we get.
  • Summary (Reference)

    • A greater frequency resolution results in a smaller time resolution.

FFt_ex1.py

  • This example code is a reproduction of the code in the reference next. Reference

Data Shape!

Frequency Analysis of the Noise - by FFT

The second graph in the plot below shows the FFT analysis of the noise signal added to the original signal. The horizontal axis represents the frequency, and the vertical axis represents the magnitude of each frequency.

FFt_ex2.py

def FFT(sample_rate, duration, signal):
    ...
    return xf, yf, phase_ang

  • 😟 Why are the results different with ex1 ?
    • In [ex1] code, it obtains power spectrum scale value.

      fhat = np.fft.fft(f,n)                     #compute the FFT
      PSD = fhat * np.conj(fhat)/n               #power spectrum 
      freq = (1/(dt*n)) * np.arange(n)           #x-axis of frequencies 
      L = np.arange(1,np.floor(n/2),dtype='int') #only plot 1st half

FFt_ex3.py

def FFT(sample_rate, duration, signal):
    ...
    return xf, amplitude_Hz, phase_ang

  • 😟 Is this ex3 result same with matlab positiveFFT ? Matlab codes are represented below.
    • Ans : not perfectly same with the result of ex3's.

    • Reason : Below matlab code is doing 'normalize' as 'X=fft(x)/N*2; % normalize the data'.

               But python [ex3] code get the magnitude values after 'abs' function('amplitude_Hz = 2*abs(Y)').    
      
      % test.m
      clear;clc;
      dt = 0.001; 
      t= 0:dt:1;
      noise = 2.5 .*rand(length(t),1 );
      x= sin(2*pi*50*t) + sin(2*pi*120*t);
      
      positiveFFT(x',length(x'));
      
      Magnitude=(abs(ans))';
      Order=0:1:length(Magnitude)-1;
      bar(Order, Magnitude);
      
      grid on;
      xlabel('Harmonic Order','FontSize',15);
      ylabel('Mag','FontSize',15);
      set(gca,'FontSize',15);
      grid on;
      % poisitiveFFT.m
      function [X,freq]=positiveFFT(x,Fs)
         N=length(x); %get the number of points
         k=0:N-1;     %create a vector from 0 to N-1
         T=N/Fs;      %get the frequency interval
         freq=k/T;    %create the frequency range
         X=fft(x)/N*2; % normalize the data
       if X(1)~=X(N)
           X(1)=X(1)/2;
       end
         %only want the first half of the FFT, since it is redundant
         cutOff = ceil(N/2); 
      
         %take only the first half of the spectrum
         X = X(1:cutOff);
         freq = freq(1:cutOff);







Sampler-and-Generator.py

You can do [1], [2] using this file
[1] Sampling
Sampling from Github_set.csv
[2] Generator
You can generate extra columns like below on the new generated .csv file.

Columns to be stacked could be set at the part shown below.

StackedNum = 4               # TobeStacked's copy Num
Samp_Num = 12 * StackedNum   # Sampling from [Github_gener_ex.csv] file.

TobeStacked_all = []

# 12
TobeStacked = np.array([[0, 0, 3, 0, 0, 3], [0, 0, 3, 0, 0, -3], [0, 0, -3, 0, 0, 3], [0, 0, -3, 0, 0, -3],
                        [0, 3, 0, 0, 3, 0], [0, 3, 0, 0, -3, 0], [0, -3, 0, 0, 3, 0], [0, -3, 0, 0, -3, 0],
                        [3, 0, 0, 3, 0, 0], [3, 0, 0, -3, 0, 0], [-3, 0, 0, 3, 0, 0], [-3, 0, 0, -3, 0, 0]])
TobeStacked = TobeStacked.reshape(-1, 6)
TobeStacked = pd.DataFrame(TobeStacked)







SVM/SVM_classifier.py

svm_classifier using a scikit learn

If your data group is more than 3,
you should use below code for ensuring visualization.

z = clf.predict(xy)

If your data group is smaller than 3,
you can also use below code for visualization with rpresentating of margins.

z = clf.decision_function(xy)







PCA_Reconstructing > pca_various_plt_test.py

image

Below description's Reference

How to reverse PCA and reconstruct original variables from several principal components?

PCA computes eigenvectors of the covariance matrix ("principal axes") and sorts them by their eigenvalues (amount of explained variance). The centered data can then be projected onto these principal axes to yield principal components ("scores"). For the purposes of dimensionality reduction, one can keep only a subset of principal components and discard the rest. (See here for a layman's introduction to PCA.)

Let Xraw be the n×p data matrix with n rows (data points) and p columns (variables, or features). After subtracting the mean vector μ from each row, we get the centered data matrix X. Let V be the p×k matrix of some k eigenvectors that we want to use; these would most often be the k eigenvectors with the largest eigenvalues. Then the n×k matrix of PCA projections ("scores") will be simply given by Z=XV.

This is illustrated on the figure below: the first subplot shows some centered data (the same data that I use in my animations in the linked thread) and its projections on the first principal axis. The second subplot shows only the values of this projection; the dimensionality has been reduced from two to one:

image

In order to be able to reconstruct the original two variables from this one principal component, we can map it back to p dimensions with V⊤. Indeed, the values of each PC should be placed on the same vector as was used for projection; compare subplots 1 and 3. The result is then given by X^=ZV⊤=XVV⊤. I am displaying it on the third subplot above. To get the final reconstruction X^raw, we need to add the mean vector μ to that:

PCA reconstruction=PC scores⋅Eigenvectors⊤+Mean Note that one can go directly from the first subplot to the third one by multiplying X with the VV⊤ matrix; it is called a projection matrix. If all p eigenvectors are used, then VV⊤ is the identity matrix (no dimensionality reduction is performed, hence "reconstruction" is perfect). If only a subset of eigenvectors is used, it is not identity.

This works for an arbitrary point z in the PC space; it can be mapped to the original space via x^=zV⊤.

Discarding (removing) leading PCs

Discarding (removing) leading PCs

Sometimes one wants to discard (to remove) one or few of the leading PCs and to keep the rest, instead of keeping the leading PCs and discarding the rest (as above). In this case all the formulas stay exactly the same, but V should consist of all principal axes except for the ones one wants to discard. In other words, V should always include all PCs that one wants to keep.

Caveat about PCA on correlation

Caveat about PCA on correlation

When PCA is done on correlation matrix (and not on covariance matrix), the raw data Xraw is not only centered by subtracting μ but also scaled by dividing each column by its standard deviation σi. In this case, to reconstruct the original data, one needs to back-scale the columns of X^ with σi and only then to add back the mean vector μ.

Image processing example

This topic often comes up in the context of image processing. Consider Lenna -- one of the standard images in image processing literature (follow the links to find where it comes from). Below on the left, I display the grayscale variant of this 512×512 image.

image

We can treat this grayscale image as a 512×512 data matrix Xraw. I perform PCA on it and compute X^raw using the first 50 principal components. The result is displayed on the right.








Data save to a text file

  • It makes text file with target data.

  • If you set a below code, you can distinguish the data with lines.

        for i in range(len(data1)):
            if i % 10 == 0 and i > 0:
                file_1.write('\n')
                file_1.write(',')
                file_2.write('\n')
                file_2.write(',')
            elif i % 10 != 0 and i > 0:
                file_1.write(',')
                file_2.write(',')
    • All data are saved in string format before saving them.

CompSens (Compressive Sensing)

real time domain plot

image

DCT domain plot

image

Spectrogram_plot

If you want to place results in low frequency on bottom, active a below code.

spec = np.flip(spec, axis=0)

100Hz

200Hz

2000Hz

10000Hz

wavelet_trans_basic_ex( = Continuous Wavelet Transform(CWT) )








Path Example








CWT_v2 - ex

Mexican Hat Wavelet

What to consider

It otuputs constants shape with normalization factor, meaning it has 'energy' unit.

Also, in this reason, if you wanna plot this outputs, you could just plot outputs.

plt.imshow(cwtmatr, cmap='viridis', aspect='auto',
                 vmax=abs(cwtmatr).max(), vmin=-abs(cwtmatr).max())

Form

$$ {\psi(t)} = {2 \over{\sqrt3\sqrt[4]\pi}} e^{-t^2\over2}(1-t^2) $$

Complex Morlet Wavelets

What to consider

This wavelet has imaginary part.

On the cmorB-C, B and C represents :

  • B-> bandwith,

  • C -> center frequecny

Also, if you wanna plot this outputs, you should consider imaginary part.

In this code, i manage it like below.

    power = (abs(cwtmatr)) ** 2
    ...
    wt_power_results = np.log2(power)

Form

$$ {\psi(t)} = {1 \over{\sqrt{\pi B}}} e^{-t^2 \over B} e^{j2 \pi Ct} $$

How to set 'B and C' ?

Below figure is from above reference site ! They say that reasonable results are come from when scales are above '2'. In other words, when 'C=1', and 'Scale = 1', the highest frequency being analyzed is same with 'Sampling Frequency'. It violate the Nyquist theorem. (i.e., it cannot result reasonable value).

image

How to set 'scales' ?

In this project, 'scales' is set like below way.

scale_min = 2
#scale_max = 300
scale_max = 130
scales = np.arange(scale_min, scale_max)

If scales contain big large value, the wavelet is to be streched, and going to be sensitive to lower frequencies on the signal.

As mentioned above, start value of scale is recommended to being above 2 for overcoming the nyquist limitaion.

Running Results

  • if u want to see the origin figure, refer to ppt in this project
  • Analysis
    • If Scale includes large scale, it can output results more segmented.
    • If band width frequency of cmor is set to high level, it can output more segmented outputs in time domain region. (but gonna be sensitive)
    • Caution : frequency aspects are different in each graph. (cause of plot functions are set to different)

image

CWT_v2 - ex2

You can verify a relation between 2 things.

  • Relation b/w FREQEUNCies and SCALEs
  • Relation b/w CENTER FREQUENCY and SCALEs 

As mentioned above, if u set 'C=1' , the resonalble results are come from 'above scale 2'.

If you want to get reasonable results from 'scale 1', then you are recommended to set 'C' below 1. (ex. C = 0.5)

스크린샷 2022-11-20 오후 10 03 16

DCT

# normalize the amplitude
X_mag =abs(X[:n_oneside]) * 1/n_oneside
X_mag[0] =X_mag[0]/2
# same with above (general form)
# normalize the amplitude
X_mag =abs(X[:n_oneside]) * 2/N
X_mag[0] =X_mag[0]/2

Results

  • Orginal signal
# sampling rate
sr = 60
# sampling interval
ts = 1.0/sr
t = np.arange(0,1,ts)

freq = 0.
x = 1
freq = 1.
x += 2*np.sin(2*np.pi*freq*t)
freq = 2
x += 3*np.sin(2*np.pi*freq*t)
freq = 3
x += 4*np.sin(2*np.pi*freq*t)
freq = 29
x += 5*np.sin(2*np.pi*freq*t)

image

image

image

About

Python-data-processing-example

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages