Some important Python libraries

In [None]:
!pip install numpy pandas scikit-learn matplotlib seaborn scipy

### Data Visualization
OS → Excel <br>
Python → Pandas

In [None]:
import pandas as pd

# Creating a DataFrame with house data
data = {
    'Bedrooms': [2, 3, 4, 3, 5],         # Feature 1
    'Bathrooms': [1, 2, 2, 1, 3],        # Feature 2
    'SquareFeet': [1500, 2000, 2500, 1800, 3000],  # Feature 3
    'Price': [200000, 250000, 300000, 220000, 350000]  # Target variable
}

# Create the DataFrame
house_data = pd.DataFrame(data)

# Display the DataFrame
display(house_data)


Loading Diabetes dataset (normalized) from sklearn

age: Age in years <br>
sex: Gender of the patient <br>
bmi: Body mass index <br>
bp: Average blood pressure <br>
s1: Total serum cholesterol (tc) <br>
s2: Low-density lipoproteins (ldl) <br>
s3: High-density lipoproteins (hdl) <br>
s4: Total cholesterol / HDL (tch) <br>
s5: Possibly log of serum triglycerides level (ltg) <br>
s6: Blood sugar level (glu) <br>

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()

# Create a DataFrame from the dataset
diabetes_data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Add the target variable (disease progression)
diabetes_data['DiseaseProgression'] = diabetes.target

# Display the first few rows of the dataset
diabetes_data.head()


In [None]:
diabetes_data.info()  # General information about columns

In [None]:
diabetes_data.describe()  # Summary statistics for numerical data

Simple visualization of your dataframe
* Bar Plot: Displays how house prices vary with the number of bedrooms. Each bar represents a house, and the height indicates the price.
* Scatter Plot: Shows if there is a positive or negative relationship between square footage and price. We expect larger houses to generally have higher prices.
* Histogram: Shows how house prices are distributed across the dataset.
* Correlation Heatmap: This shows the relationship between all features. A correlation closer to 1 means a strong positive relationship (e.g., SquareFeet and Price might be highly correlated).

In [None]:
import matplotlib.pyplot as plt

# Scatter plot for BMI vs Disease Progression
diabetes_data.plot(kind='scatter', x='bmi', y='DiseaseProgression', title='BMI vs Disease Progression', color='blue')

plt.ylabel('Disease Progression')
plt.show()


In [None]:
# Histogram of Disease Progression
diabetes_data['DiseaseProgression'].plot(kind='hist', bins=30, title='Distribution of Disease Progression', color='green')

plt.xlabel('Disease Progression')
plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
corr_matrix = diabetes_data.corr()

# Set the figure size (e.g., 12x8 inches)
plt.figure(figsize=(12, 8))

# Visualize the correlation matrix using a heatmap
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")

plt.title("Feature Correlation Heatmap", fontsize=16)
plt.show()


In [None]:
# Distribution plot for Age
sns.histplot(diabetes_data['age'], kde=True, color='purple')

plt.title('Distribution of Age in the Dataset')
plt.xlabel('Age')
plt.show()


Here are the key reasons why **normalization of features** is important in machine learning:

### 1. **Improving Model Performance**
   - **Prevents Dominance by Features with Larger Scales**: In datasets where features have different ranges (e.g., one feature could range from 0 to 1, while another could range from 0 to 10,000), models like linear regression, k-nearest neighbors (k-NN), and support vector machines (SVM) might weigh the larger feature more heavily, even if it’s not more important.
   - **Ensures Equal Contribution**: Normalizing scales down all features to a similar range, making sure each feature contributes equally to the learning process.

### 2. **Faster Convergence in Gradient Descent**
   - **Helps Optimization Algorithms**: For models that use gradient descent (like logistic regression, neural networks, etc.), normalization speeds up convergence by allowing the optimization algorithm to take more even steps in all directions. Features with large values can slow down or complicate the optimization process.
   - **Avoids Slow Learning in Some Directions**: If one feature has a much larger scale than others, gradient descent will oscillate inefficiently along the larger feature axis, slowing down learning.

### 3. **Improving Accuracy in Distance-Based Algorithms**
   - **Important for Distance Metrics**: Algorithms like k-NN, SVMs, and clustering techniques (e.g., K-means) rely on distance calculations between data points. Features with larger ranges can dominate the distance metric, skewing the results.
   - **Creates Balanced Influence**: Normalizing ensures that all features have equal influence on the distance computations, leading to better and more accurate results.

### 4. **Required for Regularization**
   - **Avoids Feature-Scale Bias in Regularization**: When using models that apply regularization (e.g., Ridge or Lasso regression), normalization is crucial. Without normalization, regularization would penalize large coefficients more heavily simply because of the feature scale, leading to suboptimal models.
   
### 5. **Better Interpretation of Coefficients**
   - **Makes Coefficients Comparable**: In linear models, the learned coefficients represent the importance of each feature. Without normalization, interpreting these coefficients becomes tricky, as they are affected by the feature's scale. Normalization puts all features on the same scale, so the magnitude of the coefficients better reflects the feature importance.

### 6. **Prepares Data for Neural Networks**
   - **Necessary for Activation Functions**: Many neural networks use activation functions like sigmoid, ReLU, or tanh, which perform best when input values are within a small, standardized range (typically between -1 and 1). Normalization ensures that the inputs to the network are well-scaled for these activations, improving training performance and stability.

### 7. **Reduces Computational Complexity**
   - **Avoids Large Number Handling**: If some features have extremely large values, it may increase the computational complexity or even lead to overflow issues during model training. Normalization avoids such problems by keeping the values within manageable ranges.

### 8. **Required for PCA and Other Dimensionality Reduction Methods**
   - **Improves Variance Interpretation**: Principal Component Analysis (PCA) and other dimensionality reduction techniques aim to capture the variance in data. Without normalization, features with larger scales would dominate the variance, leading to poor results. Normalization ensures that PCA treats all features equally and captures meaningful variance.

### 9. **Improves Training Stability**
   - **Avoids Instability Due to Feature Imbalance**: In some models, feature imbalance (due to differences in scale) can cause instability in model training, leading to divergent results or poor performance. Normalization stabilizes the learning process by aligning all feature scales.

### Conclusion:
Normalization helps balance feature contributions, speeds up model convergence, and ensures better performance for distance-based models and optimization algorithms. It is a crucial preprocessing step for many machine learning models to function efficiently and accurately.

### Imputation

In [None]:
import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'Age': [25, np.nan, 35, 28, np.nan, 40],
    'Salary': [50000, 60000, np.nan, 52000, 48000, np.nan],
    'City': ['New York', 'San Francisco', 'Los Angeles', np.nan, 'Chicago', 'Miami']
}

df = pd.DataFrame(data)

# Display the DataFrame with missing values
display(df)


In [None]:
df.info()

In [None]:
# Dropping rows with missing values
df_dropped = df.dropna()

# Display the DataFrame after dropping missing values
display(df_dropped)


In [None]:
# Filling missing values in 'Age' column with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Display the DataFrame after filling missing Age values
display(df)


In [None]:
# Filling missing values in 'Salary' column with the median
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Display the DataFrame after filling missing Salary values
display(df)


In [None]:
# Filling missing values in 'City' column with the mode (most frequent value)
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Display the DataFrame after filling missing City values
display(df)


### Conclusion on Handling Missing Values

1. **Dropping Rows**:
   - **When to Use**: Dropping rows is useful when the dataset is large, and only a small proportion of rows have missing values. In these cases, removing rows with missing data won’t significantly impact the quality of the dataset or the model’s performance. However, this method is not suitable when the missing data is significant or contains important information.
   - **Caution**: Be careful when using this approach, as it may introduce bias if the missing data has a pattern or is not randomly distributed.

2. **Filling with Mean/Median/Mode (Imputation)**:
   - **Mean Imputation**: Appropriate for numerical data with no significant skew. This method fills in missing values with the average of the non-missing data points. It’s simple and maintains the distribution of data fairly well when the distribution is close to normal.
     - **When to Use**: Use mean imputation when the feature’s data is normally distributed and there are few outliers.
   
   - **Median Imputation**: More suitable for **skewed data**, as it is less affected by outliers. This method replaces missing values with the median of the existing data. Median imputation is robust and is often preferred for features with highly skewed distributions (e.g., income, housing prices).
     - **When to Use**: Use median imputation when the data has outliers or a non-symmetric (skewed) distribution, as it provides a better representation of central tendency in such cases.

   - **Mode Imputation**: Common for **categorical features**. The mode, or the most frequent value in a column, is used to fill in missing data. This is particularly useful for categorical data where it makes sense to fill missing values with the most common category (e.g., filling missing values in a column with "Male" or "Female" based on the majority).
     - **When to Use**: Use mode imputation for categorical variables, where the most frequent value is a reasonable assumption for filling in missing data.

3. **Advanced Imputation (Scikit-learn's SimpleImputer)**:
   - **What it Does**: `SimpleImputer` from `scikit-learn` provides more flexibility and control over how missing data is handled. It can impute missing values using strategies like mean, median, mode (most frequent), or even a constant value.
   - **When to Use**: This method is useful for integrating imputation into a machine learning pipeline, where missing data is automatically handled during model training and testing.
     - **Benefits**: `SimpleImputer` can handle missing data across multiple columns and ensures consistency in imputation when transforming datasets for model training.

   - **Alternative Techniques**: More advanced techniques include **K-nearest neighbors imputation** or **regression-based imputation**, where missing values are predicted based on other features in the dataset. These techniques can sometimes offer better accuracy but come with increased complexity.

## Time-series

In [None]:
# Here is a Python code to generate an artificial time series signal,
# and a time column with the Zurich timezone:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pytz
from datetime import datetime, timedelta

# Generate an artificial time series signal with peaks and dims
np.random.seed(42)  # For reproducibility
time_points = 1000  # Number of time points
time_series = np.linspace(0, 50, time_points)  # Simulating a time axis

# Creating a synthetic signal with peaks and dips
signal = (
    np.sin(time_series) * np.sin(0.5 * time_series) * 10 +
    np.cos(2 * time_series) * 2 +
    np.random.normal(scale=0.5, size=time_points)
)

# Create a time column in Zurich timezone (CET/CEST)
start_time = datetime(2024, 10, 3, 5, 0, tzinfo=pytz.timezone('Europe/Zurich'))
time_column = [start_time + timedelta(minutes=15 * i) for i in range(time_points)]

# Create a DataFrame with the time and signal
df_time_series = pd.DataFrame({
    'Time': time_column,
    'Time_series': signal
})

# Plot the artificial signal
plt.figure(figsize=(12, 6))
plt.plot(df_time_series['Time'], df_time_series['Time_series'], label='Artificial Signal', color='blue')
plt.title('An Exemplary Time Series')
plt.xlabel('Time (Zurich)')
plt.ylabel('Time_series')
plt.grid(True)
plt.legend()
plt.show()

# Display the first few rows of the DataFrame
df_time_series.head()


In [None]:
# Create a new column 'Timestamp' by converting the 'Time' column to UNIX timestamps
df_time_series['Timestamp'] = df_time_series['Time'].apply(lambda x: x.timestamp())

# Display the first few rows of the DataFrame with the new 'Timestamp' column
df_time_series[['Timestamp', 'Time_series']].head()


In [None]:
# Define the start and end datetimes
start_datetime = datetime(2024, 10, 3, 6, 0, tzinfo=pytz.timezone('Europe/Zurich'))  # Example start datetime
end_datetime = datetime(2024, 10, 3, 7, 0, tzinfo=pytz.timezone('Europe/Zurich'))    # Example end datetime

# Convert the datetimes to UNIX timestamps
start_timestamp = start_datetime.timestamp()
end_timestamp = end_datetime.timestamp()

# Select the chunk of the DataFrame between the two timestamps
signal_chunk = df_time_series[(df_time_series['Timestamp'] >= start_timestamp) &
                              (df_time_series['Timestamp'] <= end_timestamp)]

# Display the filtered DataFrame
signal_chunk[['Time', 'Time_series', 'Timestamp']]


# Sampling
The Nyquist–Shannon sampling theorem is a fundamental principle in the field of signal processing and information theory. It states that for a continuous-time signal to be properly reconstructed from its sampled values, the sampling rate must be bigger than at least twice the highest frequency component present in the signal.

Mathematically, the theorem can be expressed as:

f<sub>s</sub> > 2 * f<sub>max</sub>

Where:

- f<sub>s</sub> is the sampling rate (samples per second)
- f<sub>max</sub> is the maximum frequency present in the signal
- f<sub>s</sub> = 2 * f<sub>max</sub>, is known as the Nyquist rate.

If f<sub>s</sub> > 2 * f<sub>max</sub> is not met, a phenomenon known as aliasing occurs. Aliasing is the effect where high-frequency components in the original signal appear as lower frequencies in the sampled signal, leading to distortion and incorrect reconstruction of the original signal.


The Nyquist–Shannon sampling theorem has important implications in various fields, such as digital signal processing, audio and video processing, communications, and data acquisition systems. It provides a theoretical foundation for determining the minimum sampling rate required to properly represent and reconstruct continuous-time signals in the digital domain.

* Question: Why are 44100Hz and 48000Hz sampling rates quite popular for audio signals?


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Create a signal with two different frequencies
fs = 10000  # Original sampling frequency
t = np.linspace(0, 1, fs, endpoint=False)  # Time vector
f1 = 5  # Frequency of the first sine wave
f2 = 50  # Frequency of the second sine wave
signal = np.sin(2 * np.pi * f1 * t) + np.sin(2 * np.pi * f2 * t)

# Define sampling rates around the Nyquist rate
nyquist_rate = 2 * f2  # 2 * highest frequency
sampling_rates = [nyquist_rate / 4, nyquist_rate / 2, nyquist_rate, nyquist_rate , 1.5 * nyquist_rate, 2 * nyquist_rate, 3 * nyquist_rate]

fig, ax = plt.subplots(figsize=(10, 6))
line, = ax.plot([], [], lw=2, label='Sampled Signal')
points, = ax.plot([], [], 'ro', label='Intersection Points')
ax.plot(t, signal, 'k--', lw=1, label='Original Signal')
ax.set_xlim(0, 1)
ax.set_ylim(-2, 2)
ax.set_title('Effect of Different Sampling Rates on Signal')
ax.set_xlabel('Time [s]')
ax.set_ylabel('Amplitude')

# Position legend outside the plot area
ax.legend(loc='upper left', bbox_to_anchor=(0.9, 1.15), borderaxespad=0.)

def init():
    line.set_data([], [])
    points.set_data([], [])
    return line, points

def animate(i):
    rate = sampling_rates[i]
    t_sampled = t[::int(fs/rate)]

    signal_sampled = np.sin(2 * np.pi * f1 * t_sampled) + np.sin(2 * np.pi * f2 * t_sampled)
    line.set_data(t_sampled, signal_sampled)

    # Find intersection points
    if rate <= nyquist_rate:
        t_intersections = np.intersect1d(t, t_sampled)
        signal_intersections = np.sin(2 * np.pi * f1 * t_intersections) + np.sin(2 * np.pi * f2 * t_intersections)
        points.set_data(t_intersections, signal_intersections)
    else:
        points.set_data([], [])

    ax.set_title(f'Sampling Rate: {rate} Hz')

    return line, points

ani = FuncAnimation(fig, animate, init_func=init, frames=len(sampling_rates), interval=1000, blit=True)

# Clear the current figure before displaying the animation
plt.close(fig)

# Display the animation
HTML(ani.to_jshtml())


### Missing data in time-series and resampling
It is always a good decision to resample your data (even if it does not contain missing chunks). This will ensure the samples of the time-series are uniformly distributed through time. This is usually a must for some pre-processing steps such as filtering.

In [None]:
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt

# Generating example data (time and values)
time_original_1 = np.arange(1, 9, 1/20)  # Original time array (20 Hz)
values_original_1 = np.sin(0.5 * np.pi * time_original_1)  # Example values

time_original_2 = np.arange(20, 29, 1/20)  # Original time array (20 Hz)
values_original_2 = np.sin(2 * np.pi * time_original_2)  # Example values

# Combine the two sets of example data together
# Insert NaN to break the line between chunks
combined_time = np.concatenate((time_original_1, [np.nan], time_original_2))
combined_values = np.concatenate((values_original_1, [np.nan], values_original_2))

# Define new time array with larger time interval (100 Hz)
time_new = np.arange(0, 30, 1/100)  # New time array (100 Hz)

# Use interp1d to interpolate and extrapolate the data
interpolator = interp1d(combined_time, combined_values, kind='slinear', fill_value=(np.nan, np.nan), bounds_error=False)

# Interpolate/extrapolate the values to the new time array
values_new = interpolator(time_new)

# Insert NaN where there are no corresponding data points in the new resampled data
# This ensures gaps between the two chunks in the resampled plot too
values_new[time_new < np.min(time_original_1)] = np.nan
values_new[time_new > np.max(time_original_2)] = np.nan

# Plot the original and resampled data
plt.figure(figsize=(10, 6))

# Plot original signal (with NaN to break the connection between chunks)
plt.subplot(2, 1, 1)
plt.plot(combined_time, combined_values, label='Original Data (20 Hz)', marker='', linestyle='-', color='blue')
plt.title('Original Data (20 Hz)')
plt.xlabel('Time')
plt.ylabel('Values')
plt.grid(True)

# Plot resampled signal (with NaN handling for gaps)
plt.subplot(2, 1, 2)
plt.plot(time_new, values_new, label='Resampled Data (100 Hz)', marker='', linestyle='--', color='orange')
plt.title('Resampled Data (100 Hz)')
plt.xlabel('Time')
plt.ylabel('Values')
plt.grid(True)

plt.tight_layout()
plt.show()


### Filtering

In [None]:
'''Trying different low pass filters'''

import numpy as np
from scipy.signal import butter,filtfilt
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt

# Generating example data (time and values)
time_original_1 = np.arange(1, 9, 1/20)  # Original time array (20 Hz)
values_original_1 = np.sin(0.5 * np.pi * time_original_1) + 4  # Example values

time_original_2 = np.arange(20, 29, 1/20)  # Original time array (20 Hz)
values_original_2 = np.sin(2 * np.pi * time_original_2) + 6 # Example values

# Combine the two sets of example data together
combined_time = np.concatenate((time_original_1, time_original_2))
combined_values = np.concatenate((values_original_1, values_original_2))

# Define new time array with larger time interval (100 Hz)
time_resampled = np.arange(0, 30, 1/100)  # New time array (100 Hz)

# Use interp1d to interpolate and extrapolate the data
interpolator = interp1d(combined_time, combined_values, kind='slinear', fill_value=(0, 0), bounds_error=False)

# Interpolate/extrapolate the values to the new time array
values_resampled = interpolator(time_resampled) + 2 + np.random.normal(0, 0.5, len(time_resampled))


def butter_lowpass_filter(data, cutoff, fs, order):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    # Get the filter coefficients
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    y = filtfilt(b, a, data)
    return y

values_filtered = butter_lowpass_filter(values_resampled, 1.5, 100, 3)


# Plot the original and resampled data
plt.figure(figsize=(10, 6))
plt.plot(time_resampled, values_resampled, label='Before filter', marker='', linestyle='-')
plt.plot(time_resampled, values_filtered, label='After filter', marker='', linestyle='--')
plt.xlabel('Time')
plt.ylabel('Values')
plt.title('Resampled and Filtered Data')
plt.legend()
plt.grid(True)
plt.show()


### Spectral Analysis and Visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import windows, spectrogram

# Parameters
fs = 1000  # Sampling frequency in Hz
T = 1.0    # Duration in seconds
t = np.linspace(0, T, int(T * fs), endpoint=False)  # Time vector

# Create a signal with two different frequencies
f1 = 50   # Frequency of the first sine wave (Hz)
f2 = 120  # Frequency of the second sine wave (Hz)
signal = np.sin(2 * np.pi * f1 * t) + 0.5 * np.sin(2 * np.pi * f2 * t)

# Parameters for windowing
window_length = 100  # Length of the window
window_type = 'hann'  # Type of the window (e.g., 'hann', 'hamming', etc.)
window = windows.get_window(window_type, window_length)
n_overlap = window_length // 2  # Number of overlapping samples

# Perform STFT
frequencies, times, Sxx = spectrogram(signal, fs, window=window, nperseg=window_length, noverlap=n_overlap, scaling='spectrum')

# Plot the original signal
plt.figure(figsize=(14, 8))

plt.subplot(3, 1, 1)
plt.plot(t, signal)
plt.title('Original Signal')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.grid(True)

# Plot the spectrogram
plt.subplot(3, 1, 2)
plt.pcolormesh(times, frequencies, 10 * np.log10(Sxx), shading='gouraud')
plt.title('Spectrogram')
plt.xlabel('Time [s]')
plt.ylabel('Frequency [Hz]')
plt.colorbar(label='Power/Frequency (dB/Hz)')
plt.grid(True)

# Illustration of windowing effect
# Select a specific time slice for illustration
time_slice = int(0.5 * fs) + 110  # Center of the signal (at 0.5 seconds)
signal_slice = signal[time_slice:time_slice + window_length] * window

# Perform FFT on the windowed signal slice
N_slice = len(signal_slice)
fft_result_slice = np.fft.fft(signal_slice)
fft_freq_slice = np.fft.fftfreq(N_slice, 1/fs)

# Only take the positive frequencies and corresponding FFT results
positive_freqs_slice = fft_freq_slice[:N_slice//2]
positive_fft_result_slice = fft_result_slice[:N_slice//2]

# Magnitude of the FFT (normalized)
magnitude_slice = np.abs(positive_fft_result_slice) / N_slice

# Plot the FFT (magnitude spectrum) of the windowed signal slice
plt.subplot(3, 1, 3)
plt.stem(positive_freqs_slice, magnitude_slice, 'b', markerfmt=" ", basefmt="-b")
plt.title('Magnitude Spectrum of a Windowed Slice')
plt.xlabel('Frequency [Hz]')
plt.ylabel('Magnitude')
plt.grid(True)

plt.tight_layout()
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import windows, spectrogram

# Parameters
fs = 1000  # Sampling frequency in Hz
T = 1.0    # Duration in seconds
t = np.linspace(0, T, int(T * fs), endpoint=False)  # Time vector

# Create a signal with two different frequencies
f1 = 50   # Frequency of the first sine wave (Hz)
f2 = 120  # Frequency of the second sine wave (Hz)

signal = np.sin(2 * np.pi * f1 * t) + 0.5 * np.sin(2 * np.pi * f2 * t)

# Add white noise
noise_amplitude = 0.3  # Set the amplitude of the noise
white_noise = np.random.normal(0, noise_amplitude, size=t.shape)
signal = signal + white_noise

# Parameters for windowing
window_length = 100  # Length of the window
window_type = 'hann'  # Type of the window (e.g., 'hann', 'hamming', etc.)
window = windows.get_window(window_type, window_length)
n_overlap = window_length // 2  # Number of overlapping samples

# Perform STFT
frequencies, times, Sxx = spectrogram(signal, fs, window=window, nperseg=window_length, noverlap=n_overlap, scaling='spectrum')

# Plot the original signal
plt.figure(figsize=(14, 8))

plt.subplot(3, 1, 1)
plt.plot(t, signal)
plt.title('Original Signal')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.grid(True)

# Plot the spectrogram
plt.subplot(3, 1, 2)
plt.pcolormesh(times, frequencies, 10 * np.log10(Sxx), shading='gouraud')
plt.title('Spectrogram')
plt.xlabel('Time [s]')
plt.ylabel('Frequency [Hz]')
plt.colorbar(label='Power/Frequency (dB/Hz)')
plt.grid(True)

# Illustration of windowing effect
# Select a specific time slice for illustration
time_slice = int(0.5 * fs) + 110  # Center of the signal (at 0.5 seconds)
signal_slice = signal[time_slice:time_slice + window_length] * window

# Perform FFT on the windowed signal slice
N_slice = len(signal_slice)
fft_result_slice = np.fft.fft(signal_slice)
fft_freq_slice = np.fft.fftfreq(N_slice, 1/fs)

# Only take the positive frequencies and corresponding FFT results
positive_freqs_slice = fft_freq_slice[:N_slice//2]
positive_fft_result_slice = fft_result_slice[:N_slice//2]

# Magnitude of the FFT (normalized)
magnitude_slice = np.abs(positive_fft_result_slice) / N_slice

# Plot the FFT (magnitude spectrum) of the windowed signal slice
plt.subplot(3, 1, 3)
plt.stem(positive_freqs_slice, magnitude_slice, 'b', markerfmt=" ", basefmt="-b")
plt.title('Magnitude Spectrum of a Windowed Slice')
plt.xlabel('Frequency [Hz]')
plt.ylabel('Magnitude')
plt.grid(True)

plt.tight_layout()
plt.show()


### Feature Extraction

In [None]:
import numpy as np
from scipy import stats
from scipy.signal import periodogram

# Parameters
fs = 1000  # Sampling frequency in Hz
T = 1.0    # Duration in seconds
t = np.linspace(0, T, int(T * fs), endpoint=False)  # Time vector

# Create a signal with two different frequencies
f1 = 50   # Frequency of the first sine wave (Hz)
f2 = 120  # Frequency of the second sine wave (Hz)

signal = np.sin(2 * np.pi * f1 * t) + 0.5 * np.sin(2 * np.pi * f2 * t)

# Add white noise
noise_amplitude = 0.3  # Set the amplitude of the noise
white_noise = np.random.normal(0, noise_amplitude, size=t.shape)
signal = signal + white_noise

# 1. Mean of the signal
mean_value = np.mean(signal)

# 2. Standard deviation of the signal
std_value = np.std(signal)

# 3. Maximum and minimum values of the signal
max_value = np.max(signal)
min_value = np.min(signal)

# 4. Root Mean Square (RMS)
rms_value = np.sqrt(np.mean(signal ** 2))

# 5. Skewness and kurtosis
skewness_value = stats.skew(signal)
kurtosis_value = stats.kurtosis(signal)

# 6. Spectral Analysis (Dominant Frequency)
frequencies, power_spectral_density = periodogram(signal, fs)
dominant_frequency = frequencies[np.argmax(power_spectral_density)]

# Display extracted features
print(f"Mean: {mean_value:.4f}")
print(f"Standard Deviation: {std_value:.4f}")
print(f"Maximum Value: {max_value:.4f}")
print(f"Minimum Value: {min_value:.4f}")
print(f"RMS: {rms_value:.4f}")
print(f"Skewness: {skewness_value:.4f}")
print(f"Kurtosis: {kurtosis_value:.4f}")
print(f"Dominant Frequency: {dominant_frequency:.4f} Hz")


### Windowing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Generate a sample signal
np.random.seed(0)
x = np.linspace(0, 10, 1000)
y = np.sin(x) + 0.5 * np.random.normal(size=x.size)

# Set the window size
window_size = 200

# Create the figure and axis
fig, ax = plt.subplots()
line, = ax.plot(x, y, label='Signal')
window_line, = ax.plot([], [], 'r', lw=2, label='Sliding Window')

# Set the axis limits
ax.set_xlim(x.min(), x.max())
ax.set_ylim(y.min(), y.max())
ax.legend()

# Initialize the window line
def init():
    window_line.set_data([], [])
    return window_line,

# Update the window line
def update(frame):
    start = frame
    end = frame + window_size
    window_line.set_data(x[start:end], y[start:end])
    return window_line,

# Create the animation
ani = FuncAnimation(fig, update, frames=range(len(x) - window_size), init_func=init, blit=True, interval=50)

plt.close(fig)

# Display the animation
HTML(ani.to_jshtml())
