# FYP: Exploratory Data Analysis for Call Recordings
**FYP Goal:** Development of an AI/ML pipeline for pre-training a foundation model for sales

**EDA Goal:** Study the Calls that led to an Appointment Set to determine the preprocessing steps required for calls

**Data Source:** Using the hubspot data viewer, we extracted the data set based on the below specifications. We then employed a script in src/data_collector/getAllRecordings to extract all recordings from the `Recording URL` column. We then selected an initial sample of 28 calls that led to appointments as an initial training set for the NLP model

#### Filters Employed to extract SA_Singapore_Calls.csv
- Countries: Singapore
- Object: Calls

#### Data Columns extracted from HubSpot (17)
1. ```Record ID```
2. ```Call Title```
3. ```Activity date```
4. ```Activity assigned to```
5. ```Call notes```
6. ```Associated Contact```
7. ```Associated Company```
8. ```Associated Deal```
9. ```Call Outcome```
10. ```Recording URL```
11. ```To Number```
12. ```Call duration (HH:mm:ss)```
13. ```Associated Contact IDs```
14. ```Number of times contacted```
15. ```Associated Contact IDs```
16. ```Associated Company IDs```
17. ```Associated Deal IDs```

#### Notes
We will not be dealing with the numerical entries. The above is just a preamble to provide context for the data source

### 1. Import Libraries
First, we import the necessary Python libraries required for our analysis.

In [1]:
import librosa
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from pydub import AudioSegment

we also change the working directory

In [2]:
# change working directory to the root of your project
# Adjust according to the current path in the output
os.chdir('../')

### 2. Load the dataset

In [3]:
calls_df = pd.read_csv('data/SA_Singapore_Calls.csv')

### 3. Viewing the Dataset

In [4]:
calls_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8326 entries, 0 to 8325
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Record ID                 8326 non-null   int64  
 1   Call Title                8324 non-null   object 
 2   Activity date             8326 non-null   object 
 3   Activity assigned to      8243 non-null   object 
 4   Call notes                6195 non-null   object 
 5   Associated Contact        6785 non-null   object 
 6   Associated Company        7975 non-null   object 
 7   Associated Deal           2465 non-null   object 
 8   Call outcome              8326 non-null   object 
 9   Recording URL             8326 non-null   object 
 10  To Number                 8326 non-null   object 
 11  Call duration (HH:mm:ss)  8326 non-null   object 
 12  Voicemail Available       0 non-null      float64
 13  Associated Contact IDs    6785 non-null   object 
 14  Associat

### 2. Function to Extract Audio Properties

This function extracts essential audio properties to inform preprocessing decisions for audio files. It leverages `pydub` and `librosa` to retrieve details about each file, including duration, channels, sample rate, loudness, and frequency characteristics. These features are valuable for analyzing the audio data quality, consistency, and identifying necessary preprocessing steps.

#### Features Extracted

1. **Duration (sec)**: The total length of the audio file in seconds.
2. **Channels**: The number of audio channels (e.g., mono or stereo).
3. **Sample Rate**: The frequency at which the audio signal is sampled per second.
4. **File Size (KB)**: The file’s size, providing an indirect indication of the audio quality and bitrate.
5. **RMS Energy**: The Root Mean Square energy of the signal, indicating the average loudness of the audio file.
6. **Zero Crossing Rate (ZCR)**: The rate at which the signal changes sign, providing insight into the audio’s frequency characteristics.
7. **Log Mel Mean**: The average power across the Mel spectrogram’s frequency bands, indicating the distribution of energy in the audio.
8. **Log Mel Spectrogram**: A matrix representing the power of different frequency bands over time, converted to a logarithmic scale for better interpretability.

#### Desired Decision-Making

Based on the features extracted in the function below, we wish to know be able to make the following decisions

1. **Duration (sec) :** Standardize audio length by trimming longer files or padding shorter files. This ensures consistency across samples, especially important because we will be feeding these files into OpenSmile for feture extraction & Whisper for transcription
2. **Channels :** Convert all files to mono if they vary in channels (mono vs. stereo) to maintain uniformity, reduce data size, and simplify processing
3. **Sample Rate :** Resample audio to a standard rate (e.g., 16kHz or 44.1kHz) if there is variation, ensuring compatibility and consistency across files
4. **File Size (KB) :** Compress or downsample large files if needed, especially if they consume excessive storage or processing power
5. **RMS Energy :** Normalize the loudness across files if RMS values vary significantly. This reduces variance in audio intensity and ensures consistent audio levels for analysis or training
6. **Zero Crossing Rate (ZCR) :** Apply noise reduction or frequency filtering for files with unusually high ZCR, as this may indicate the presence of noise or high-frequency content that may not be useful.
7. **Log Mel Mean :** Normalize audio files if there is significant variation in mean energy across the frequency bands, helping to maintain consistency in audio features.
8. **Log Mel Spectrogram :** Use the Log Mel spectrogram as a feature representation for machine learning models. Additionally, visual inspection can reveal noise patterns or artifacts, which may guide additional preprocessing, like denoising.

In [3]:
# Function to extract audio properties
def extract_audio_properties(file_path):
    try:
        # Load the audio file
        audio = AudioSegment.from_file(file_path)
        
        # Extract file properties
        duration_ms = len(audio)  # Duration in milliseconds
        duration_sec = duration_ms / 1000  # Convert to seconds
        channels = audio.channels
        sample_rate = audio.frame_rate
        file_size_kb = os.path.getsize(file_path) / 1024  # File size in KB
        
        # Use librosa for additional analysis
        y, sr = librosa.load(file_path, sr=None)
        rms = librosa.feature.rms(y=y).mean()  # Root Mean Square energy
        zcr = librosa.feature.zero_crossing_rate(y=y).mean()  # Zero Crossing Rate
        
        # Compute the Mel spectrogram
        mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, hop_length=512)
        
        # Convert to Log scale
        log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
        
        # Calculate mean Log Mel across time for summary
        log_mel_mean = log_mel_spectrogram.mean()

        # Return extracted properties as a dictionary
        return {
            "File Name": os.path.basename(file_path),
            "Duration (sec)": duration_sec,
            "Channels": channels,
            "Sample Rate": sample_rate,
            "File Size (KB)": file_size_kb,
            "RMS Energy": rms,
            "Zero Crossing Rate": zcr,
            "Log Mel Mean": log_mel_mean,
            "Log Mel Spectrogram": log_mel_spectrogram
        }
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None

### 3. Function to Extract Features into a DataFrame
This function will employ the use of the function above to collate all properties into a dataframe

In [4]:
# Function to process a directory of MP3 files and create a DataFrame
def analyze_audio_directory(directory):
    audio_data = []

    # Iterate over all MP3 files in the directory
    for file_name in os.listdir(directory):
        if file_name.endswith(".mp3"):
            file_path = os.path.join(directory, file_name)
            audio_properties = extract_audio_properties(file_path)
            if audio_properties:
                audio_data.append(audio_properties)

    # Create DataFrame from collected data
    return pd.DataFrame(audio_data)

### 4. Load Calls
Load the calls we intend to study

In [5]:
audio_df = analyze_audio_directory(audio_dir)

NameError: name 'audio_dir' is not defined

### 5. View the data set

In [None]:
audio_df.head()

### 5. Data Analysis

#### Summary Statistics
We conduct some overview summary statistics for the dataset

In [None]:
audio_df.describe()

#### Outliers in Duration and Sample Rate

In [None]:
# Histogram for Duration
plt.hist(audio_df["Duration (sec)"], bins=20)
plt.xlabel("Duration (sec)")
plt.ylabel("Frequency")
plt.title("Distribution of Audio Duration")
plt.show()

# Histogram for Sample Rate
plt.hist(audio_df["Sample Rate"], bins=10)
plt.xlabel("Sample Rate (Hz)")
plt.ylabel("Frequency")
plt.title("Distribution of Sample Rates")
plt.show()

#### File Size Analysis

In [None]:
# Histogram for File Size
plt.hist(audio_df["File Size (KB)"], bins=20)
plt.xlabel("File Size (KB)")
plt.ylabel("Frequency")
plt.title("Distribution of File Sizes")
plt.show()

#### RMS Energy Analysis

In [None]:
# Histogram for RMS Energy
plt.hist(audio_df["RMS Energy"], bins=20)
plt.xlabel("RMS Energy")
plt.ylabel("Frequency")
plt.title("Distribution of RMS Energy (Loudness)")
plt.show()

#### Zero Crossing Rate (ZCR) Analysis

In [None]:
# Histogram for Zero Crossing Rate
plt.hist(audio_df["Zero Crossing Rate"], bins=20)
plt.xlabel("Zero Crossing Rate")
plt.ylabel("Frequency")
plt.title("Distribution of Zero Crossing Rate")
plt.show()

#### Log Mel Spectrogram Analysis

In [None]:
# Plot example Log Mel Spectrograms
for i in range(min(3, len(audio_df))):  # Plot up to 3 examples
    file_name = audio_df["File Name"].iloc[i]
    log_mel_spectrogram = audio_df["Log Mel Spectrogram"].iloc[i]
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(log_mel_spectrogram, sr=22050, hop_length=512, x_axis="time", y_axis="mel")
    plt.colorbar(format="%+2.0f dB")
    plt.title(f"Log Mel Spectrogram for {file_name}")
    plt.tight_layout()
    plt.show()

#### Correlation Analysis

In [None]:
correlation_matrix = audio_df[["Duration (sec)", "Sample Rate", "File Size (KB)", "RMS Energy", "Zero Crossing Rate", "Log Mel Mean"]].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Matrix of Audio Features")
plt.show()