# Earnings Call Project: Feature Engineering
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

This notebook featurizes the text and audio data from the earnings calls. Each earnings calls data comes pre-processed such that each sentence of the call corresponds to 1 line-of-text and 1 MP3 audio file. The transcript text is processed with the Glove–300 pre-trained word embedding, and the audio files are processed with Praat using parselmouth.
- Text (Glove–300) [Glove Download](https://nlp.stanford.edu/projects/glove/)
- Audio (Praat) [Parselmouth](https://parselmouth.readthedocs.io/en/stable/)

The data from this notebook is stored in the "data/data_prep" directory as the following CSVs.
- glove_features
- MAEC_glove_features
- praat_features
- MAEC_praat_features

In [3]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install parselmouth
    from google.colab import drive
    drive.mount('/content/gdrive')
    %cd gdrive/My Drive/831

In [4]:
import pandas as pd
import numpy as np
import parselmouth
import requests
import time
import json
import csv
import re
import os
from datetime import datetime
from tqdm import tqdm

In [5]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 
# there is a link for the audio data in the MAEC GitHub, but it does not work
# I emailed the authors, and they send another link.
# There is like a half-million files, but only 19 GB
# https://drive.google.com/file/d/1m1GRCHgKn9Vz9IFMC_SpCog6uP3-gFgY/view?usp=drive_link 

In [6]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

Here, we loop through the original dataset (text & audio) and show each sentence should have one line of text & one audio file.

In [8]:
def each_row(row):
    Company = row['Company']
    Date = row['Date'].replace('-', '') 

    text_path = f"D:/original_dataset/{Company}_{Date}/TextSequence.txt"
    with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
        for i, line in enumerate(file, start= 1):
            #I'll do something with this 
            text_length = i

    audio_length = 0
    audio_dir = f"D:/original_dataset/{Company}_{Date}/CEO"
    if os.path.exists(audio_dir):
        for i, audio_file in enumerate(os.listdir(audio_dir), start= 1):
            #I'll do something with this
            if audio_file.lower().endswith('.mp3'):
                audio_length += 1
            
    return text_length, audio_length

filename_data[['text_length', 'audio_length']] = filename_data.apply(each_row, axis=1, result_type='expand')
print('should be zero-', len(filename_data[filename_data['text_length']!= filename_data['audio_length']] ))
filename_data.info(verbose=True)

should be zero- 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572 entries, 0 to 571
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company       572 non-null    object
 1   Date          572 non-null    object
 2   Ticker        572 non-null    object
 3   text_length   572 non-null    int64 
 4   audio_length  572 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 22.5+ KB


In [9]:
def each_row(row):
    Ticker = row['Ticker']
    Date = row['Date'].replace('-', '') 

    text_path = f"D:/MAEC_audio/{Date}_{Ticker}/text.txt"
    with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
        for i, line in enumerate(file, start= 1):
            #I'll do something with this 
            text_length = i
            
    audio_length = 0
    audio_dir = f"D:/MAEC_audio/{Date}_{Ticker}"
    if os.path.exists(audio_dir):
        for i, audio_file in enumerate(os.listdir(audio_dir), start= 1):
            #I'll do something with this
            if audio_file.lower().endswith('.mp3'):
                audio_length += 1
            
    return text_length, audio_length

MAEC_filename_data[['text_length', 'audio_length']] = MAEC_filename_data.apply(each_row, axis=1, result_type='expand')
print('should be zero-', len(MAEC_filename_data[MAEC_filename_data['text_length']!= MAEC_filename_data['audio_length']] ))
MAEC_filename_data.info(verbose=True)

should be zero- 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3443 entries, 0 to 3442
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Ticker        3443 non-null   object
 1   Date          3443 non-null   object
 2   text_length   3443 non-null   int64 
 3   audio_length  3443 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 107.7+ KB


# Glove 300 features from meeting transcript text files

Glove file can be found at https://nlp.stanford.edu/projects/glove/

In [11]:
glove_file_path = r"C:\Users\James\OneDrive\Kansas State University\CIS 831\Project\glove.6B\glove.6B.300d.txt" 
glove_model = {}
with open(glove_file_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.split()
        word = parts[0]
        vector = np.array(parts[1:], dtype='float32')
        glove_model[word] = vector
print(f"Loaded {len(glove_model)} words from GloVe.")

contractions = {
    "i'll": "i will", "can't": "cannot", "won't": "will not",
    "i'm": "i am", "he's": "he is", "she's": "she is",
    "they're": "they are", "it's": "it is", "let's": "let us",
    "don't": "do not", "didn't": "did not", "doesn't": "does not",
    "you're": "you are", "we're": "we are", "they've": "they have",
    "there's": "there is", "who's": "who is", "what's": "what is",
    "i'd": "i would", "he'd": "he would", "she'd": "she would",
    "they'd": "they would", "we'd": "we would", "you'd": "you would",
    "i've": "i have", "we've": "we have", "you've": "you have",
    "shouldn't": "should not", "wouldn't": "would not", "couldn't": "could not",
    "isn't": "is not", "aren't": "are not", "wasn't": "was not",
    "weren't": "were not", "hasn't": "has not", "haven't": "have not",
    "hadn't": "had not", "mustn't": "must not", "mightn't": "might not",
    "needn't": "need not", "shan't": "shall not", "would've": "would have",
    "should've": "should have", "could've": "could have", "might've": "might have",
    "must've": "must have", "let's": "let us", "he'll": "he will",
    "she'll": "she will", "they'll": "they will", "we'll": "we will",
    "you'll": "you will", "that'll": "that will", "there'll": "there will",
    "it'll": "it will", "who'll": "who will", "how's": "how is",
    "why's": "why is", "where's": "where is", "when's": "when is",
    "ain't": "is not", "couldn't've": "could not have", "shouldn't've": "should not have",
    "wouldn't've": "would not have", "oughtn't": "ought not", "she'd've": "she would have",
    "he'd've": "he would have", "they'd've": "they would have", "we'd've": "we would have",
    "you'd've": "you would have", "i'd've": "i would have", "it'd've": "it would have",
    "y'all": "you all", "how'd": "how did", "what'd": "what did",
    "who'd": "who did", "where'd": "where did", "when'd": "when did",
    "why'd": "why did", "how'll": "how will", "what'll": "what will",
    "who'll": "who will", "where'll": "where will", "when'll": "when will",
    "why'll": "why will"
}

# Returns 300 dimensional vector (averaged of word vectors). Is there a better way?
def get_glove(sentence):
    # clean up each sentence
    sentence = sentence.lower()
    for contraction, full_form in contractions.items():
        sentence = sentence.replace(contraction, full_form)
    sentence = re.sub(r'[^a-z0-9\s]', ' ', sentence)

    words = sentence.split()
    word_embeddings = []
    for word in words:
        if word in glove_model:
            word_embeddings.append(glove_model[word])
    if not word_embeddings:
        print(sentence)
        return [np.zeros(300)], True  # Return a zero vector if no words match
    return [np.mean(word_embeddings, axis=0)], False

Loaded 400000 words from GloVe.


In [12]:
glove_features = pd.DataFrame()
bad_glove = []
for Company, Date in tqdm(filename_data[['Company', 'Date']].values):
    Date = Date.replace('-', '')
    text_path = f"D:/original_dataset/{Company}_{Date}/TextSequence.txt"
    with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
        for i, line in enumerate(file, start=1):
            sentence_embedding, sentence_has_error = get_glove(line.strip())
            if sentence_has_error: # whenever a sentence is returned as all zeros
                bad_glove.append([Company,Date,i,line.strip()])
            features_df = pd.DataFrame(sentence_embedding, columns=[f'GloVe_{j}' for j in range(300)])
            features_df['Company'] = Company
            features_df['Date'] = Date
            features_df['Sentence_num'] = i
            glove_features = pd.concat([glove_features, features_df], ignore_index=True)

print(len(bad_glove))
# printed below are the sentences without word embeddings (all zeros)

 20%|█▉        | 114/572 [01:23<07:41,  1.01s/it]

sunit 


 30%|██▉       | 169/572 [03:07<13:12,  1.97s/it]

 


100%|██████████| 572/572 [34:04<00:00,  3.57s/it]

2





In [14]:
glove_features.info(verbose=True)
### save ############################################
glove_features.to_csv('data/data_prep/glove_features.csv', index=False)
#####################################################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Data columns (total 303 columns):
 #    Column        Dtype  
---   ------        -----  
 0    GloVe_0       float64
 1    GloVe_1       float64
 2    GloVe_2       float64
 3    GloVe_3       float64
 4    GloVe_4       float64
 5    GloVe_5       float64
 6    GloVe_6       float64
 7    GloVe_7       float64
 8    GloVe_8       float64
 9    GloVe_9       float64
 10   GloVe_10      float64
 11   GloVe_11      float64
 12   GloVe_12      float64
 13   GloVe_13      float64
 14   GloVe_14      float64
 15   GloVe_15      float64
 16   GloVe_16      float64
 17   GloVe_17      float64
 18   GloVe_18      float64
 19   GloVe_19      float64
 20   GloVe_20      float64
 21   GloVe_21      float64
 22   GloVe_22      float64
 23   GloVe_23      float64
 24   GloVe_24      float64
 25   GloVe_25      float64
 26   GloVe_26      float64
 27   GloVe_27      float64
 28   GloVe_28      float64
 29   GloVe_29      fl

In [15]:
MAEC_glove_features = pd.DataFrame()
MAEC_bad_glove = []
for Ticker, Date in tqdm(MAEC_filename_data[['Ticker', 'Date']].values):
    Date = Date.replace('-', '')
    text_path = f"D:/MAEC_audio/{Date}_{Ticker}/text.txt"
    with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
        for i, line in enumerate(file, start=1):
            sentence_embedding, sentence_has_error = get_glove(line.strip())
            if sentence_has_error:  # whenever a sentence is returned as all zeros
                MAEC_bad_glove.append([Company,Date,i,line.strip()])
            features_df = pd.DataFrame(sentence_embedding, columns=[f'GloVe_{j}' for j in range(300)])
            features_df['Ticker'] = Ticker
            features_df['Date'] = Date
            features_df['Sentence_num'] = i
            MAEC_glove_features = pd.concat([MAEC_glove_features, features_df], ignore_index=True)

print(len(MAEC_bad_glove))
# printed below are the sentences without word embeddings (all zeros)

  3%|▎         | 115/3443 [00:36<26:32,  2.09it/s]

mininberg 


  4%|▍         | 142/3443 [01:05<56:34,  1.03s/it]  

 


  7%|▋         | 232/3443 [02:38<1:10:28,  1.32s/it]

shrotriya 


  8%|▊         | 259/3443 [03:08<36:22,  1.46it/s]  

  qbfv 


 11%|█         | 383/3443 [06:10<1:38:13,  1.93s/it]

 


 13%|█▎        | 446/3443 [08:09<1:40:22,  2.01s/it]

shrotriya 


 14%|█▎        | 470/3443 [09:00<59:13,  1.20s/it]  

 


 18%|█▊        | 617/3443 [15:23<1:00:40,  1.29s/it]

 


 20%|█▉        | 686/3443 [18:49<3:25:35,  4.47s/it]

shrotriya 


 20%|██        | 695/3443 [19:35<3:42:14,  4.85s/it]

belanoff 


 26%|██▋       | 912/3443 [32:11<2:09:36,  3.07s/it]

 


 29%|██▊       | 984/3443 [37:24<4:36:23,  6.74s/it]

tetrasun 


 30%|███       | 1046/3443 [42:36<2:39:31,  3.99s/it]

shrotriya 


 32%|███▏      | 1107/3443 [46:46<1:42:56,  2.64s/it]

 


 35%|███▍      | 1190/3443 [54:59<5:11:30,  8.30s/it]

 


 38%|███▊      | 1298/3443 [1:04:26<2:55:52,  4.92s/it]

 


 40%|███▉      | 1367/3443 [1:13:28<3:38:37,  6.32s/it]

shrotriya 


 41%|████      | 1395/3443 [1:16:56<3:13:12,  5.66s/it]

 


 43%|████▎     | 1489/3443 [1:28:37<2:53:07,  5.32s/it]

  


 43%|████▎     | 1490/3443 [1:28:52<4:29:08,  8.27s/it]

 


 46%|████▌     | 1590/3443 [1:41:31<4:58:58,  9.68s/it]

 


 47%|████▋     | 1625/3443 [1:47:05<3:44:52,  7.42s/it]

idriver 


 52%|█████▏    | 1776/3443 [2:05:48<3:35:37,  7.76s/it]

   


 52%|█████▏    | 1777/3443 [2:06:01<4:22:21,  9.45s/it]

cableone 


 52%|█████▏    | 1793/3443 [2:08:24<3:44:40,  8.17s/it]

 


 53%|█████▎    | 1840/3443 [2:14:48<4:17:42,  9.65s/it]

 


 54%|█████▍    | 1854/3443 [2:16:35<3:29:00,  7.89s/it]

 


 54%|█████▍    | 1868/3443 [2:19:03<4:10:45,  9.55s/it]

shrotriya 


 56%|█████▌    | 1913/3443 [2:25:15<2:57:06,  6.95s/it]

 thatshowilip 


 61%|██████    | 2090/3443 [2:48:47<2:22:28,  6.32s/it]

 


 61%|██████    | 2097/3443 [2:50:02<3:59:55, 10.70s/it]

 


 66%|██████▌   | 2277/3443 [3:19:23<4:38:27, 14.33s/it]

 


 67%|██████▋   | 2316/3443 [3:26:30<3:20:55, 10.70s/it]

 


 71%|███████▏  | 2457/3443 [3:48:20<3:53:06, 14.19s/it]

cableone 


 74%|███████▍  | 2558/3443 [4:09:14<4:02:29, 16.44s/it]

 


 75%|███████▍  | 2575/3443 [4:14:42<4:16:32, 17.73s/it]

 


 75%|███████▌  | 2584/3443 [4:17:05<3:19:33, 13.94s/it]

 
 


 76%|███████▌  | 2601/3443 [4:23:22<3:59:34, 17.07s/it]

 


 77%|███████▋  | 2649/3443 [4:39:43<5:38:20, 25.57s/it]

 


 78%|███████▊  | 2676/3443 [4:48:25<5:16:52, 24.79s/it]

 


 78%|███████▊  | 2682/3443 [4:50:28<4:18:00, 20.34s/it]

meenal 


 78%|███████▊  | 2691/3443 [4:54:14<5:12:20, 24.92s/it]

 


 78%|███████▊  | 2694/3443 [4:55:09<4:42:14, 22.61s/it]

 


 78%|███████▊  | 2699/3443 [4:56:43<3:30:42, 16.99s/it]

cableone 


 79%|███████▊  | 2705/3443 [4:58:52<4:15:02, 20.74s/it]

 


 79%|███████▉  | 2734/3443 [5:08:52<3:31:43, 17.92s/it]

 


 80%|████████  | 2757/3443 [5:17:45<5:21:07, 28.09s/it]

hyperscale 


 81%|████████▏ | 2801/3443 [5:33:45<3:51:22, 21.62s/it]

 


 82%|████████▏ | 2810/3443 [5:36:58<3:30:04, 19.91s/it]

 
 
 


 82%|████████▏ | 2819/3443 [5:40:12<3:30:22, 20.23s/it]

 


 82%|████████▏ | 2832/3443 [5:45:53<5:45:00, 33.88s/it]

 
 
 


 83%|████████▎ | 2852/3443 [5:52:49<2:58:55, 18.17s/it]

 


 83%|████████▎ | 2857/3443 [5:54:04<2:21:26, 14.48s/it]

 


 83%|████████▎ | 2858/3443 [5:54:29<2:52:31, 17.69s/it]

 


 84%|████████▍ | 2903/3443 [6:13:54<3:30:04, 23.34s/it]

 


 85%|████████▍ | 2924/3443 [6:21:56<4:11:12, 29.04s/it]

 


 85%|████████▌ | 2928/3443 [6:23:25<3:21:18, 23.45s/it]

 
 


 85%|████████▌ | 2932/3443 [6:24:27<2:06:27, 14.85s/it]

 


 85%|████████▌ | 2937/3443 [6:26:18<3:04:17, 21.85s/it]

 
 


 88%|████████▊ | 3013/3443 [6:52:19<2:03:29, 17.23s/it]

 
 


 89%|████████▉ | 3064/3443 [7:14:02<3:09:01, 29.92s/it]

cableone 


 89%|████████▉ | 3065/3443 [7:14:31<3:06:57, 29.68s/it]

sunit 


 90%|████████▉ | 3092/3443 [7:25:46<2:06:43, 21.66s/it]

miamifoundation 


 90%|█████████ | 3104/3443 [7:31:38<2:43:56, 29.02s/it]

 


 90%|█████████ | 3108/3443 [7:34:04<2:58:14, 31.92s/it]

 


 90%|█████████ | 3115/3443 [7:37:26<2:36:09, 28.56s/it]

 


 91%|█████████ | 3135/3443 [7:46:20<2:14:20, 26.17s/it]

 


 92%|█████████▏| 3152/3443 [7:53:46<1:48:44, 22.42s/it]

 


 92%|█████████▏| 3159/3443 [7:56:22<1:42:38, 21.69s/it]

psd2 


 93%|█████████▎| 3187/3443 [8:08:27<1:34:35, 22.17s/it]

 


 93%|█████████▎| 3192/3443 [8:11:22<2:22:02, 33.95s/it]

 


 94%|█████████▎| 3221/3443 [8:24:30<2:12:07, 35.71s/it]

 


 94%|█████████▎| 3225/3443 [8:27:07<2:18:58, 38.25s/it]

 


 94%|█████████▍| 3241/3443 [8:36:38<2:17:54, 40.96s/it]

 


 94%|█████████▍| 3247/3443 [8:38:03<55:02, 16.85s/it]  

 
 


 95%|█████████▍| 3261/3443 [8:44:33<1:23:11, 27.43s/it]

 


 95%|█████████▍| 3268/3443 [8:49:04<1:29:46, 30.78s/it]

 


 95%|█████████▌| 3279/3443 [8:56:05<1:33:19, 34.14s/it]

 


 96%|█████████▌| 3294/3443 [9:03:42<1:26:27, 34.82s/it]

 


 97%|█████████▋| 3326/3443 [9:22:17<1:11:28, 36.65s/it]

 


 97%|█████████▋| 3327/3443 [9:22:35<1:00:01, 31.05s/it]

 


 97%|█████████▋| 3336/3443 [9:27:14<56:28, 31.67s/it]  

 


 97%|█████████▋| 3348/3443 [9:32:42<42:34, 26.89s/it]  

 


 98%|█████████▊| 3369/3443 [9:45:14<45:38, 37.00s/it]

chekanow 


 98%|█████████▊| 3371/3443 [9:46:36<46:08, 38.45s/it]

 


 98%|█████████▊| 3374/3443 [9:48:31<44:11, 38.42s/it]

 


 98%|█████████▊| 3376/3443 [9:49:37<37:52, 33.91s/it]

 


 98%|█████████▊| 3381/3443 [9:52:34<36:11, 35.02s/it]

 


 98%|█████████▊| 3383/3443 [9:53:27<30:44, 30.74s/it]

 


 98%|█████████▊| 3389/3443 [9:56:40<30:20, 33.71s/it]

fenofibrate 


 99%|█████████▊| 3397/3443 [10:00:36<22:27, 29.29s/it]

 


 99%|█████████▉| 3408/3443 [10:05:40<19:01, 32.62s/it]

 
 
 


 99%|█████████▉| 3409/3443 [10:06:00<16:24, 28.97s/it]

 


100%|██████████| 3443/3443 [10:22:11<00:00, 10.84s/it]

104





In [16]:
MAEC_glove_features.info(verbose=True)
### save ############################################
MAEC_glove_features.to_csv('data/data_prep/MAEC_glove_features.csv', index=False)
#####################################################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 303 columns):
 #    Column        Dtype  
---   ------        -----  
 0    GloVe_0       float64
 1    GloVe_1       float64
 2    GloVe_2       float64
 3    GloVe_3       float64
 4    GloVe_4       float64
 5    GloVe_5       float64
 6    GloVe_6       float64
 7    GloVe_7       float64
 8    GloVe_8       float64
 9    GloVe_9       float64
 10   GloVe_10      float64
 11   GloVe_11      float64
 12   GloVe_12      float64
 13   GloVe_13      float64
 14   GloVe_14      float64
 15   GloVe_15      float64
 16   GloVe_16      float64
 17   GloVe_17      float64
 18   GloVe_18      float64
 19   GloVe_19      float64
 20   GloVe_20      float64
 21   GloVe_21      float64
 22   GloVe_22      float64
 23   GloVe_23      float64
 24   GloVe_24      float64
 25   GloVe_25      float64
 26   GloVe_26      float64
 27   GloVe_27      float64
 28   GloVe_28      float64
 29   GloVe_29      

# Praat features from MP3 audio files

Using parselmouth, "Voice report" produces this following string.

In [18]:
sound = parselmouth.Sound(r"D:\MAEC_audio\20150225_LMAT\LMAT_20150225_f000002100.mp3")
pitch = sound.to_pitch()
pulses = parselmouth.praat.call([sound, pitch], "To PointProcess (cc)")
voice_report_str = parselmouth.praat.call([sound, pitch, pulses], "Voice report", 0.0, 0.0, 75, 600, 1.3, 1.6, 0.03, 0.45)
print(voice_report_str) 

   From 0 to 0 seconds (duration: 0.370979 seconds)
Pitch:
   Median pitch: 120.438 Hz
   Mean pitch: 122.509 Hz
   Standard deviation: 4.582 Hz
   Minimum pitch: 119.282 Hz
   Maximum pitch: 132.486 Hz
Pulses:
   Number of pulses: 15
   Number of periods: 14
   Mean period: 8.145224E-3 seconds
   Standard deviation of period: 0.324092E-3 seconds
Voicing:
   Fraction of locally unvoiced frames: 61.765%   (21 / 34)
   Number of voice breaks: 0
   Degree of voice breaks: 0   (0 seconds / 0 seconds)
Jitter:
   Jitter (local): 1.718%
   Jitter (local, absolute): 139.948E-6 seconds
   Jitter (rap): 1.067%
   Jitter (ppq5): 0.811%
   Jitter (ddp): 3.201%
Shimmer:
   Shimmer (local): 17.053%
   Shimmer (local, dB): 1.576 dB
   Shimmer (apq3): 6.078%
   Shimmer (apq5): 11.513%
   Shimmer (apq11): 30.665%
   Shimmer (dda): 18.235%
Harmonicity of the voiced parts only:
   Mean autocorrelation: 0.675386
   Mean noise-to-harmonics ratio: 0.524708
   Mean harmonics-to-noise ratio: 3.356 dB



In [19]:
# Returns dictionary of 29 features
def get_praat(audio_dir,audio_file):
    sound = parselmouth.Sound(os.path.join(audio_dir,audio_file))
    pitch = sound.to_pitch()
    pulses = parselmouth.praat.call([sound, pitch], "To PointProcess (cc)")
    voice_report_str = parselmouth.praat.call([sound, pitch, pulses], "Voice report", 0.0, 0.0, 75, 600, 1.3, 1.6, 0.03, 0.45)
    
    # not in voice report
    audio_length = sound.get_total_duration()
    try:
        intensity = sound.to_intensity()
        mean_intensity = intensity.values.mean()
        min_intensity = intensity.values.min()
        max_intensity = intensity.values.max()
    except: 
        mean_intensity = '--undefined--'
        min_intensity = '--undefined--'
        max_intensity = '--undefined--'
        audio_length = '--undefined--'

    mean_pitch = re.search(r'Mean pitch: ([\d.]+|--undefined--)', voice_report_str).group(1)
    stdev_pitch = re.search(r'Standard deviation: ([\d.]+|--undefined--)', voice_report_str).group(1)
    min_pitch = re.search(r'Minimum pitch: ([\d.]+|--undefined--)', voice_report_str).group(1)
    max_pitch = re.search(r'Maximum pitch: ([\d.]+|--undefined--)', voice_report_str).group(1)
    num_pulses = re.search(r'Number of pulses: (\d+|--undefined--)', voice_report_str).group(1)
    num_periods = re.search(r'Number of periods: (\d+|--undefined--)', voice_report_str).group(1)
    mean_period = re.search(r'Mean period: ([\d.E+-]+|--undefined--)', voice_report_str).group(1)
    stdev_period = re.search(r'Standard deviation of period: ([\d.E+-]+|--undefined--)', voice_report_str).group(1)
    fraction_unvoiced = re.search(r'Fraction of locally unvoiced frames: ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    num_voice_breaks = re.search(r'Number of voice breaks: (\d+|--undefined--)', voice_report_str).group(1)
    degree_of_voice_breaks = re.search(r'Degree of voice breaks: ([\d.]+|--undefined--)', voice_report_str).group(1)
    jitter_local = re.search(r'Jitter \(local\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    jitter_local_abs = re.search(r'Jitter \(local, absolute\): ([\d.E+-]+|--undefined--)', voice_report_str).group(1)
    jitter_rap = re.search(r'Jitter \(rap\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    jitter_ppq5 = re.search(r'Jitter \(ppq5\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    jitter_ddp = re.search(r'Jitter \(ddp\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    shimmer_local = re.search(r'Shimmer \(local\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    shimmer_local_dB = re.search(r'Shimmer \(local, dB\): ([\d.]+|--undefined--) dB', voice_report_str).group(1)
    shimmer_apq3 = re.search(r'Shimmer \(apq3\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    shimmer_apq5 = re.search(r'Shimmer \(apq5\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    shimmer_apq11 = re.search(r'Shimmer \(apq11\): ([\d.]+|--undefined--)', voice_report_str).group(1)
    shimmer_dda = re.search(r'Shimmer \(dda\): ([\d.]+|--undefined--)%?', voice_report_str).group(1)
    autocorrelation = re.search(r'Mean autocorrelation: ([\d.]+|--undefined--)', voice_report_str).group(1)
    nhr = re.search(r'Mean noise-to-harmonics ratio: ([\d.]+|--undefined--)', voice_report_str).group(1)
    try:
        hnr = re.search(r'Mean harmonics-to-noise ratio: ([\d.]+|--undefined--) dB', voice_report_str).group(1)
    except: 
        hnr = '--undefined--'

    features = {
        'Mean pitch': mean_pitch,
        'Standard deviation': stdev_pitch,
        'Minimum pitch': min_pitch,
        'Maximum pitch': max_pitch,
        'Number of pulses': num_pulses,
        'Number of periods': num_periods,
        'Mean period': mean_period,
        'Mean intensity': mean_intensity,
        'Minimum intensity': min_intensity,
        'Maximum intensity': max_intensity,
        'Standard deviation of period': stdev_period,
        'Fraction of unvoiced': fraction_unvoiced,  
        'Number of voice breaks': num_voice_breaks, 
        'Degree of voice breaks': degree_of_voice_breaks,  
        'Jitter local': jitter_local,
        'Jitter local absolute': jitter_local_abs,
        'Jitter rap': jitter_rap,
        'Jitter ppq5': jitter_ppq5,
        'Jitter ddp': jitter_ddp,
        'Shimmer local': shimmer_local,
        'Shimmer local dB': shimmer_local_dB,
        'Shimmer apq3': shimmer_apq3,
        'Shimmer apq5': shimmer_apq5,
        'Shimmer apq11': shimmer_apq11,
        'Shimmer dda': shimmer_dda,
        'Mean autocorrelation': autocorrelation,
        'Mean NHR': nhr,
        'Mean HNR': hnr,
        'Audio Length': audio_length
    }
    return features

In [20]:
# Collect whenever there is an error
bad_praat = [] # Company,Date,i,audio_file,voice_report_str,e
# set all features to undefined
undefined_features = {'Mean pitch': '--undefined--', 'Standard deviation':'--undefined--', 'Minimum pitch': '--undefined--', 'Maximum pitch': '--undefined--', 'Number of pulses': '--undefined--', 'Number of periods': '--undefined--', 'Mean period':  '--undefined--', 'Mean intensity': '--undefined--', 'Minimum intensity':  '--undefined--', 'Maximum intensity':  '--undefined--', 'Standard deviation of period': '--undefined--', 'Fraction of unvoiced':'--undefined--', 'Number of voice breaks':  '--undefined--', 'Degree of voice breaks':  '--undefined--', 'Jitter local': '--undefined--', 'Jitter local absolute': '--undefined--', 'Jitter rap':  '--undefined--', 'Jitter ppq5':  '--undefined--', 'Jitter ddp': '--undefined--', 'Shimmer local':'--undefined--', 'Shimmer local dB': '--undefined--', 'Shimmer apq3': '--undefined--', 'Shimmer apq5': '--undefined--', 'Shimmer apq11': '--undefined--', 'Shimmer dda': '--undefined--', 'Mean autocorrelation':'--undefined--', 'Mean NHR': '--undefined--', 'Mean HNR': '--undefined--', 'Audio Length': '--undefined--'}

praat_features = pd.DataFrame()
for Company,Date in tqdm(filename_data[['Company','Date']].values):
    Date = Date.replace('-', '') 
    audio_dir = f"D:/original_dataset/{Company}_{Date}/CEO"
    if os.path.exists(audio_dir):
        for i, audio_file in enumerate(os.listdir(audio_dir), start= 1):
            try:
                # skip files that are not MP3 audio
                if audio_file.lower().endswith('.mp3'):
                    features = get_praat(audio_dir,audio_file)
                    features_df = pd.DataFrame([features])
                    features_df['Company'] = Company
                    features_df['Date'] = Date
                    features_df['Sentence_num'] = i
                    features_df['audio_file'] = audio_file
                    praat_features = pd.concat([praat_features, features_df], ignore_index=True)
            except KeyboardInterrupt: break
            except Exception as e: 
                bad_praat.append([Company,Date,i,audio_file,voice_report_str,e])
                # set all features to undefined
                features_df = pd.DataFrame([undefined_features])
                features_df['Company'] = Company
                features_df['Date'] = Date
                features_df['Sentence_num'] = i
                features_df['audio_file'] = audio_file
                praat_features = pd.concat([praat_features, features_df], ignore_index=True)

praat_features.info(verbose=True)
### save ############################################
praat_features.to_csv('data/data_prep/praat_features.csv', index=False)
#####################################################

100%|██████████| 572/572 [1:26:55<00:00,  9.12s/it]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Mean pitch                    89722 non-null  object
 1   Standard deviation            89722 non-null  object
 2   Minimum pitch                 89722 non-null  object
 3   Maximum pitch                 89722 non-null  object
 4   Number of pulses              89722 non-null  object
 5   Number of periods             89722 non-null  object
 6   Mean period                   89722 non-null  object
 7   Mean intensity                89722 non-null  object
 8   Minimum intensity             89722 non-null  object
 9   Maximum intensity             89722 non-null  object
 10  Standard deviation of period  89722 non-null  object
 11  Fraction of unvoiced          89722 non-null  object
 12  Number of voice breaks        89722 non-null  object
 13  Degree of voice 

In [21]:
print(len(bad_praat))
bad_praat

1


[['Ventas Inc',
  '20171027',
  113,
  'Robert Probst_1_8.mp3',
  '   From 0 to 0 seconds (duration: 0.370979 seconds)\nPitch:\n   Median pitch: 120.438 Hz\n   Mean pitch: 122.509 Hz\n   Standard deviation: 4.582 Hz\n   Minimum pitch: 119.282 Hz\n   Maximum pitch: 132.486 Hz\nPulses:\n   Number of pulses: 15\n   Number of periods: 14\n   Mean period: 8.145224E-3 seconds\n   Standard deviation of period: 0.324092E-3 seconds\nVoicing:\n   Fraction of locally unvoiced frames: 61.765%   (21 / 34)\n   Number of voice breaks: 0\n   Degree of voice breaks: 0   (0 seconds / 0 seconds)\nJitter:\n   Jitter (local): 1.718%\n   Jitter (local, absolute): 139.948E-6 seconds\n   Jitter (rap): 1.067%\n   Jitter (ppq5): 0.811%\n   Jitter (ddp): 3.201%\nShimmer:\n   Shimmer (local): 17.053%\n   Shimmer (local, dB): 1.576 dB\n   Shimmer (apq3): 6.078%\n   Shimmer (apq5): 11.513%\n   Shimmer (apq11): 30.665%\n   Shimmer (dda): 18.235%\nHarmonicity of the voiced parts only:\n   Mean autocorrelation: 0.6753

In [22]:
bad_MAEC_praat = []
MAEC_praat_features = pd.DataFrame()
for Ticker,Date in tqdm(MAEC_filename_data[['Ticker','Date']].values):
    Date = Date.replace('-', '') 
    audio_dir = f"D:/MAEC_audio/{Date}_{Ticker}"
    if os.path.exists(audio_dir):
        for i, audio_file in enumerate(os.listdir(audio_dir), start= 1):
            try:
                # skip files that are not MP3 audio
                if audio_file.lower().endswith('.mp3'):
                    features = get_praat(audio_dir,audio_file)
                    features_df = pd.DataFrame([features])
                    features_df['Ticker'] = Ticker
                    features_df['Date'] = Date
                    features_df['Sentence_num'] = i
                    features_df['audio_file'] = audio_file
                    MAEC_praat_features = pd.concat([MAEC_praat_features, features_df], ignore_index=True)
            except KeyboardInterrupt: break
            except Exception as e: 
                bad_praat.append([Ticker,Date,i,audio_file,voice_report_str,e])
                # set all features to undefined
                features_df = pd.DataFrame([undefined_features])
                features_df['Ticker'] = Ticker
                features_df['Date'] = Date
                features_df['Sentence_num'] = i
                features_df['audio_file'] = audio_file
                MAEC_praat_features = pd.concat([MAEC_praat_features, features_df], ignore_index=True)


MAEC_praat_features.info(verbose=True)
### save ############################################
MAEC_praat_features.to_csv('data/data_prep/MAEC_praat_features.csv', index=False)
#####################################################

100%|██████████| 3443/3443 [14:29:30<00:00, 15.15s/it]  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 33 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Mean pitch                    394277 non-null  object
 1   Standard deviation            394277 non-null  object
 2   Minimum pitch                 394277 non-null  object
 3   Maximum pitch                 394277 non-null  object
 4   Number of pulses              394277 non-null  object
 5   Number of periods             394277 non-null  object
 6   Mean period                   394277 non-null  object
 7   Mean intensity                394277 non-null  object
 8   Minimum intensity             394277 non-null  object
 9   Maximum intensity             394277 non-null  object
 10  Standard deviation of period  394277 non-null  object
 11  Fraction of unvoiced          394277 non-null  object
 12  Number of voice breaks        394277 non-null  object
 13 

In [23]:
print(len(bad_MAEC_praat))
bad_MAEC_praat

0


[]

In [24]:
print(bad_MAEC_praat)

[]
