# A new database of healthy and pathological voices
# Create dataset

In this notebook a dataset is created from the -info.txt files and saved into an excel file.

Database [VOICED (VOice ICar fEDerico II) database](https://physionet.org/physiobank/database/voiced/). 

References:<br>
U. Cesari, G. De Pietro, E. Marciano, C. Niri, G. Sannino, and L. Verde. A new database of healthy and pathological voices. Computers & Electrical Engineering, vol. 68, pp. 310-321, 5 2018.
<br><br>
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215]; 2000 (June 13).
<br><br>
Last modified 23 December 2018

In [1]:
import pandas as pd
import re

**Read record with all the filenames**

In [2]:
# Read record file
fileDownloadLocation = './Database/'
filename = 'RECORDS'


# Open record file    
fileHandler = open (fileDownloadLocation+filename, "r")
# Get list of all lines in file
listOfRecords = fileHandler.readlines()
# Close file 
fileHandler.close()

**The features**

In [3]:
# Create header of features
filename = listOfRecords[0].strip()
# Open record file    
fileHandler = open (fileDownloadLocation+filename+'-info.txt', "r")
# Get list of all lines in file
listOfData = fileHandler.readlines()
# Close file 
fileHandler.close()

In [4]:
headerList = []
for i in range(len(listOfData)):
    string = re.split('\t|,|\n',listOfData[i])
    headerList.append(string[0])
headerList.insert(len(headerList)+1,"")   
# remove : in the string
newheaderList = []
for element in headerList:
    newheaderList.append(element.replace(':', ''))
newheaderList

['ID',
 '',
 'Age',
 'Gender',
 'Diagnosis',
 'Occupation status',
 '',
 '',
 'Voice Handicap Index (VHI) Score',
 'Reflux Symptom Index (RSI) Score',
 '',
 '',
 'Smoker',
 'Number of cigarettes smoked per day',
 '',
 'Alcohol consumption',
 'Number of glasses containing alcoholic beverage drinked in a day',
 "Amount of water's litres drink every day",
 '',
 'Eating habits',
 'Carbonated beverages',
 'Amount of glasses drinked in a day',
 'Tomatoes',
 'Coffee',
 'Number of cups of coffee drinked in a day',
 'Chocolate',
 'Gramme of chocolate eaten in  a day',
 'Soft cheese',
 'Gramme of soft cheese eaten in a day',
 'Citrus fruits',
 'Number of citrus fruits eaten in a day',
 '']

**Build dataframe**

In [5]:
data = []
for line in listOfRecords:
    record = line.strip()
    #print(record)
    filename = record+'-info.txt'
    # Open record file    
    fileHandler = open (fileDownloadLocation+filename, "r")
    # Get list of all lines in file
    dataLine= fileHandler.readlines()
    # Close file 
    fileHandler.close()
    
    dataString = []
    for i in range(len(dataLine)):
        string = re.split('\t|,|\n',dataLine[i])
        dataString.append(string[1])
    data.append(dataString)

In [6]:
# Drop all columns with name ''
df = pd.DataFrame(data)
df.columns = newheaderList
df = df.drop('', 1)
df.head()

Unnamed: 0,ID,Age,Gender,Diagnosis,Occupation status,Voice Handicap Index (VHI) Score,Reflux Symptom Index (RSI) Score,Smoker,Number of cigarettes smoked per day,Alcohol consumption,...,Amount of glasses drinked in a day,Tomatoes,Coffee,Number of cups of coffee drinked in a day,Chocolate,Gramme of chocolate eaten in a day,Soft cheese,Gramme of soft cheese eaten in a day,Citrus fruits,Number of citrus fruits eaten in a day
0,voice001,32,m,hyperkinetic dysphonia,Researcher,15,5,no,NU,casual drinker,...,NU,sometimes,almost always,4,almost never,NU,sometimes,NU,sometimes,NU
1,voice002,55,m,healthy,Employee,17,12,casual smoker,2,habitual drinker,...,3,sometimes,sometimes,3,sometimes,NU,almost always,50 gr,almost always,2
2,voice003,34,m,hyperkinetic dysphonia (nodule),Researcher,42,26,no,NU,casual drinker,...,1,sometimes,almost always,NU,sometimes,20 gr,almost always,200 gr,almost never,NU
3,voice004,28,f,hypokinetic dysphonia,Researcher,20,9,casual smoker,NU,casual drinker,...,NU,sometimes,always,3,sometimes,NU,almost always,NU,sometimes,NU
4,voice005,54,f,hypokinetic dysphonia,Researcher,39,23,no,NU,casual drinker,...,NU,sometimes,never,NU,sometimes,150 gr,sometimes,200 gr,almost always,1


In [7]:
df.shape

(208, 24)

In [8]:
list(df.columns.values)

['ID',
 'Age',
 'Gender',
 'Diagnosis',
 'Occupation status',
 'Voice Handicap Index (VHI) Score',
 'Reflux Symptom Index (RSI) Score',
 'Smoker',
 'Number of cigarettes smoked per day',
 'Alcohol consumption',
 'Number of glasses containing alcoholic beverage drinked in a day',
 "Amount of water's litres drink every day",
 'Eating habits',
 'Carbonated beverages',
 'Amount of glasses drinked in a day',
 'Tomatoes',
 'Coffee',
 'Number of cups of coffee drinked in a day',
 'Chocolate',
 'Gramme of chocolate eaten in  a day',
 'Soft cheese',
 'Gramme of soft cheese eaten in a day',
 'Citrus fruits',
 'Number of citrus fruits eaten in a day']

**Save dataframe to excel file**

In [9]:
# Write to excel file
excelFile = './Datasets/dataset_InfoTxtFile.xlsx'
writer = pd.ExcelWriter(excelFile)
df.to_excel(writer,'Sheet1')
writer.save()

# [EOF]