# Connecting speakers, ideology and phonetics

We form our working phonetics-based dataset. 

In [1]:
formants=open("gathered_data/wordssyllablesformants.txt",'r')
list_formants=formants.readlines()

- **Data cleaning.** First, we reformat the data produced by the gathering scripts. 
We form a dictionary whose keys are tuples (speaker, word, vowel, position_in_the_word) and values are the phonetics data, namely a list $[\text{frequence 1}, \text{frequence 2}, \text{duration}, \text{formant1}, \text{formant 2}]$ where both formant1 and formant2 are themselves lists containing 3 frequences.
The script also populates a list of all the speakers.

In [2]:
def reformat(l,vowels_dict,names_list):
    l=l.replace('[','')
    l=l.replace(']','')
    l=l.replace("'",'')
    l=l.replace(' ','')
    l=l.split(',')
    word=l[0]
    speaker=l[1].lower()
    for i in range(3,len(l),13):
        position = ((i-3)/13)+1
        sublist=l[i:i+12]
        vowel=sublist[0]
        freq1=float(sublist[1])
        freq2=float(sublist[2])
        duration=float(sublist[3])
        formant1=[float(i) for i in sublist[4:7]]
        formant2=[float(i) for i in sublist[8:11]]
        if (speaker,word,vowel,position) in vowels_dict:
            v=vowels_dict[(speaker,word,vowel,position)]
            v[0]+=1
            v[1]+=freq1
            v[2]+=freq2
            v[3]+=duration
            for i in range(len(formant1)):
                v[4][i]+=formant1[i]
                v[5][i]+=formant2[i]

        else:
            vowels_dict[(speaker,word,vowel,position)]=[1,freq1,freq2,duration,formant1,formant2]    
    names_list.append(speaker)
    
vowels_dict={}
names_list=[]
for line in list_formants:
    reformat(line,vowels_dict,names_list)

names_list=list(set(names_list)) #in order to remove duplicates
len(names_list)

804

- **Averaging the formants.** This is an important step. 
We average the phonetics data for each key of our dictionary. This in order the future data analysis to overestimate the role of certain "dominant" speakers in the ideology prediction. As a result, we still have a dictionary whose keys are (speaker, word, vowel), but the value is now a single, averaged phonetics information.

In [3]:
for e in vowels_dict:
    v=vowels_dict[e]
    count=v[0]
    v[1]=v[1]/count
    v[2]=v[2]/count
    v[3]=v[3]/count
    formant1=v[4]
    for i in range(len(formant1)):
        v[4][i]=v[4][i]/count
        v[5][i]=v[5][i]/count

- **Getting the ideology for each speaker.**
However, not all the speakers appear in ideodefined.txt as the donations record for some speakers either weren't found or don't allow to conclude regarding their political orientation.
We form a dictionary where keys are (first, last name) and values are the ideology. We find 433 different speakers with ideology.

In [4]:
ideology=open("gathered_data/ideodefined.txt",'r')  #file containing speakers and their respective ideology. 
list_ideology=ideology.readlines()
ideology_dict={}
names=[]
for e in list_ideology:
    l=e[:-1].split(',')
    first_name=l[0]
    last_name=l[1]
    ideo=l[2]
    ideology_dict[(first_name,last_name)]=ideo
    names.append((first_name,last_name))

We also create a dictionary to match names (as they do not have the same format in both files)

In [5]:
wordsyllablesformant_names=list(set(names_list[:]))
ideology_names=list(set(names[:]))
match={}
for m in ideology_names:
    for n in wordsyllablesformant_names:
        if (m[0] in n and m[1].replace("'",'') in n):  #replace the ' by a blank for Irish names
            wordsyllablesformant_names.remove(n)
            match[n]=m

- **Adding ideology to the names in the phonetics dataset**

In [6]:
vowels_ideology={}
keys=[]
for e in vowels_dict:
    keys.append(e)
for k in keys:
    speaker=k[0]
    word=k[1]
    vowel=k[2]
    position=k[3]
    v=vowels_dict[k]
    try:
        sep_name=match[speaker]
        ideology_of_speaker=ideology_dict[sep_name]
        vowels_ideology[(speaker,ideology_of_speaker,word,vowel,position)]=v
    except KeyError:
        oops=1

We do some more data formatting so that the values in the dictionary are as follows

In [7]:
col=['formant1','formant2','duration']

In [20]:
vowels_final={}
for e in vowels_ideology:
    (name,ideo,a,b,c) = e
    if float(ideo) < 0.5:
        ideo = 0
    else:
        ideo = 1
    l=[]
    v=vowels_ideology[e]
    l=v[1:4]
    vowels_final[(name,ideo,a,b)]=l

vowels_final

{('johnpaulstevens', 0, 'REMEDIES', 'EH'): [797.5454545454545,
  1474.6363636363637,
  0.049363636363636366],
 ('johnpaulstevens', 0, 'REMEDIES', 'AH'): [623.7272727272727,
  1571.090909090909,
  0.03936363636363637],
 ('johnpaulstevens', 0, 'REMEDIES', 'IY'): [489.6, 2266.1, 0.0992],
 ('johnpaulstevens', 0, 'FLORIDA', 'AO'): [700.8461538461538,
  903.6153846153846,
  0.09703846153846156],
 ('johnpaulstevens', 0, 'FLORIDA', 'AH'): [656.5769230769231,
  1648.1538461538462,
  0.04192307692307694],
 ('johnpaulstevens', 0, 'PATENT', 'AE'): [936.9767441860465,
  1869.2325581395348,
  0.1261860465116279],
 ('johnpaulstevens', 0, 'PATENT', 'AH'): [624.4418604651163,
  1720.6976744186047,
  0.031069767441860473],
 ('johnpaulstevens', 0, 'INFRINGEMENT', 'IH'): [628.8235294117648,
  1820.235294117647,
  0.04170588235294118],
 ('johnpaulstevens', 0, 'INFRINGEMENT', 'AH'): [628.125,
  1521.125,
  0.04025000000000001],
 ('johnpaulstevens', 0, 'ANSWER', 'AE'): [705.984,
  2186.848,
  0.1185199999999

- ** Most used words ** We list the words by decreasing frequency, we select the top 20 words (that do not seem to be ideologically relevant) and we arbitrarily pick the first 20 words that may have some ideological content.

In [10]:
count={}
for e in vowels_final:
    try:
        count[e[1]]+=1
    except KeyError:
        count[e[1]]=1

import operator
sorted_count = sorted(count.items(), key=operator.itemgetter(1),)
sorted(count.items(), key=lambda x: -x[1])

[('THE', 534),
 ('THAT', 395),
 ('PARTICULAR', 366),
 ('SITUATION', 334),
 ('DIFFERENT', 332),
 ('FEDERAL', 317),
 ('A', 315),
 ('AND', 310),
 ('NECESSARILY', 306),
 ('SPECIFICALLY', 306),
 ('DECISION', 297),
 ('ABSOLUTELY', 296),
 ('CONSTITUTIONAL', 295),
 ('BECAUSE', 294),
 ('EXAMPLE', 286),
 ('CIRCUMSTANCES', 285),
 ('IN', 282),
 ('TO', 281),
 ('PARTICULARLY', 276),
 ('CERTAINLY', 274),
 ('INDIVIDUAL', 270),
 ('IMPORTANT', 266),
 ('EXACTLY', 265),
 ('INTO', 260),
 ('DETERMINATION', 260),
 ('VERY', 256),
 ('ANY', 254),
 ('UNDER', 254),
 ('ARGUMENT', 253),
 ('JUSTICE', 252),
 ('IT', 249),
 ('APPROPRIATE', 247),
 ('ACTUALLY', 247),
 ('POSITION', 246),
 ('ABOUT', 242),
 ('OTHER', 238),
 ('SCALIA', 235),
 ('HYPOTHETICAL', 235),
 ('GOING', 233),
 ('ANOTHER', 233),
 ('ANALYSIS', 233),
 ('WHETHER', 230),
 ('BEFORE', 230),
 ('SPECIFIC', 230),
 ('ONLY', 225),
 ('MR', 224),
 ('QUESTION', 223),
 ('NECESSARY', 223),
 ('VIOLATION', 222),
 ('AUTHORITY', 219),
 ('IS', 219),
 ('OPPORTUNITY', 219),
 

In [11]:
non_charged=['SITUATION','BECAUSE','EXAMPLE','PARTICULAR','DIFFERENT','CERTAINLY',
             'IMPORTANT','INTO','ABSOLUTELY','ABOUT','UNDER','SPECIFICALLY','WHETHER',
            'OTHER','EXACTLY']
charged=['FEDERAL','SCALIA','OPPORTUNITY','STATUTORY','HONOR','MR','LEGISLATIVE',
         'CONSTITUTIONAL','KENNEDY','AUTHORITY','GINSBURG','LEGISLATURE','PEOPLE',
        'POLICY','CONGRESS','GOVERNMENT','AMENDMENT','CONSTITUTION','ECONOMIC',
         'CALIFORNIA']

#### We split the speakers into a training group and a test group

{('aaron', 'panner'): '0.0',
 ('alan', 'freedman'): '0.0',
 ('alan', 'gura'): '1.0',
 ('alan', 'morrison'): '0.0',
 ('alan', 'untereiner'): '0.0',
 ('alexander', 'reichert'): '0.0',
 ('allison', 'zieve'): '0.0',
 ('amanda', 'leiter'): '0.0',
 ('amy', 'howe'): '0.007246376811594203',
 ('amy', 'zapp'): '1.0',
 ('andrew', 'frey'): '0.0',
 ('andrew', 'pincus'): '0.0',
 ('andrew', 'rossman'): '0.0',
 ('anita', 'alvarez'): '1.0',
 ('ann', 'oconnell'): '0.0',
 ('anthony', 'yang'): '0.0',
 ('arthur', 'bryant'): '0.0',
 ('arthur', 'fergenson'): '1.0',
 ('barbara', 'mcdowell'): '0.0',
 ('barry', 'barnett'): '0.003323262839879154',
 ('barry', 'ostrager'): '0.0',
 ('beau', 'brindley'): '0.0',
 ('bert', 'deixler'): '0.0',
 ('bert', 'rein'): '0.6153846153846154',
 ('beth', 'brinkmann'): '0.0',
 ('bradley', 'phillips'): '0.0',
 ('brian', 'barov'): '0.0',
 ('brian', 'lauten'): '0.0',
 ('brian', 'shiffrin'): '0.0',
 ('brian', 'wolfman'): '0.0',
 ('bruce', 'ennis'): '1.0',
 ('bruce', 'smith'): '0.208333

**Finally, we generate the dataset**: one file per vowel in a word (charged or uncharged), and each file contains a list of
ideology, phonetics_data
where phonetics_data is
'count','formant1','formant2','duration','F11','F12','F13','F21','F22','F23'

In [14]:
for entry in vowels_final:
    ideology,word,vowel,position = entry
    if word in non_charged:
        thisword = open("gathered_data/non_charged/"+str(word)+'_'+str(vowel)+'_'+str(position)+'.txt', 'a')
        thisword.write(str(ideology)+','+str(vowels_final[entry]).replace('[', '').replace(']','')+'\n')
        thisword.close()
    elif word in charged:
        thisword = open("gathered_data/charged/"+str(word)+'_'+str(vowel)+'_'+str(position)+'.txt', 'a')
        thisword.write(str(ideology)+','+str(vowels_final[entry]).replace('[', '').replace(']','')+'\n')
        thisword.close()
    else:
        pass