# ALIGN

This notebook provides an introduction to **ALIGN**, a tool for quantifying multi-level linguistic similarity between speakers. 

***

**Table of Contents**:

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [User-specified parameters](#User-specified-parameters)
    * [Highest-level functions](#Highest-level-functions)
* [Setup](#Setup)
    * [Download necessary libraries](#Download-necessary-libraries)
    * [Import libraries](#Import-libraries)
    * [Specify global ALIGN settings](#Specify-global-ALIGN-settings)
* [Phase 1: Generate "prepped" transcripts](#Phase-1:-Generate-"prepped"-transcripts)
    * [Initial clean-up](#Initial-clean-up)
    * [Prepare transcript text](#Prepare-transcript-text)
    * [RUN Phase 1](#RUN-Phase-1)
* [Phase 2: Generate alignment scores](#Phase-2:-Generate-alignment-scores)
    * [Calculate similarity scores](#Calculate-similarity-scores)
    * [Generate turn-level analysis of alignment scores](#Generate-turn-level-analysis-of-alignment-scores)
    * [Generate conversation-level analysis of alignment scores](#Generate-conversation-level-analysis-of-alignment-scores)
    * [Generate surrogate pairings](#Generate-surrogate-pairings)
    * [RUN Phase 2: Actual Partners](#RUN-Phase-2:-Actual-Partners)
    * [RUN Phase 2: Surrogate Partners](#RUN-Phase-2:-Surrogate-Partners)
* [Run everything!](#Run-everything!)
    * [Phase 1: Prep](#Phase-1:-Prep)
    * [Phase 2: Real](#Phase-2:-Real)
    * [Phase 2: Surrogate](#Phase-2:-Surrogate)
    * [Speed calculations](#Speed-calculations)
    * [Printouts!](#Printouts!)

***

# Getting Started

### Prerequisites

* Jupyter Notebook with Python 2.7.13 kernel
* Pandas:
    * 0.21.1
* Numpy:
    * 1.11.3
* Scipy:
    * 0.19.0
* NLTK:
    * 3.2.5
* Gensim:
    * 3.1.0

* Necessary packages also in `package_files/requirements.txt` 

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See folder `examples > toy_data-original` in Github repository for an example

### Filename conventions

* Each conversation text file must be regularly formatted, including a prefix for dyad and a prefix for conversation prior to the identifier for each that are separated by a unique character. By default, ALIGN looks for patterns that follow this convention: `dyad1-condA.txt`
    * However, users may choose to include any label for dyad or condition so long as the two labels are distinct from one another and are not subsets of any possible dyad or condition labels. Users may also use any character as a separator so long as it does not occur anywhere else in the filename.
    * The chosen file format **must** be used when saving **all** files for this analysis.

### Highest-level functions

Given appropriately prepared transcript files, ALIGN can be run in 3 high-level functions:

`prepare_transcripts`

* Pre-process each standardized conversation, checking it conforms to the requirements.
* Each utterance is tokenized and lemmatized and has POS tags added.

`calculate_alignment`

* Generates turn-level and conversation-level alignment scores (lexical, conceptual, and syntactic) across a range of n-gram sequences

`calculate_baseline_alignment`

* Generates a surrogate corpus.
* Runs analysis (using identical specifications from `calculate_alignment`) on the surrogate corpus.

***

# Setup

## Download necessary libraries

Below, we install all required libraries for the project. We use `conda` to install *only within the current conda environment* to prevent interfering with any other environments you may have.

In [38]:
import sys

In [39]:
# !conda install --yes --prefix {sys.prefix} pandas

In [40]:
# !conda install --yes --prefix {sys.prefix} numpy

In [41]:
# !conda install --yes --prefix {sys.prefix} scipy

In [42]:
# !conda install --yes --prefix {sys.prefix} nltk

In [43]:
# !conda install --yes --prefix {sys.prefix} gensim

## Import libraries

### Standard libraries

In [44]:
import os,re,math,csv,string,random,logging,glob,itertools,operator, sys
from os import listdir 
from os.path import isfile, join 
from collections import Counter, defaultdict, OrderedDict
from itertools import chain, combinations

### Third-party libraries

For data analysis and data handling:

In [45]:
import pandas as pd
import numpy as np
import scipy
from scipy import spatial 

For natural language processing:

In [46]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn 
from nltk.tag.stanford import StanfordPOSTagger
from nltk.util import ngrams

Download the NLTK default POS tagger:

In [47]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nduran/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**Note:** With older version of NLTK (pre 3.1), the `maxent_treebank_pos_tagger` is also available. If desired, uncomment and run the following:

In [48]:
# nltk.download('maxent_treebank_pos_tagger')

**Note**: The `StanfordPOSTagger` will be
used in conjunction with local folder `stanford-postagger-full-2017-06-09/` and `.jar` file. The `StanfordPOSTagger` also uses the trained model: `english-left3words-distsim.tagger`. These files will be called below if analysis is being run with the Stanford tagger.

For building semantic space:

In [49]:
import gensim
from gensim.models import word2vec

Let's check our environment.

In [50]:
print("Pandas Version Info:\n{}".format(pd.__version__))
print("Numpy Version Info:\n{}".format(np.__version__))
print("Scipy Version Info:\n{}".format(scipy.__version__))
print("NLTK Version Info:\n{}".format(nltk.__version__))
print("Gensim Version Info:\n{}".format(gensim.__version__))
print("Python and Environment Info:\n{}".format(sys.version))

Pandas Version Info:
0.22.0
Numpy Version Info:
1.11.3
Scipy Version Info:
0.19.0
NLTK Version Info:
3.2.5
Gensim Version Info:
3.1.0
Python and Environment Info:
2.7.13 |Anaconda 2.3.0 (x86_64)| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]


## Specify global ALIGN settings

For purposes of demonstrating ALIGN, the directory and folder pathnames correspond to data provided in the Github repository associated with this notebook. The default option is set to analyze conversations from a single English corpus from the CHILDES database (MacWhinney, 2000), specifically, Kuczaj’s Abe corpus (Kuczaj, 1976). Here, only the last 20 conversations are evaluated. Analysis is based on default settings unless otherwise indicated.

### Directories and folders

**`INPUT_PATH`**: Set working directory, in which all notebook and supporting files are located.

In [51]:
INPUT_PATH=os.getcwd()+'/'

**`TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder containing the original transcript files.

In [52]:
TRANSCRIPTS = INPUT_PATH + 'examples/CHILDES/childes-original/'

**`STANFORD_POS_PATH`**: Path to Stanford POS tagger files.

In [53]:
STANFORD_POS_PATH = INPUT_PATH + 'package_files/stanford-postagger-full-2017-06-09/'

**`STANFORD_LANGUAGE`**: If using stanford tagger, set language model to be used for POS tagging

In [54]:
STANFORD_LANGUAGE = 'models/english-left3words-distsim.tagger'

**`PREPPED_TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder into which prepared transcript files will be saved.

In [55]:
PREPPED_TRANSCRIPTS = INPUT_PATH + 'examples/CHILDES/childes-prepped/'

**`ANALYSIS_READY`**: Set variable for folder name (as string) for relative location of folder into which analysis-ready dataframe files will be saved.

In [56]:
ANALYSIS_READY = 'examples/CHILDES/childes-analysis/'

**`SURROGATE_TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder into which all prepared surrogate transcript files will be saved.

In [57]:
SURROGATE_TRANSCRIPTS = 'examples/CHILDES/childes-surrogate/'

### Analysis settings

`MAXNGRAM`: Set maximum size for n-gram chunking.

* Default: 2

In [59]:
MAXNGRAM = 2

`MINWORDS`: Set minimum number of words for each turn.

* Default: 2

**Note**: The minimum number of words must be at least as long as maximum *n*-gram size (`MAXNGRAM` above).

In [60]:
MINWORDS = 2

`ADD_STANFORD_TAGS`: Choose POS tagger. 

* Default: `False`
    * Run NLTK default POS tagger (NLTK 3.1+): `averaged_perceptron_tagger`
* Option: `True`
    * Run both NLTK default POS tagger and Stanford POS tagger. Note: Adding the Stanford POS tagger will lead to an increase in processing time. 

In [61]:
ADD_STANFORD_TAGS = False

`DELAY`: Set max delay between partner's turns when generating alignment score.

* Currently, the only acceptable value is 1 (i.e., contiguous turns).

In [62]:
DELAY = 1

`USE_FILLER_LIST`: Choose method for removing speech fillers. 

* Default: `None`
    * Does not provide additional speech fillers to be removed.
* Option: list of strings
    * Provide a list of literal strings to be removed from the transcripts.

In [63]:
USE_FILLER_LIST = None

`IGNORE_DUPLICATES`: Choose whether to remove duplicate lexial bigrams when computing syntactic alignment

* Default: `True`
    * Removes duplicate lexical bigrams.
* Option `False`
    * Keeps duplicate lexical bigrams

In [64]:
IGNORE_DUPLICATES = True

`USE_PRETRAINED_VECTORS`: Choose whether to use high-dimensional semantic model pretrained vectors from GoogleNews or to build vectors based on transcripts (each utterance/row is equivalent to a single context). Note: if there are a small number of utterance/rows then the pretrained vectors should be used. 

* Default: `False`
    * Builds high-dimensional based on input transcript
* Option `True`
    * Uses pre-trained vectors from GoogleNews

In [65]:
USE_PRETRAINED_VECTORS = False

`ALL_SURROGATES`: Choose whether to generate surrogates from all possible pairings within a condition or only from a subset of all possible pairings. 

* Default: `True`
    * Generates all possible pairings
* Option `False`
    * Generates from a subset of all possible pairings

In [66]:
ALL_SURROGATES = True

`KEEP_ORIGINAL_TURN_ORDER`: For generating surrogate transcripts, choose whether to to retain the original ordering of each surrogate partner's data or create surrogates by shuffling all turns within each surrogate partner's data. 

* Default: `True`
    * Retains original
    ordering of conversational turns
* Option `False`
    * Shuffles ordering of conversational turns

In [67]:
KEEP_ORIGINAL_TURN_ORDER = True

### Additional settings

ALIGN contains a number of other settings that users may alter if desired. We outline each below and provide the default value for user information, but we preserve them in their defaults for the sake of this notebook. More information about each argument can also be found in the docstring for each function.

* `filler_regex_and_list`: remove common fillers through regex in addition to removing a user-specified list of fillers (default: `False`)
* `high_sd_cutoff`: remove any words that occur in the dataset over a certain number of SDs greater than the mean (default: `3`)
* `low_n_cutoff`: remove any words that occur in the dataset at or below a given raw number of times (default: `1`)
* `input_as_directory`: pass a directory of files (rather than a list of file names) to process data (default: `True`)
* `save_concatenated_dataframe`: save output of Phase 1 as a single dataframe (default: `True`)
* `dyad_label`: prefix before dyad identifier in transcript filenames (default: `dyad`)
* `condition_label`: prefix before dyad identifier in transcript filenames (default: `cond`)
* `id_separator`: unique character separator between dyad and condition in transcript filenames (default: `-`)

***

# Phase 1: Generate "prepped" transcripts

## Initial clean-up

### Clean up text

In [68]:
def InitialCleanup(dataframe,
                   minwords=2,
                   use_filler_list=None,
                   filler_regex_and_list=False):
    
    """
    Perform basic text cleaning to prepare dataframe
    for analysis. Remove non-letter/-space characters,
    empty turns, turns below a minimum length, and 
    fillers.
    
    By default, preserves turns 2 words or longer.
    If desired, this may be changed by updating the
    `minwords` argument.
    
    By default, remove common fillers through regex.
    If desired, remove other words by passing a list
    of literal strings to `use_filler_list` argument, 
    and if both regex and list of additional literal
    strings are to be used, update `filler_regex_and_list=True`.
    """
    
    # only allow strings, spaces, and newlines to pass
    WHITELIST = string.letters + '\'' + ' '
     
    # remove inadvertent empty turns 
    dataframe = dataframe[pd.notnull(dataframe['content'])]
    
    # internal function: remove fillers via regular expressions
    def applyRegExpression(textFiller):
        textClean = re.sub('^(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]+\s', ' ', textFiller) # at the start of a string
        textClean = re.sub('\s(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]+\s', ' ', textClean) # within a string 
        textClean = re.sub('\s(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]$', ' ', textClean) # end of a string 
        textClean = re.sub('^(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]$', ' ', textClean) # if entire turn string        
        return textClean
    
    # create a new column with only approved text before cleaning per user-specified settings
    dataframe['clean_content'] = dataframe['content'].apply(lambda utterance: ''.join([char for char in utterance if char in WHITELIST]).lower())
    print(dataframe.head(5))
    
    # DEFAULT: remove typical speech fillers via regular expressions (examples: "um, mm, oh, hm, uh, ha")
    if use_filler_list == None and filler_regex_and_list == False:                                
        dataframe['clean_content'] = dataframe['clean_content'].apply(applyRegExpression)
        
    # OPTION 1: remove speech fillers or other words specified by user in a list
    elif use_filler_list != None and filler_regex_and_list == False:
        dataframe['clean_content'] = dataframe['clean_content'].apply(lambda utterance: ' '.join([word for word in utterance.split(" ") if word not in use_filler_list]))
            
    # OPTION 2: remove speech fillers via regular expression and any additional words from user-specified list
    elif use_filler_list != None and filler_regex_and_list == True:
        dataframe['clean_content'] = dataframe['clean_content'].apply(applyRegExpression)
        dataframe['clean_content'] = dataframe['clean_content'].apply(lambda utterance: ' '.join([word for word in utterance.split(" ") if word not in use_filler_list]))
        cleantext = " ".join(cleantext)
    
    # OPTION 3: nothing is filtered
    else:
        dataframe['clean_content'] = dataframe['clean_content']     
                
    # drop the old "content" column and rename the clean "content" column
    dataframe = dataframe.drop(['content'],axis=1)
    dataframe = dataframe.rename(index=str,
                                 columns ={'clean_content': 'content'})
        
    # remove rows that are now blank or do not meet `minwords` requirement, then drop length column    
    dataframe['utteranceLen'] = dataframe['content'].apply(lambda x: word_tokenize(x)).str.len()
    dataframe = dataframe.drop(dataframe[dataframe.utteranceLen < int(minwords)].index).drop(['utteranceLen'],axis=1)
    dataframe = dataframe.reset_index(drop=True)
        
    # return the cleaned dataframe    
    return dataframe

### Merge adjacent turns by the same participant

In [69]:
def AdjacentMerge(dataframe):

    """
    Given a dataframe of conversation turns,
    merge adjacent turns by the same speaker.
    """    
    
    repeat=1
    while repeat==1:
        l1=len(dataframe) 
        DfMerge = []
        k = 0
        if len(dataframe) > 0:
            while k < len(dataframe)-1: 
                if dataframe['participant'].iloc[k] != dataframe['participant'].iloc[k+1]:
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])         
                    k = k + 1
                elif dataframe['participant'].iloc[k] == dataframe['participant'].iloc[k+1]:                    
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k] + " " + dataframe['content'].iloc[k+1]])           
                    k = k + 2   
            if k == len(dataframe)-1:
                DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])      
        
        dataframe=pd.DataFrame(DfMerge,columns=('participant','content'))
        if l1==len(dataframe): 
            repeat=0 
                
    return dataframe

## Prepare transcript text

### Tokenize and apply spell correction

In [70]:
def Tokenize(text,nwords):
    """
    Given list of text to be processed and a list 
    of known words, return a list of edited and 
    tokenized words.
    
    Spell-checking is implemented using a
    Bayesian spell-checking algorithm 
    (http://norvig.com/spell-correct.html)
    
    """
    
    # internal function: identify possible spelling errors for a given word
    def edits1(word): 
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
        replaces   = [a + c + b[1:] for a, b in splits for c in string.lowercase if b]
        inserts    = [a + c + b     for a, b in splits for c in string.lowercase]
        return set(deletes + transposes + replaces + inserts)

    # internal function: identify known edits
    def known_edits2(word,nwords):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

    # internal function: identify known words
    def known(words,nwords): return set(w for w in words if w in nwords)

    # internal function: correct spelling
    def correct(word,nwords):
        candidates = known([word],nwords) or known(edits1(word),nwords) or known_edits2(word,nwords) or [word]
        return max(candidates, key=nwords.get)

    # expand out based on a fixed list of common contractions 
    contract_dict = { "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "couldn't've": "could not have",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hadn't've": "had not have",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he had",
        "he'd've": "he would have",
        "he'll": "he will",
        "he'll've": "he will have",
        "he's": "he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how is",
        "i'd": "i would",
        "i'd've": "i would have",
        "i'll": "i will",
        "i'll've": "i will have",
        "i'm": "i am",
        "i've": "i have",
        "isn't": "is not",
        "it'd": "it would",
        "it'd've": "it would have",
        "it'll": "it will",
        "it'll've": "it will have",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she would",
        "she'd've": "she would have",
        "she'll": "she will",
        "she'll've": "she will have",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so as",
        "that'd": "that had",
        "that'd've": "that would have",
        "that's": "that is",
        "there'd": "there would",
        "there'd've": "there would have",
        "there's": "there is",
        "they'd": "they would",
        "they'd've": "they would have",
        "they'll": "they will",
        "they'll've": "they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what'll've": "what will have",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where is",
        "where've": "where have",
        "who'll": "who will",
        "who'll've": "who will have",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you would",
        "you'd've": "you would have",
        "you'll": "you will",
        "you'll've": "you will have",
        "you're": "you are",
        "you've": "you have" }
    contractions_re = re.compile('(%s)' % '|'.join(contract_dict.keys()))      

    # internal function:    
    def expand_contractions(text, contractions_re=contractions_re):
        def replace(match):
            return contract_dict[match.group(0)]
        return contractions_re.sub(replace, text.lower())

    # process all words in the text
    cleantoken = []
    text = expand_contractions(text)
    token = word_tokenize(text)
    for word in token:        
        if "'" not in word:
            cleantoken.append(correct(word,nwords))
        else:
            cleantoken.append(word) 
    return cleantoken

### Lemmatize

In [71]:
def pos_to_wn(tag):
    """
    Convert NLTK default tagger output into a format that Wordnet
    can use in order to properly lemmatize the text.
    """
    
    # create some inner functions for simplicity
    def is_noun(tag):
        return tag in ['NN', 'NNS', 'NNP', 'NNPS']
    def is_verb(tag):
        return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    def is_adverb(tag):
        return tag in ['RB', 'RBR', 'RBS']
    def is_adjective(tag):
        return tag in ['JJ', 'JJR', 'JJS']
    
    # check each tag against possible categories
    if is_noun(tag):
        return wn.NOUN
    elif is_verb(tag):
        return wn.VERB
    elif is_adverb(tag):
        return wn.ADV
    elif is_adjective(tag):
        return wn.ADJ
    else:
        return wn.NOUN

In [72]:
def Lemmatize(tokenlist):
    lemmatizer = WordNetLemmatizer() 
    defaultPos = nltk.pos_tag(tokenlist) # get the POS tags from NLTK default tagger
    words_lemma = []
    for item in defaultPos:  
        words_lemma.append(lemmatizer.lemmatize(item[0],pos_to_wn(item[1]))) # need to convert POS tags to a format (NOUN, VERB, ADV, ADJ) that wordnet uses to lemmatize
    return words_lemma

### Part-of-speech tagging

In [73]:
def ApplyPOSTagging(df,
                    filename,
                    add_stanford_tags=False,
                    stanford_pos_path=None,
                    stanford_language_path=None):

    """
    Given a dataframe of conversation turns, return a new
    dataframe with part-of-speech tagging. Add filename
    (given as string) as a new column in returned dataframe.
    
    By default, return only tags from the NLTK default POS 
    tagger. Optionally, also return Stanford POS tagger 
    results by setting `add_stanford_tags=True`. 
    
    If Stanford POS tagging is desired, specify the
    location of the Stanford POS tagger with the 
    `stanford_pos_path` argument. Also note that the 
    default language model for the Stanford tagger is 
    English (english-left3words-distsim.tagger). To change 
    language model, specify the location with the
    `stanford_language_path` argument.
    
    """
    
    # if desired, import Stanford tagger
    if add_stanford_tags == True:
        if stanford_pos_path == None or stanford_language_path == None:
            raise ValueError('Error! Specify path to Stanford POS tagger and language model using the `stanford_pos_path` and `stanford_language_path` arguments')
        else:
            stanford_tagger = StanfordPOSTagger(stanford_pos_path + stanford_language_path,
                                                stanford_pos_path + 'stanford-postagger.jar')
    
    # add new columns to dataframe
    df['tagged_token'] = df['token'].apply(nltk.pos_tag)
    df['tagged_lemma'] = df['lemma'].apply(nltk.pos_tag)
    
    # if desired, also tag with Stanford tagger
    if add_stanford_tags == True:
        df['tagged_stan_token'] = df['token'].apply(stanford_tagger.tag)
        df['tagged_stan_lemma'] = df['lemma'].apply(stanford_tagger.tag)

    df['file'] = filename
        
    # return finished dataframe
    return df

## RUN Phase 1

In [74]:
def prepare_transcripts(input_files, 
              output_file_directory,
              training_dictionary,
              minwords=2,
              use_filler_list=None,
              filler_regex_and_list=False,
              add_stanford_tags=False,
              stanford_pos_path=None,
              stanford_language_path=None,
              input_as_directory=True,
              save_concatenated_dataframe=True):   

    """
    Given individual .txt files of conversations, 
    return a completely prepared dataframe of transcribed 
    conversations for later ALIGN analysis, including: text 
    cleaning, merging adjacent turns, spell-checking, 
    tokenization, lemmatization, and part-of-speech tagging. 
    The output serve as the input for later ALIGN
    analysis.
    
    By default, set a minimum number of words in a turn to
    2. If desired, this may be chaged by changing the
    `minwords` file.
    
    By default, remove common fillers through regex.
    If desired, remove other words by passing a list
    of literal strings to `use_filler_list` argument, 
    and if both regex and list of additional literal
    strings are to be used, update `filler_regex_and_list=True`.
    
    By default, return only the NLTK default 
    POS tagger values. Optionally, also return Stanford POS 
    tagger values with `add_stanford_tags=True`.
    
    If Stanford POS tagging is desired, specify the
    location of the Stanford POS tagger with the 
    `stanford_pos_path` argument.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    
    By default, produce a single concatenated dataframe
    of all processed conversations in the output directory. 
    If desired, suppress concatenated dataframe with 
    `save_concatenated_dataframe=False`.
    """
    
    # create an internal function to train the model
    def train(features): 
        model = defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model
        
    # train our spell-checking model
    nwords = train(re.findall('[a-z]+',(file(training_dictionary).read().lower())))
    
    # grab the appropriate files
    if input_as_directory==False:
        file_list = glob.glob(input_files)
    else: 
        file_list = glob.glob(input_files+"*.txt")
    
    # cycle through all files 
    main = pd.DataFrame()
    for fileName in file_list:  
        
        # let us know which file we're processing
        dataframe = pd.read_csv(fileName, sep='\t',encoding='utf-8')
        print "Processing: "+fileName

        # clean up, merge, spellcheck, tokenize, lemmatize, and POS-tag
        dataframe = InitialCleanup(dataframe,
                                  minwords=minwords,
                                  use_filler_list=use_filler_list,
                                  filler_regex_and_list=filler_regex_and_list)
        dataframe = AdjacentMerge(dataframe)
        
        # tokenize and lemmatize 
        dataframe['token'] = dataframe['content'].apply(Tokenize,
                                     args=(nwords,))
        dataframe['lemma'] = dataframe['token'].apply(Lemmatize)

        # apply part-of-speech tagging
        dataframe = ApplyPOSTagging(dataframe,  
                                    filename = os.path.basename(fileName),
                                    add_stanford_tags=add_stanford_tags,
                                    stanford_pos_path=stanford_pos_path,
                                    stanford_language_path=stanford_language_path)
        
        # export the conversation's dataframe as a CSV
        dataframe.to_csv(output_file_directory + os.path.basename(fileName), 
                         encoding='utf-8',index=False,sep='\t')
        main = main.append(dataframe)

    # save the concatenated dataframe
    if save_concatenated_dataframe != False:
        main.to_csv(output_file_directory + '../' + "align_concatenated_dataframe.txt",
                    encoding='utf-8',index=False, sep='\t')
    
    # return the dataframe
    return main

***

# Phase 2: Generate alignment scores

## Calculate similarity scores

### General helper functions

In [37]:
def ngram_pos(sequence1,sequence2,ngramsize=2,
                   ignore_duplicates=True):
    """
    Remove mimicked lexical sequences from two interlocutors'
    sequences and return a dictionary of counts of ngrams
    of the desired size for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.    
    
    By default, ignore duplicate lexical n-grams when
    processing these sequences. If desired, this may
    be changed with `ignore_duplicates=False`.
    """     
        
    # remove duplicates and recreate sequences
    sequence1 = set(ngrams(sequence1,ngramsize))
    sequence2 = set(ngrams(sequence2,ngramsize))

    # if desired, remove duplicates from sequences
    if ignore_duplicates==True:
        new_sequence1 = [tuple([''.join(pair[1]) for pair in tup]) for tup in list(sequence1 - sequence2)]
        new_sequence2 = [tuple([''.join(pair[1]) for pair in tup]) for tup in list(sequence2 - sequence1)]
    else:
        new_sequence1 = [tuple([''.join(pair[1]) for pair in tup]) for tup in sequence1]
        new_sequence2 = [tuple([''.join(pair[1]) for pair in tup]) for tup in sequence2]
        
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [38]:
def ngram_lexical(sequence1,sequence2,ngramsize=2):
    """
    Create ngrams of the desired size for each of two
    interlocutors' sequences and return a dictionary 
    of counts of ngrams for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.  
    """   
    
    # generate ngrams
    sequence1 = list(ngrams(sequence1,ngramsize))
    sequence2 = list(ngrams(sequence2,ngramsize)) 

    # join for counters
    new_sequence1 = [' '.join(pair) for pair in sequence1]
    new_sequence2 = [' '.join(pair) for pair in sequence2]
    
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [39]:
def get_cosine(vec1, vec2): 
    """
    Derive cosine similarity metric, standard measure.
    Adapted from <https://stackoverflow.com/a/33129724>.
    """     
    
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator    

In [40]:
def build_composite_semantic_vector(lemma_seq,vocablist,highDimModel):
    """
    Function for producing vocablist and model is called in the main loop
    """
    
    ## filter out words in corpus that do not appear in vocablist (either too rare or too frequent)
    filter_lemma_seq = [word for word in lemma_seq if word in vocablist]    
    ## build composite vector
    getComposite = [0] * len(highDimModel[vocablist[1]])        
    for w1 in filter_lemma_seq:
        if w1 in highDimModel.vocab:
            semvector = highDimModel[w1]
            getComposite = getComposite + semvector
    return getComposite

###  Build semantic space

In [41]:
def BuildSemanticModel(semantic_model_input_file,   
                        pretrained_input_file,
                        use_pretrained_vectors=False,                     
                        high_sd_cutoff=3,
                        low_n_cutoff=1):
    
    """
    Given an input file produced by the ALIGN Phase 1 functions, 
    build a semantic model from all transcripts in all conversations
    in target corpus after removing high- and low-frequency words.
    High-frequency words are determined by a user-defined number of
    SDs over the mean (by default, `high_sd_cutoff=3`). Low-frequency
    words must appear over a specified number of raw occurrences 
    (by default, `low_n_cutoff=1`).
    
    Frequency cutoffs can be removed by `high_sd_cutoff=None` and/or
    `low_n_cutoff=0`.
    """
    
    # build vocabulary list from transcripts
    data1 = pd.read_csv(semantic_model_input_file, sep='\t',encoding='utf-8')
        
    # get frequency count of all included words        
    all_sentences = [re.sub('[^\w\s]+','',str(row)).split(' ') for row in list(data1['lemma'])]
    all_words = list([a for b in all_sentences for a in b])  
    frequency = defaultdict(int)
    for word in all_words:
        frequency[word] += 1

    # remove words that only occur more frequently than our cutoff (defined in occurrences)
    frequency = {word: freq for word, freq in frequency.iteritems() if freq > low_n_cutoff}
    
    # if desired, remove high-frequency words (over user-defined SDs above mean) 
    if high_sd_cutoff == None:
        contentWords = [word for word in frequency.keys()] 
    else:
        getOut = np.mean(frequency.values())+(np.std(frequency.values())*(high_sd_cutoff))
        contentWords = {word: freq for word, freq in frequency.iteritems() if freq < getOut}.keys()
    
    # decide whether to build semantic model from scratch or load in pretrained vectors
    if use_pretrained_vectors == False:
        keepSentences = [[word for word in row if word in contentWords] for row in all_sentences]
        semantic_model = word2vec.Word2Vec(all_sentences, min_count=low_n_cutoff)
    else:
        if pretrained_input_file == None:
            raise ValueError('Error! Specify path to pretrained vector file using the `pretrained_input_file` argument.')
        else:
            semantic_model = gensim.models.KeyedVectors.load_word2vec_format(pretrained_input_file, binary=True)    
        
    # return all the content words and the trained word vectors
    return contentWords, semantic_model.wv

### Calculate lexical and POS alignment scores for each n-gram length across two comparison vectors

In [42]:
def LexicalPOSAlignment(tok1,lem1,penn_tok1,penn_lem1,
                             tok2,lem2,penn_tok2,penn_lem2,
                             stan_tok1=None,stan_lem1=None,
                             stan_tok2=None,stan_lem2=None,
                             maxngram=2,
                             ignore_duplicates=True,
                             add_stanford_tags=False):
    
    """
    Derive lexical and part-of-speech alignment scores
    between interlocutors (suffix `1` and `2` in arguments
    passed to function). 
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigram when calculating
    similarity. If desired, this window may be expanded 
    by changing the `maxngram` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # create empty dictionaries for syntactic similarity
    syntax_penn_tok = {}
    syntax_penn_lem = {}
    
    # if desired, generate Stanford-based scores
    if add_stanford_tags == True:
        syntax_stan_tok = {}
        syntax_stan_lem = {}
    
    # create empty dictionaries for lexical similarity
    lexical_tok = {}
    lexical_lem = {}
    
    # cycle through all desired ngram lengths
    for ngram in range(2,maxngram+1):
                
        # calculate similarity for lexical ngrams (tokens and lemmas)
        [vectorT1, vectorT2] = ngram_lexical(tok1,tok2,ngramsize=ngram)
        [vectorL1, vectorL2] = ngram_lexical(lem1,lem2,ngramsize=ngram)        
        lexical_tok['lexical_tok{0}'.format(ngram)] = get_cosine(vectorT1,vectorT2)
        lexical_lem['lexical_lem{0}'.format(ngram)] = get_cosine(vectorL1, vectorL2)
        
        # calculate similarity for Penn POS ngrams (tokens)
        [vector_penn_tok1, vector_penn_tok2] = ngram_pos(penn_tok1,penn_tok2,
                                                ngramsize=ngram,
                                                ignore_duplicates=ignore_duplicates) 
        syntax_penn_tok['syntax_penn_tok{0}'.format(ngram)] = get_cosine(vector_penn_tok1, 
                                                                                            vector_penn_tok2)
        # calculate similarity for Penn POS ngrams (lemmas)
        [vector_penn_lem1, vector_penn_lem2] = ngram_pos(penn_lem1,penn_lem2,
                                                              ngramsize=ngram,
                                                              ignore_duplicates=ignore_duplicates) 
        syntax_penn_lem['syntax_penn_lem{0}'.format(ngram)] = get_cosine(vector_penn_lem1, 
                                                                                            vector_penn_lem2) 

        # if desired, also calculate using Stanford POS
        if add_stanford_tags == True:         
          
            # calculate similarity for Stanford POS ngrams (tokens)
            [vector_stan_tok1, vector_stan_tok2] = ngram_pos(stan_tok1,stan_tok2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            syntax_stan_tok['syntax_stan_tok{0}'.format(ngram)] = get_cosine(vector_stan_tok1,
                                                                                                vector_stan_tok2)
                        
            # calculate similarity for Stanford POS ngrams (lemmas)
            [vector_stan_lem1, vector_stan_lem2] = ngram_pos(stan_lem1,stan_lem2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            syntax_stan_lem['syntax_stan_lem{0}'.format(ngram)] = get_cosine(vector_stan_lem1,
                                                                                                vector_stan_lem2)
        
    # return requested information
    if add_stanford_tags == True:
        dictionaries_list = [syntax_penn_tok, syntax_penn_lem,
                             syntax_stan_tok, syntax_stan_lem, 
                             lexical_tok, lexical_lem]      
    else:
        dictionaries_list = [syntax_penn_tok, syntax_penn_lem,
                             lexical_tok, lexical_lem]      
            
    return dictionaries_list

## Generate turn-level analysis of alignment scores

In [43]:
def conceptualAlignment(lem1, lem2, vocablist, highDimModel):
    
    """
    Calculate conceptual alignment scores from list of lemmas
    from between two interocutors (suffix `1` and `2` in arguments
    passed to function) using `word2vec`.
    """

    # aggregate composite high-dimensional vectors of all words in utterance
    W2Vec1 = build_composite_semantic_vector(lem1,vocablist,highDimModel)
    W2Vec2 = build_composite_semantic_vector(lem2,vocablist,highDimModel)

    # return cosine distance alignment score
    return 1 - spatial.distance.cosine(W2Vec1, W2Vec2) 

In [44]:
def returnMultilevelAlignment(cond_info,
                                   partnerA,tok1,lem1,penn_tok1,penn_lem1,
                                   partnerB,tok2,lem2,penn_tok2,penn_lem2,
                                   vocablist, highDimModel, 
                                   stan_tok1=None,stan_lem1=None,
                                   stan_tok2=None,stan_lem2=None,
                                   add_stanford_tags=False,
                                   maxngram=2, 
                                   ignore_duplicates=True):

    """
    Calculate lexical, syntactic, and conceptual alignment
    between a pair of turns by individual interlocutors 
    (suffix `1` and `2` in arguments passed to function), 
    including leading/following comparison directionality.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `maxngram` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # create empty dictionaries 
    partner_direction = {}
    condition_info = {}
    cosine_semanticL = {}
    
    # calculate lexical and syntactic alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 maxngram=maxngram,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tags=add_stanford_tags)
    
    # calculate conceptual alignment
    cosine_semanticL['cosine_semanticL'] = conceptualAlignment(lem1,lem2,vocablist,highDimModel)
    dictionaries_list.append(cosine_semanticL.copy())
    
    # determine directionality of leading/following comparison
    partner_direction['partner_direction'] = str(partnerA) + ">" + str(partnerB)
    dictionaries_list.append(partner_direction.copy())

    # add condition information
    condition_info['condition_info'] = cond_info    
    dictionaries_list.append(condition_info.copy())
    
    # return alignment scores
    return dictionaries_list

In [45]:
def TurnByTurnAnalysis(dataframe,
                            vocablist,
                            highDimModel, 
                            delay=1,
                            maxngram=2,
                            add_stanford_tags=False,
                            ignore_duplicates=True):    

    """
    Calculate lexical, syntactic, and conceptual alignment
    between interlocutors over an entire conversation.
    Automatically detect individual speakers by unique
    speaker codes.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # if we don't want the Stanford tagger data, set defaults
    if add_stanford_tags == False:
        stan_tok1=None
        stan_lem1=None
        stan_tok2=None
        stan_lem2=None
    
    # prepare the data to the appropriate type    
    dataframe['token'] = dataframe['token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))    
    dataframe['lemma'] = dataframe['lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        
    # if desired, prepare the Stanford tagger data
    if add_stanford_tags == True:           
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        
    # create lagged version of the dataframe
    df_original = dataframe.drop(dataframe.tail(delay).index,inplace=False)
    df_lagged = dataframe.shift(-delay).drop(dataframe.tail(delay).index,inplace=False)
        
    # cycle through each pair of turns
    aggregated_df = pd.DataFrame()
    for i in range(0,df_original.shape[0]):

        # identify the condition for this dataframe
        cond_info = dataframe['file'].unique()
        if len(cond_info)==1: 
            cond_info = str(cond_info[0])
        
        # break and flag error if we have more than 1 condition per dataframe
        else: 
            raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)

        # grab all of first participant's data
        first_row = df_original.iloc[i]
        first_partner = first_row['participant']
        tok1=first_row['token']
        lem1=first_row['lemma']
        penn_tok1=first_row['tagged_token']
        penn_lem1=first_row['tagged_lemma']

        # grab all of lagged participant's data
        lagged_row = df_lagged.iloc[i]
        lagged_partner = lagged_row['participant']
        tok2=lagged_row['token']
        lem2=lagged_row['lemma']
        penn_tok2=lagged_row['tagged_token']
        penn_lem2=lagged_row['tagged_lemma']
                
        # if desired, grab the Stanford tagger data for both participants
        if add_stanford_tags == True:         
            stan_tok1=first_row['tagged_stan_token']
            stan_lem1=first_row['tagged_stan_lemma']
            stan_tok2=lagged_row['tagged_stan_token']
            stan_lem2=lagged_row['tagged_stan_lemma']
   
        # process multilevel alignment
        dictionaries_list=returnMultilevelAlignment(cond_info=cond_info,
                                                         partnerA=first_partner,
                                                         tok1=tok1,lem1=lem1,
                                                         penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                         partnerB=lagged_partner,
                                                         tok2=tok2,lem2=lem2,
                                                         penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                         vocablist=vocablist,
                                                         highDimModel=highDimModel,
                                                         stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                         stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                         maxngram = maxngram,
                                                         ignore_duplicates = ignore_duplicates,
                                                         add_stanford_tags = add_stanford_tags) 
                
        # sort columns so they are in order, append data to existing structures   
        next_df_line = pd.DataFrame.from_dict(OrderedDict(k for num, i in enumerate(d for d in dictionaries_list) for k in sorted(i.items())),
                               orient='index').transpose()
        aggregated_df = aggregated_df.append(next_df_line)
        
    # reformat turn information and add index
    aggregated_df = aggregated_df.reset_index(drop=True).reset_index().rename(columns={"index":"time"})

    # give us our finished dataframe
    return aggregated_df

Generate conversation-level analysis of alignment scores
-----------------------------------------------------

In [46]:
def ConvoByConvoAnalysis(dataframe,
                          maxngram=2,
                          ignore_duplicates=True,
                          add_stanford_tags=False):

    """
    Calculate analysis of multilevel similarity over
    a conversation between two interlocutors from a 
    transcript dataframe prepared by Phase 1
    of ALIGN. Automatically detect speakers by unique
    speaker codes.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # identify the condition for this dataframe
    cond_info = dataframe['file'].unique()
    if len(cond_info)==1: 
        cond_info = str(cond_info[0])
    
    # break and flag error if we have more than 1 condition per dataframe
    else: 
        raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)
   
    # if we don't want the Stanford info, set defaults 
    if add_stanford_tags == False:
        stan_tok1 = None
        stan_lem1 = None
        stan_tok2 = None
        stan_lem2 = None

    # identify individual interlocutors
    df_A = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[0]]
    df_B = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[1]]
   
    # concatenate the token, lemma, and POS information for participant A
    tok1 = [word for turn in df_A['token'] for word in turn]
    lem1 = [word for turn in df_A['lemma'] for word in turn]
    penn_tok1 = [POS for turn in df_A['tagged_token'] for POS in turn]    
    penn_lem1 = [POS for turn in df_A['tagged_token'] for POS in turn] 
    if add_stanford_tags == True:        
        
        if type(df_A['tagged_stan_token'][0]) == list:
            stan_tok1 = [POS for turn in df_A['tagged_stan_token'] for POS in turn] 
            stan_lem1 = [POS for turn in df_A['tagged_stan_lemma'] for POS in turn]
            
        elif type(df_A['tagged_stan_token'][0]) == unicode:
            stan_tok1 = pd.Series(df_A['tagged_stan_token'].values).apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
            stan_tok1 = stan_tok1.apply(lambda x: zip(x[0::2],x[1::2]))
            stan_tok1 = [POS for turn in stan_tok1 for POS in turn] 
            stan_lem1 = pd.Series(df_A['tagged_stan_lemma'].values).apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
            stan_lem1 = stan_lem1.apply(lambda x: zip(x[0::2],x[1::2]))
            stan_lem1 = [POS for turn in stan_lem1 for POS in turn] 
                
    # concatenate the token, lemma, and POS information for participant B
    tok2 = [word for turn in df_B['token'] for word in turn]
    lem2 = [word for turn in df_B['lemma'] for word in turn]
    penn_tok2 = [POS for turn in df_B['tagged_token'] for POS in turn]    
    penn_lem2 = [POS for turn in df_B['tagged_token'] for POS in turn] 
    if add_stanford_tags == True:
        
        if type(df_A['tagged_stan_token'][0]) == list:
            stan_tok2 = [POS for turn in df_B['tagged_stan_token'] for POS in turn] 
            stan_lem2 = [POS for turn in df_B['tagged_stan_lemma'] for POS in turn]
            
        elif type(df_A['tagged_stan_token'][0]) == unicode:        
            stan_tok2 = pd.Series(df_B['tagged_stan_token'].values).apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
            stan_tok2 = stan_tok2.apply(lambda x: zip(x[0::2],x[1::2]))
            stan_tok2 = [POS for turn in stan_tok2 for POS in turn] 
            stan_lem2 = pd.Series(df_B['tagged_stan_lemma'].values).apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
            stan_lem2 = stan_lem2.apply(lambda x: zip(x[0::2],x[1::2]))
            stan_lem2 = [POS for turn in stan_lem2 for POS in turn]         
        
    # process multilevel alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 maxngram=maxngram,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tags=add_stanford_tags)
    
    # append data to existing structures
    dictionary_df = pd.DataFrame.from_dict(OrderedDict(k for num, i in enumerate(d for d in dictionaries_list) for k in sorted(i.items())),
                       orient='index').transpose()    
    dictionary_df['condition_info'] = cond_info
            
    # return the dataframe
    return dictionary_df

## Generate surrogate pairings

In [47]:
def GenerateSurrogate(original_conversation_list,
                           surrogate_file_directory,
                           all_surrogates=True,
                           keep_original_turn_order=True,
                           id_separator = '\-',
                           dyad_label='dyad',
                           condition_label='cond'):
    
    """
    Create transcripts for surrogate pairs of 
    participants (i.e., participants who did not 
    genuinely interact in the experiment), which
    will later be used to generate baseline levels 
    of alignment. Store surrogate files in a new
    folder each time the surrogate generation is run.
    
    Returns a list of all surrogate files created.

    By default, the separator between dyad ID and
    condition ID is a hyphen ('\-'). If desired,
    this may be changed in the `id_separator` 
    argument.

    By default, condition IDs will be identified as 
    any characters following `cond`. If desired,
    this may be changed with the `condition_label`
    argument.
    
    By default, dyad IDs will be identified as 
    any characters following `dyad`. If desired,
    this may be changed with the `dyad_label`
    argument.
    
    By default, generate surrogates from all possible 
    pairings. If desired, instead generate surrogates 
    only from a subset of all possible pairings
    with `all_surrogates=False`.
    
    By default, create surrogates by retaining the 
    original ordering of each surrogate partner's 
    data. If desired, create surrogates by shuffling 
    all turns within each surrogate partner's data
    with `keep_original_turn_order = False`.
    """
        
    # create a subfolder for the new set of surrogates
    import time
    new_surrogate_path = surrogate_file_directory + 'surrogate_run-' + str(time.time()) +'/'
    if not os.path.exists(new_surrogate_path):
        os.makedirs(new_surrogate_path)
        
    # grab condition types from each file name
    file_info = [re.sub('\.txt','',os.path.basename(file_name)) for file_name in original_conversation_list]
    condition_ids = list(set([re.findall('[^'+id_separator+']*'+condition_label+'.*',metadata)[0] for metadata in file_info]))    
    files_conditions = {}
    for unique_condition in condition_ids:
        next_condition_files = [add_file for add_file in original_conversation_list if unique_condition in add_file]
        files_conditions[unique_condition] = next_condition_files
        
    # cycle through conditions
    for condition in files_conditions.keys():
        
        # default: grab all possible pairs of conversations of this condition
        paired_surrogates = [pair for pair in combinations(files_conditions[condition],2)]

        # otherwise, if desired, randomly pull from all pairs to get target surrogate sample
        if all_surrogates == False:
            import math
            paired_surrogates = random.sample(paired_surrogates, 
                                              int(math.ceil(len(files_conditions[condition])/2)))

        # cycle through surrogate pairings
        for next_surrogate in paired_surrogates:

            # read in the files
            original_file1 = os.path.basename(next_surrogate[0])
            original_file2 = os.path.basename(next_surrogate[1])
            original_df1=pd.read_csv(next_surrogate[0], sep='\t',encoding='utf-8')
            original_df2=pd.read_csv(next_surrogate[1], sep='\t',encoding='utf-8')

            # get participants A and B from df1
            participantA_1_code = min(original_df1['participant'].unique())
            participantB_1_code = max(original_df1['participant'].unique())
            participantA_1 = original_df1[original_df1['participant'] == participantA_1_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_1 = original_df1[original_df1['participant'] == participantB_1_code].reset_index().rename(columns={'file': 'original_file'})

            # get participants A and B from df2
            participantA_2_code = min(original_df2['participant'].unique())
            participantB_2_code = max(original_df2['participant'].unique())
            participantA_2 = original_df2[original_df2['participant'] == participantA_2_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_2 = original_df2[original_df2['participant'] == participantB_2_code].reset_index().rename(columns={'file': 'original_file'})

            # identify truncation point for both surrogates (to have even number of turns)
            surrogateX_turns=min([participantA_1.shape[0],
                                  participantB_2.shape[0]])
            surrogateY_turns=min([participantA_2.shape[0], 
                                  participantB_1.shape[0]])

            # preserve original turn order for surrogate pairs
            if keep_original_turn_order == True:                
                surrogateX_A1 = participantA_1.truncate(after=surrogateX_turns-1,copy=False)
                surrogateX_B2 = participantB_2.truncate(after=surrogateX_turns-1,copy=False)
                surrogateX = pd.concat([surrogateX_A1,surrogateX_B2]).sort_index(kind="mergesort").reset_index(drop=True).rename(columns={'index': 'original_index'})

                surrogateY_A2 = participantA_2.truncate(after=surrogateY_turns-1,copy=False)
                surrogateY_B1 = participantB_1.truncate(after=surrogateY_turns-1,copy=False)
                surrogateY = pd.concat([surrogateY_A2,surrogateY_B1]).sort_index(kind="mergesort").reset_index(drop=True).rename(columns={'index': 'original_index'})

            # otherwise, if desired, just shuffle all turns within participants
            else:

                # shuffle for first surrogate pairing
                surrogateX_A1 = participantA_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX_B2 = participantB_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX = pd.concat([surrogateX_A1,surrogateX_B2]).sort_index(kind="mergesort").reset_index(drop=True).rename(columns={'index': 'original_index'})

                # and for second surrogate pairing
                surrogateY_A2 = participantA_2.truncate(after=surrogateY_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY_B1 = participantB_1.truncate(after=surrogateY_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY = pd.concat([surrogateY_A2,surrogateY_B1]).sort_index(kind="mergesort").reset_index(drop=True).rename(columns={'index': 'original_index'})

            # create filename for our surrogate file
            original_dyad1 = re.findall(dyad_label+'[^'+id_separator+']*',original_file1)[0]
            original_dyad2 = re.findall(dyad_label+'[^'+id_separator+']*',original_file2)[0]
            surrogateX['file'] = original_dyad1 + '-' + original_dyad2 + '-' + condition
            surrogateY['file'] = original_dyad2 + '-' + original_dyad1 + '-' + condition                
            nameX='SurrogatePair-'+original_dyad1+'A'+'-'+original_dyad2+'B'+'-'+condition+'.txt'
            nameY='SurrogatePair-'+original_dyad2+'A'+'-'+original_dyad1+'B'+'-'+condition+'.txt'

            # save to file
            surrogateX.to_csv(new_surrogate_path + nameX, encoding='utf-8',index=False,sep='\t')
            surrogateY.to_csv(new_surrogate_path + nameY, encoding='utf-8',index=False,sep='\t')
            
    # return list of all surrogate files
    return glob.glob(new_surrogate_path+"*.txt") 

## RUN Phase 2: Actual Partners

In [48]:
def calculate_alignment(input_files, 
                    output_file_directory,
                    semantic_model_input_file,
                    pretrained_input_file,
                    high_sd_cutoff=3,
                    low_n_cutoff=1,
                    delay=1,
                    maxngram=2,
                    use_pretrained_vectors=False,
                    ignore_duplicates=True,
                    add_stanford_tags=False,
                    input_as_directory=True):   
    
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `prepare_transcripts` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    """
    
    # grab the files in the list
    if input_as_directory == False:
        file_list = glob.glob(input_files)
    else:
        file_list = glob.glob(input_files+"*.txt")
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                       pretrained_input_file=pretrained_input_file,
                                                       use_pretrained_vectors=use_pretrained_vectors,
                                                       high_sd_cutoff=high_sd_cutoff,
                                                       low_n_cutoff=low_n_cutoff)  
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
        
    # cycle through each prepared file
    for fileName in file_list:   
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel,
                                         add_stanford_tags=add_stanford_tags,
                                         ignore_duplicates=ignore_duplicates)   
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             maxngram = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tags = add_stanford_tags)
            AlignmentC2C=AlignmentC2C.append(xC2C)
            
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN.to_csv(output_file_directory+"AlignmentT2T.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO.to_csv(output_file_directory+"AlignmentC2C.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN, FINAL_CONVO

## RUN Phase 2: Surrogate Partners

In [49]:
def calculate_baseline_alignment(input_files, 
                         surrogate_file_directory,
                         output_file_directory,
                         semantic_model_input_file,
                         pretrained_input_file,   
                         high_sd_cutoff=3,
                         low_n_cutoff=1,
                         id_separator = '\-',
                         condition_label='cond',
                         dyad_label='dyad',
                         all_surrogates=True,
                         keep_original_turn_order=True,
                         delay=1,
                         maxngram=2,
                         use_pretrained_vectors=False,   
                         ignore_duplicates=True,
                         add_stanford_tags=False,
                         input_as_directory=True):   
    
    
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `prepare_transcripts` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics
    for surrogate baseline conversations.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, the separator between dyad ID and
    condition ID in each file name is a hyphen ('\-'). 
    If desired, this may be changed with the 
    `id_separator` argument.

    By default, condition IDs in each file name
    will be identified as any characters following 
    `cond`. If desired, this may be changed with the 
    `condition_label` argument. 
    
    By default, dyad IDs in each file name
    will be identified as any characters following 
    `dyad`. If desired, this may be changed with the 
    `dyad_label` argument.
    
    By default, generate surrogates from all possible 
    pairings within a condition.  If desired, instead 
    generate surrogates only from a subset of all
    possible pairings within a condition with 
    `all_surrogates=False`.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    """
    
    # grab the files in the input list
    if input_as_directory==False:
        file_list = glob.glob(input_files)
    else:
        file_list = glob.glob(input_files+"*.txt")
    
    # create a surrogate file list
    surrogate_file_list = GenerateSurrogate(original_conversation_list = file_list,
                                                   surrogate_file_directory = surrogate_file_directory,
                                                   all_surrogates = all_surrogates,
                                                   id_separator = id_separator,
                                                   condition_label = condition_label,
                                                   dyad_label = dyad_label,
                                                   keep_original_turn_order = keep_original_turn_order) 
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                       pretrained_input_file=pretrained_input_file,
                                                       use_pretrained_vectors=use_pretrained_vectors,
                                                       high_sd_cutoff=high_sd_cutoff,
                                                       low_n_cutoff=low_n_cutoff)  
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through the files
    for fileName in surrogate_file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel,
                                         add_stanford_tags = add_stanford_tags)
            AlignmentT2T=AlignmentT2T.append(xT2T)
                        
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             maxngram = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tags = add_stanford_tags)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN_SURROGATE = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO_SURROGATE = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN_SURROGATE.to_csv(output_file_directory+"AlignmentT2T_Surrogate.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO_SURROGATE.to_csv(output_file_directory+"AlignmentC2C_Surrogate.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN_SURROGATE, FINAL_CONVO_SURROGATE

***

# Run everything!

Now that we've walked through all of our functions, let's try out ALIGN on some of the CHILDES data. We'll be getting a sense of the length of time it takes to run ALIGN with each step and then take a peek at the resulting data.

## Phase 1: Prep

In [50]:
import time
start_phase1 = time.time()

In [67]:
model_store = prepare_transcripts(
          input_files=TRANSCRIPTS,
          minwords=2,
          add_stanford_tags=False,
          output_file_directory=PREPPED_TRANSCRIPTS,
          use_filler_list=None,
          filler_regex_and_list=False,          
          training_dictionary=INPUT_PATH+'package_files/gutenberg_tasa.txt',
          stanford_pos_path=STANFORD_POS_PATH,
          stanford_language_path=STANFORD_LANGUAGE,
          input_as_directory=True,
          save_concatenated_dataframe=True)

## Phase 2: Real

**Note**: For demonstration purposes, given the small number of transcripts in our example corpus, the example here uses pretrained vectors from Google News rather than building a new semantic space from the example corpus itself.

In [52]:
start_phase2real = time.time()

In [66]:
[turn_real,convo_real]= calculate_alignment(
                        input_files = INPUT_PATH+'examples/CHILDES/childes-prepped/',
                        add_stanford_tags=False,  
                        maxngram=2,   
                        use_pretrained_vectors=True,
                        semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                        output_file_directory = INPUT_PATH+'examples/CHILDES/childes-analysis/',
                        pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                        ignore_duplicates=True,
                        delay=1,
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        input_as_directory=True)

## Phase 2: Surrogate

**Note**: For demonstration purposes, we again use pre-trained vectors from Google News. We demonstrate other possible uses for labels by setting `dyad_label = time`, allowing us to compare alignment over time across the same speakers. We also demonstrate how to generate a subset of surrogate pairings rather than all possible pairings.

In [54]:
start_phase2surrogate = time.time()

In [65]:
[turn_surrogate,convo_surrogate] = calculate_baseline_alignment(
                                input_files = INPUT_PATH+'examples/CHILDES/childes-prepped/', 
                                add_stanford_tags=False,
                                maxngram=2,
                                use_pretrained_vectors=True,
                                all_surrogates=False,
                                keep_original_turn_order=True,
                                id_separator = '\-',
                                dyad_label='time',
                                condition_label='cond',
                                surrogate_file_directory= INPUT_PATH+'examples/CHILDES/childes-surrogate/',                                
                                output_file_directory=INPUT_PATH+'examples/CHILDES/childes-analysis/',
                                pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                                semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                                ignore_duplicates=True,
                                delay=1,
                                high_sd_cutoff=3,
                                low_n_cutoff=1,
                                input_as_directory=True)

In [56]:
end=time.time()

## Speed calculations

As promised, let's take a look at how long it takes to run each section. Time is given in seconds.

Phase 1 time:

In [57]:
start_phase2real - start_phase1

33.50847101211548

Phase 2 real time:

In [58]:
start_phase2surrogate - start_phase2real

77.47824192047119

Phase 2 surrogate time:

In [59]:
end - start_phase2surrogate

77.40525507926941

All 3 phases:

In [60]:
end - start_phase1

188.39196801185608

## Printouts!

And that's it! Before we go, let's take a look at the output from the real data analyzed at the turn level for each conversation (`turn_real`) and at the conversation level for each dyad (`convo_real`). We'll then look at our surrogate data, analyzed both at the turn level (`turn_surrogate`) and at the conversation level (`convo_surrogate`). In our next step, we would then take these data and plug them into our statistical model of choice, but we'll stop here for the sake of our tutorial.

In [61]:
turn_real.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.0,0,0,0.285198,cgv>kid,time197-cond1.txt
1,1,0.0,0.0,0,0,0.37358,kid>cgv,time197-cond1.txt
2,2,0.154303,0.0,0,0,0.57782,cgv>kid,time197-cond1.txt
3,3,0.0,0.0,0,0,0.672067,kid>cgv,time197-cond1.txt
4,4,0.111111,0.09245,0,0,0.597504,cgv>kid,time197-cond1.txt
5,5,0.222222,0.27735,0,0,0.617649,kid>cgv,time197-cond1.txt
6,6,0.0,0.0,0,0,0.168668,cgv>kid,time197-cond1.txt
7,7,0.0,0.0,0,0,0.223091,kid>cgv,time197-cond1.txt
8,8,0.0,0.0,0,0,0.323836,cgv>kid,time197-cond1.txt
9,9,0.0,0.0,0,0,0.283156,kid>cgv,time197-cond1.txt


In [62]:
convo_real.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.70911,0.70911,0.099848,0.186072,time197-cond1.txt
1,0.76849,0.76849,0.353514,0.43538,time202-cond1.txt
2,0.744802,0.744802,0.309924,0.356673,time191-cond1.txt
3,0.782399,0.782399,0.353604,0.401469,time209-cond1.txt
4,0.810753,0.810753,0.192589,0.305209,time210-cond1.txt
5,0.766315,0.766315,0.311128,0.365522,time204-cond1.txt
6,0.670246,0.670246,0.164155,0.228145,time196-cond1.txt
7,0.789571,0.789571,0.285261,0.317173,time203-cond1.txt
8,0.741248,0.741248,0.319008,0.383271,time208-cond1.txt
9,0.78544,0.78544,0.188816,0.229783,time205-cond1.txt


In [63]:
turn_surrogate.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.169031,0.0,0.0,0.3444,cgv>kid,time195-time204-cond1
1,1,0.210819,0.111803,0.0,0.0,0.628921,kid>cgv,time195-time204-cond1
2,2,0.105409,0.111803,0.0,0.0,0.619997,cgv>kid,time195-time204-cond1
3,3,0.0,0.0,0.13484,0.13484,0.686174,kid>cgv,time195-time204-cond1
4,4,0.0,0.0,0.0,0.0,0.559554,cgv>kid,time195-time204-cond1
5,5,0.0,0.0,0.0,0.0,0.309329,kid>cgv,time195-time204-cond1
6,6,0.0,0.0,0.0,0.0,0.35275,cgv>kid,time195-time204-cond1
7,7,0.0,0.0,0.0,0.0,0.535742,kid>cgv,time195-time204-cond1
8,8,0.0,0.0,0.353553,0.353553,0.30474,cgv>kid,time195-time204-cond1
9,9,0.0,0.0,0.0,0.0,0.456275,kid>cgv,time195-time204-cond1


In [64]:
convo_surrogate.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.703404,0.703404,0.163967,0.251262,time195-time204-cond1
1,0.767801,0.767801,0.129419,0.157256,time191-time201-cond1
2,0.747546,0.747546,0.102264,0.157979,time210-time197-cond1
3,0.806707,0.806707,0.124169,0.176623,time210-time201-cond1
4,0.790831,0.790831,0.130467,0.211696,time204-time195-cond1
5,0.686797,0.686797,0.135402,0.202031,time194-time210-cond1
6,0.72722,0.72722,0.078701,0.103854,time197-time210-cond1
7,0.736246,0.736246,0.081103,0.156581,time195-time197-cond1
8,0.808982,0.808982,0.120128,0.199052,time201-time210-cond1
9,0.710756,0.710756,0.069476,0.119855,time197-time195-cond1
