# To-do list

**Things for Alex to do:**
* [ ] Handling requirements (after getting them)
* [ ] Dockerizing
* [ ] Jupyter app-ifying
* [ ] Getting Stanford tagger included automatically
* [ ] Clean up markdown text (when final notebooks are ready)
* [ ] See if I can implement w2v function (https://github.com/a-paxton/Gensim-LSI-Word-Similarities)
* [ ] Convert functions into library
* [ ] See whether `BuildSemanticSpace` can be sped up when pulling in the prebuilt GoogleNews vectors

**Things for Nick to do:**
* [x] Implement surrogate to match by conversation order AND conversation type
* [x] Make file names more intuitive
* [ ] Identify condition/dyad/number flexibly (using regex) - SKIPPED
* [x] Allow surrogate baseline to be created using a smaller subset (permutations) — 2-3x?
* [x] Do pip freeze or conda list -e > req.txt
* [**???**] Redo analysis with new baseline + consider doing sample-wise shuffled baseline - Have questions on how to proceed
* [ ] Go over manuscript again with new baseline + review comments/edits
* [x] Need to create a simple other_filler_list as a text file that can be modified by a user and imported to be used here - make note that we only catch 2-letter fillers at this point with the regular expression default 
* [x] Note that align_concatenated_dataframe.txt takes the place of forSemantic.txt. Make updates accordingly. 
* [ ] We could/should probably make `convobyconvo` an optional add-on from `turnbyturn`.
* [ ] Consider other POS taggers: https://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag
* [ ] Create semantic space using the TASA corpus to make available to users
* [ ] Still some issues with lemmatizer in that any POS tag not recognized defaults to NOUN. Consider other lemmatizer options that doesn't rely on Wordnet.

***

# ALIGN

This notebook provides an introduction to **ALIGN**, a tool for quantifying multi-level linguistic similarity between speakers. 

***

**Table of Contents**:

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [User-specified parameters](#User-specified-parameters)
    * [Main calls](#Main-calls)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [User-specified settings](#User-specified-settings)
* [Phase 1: Generate "prepped" transcripts](#Phase-1:-Generate-"prepped"-transcripts)
    * [](#)
* [Phase 2: Generate alignment scores](#Phase-2:-Generate-alignment-scores)

***

# Getting Started

### Prerequisites

* Jupyter Notebook with Python 2.7.1.3 kernel
* Packages in `requirements.txt`

*See notes in "DISTRIBUTION ISSUES" Notebook for suggestions on how to package effectively and accomodate Python 3 users*

**To Nick**: Is the above reference still accurate? I don't see such a notebook now.

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See folder `examples > toy_data-original` in Github repository for an example

### Filename conventions

* Each conversation text file needs to be named in the format: `A_B.txt`
    * `A` corresponds to the dyad number for that conversation
    * `B` corresponding to a condition code for that conversation 

### Main calls

`PHASE1RUN`

* Converts each conversation into standardized format.
* Each utterance is tokenized and lemmatized and has POS tags added.

`PHASE2RUN_REAL`

* Generates turn-level and conversation-level alignment scores (lexical, conceptual, and syntactic) across a range of n-gram sequences

`PHASE2RUN_SURROGATE`

* Generates a surrogate corpus.
* Runs identical analysis as PHASE2RUN_REAL on the surrogate corpus.

### Checking requirements

In [2]:
import pandas
import numpy
import scipy
import nltk
import gensim 

print("Pandas Version Info:\n{}".format(pandas.__version__))
print("Numpy Version Info:\n{}".format(numpy.__version__))
print("Scipy Version Info:\n{}".format(scipy.__version__))
print("NLTK Version Info:\n{}".format(nltk.__version__))
print("Gensim Version Info:\n{}".format(gensim.__version__))

import sys
print("Python and Conda Environment Info:\n{}".format(sys.version))


Pandas Version Info:
0.21.1
Numpy Version Info:
1.11.3
Scipy Version Info:
0.19.0
NLTK Version Info:
3.2.5
Gensim Version Info:
3.1.0
Python and Conda Environment Info:
2.7.13 |Anaconda 2.3.0 (x86_64)| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]


**To Alex** Feel free to delete the above if no longer needed for final notebook

***

# Setup

Here, we'll get ready to run ALIGN over our target dataset.

[To top](#ALIGN).

## Import libraries

### Standard libraries

In [3]:
import os,re,math,csv,string,random,logging,glob,itertools,operator
from os import listdir 
from os.path import isfile, join 
from collections import Counter, defaultdict, OrderedDict
from itertools import chain, combinations

### Third-party libraries

For data analysis and data handling:

In [4]:
import pandas as pd
import numpy as np
from scipy import spatial 

For natural language processing:

In [5]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn 
from nltk.tag.stanford import StanfordPOSTagger
from nltk.util import ngrams

Download the NLTK default POS tagger:

In [6]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nduran/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**Note:** With older version of NLTK (pre 3.1), the `maxent_treebank_pos_tagger` is also available. If desired, uncomment and run the following:

In [7]:
# nltk.download('maxent_treebank_pos_tagger')

**Note**: The `StanfordPOSTagger` will be
used in conjunction with local folder `stanford-postagger-2017-06-09/` and `.jar` file. The `StanfordPOSTagger` also uses the trained model: `english-left3words-distsim.tagger`. These files will be called below if analysis is being run with the Stanford tagger.

For building semantic space:

In [8]:
from gensim.models import word2vec

## User-specified settings

### Directories and folders

`INPUT_PATH`: Set working directory, in which all notebook and supporting files are located.

Or "Pathname for the unzipped project folder" if going with `anaconda-project.yml` configuration file

In [60]:
INPUT_PATH=os.getcwd()+'/'

`TRANSCRIPTS`: Set variable for folder name (as string) for relative location of folder containing the original transcript files.

In [54]:
TRANSCRIPTS = 'examples/toy_data-original/'

`STANFORD_PATH`: Path to Stanford POS tagger files.

In [11]:
STANFORD_POS_PATH = INPUT_PATH + 'package_files/stanford-postagger-2017-06-09/'

`PREPPED_TRANSCRIPTS`: Set variable for folder name (as string) for relative location of folder into which prepared transcript files will be saved.

In [12]:
PREPPED_TRANSCRIPTS = 'examples/toy_data-prepped/'

`ANALYSIS_READY`: Set variable for folder name (as string) for relative location of folder into which analysis-ready dataframe files will be saved.

In [13]:
ANALYSIS_READY = 'examples/toy_data-analysis/'

`SURROGATE_TRANSCRIPTS`: Set variable for folder name (as string) for relative location of folder into which all prepared surrogate transcript files will be saved.

In [14]:
SURROGATE_TRANSCRIPTS = 'examples/toy_data-surrogate/'

### Analysis settings

`MAXNGRAM`: Set maximum size for n-gram chunking.

* Default: 2

In [15]:
MAXNGRAM = 2

`MINWORDS`: Set minimum number of words for each turn.

* Default: 2

**Note**: The minimum number of words must be at least as long as maximum *n*-gram size (`MAXNGRAM` above).

In [16]:
MINWORDS = 2

`ADD_STANFORD_TAGS`: Choose POS tagger. 

* Default: `False`
    * Run NLTK default POS tagger (NLTK 3.1+): `averaged_perceptron_tagger`
* Option: `True`
    * Run both NLTK default POS tagger and Stanford POS tagger. Note: Adding the Stanford POS tagger will lead to an increase in processing time. 

In [17]:
ADD_STANFORD_TAGS = False

`DELAY`: Set max delay between partner's turns when generating alignment score.

* Currently, the only acceptable value is 1 (i.e., contiguous turns).

In [18]:
DELAY = 1

`USE_FILLER_LIST`: Choose method for removing speech fillers. 

* Default: `None`
    * Does not provide additional speech fillers to be removed.
* Option: list of strings
    * Provide a list of literal strings to be removed from the transcripts.

In [19]:
USE_FILLER_LIST = None

`IGNORE_DUPLICATES`: Choose whether to remove duplicate lexial bigrams when computing syntactic alignment

* Default: `True`
    * Removes duplicate lexical bigrams.
* Option `False`
    * Keeps duplicate lexical bigrams

In [20]:
IGNORE_DUPLICATES = True

`USE_PRETRAINED_VECTORS`: Choose whether to use high-dimensional semantic model pretrained vectors from GoogleNews or to build vectors based on transcripts (each utterance/row is equivalent to a single context). Note: if there are a small number of utterance/rows then the pretrained vectors should be used. 

* Default: `False`
    * Builds high-dimensional based on input transcript
* Option `True`
    * Uses pre-trained vectors from GoogleNews

In [21]:
USE_PRETRAINED_VECTORS = False

`ALL_SURROGATES`: Choose whether to generate surrogates from all possible pairings within a condition or only from a subset of all possible pairings. 

* Default: `True`
    * Generates all possible pairings
* Option `False`
    * Generates from a subset of all possible pairings

In [22]:
ALL_SURROGATES = True

`KEEP_ORIGINAL_TURN_ORDER`: For generating surrogate transcripts, choose whether to to retain the original ordering of each surrogate partner's data or create surrogates by shuffling all turns within each surrogate partner's data. 

* Default: `True`
    * Retains original
    ordering of conversational turns
* Option `False`
    * Shuffles ordering of conversational turns

In [23]:
KEEP_ORIGINAL_TURN_ORDER = True

**To Alex** There are also several other options users could specify, but they seem minor and I can't imagine that they would ever need to be changed. Plus, you did a good job in the function description of their definition and settings. The above are the more important ones that are mentioned in the paper (and we might not even want to include the above in the notebook but rather in the Github README file) 
    * use_both=False
    * high_sd_cutoff=3
    * low_n_cutoff=1
    * input_as_directory=True
    * save_concatenated_dataframe=True
    * id_separator = '\-'
    * condition_label='cond'
    * dyad_label='dyad'     

***

# Phase 1: Generate "prepped" transcripts

## Initial clean-up

* **[Clean up text](#Clean-up-text)** by removing:
    * numbers, punctuation, and other non-ASCII alphabet characters
    * common speech fillers (e.g., "um", "huh") and their derivations
    * empty turns that may have inadvertently been included
    * user-specified short turns
        * removes short turns that are at least as long as maximum n-gram
* **[Merge adjacent turns by the same participant](#Merge-adjacent-turns-by-the-same-participant)** into a single utterance row.

[To top](#ALIGN).

### Clean up text

**To Alex** Added an option to use both the regex default AND a list of additional words. I was finding that "yeah" appears a lot in the DA transcripts and wanted to manually remove it via the filler list. 

In [55]:
def InitialCleanup(dataframe,
                   minwords=2,
                   use_filler_list=None,
                   use_both=False):
    
    """
    Perform basic text cleaning to prepare dataframe
    for analysis. Remove non-letter/-space characters,
    empty turns, turns below a minimum length, and 
    fillers.
    
    By default, preserves turns 2 words or longer.
    If desired, this may be changed by updating the
    `minwords` argument.
    
    By default, remove common fillers through regex.
    If desired, remove other words by passing a list
    of literal strings to `use_filler_list` argument, 
    and if both regex and list of additional literal
    strings are to be used, update `use_both=True`.
    """
    
    # only allow strings, spaces, and newlines to pass
    WHITELIST = string.letters + '\'' + ' '
    clean = []
    utteranceLen = []
     
    # remove inadvertent empty turns 
    dataframe = dataframe[pd.notnull(dataframe['content'])]
    
    # internal function: remove fillers via regular expressions
    def applyRegExpression(textFiller):
        textClean = re.sub('^(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]+\s', ' ', textFiller) # at the start of a string
        textClean = re.sub('\s(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]+\s', ' ', textClean) # within a string 
        textClean = re.sub('\s(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]$', ' ', textClean) # end of a string 
        textClean = re.sub('^(?!mom|am|ham)[u*|h*|m*|o*|a*]+[m*|h*|u*|a*]$', ' ', textClean) # if entire turn string        
        return textClean
    
    for value in dataframe['content'].values:            
        cleantext = ''.join(c for c in value if c in WHITELIST).lower() 
        
        # DEFAULT: remove typical speech fillers via regular expressions (examples: "um, mm, oh, hm, uh, ha")
        if use_filler_list == None and use_both == False:                                
            cleantext = applyRegExpression(cleantext)
        # OPTION 1: remove speech fillers or other words specified by user in a list
        elif use_filler_list != None and use_both == False:
            cleantext = [word for word in cleantext.split(" ") if word not in use_filler_list]
            cleantext = " ".join(cleantext)
        # OPTION 2: remove speech fillers via regular expression and any additional words from user-specified list
        elif use_filler_list != None and use_both == True:
            cleantext = applyRegExpression(cleantext)
            cleantext = [word for word in cleantext.split(" ") if word not in use_filler_list]
            cleantext = " ".join(cleantext)
        # append cleaned lines
        clean.append(cleantext)        
                
    # drop the old "content" column and add the clean "content" column
    dataframe = dataframe.iloc[:, [0,1]]
    dataframe['content'] = clean
        
    # remove rows that are now blank or do not meet `minwords` requirement, then drop length column    
    dataframe['utteranceLen'] = dataframe['content'].apply(lambda x: word_tokenize(x)).str.len()
    dataframe = dataframe.drop(dataframe[dataframe.utteranceLen < int(minwords)].index)
    dataframe = dataframe.iloc[:, [0,1]]
        
    # return the cleaned dataframe    
    return dataframe

### Merge adjacent turns by the same participant

In [25]:
def AdjacentMerge(dataframe):

    """
    Given a dataframe of conversation turns,
    merge adjacent turns by the same speaker.
    """    
    
    repeat=1
    while repeat==1:
        l1=len(dataframe) 
        DfMerge = []
        k = 0
        if len(dataframe) > 0:
            while k < len(dataframe)-1: 
                if dataframe['participant'].iloc[k] != dataframe['participant'].iloc[k+1]:
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])         
                    k = k + 1
                elif dataframe['participant'].iloc[k] == dataframe['participant'].iloc[k+1]:                    
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k] + " " + dataframe['content'].iloc[k+1]])           
                    k = k + 2   
            if k == len(dataframe)-1:
                DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])      
        
        dataframe=pd.DataFrame(DfMerge,columns=('participant','content'))
        if l1==len(dataframe): 
            repeat=0 
                
    return dataframe

## Prepare transcript text

* **[Check spelling](#Check-spelling)** via a Bayesian spell-checking algorithm (http://norvig.com/spell-correct.html).
* **[Tokenize and apply spell correction](#Tokenize-and-apply-spell-correction)** to the original transcript text.
* **[Lemmatize](#Lemmatize)** using WordNet-derived categories.
* [**Part-of-speech tagging**](#Part-of-speech-tagging) with user-defined tagger(s) on both lemmatized and non-lemmatized tokens.
    * Users may choose to use the NLTK default POS tagger (default) and/or the Stanford POS tagger (optional). The NLTK default tagger is more time-efficient.

### Tokenize and apply spell correction

In [26]:
def Tokenize(text,nwords):
    """
    Given list of text to be processed and a list 
    of known words, return a list of edited and 
    tokenized words.
    """
    
    # internal function: identify possible spelling errors for a given word
    def edits1(word): 
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
        replaces   = [a + c + b[1:] for a, b in splits for c in string.lowercase if b]
        inserts    = [a + c + b     for a, b in splits for c in string.lowercase]
        return set(deletes + transposes + replaces + inserts)

    # internal function: identify known edits
    def known_edits2(word,nwords):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

    # internal function: identify known words
    def known(words,nwords): return set(w for w in words if w in nwords)

    # internal function: correct spelling
    def correct(word,nwords):
        candidates = known([word],nwords) or known(edits1(word),nwords) or known_edits2(word,nwords) or [word]
        return max(candidates, key=nwords.get)

    # expand out based on a fixed list of common contractions 
    contract_dict = { "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "couldn't've": "could not have",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hadn't've": "had not have",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he had",
        "he'd've": "he would have",
        "he'll": "he will",
        "he'll've": "he will have",
        "he's": "he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how is",
        "i'd": "i would",
        "i'd've": "i would have",
        "i'll": "i will",
        "i'll've": "i will have",
        "i'm": "i am",
        "i've": "i have",
        "isn't": "is not",
        "it'd": "it would",
        "it'd've": "it would have",
        "it'll": "it will",
        "it'll've": "it will have",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she would",
        "she'd've": "she would have",
        "she'll": "she will",
        "she'll've": "she will have",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so as",
        "that'd": "that had",
        "that'd've": "that would have",
        "that's": "that is",
        "there'd": "there would",
        "there'd've": "there would have",
        "there's": "there is",
        "they'd": "they would",
        "they'd've": "they would have",
        "they'll": "they will",
        "they'll've": "they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what'll've": "what will have",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where is",
        "where've": "where have",
        "who'll": "who will",
        "who'll've": "who will have",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you would",
        "you'd've": "you would have",
        "you'll": "you will",
        "you'll've": "you will have",
        "you're": "you are",
        "you've": "you have" }
    contractions_re = re.compile('(%s)' % '|'.join(contract_dict.keys()))      

    # internal function:    
    def expand_contractions(text, contractions_re=contractions_re):
        def replace(match):
            return contract_dict[match.group(0)]
        return contractions_re.sub(replace, text.lower())

    # process all words in the text
    cleantoken = []
    text = expand_contractions(text)
    token = word_tokenize(text)
    for word in token:        
        if "'" not in word:
            cleantoken.append(correct(word,nwords))
        else:
            cleantoken.append(word) 
    return cleantoken

### Lemmatize

In [27]:
def pos_to_wn(tag):
    """
    Convert NLTK default tagger output into a format that Wordnet
    can use in order to properly lemmatize the text.
    """
    
    # create some inner functions for simplicity
    def is_noun(tag):
        return tag in ['NN', 'NNS', 'NNP', 'NNPS']
    def is_verb(tag):
        return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    def is_adverb(tag):
        return tag in ['RB', 'RBR', 'RBS']
    def is_adjective(tag):
        return tag in ['JJ', 'JJR', 'JJS']
    
    # check each tag against possible categories
    if is_noun(tag):
        return wn.NOUN
    elif is_verb(tag):
        return wn.VERB
    elif is_adverb(tag):
        return wn.ADV
    elif is_adjective(tag):
        return wn.ADJ
    else:
        return wn.NOUN

In [28]:
def Lemmatize(tokenlist):
    lemmatizer = WordNetLemmatizer() 
    defaultPos = nltk.pos_tag(tokenlist) # get the POS tags from NLTK default tagger
    words_lemma = []
    for item in defaultPos:  
        words_lemma.append(lemmatizer.lemmatize(item[0],pos_to_wn(item[1]))) # need to convert POS tags to a format (NOUN, VERB, ADV, ADJ) that wordnet uses to lemmatize
    return words_lemma

### Part-of-speech tagging

In [29]:
def ApplyPOSTagging(df,
                    filename,
                    add_stanford_tags=False,
                    stanford_pos_path=None):

    """
    Given a dataframe of conversation turns, return a new
    dataframe with part-of-speech tagging. Add filename
    (given as string) as a new column in returned dataframe.
    
    By default, return only tags from the NLTK default POS 
    tagger. Optionally, also return Stanford POS tagger 
    results by setting `add_stanford_tags=True`.
    
    If Stanford POS tagging is desired, specify the
    location of the Stanford POS tagger with the 
    `stanford_pos_path` argument.
    """
    
    # if desired, import Stanford tagger
    if add_stanford_tags == True:
        if stanford_pos_path == None:
            raise ValueError('Error! Specify path to Stanford POS tagger using the `stanford_pos_path` argument.')
        else:
            stanford_tagger = StanfordPOSTagger(stanford_pos_path + 'models/english-left3words-distsim.tagger',
                                                stanford_pos_path + 'stanford-postagger.jar')
    
    # add new columns to dataframe
    df['tagged_token'] = df['token'].apply(nltk.pos_tag)
    df['tagged_lemma'] = df['lemma'].apply(nltk.pos_tag)
    
    # if desired, also tag with Stanford tagger
    if add_stanford_tags == True:
        df['tagged_stan_token'] = df['token'].apply(stanford_tagger.tag)
        df['tagged_stan_lemma'] = df['lemma'].apply(stanford_tagger.tag)

    df['file'] = filename
        
    # return finished dataframe
    return df

## RUN Phase 1

* For each original transcript file, saves new file with columns for:
    * "Clean" text
    * Tokenized words
    * Tokenized lemmatized-words
    * NLTK default POS-tagging on tokenized words
    * NLTK default POS-tagging on lemmatized words
    * Stanford POS-tagging on tokenized words
    * Stanford POS-tagging on lemmatized-words
* Also saves a single datasheet with all tokenized lemmatized utterances from all transcripts as individual rows
    * called `align_concatenated_dataframe.txt`
    * to be used in building semantic space for Phase 2

In [56]:
def PHASE1RUN(input_files, 
              output_file_directory,
              training_dictionary,
              minwords=2,
              use_filler_list=None,
              use_both=False,
              add_stanford_tags=False,
              stanford_pos_path=None,
              input_as_directory=True,
              save_concatenated_dataframe=True):   

    """
    Given individual .txt files of conversations, 
    return a completely prepared dataframe of transcribed 
    conversations for later ALIGN analysis, including: text 
    cleaning, merging adjacent turns, spell-checking, 
    tokenization, lemmatization, and part-of-speech tagging. 
    The output serve as the input for later ALIGN
    analysis.
    
    By default, set a minimum number of words in a turn to
    3. If desired, this may be chaged by changing the
    `minwords` file.
    
    By default, return only the NLTK default 
    POS tagger values. Optionally, also return Stanford POS 
    tagger values with `add_stanford_tags=True`.
    
    If Stanford POS tagging is desired, specify the
    location of the Stanford POS tagger with the 
    `stanford_pos_path` argument.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    
    By default, produce a single concatenated dataframe
    of all processed conversations in the output directory. 
    If desired, suppress concatenated dataframe with 
    `save_concatenated_dataframe=False`.
    """
    
    # create an internal function to train the model
    def train(features): 
        model = defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model
        
    # train our spell-checking model
    nwords = train(re.findall('[a-z]+',(file(training_dictionary).read().lower())))
    
    # grab the appropriate files
    if input_as_directory==False:
        file_list = input_files
    else: 
        file_list = glob.glob(input_files+"*.txt")
    
    # cycle through all files 
    main = pd.DataFrame()
    for fileName in file_list:  
        
        # let us know which file we're processing
        dataframe = pd.read_csv(fileName, sep='\t',encoding='utf-8')
        print "Processing: "+fileName

        # clean up, merge, spellcheck, tokenize, lemmatize, and POS-tag
        dataframe = InitialCleanup(dataframe,
                                  minwords=minwords,
                                  use_filler_list=use_filler_list,
                                  use_both=use_both)
        dataframe = AdjacentMerge(dataframe)
        
        # tokenize and lemmatize 
        dataframe['token'] = dataframe['content'].apply(Tokenize,
                                     args=(nwords,))
        dataframe['lemma'] = dataframe['token'].apply(Lemmatize)

        # apply part-of-speech tagging
        dataframe = ApplyPOSTagging(dataframe,  
                                    filename = os.path.basename(fileName),
                                    add_stanford_tags=add_stanford_tags,
                                    stanford_pos_path=stanford_pos_path
                                    )
        
        # export the conversation's dataframe as a CSV
        dataframe.to_csv(output_file_directory + os.path.basename(fileName), 
                         encoding='utf-8',index=False,sep='\t')
        main = main.append(dataframe)

    # save the concatenated dataframe
    if save_concatenated_dataframe != False:
        main.to_csv(output_file_directory + '../' + "align_concatenated_dataframe.txt",
                    encoding='utf-8',index=False, sep='\t')
    
    # return the dataframe
    return main

**To Nick**: Line 58 shouldn't need `glob.glob`, since that's a search function. Line 58 should just read in literal paths to specific files if the user isn't inputting a directory. If that's not working as intended for you, please let me know! I'll do some more debugging.

# Phase 2: Generate alignment scores

* [**Create helper functions**](#Create-helper-functions) for processing turn- and conversation-level data.
* **[Build semantic space](#Build-semantic-space)** from the `forSemantic.txt` generated in Phase 1 and return a `word2vec` semantic space and vocabulary list.

[To top.](#ALIGN)

### Create helper functions

In [31]:
def ngram_pos(sequence1,sequence2,ngramsize=2,
                   ignore_duplicates=True):
    """
    Remove mimicked lexical sequences from two interlocutors'
    sequences and return a dictionary of counts of ngrams
    of the desired size for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.    
    
    By default, ignore duplicate lexical n-grams when
    processing these sequences. If desired, this may
    be changed with `ignore_duplicates=False`.
    """     
    
    # remove duplicates and recreate sequences
    sequence1 = set(ngrams(sequence1,ngramsize))
    sequence2 = set(ngrams(sequence2,ngramsize))

    # if desired, remove duplicates from sequences
    if ignore_duplicates==True:
        new_sequence1 = [tuple([''.join(pair[1]) for pair in tup]) for tup in list(sequence1 - sequence2)]
        new_sequence2 = [tuple([''.join(pair[1]) for pair in tup]) for tup in list(sequence2 - sequence1)]
    else:
        new_sequence1 = [tuple([''.join(pair[1]) for pair in tup]) for tup in sequence1]
        new_sequence2 = [tuple([''.join(pair[1]) for pair in tup]) for tup in sequence2]
        
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [32]:
def ngram_lexical(sequence1,sequence2,ngramsize=2):
    """
    Create ngrams of the desired size for each of two
    interlocutors' sequences and return a dictionary 
    of counts of ngrams for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.  
    """   
    
    # generate ngrams
    sequence1 = list(ngrams(sequence1,ngramsize))
    sequence2 = list(ngrams(sequence2,ngramsize)) 

    # join for counters
    new_sequence1 = [' '.join(pair) for pair in sequence1]
    new_sequence2 = [' '.join(pair) for pair in sequence2]
    
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [33]:
def get_cosine(vec1, vec2): 
    """
    Derive cosine similarity metric, standard measure.
    Adapted from <https://stackoverflow.com/a/33129724>.
    """     
    
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator    

In [34]:
def build_composite_semantic_vector(lemma_seq,vocablist,highDimModel):
    """
    Function for producing vocablist and model is called in the main loop
    """
    
    ## filter out words in corpus that do not appear in vocablist (either too rare or too frequent)
    filter_lemma_seq = [word for word in lemma_seq if word in vocablist]    
    ## build composite vector
    getComposite = [0] * len(highDimModel[vocablist[1]])        
    for w1 in filter_lemma_seq:
        if w1 in highDimModel.vocab:
            semvector = highDimModel[w1]
            getComposite = getComposite + semvector
        else:
            print w1
    return getComposite

### Build semantic space

In [35]:
def BuildSemanticModel(semantic_model_input_file,   
                        pretrained_input_file,
                        use_pretrained_vectors=False,                     
                        high_sd_cutoff=3,
                        low_n_cutoff=1):
    
    """
    Given an input file produced by the ALIGN Phase 1 functions, 
    build a semantic model from all transcripts in all conversations
    in target corpus after removing high- and low-frequency words.
    High-frequency words are determined by a user-defined number of
    SDs over the mean (by default, `high_sd_cutoff=3`). Low-frequency
    words must appear over a specified number of raw occurrences 
    (by default, `low_n_cutoff=1`).
    
    Frequency cutoffs can be removed by `high_sd_cutoff=None` and/or
    `low_n_cutoff=0`.
    """
    
    # build vocabulary list from transcripts
    data1 = pd.read_csv(semantic_model_input_file, sep='\t',encoding='utf-8')
        
    # get frequency count of all included words        
    all_sentences = [re.sub('[^\w\s]+','',str(row)).split(' ') for row in list(data1['lemma'])]
    all_words = list([a for b in all_sentences for a in b])  
    frequency = defaultdict(int)
    for word in all_words:
        frequency[word] += 1

    # remove words that only occur more frequently than our cutoff (defined in occurrences)
    frequency = {word: freq for word, freq in frequency.iteritems() if freq > low_n_cutoff}
    
    # if desired, remove high-frequency words (over user-defined SDs above mean) 
    if high_sd_cutoff == None:
        contentWords = [word for word in frequency.keys()] 
    else:
        getOut = np.mean(frequency.values())+(np.std(frequency.values())*(high_sd_cutoff))
        contentWords = {word: freq for word, freq in frequency.iteritems() if freq < getOut}.keys()
    
    # decide whether to build semantic model from scratch or load in pretrained vectors
    if use_pretrained_vectors == False:
        keepSentences = [[word for word in row if word in contentWords] for row in all_sentences]
        semantic_model = word2vec.Word2Vec(all_sentences, min_count=low_n_cutoff)
    else:
        if pretrained_input_file == None:
            raise ValueError('Error! Specify path to pretrained vector file using the `pretrained_input_file` argument.')
        else:
            semantic_model = gensim.models.KeyedVectors.load_word2vec_format(pretrained_input_file, binary=True)    
        
    # return all the content words and the trained word vectors
    return contentWords, semantic_model.wv

### Calculate lexical and POS alignment scores for each n-gram length across two comparison vectors

In [36]:
def LexicalPOSAlignment(tok1,lem1,penn_tok1,penn_lem1,
                             tok2,lem2,penn_tok2,penn_lem2,
                             stan_tok1=None,stan_lem1=None,
                             stan_tok2=None,stan_lem2=None,
                             maxngram=2,
                             ignore_duplicates=True,
                             add_stanford_tags=False):
    
    """
    Derive lexical and part-of-speech alignment scores
    between interlocutors (suffix `1` and `2` in arguments
    passed to function). 
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigram when calculating
    similarity. If desired, this window may be expanded 
    by changing the `maxngram` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # create empty dictionaries for syntactic similarity
    syntax_penn_tok = {}
    syntax_penn_lex = {}
    
    # if desired, generate Stanford-based scores
    if add_stanford_tags == True:
        syntax_stan_tok = {}
        syntax_stan_lem = {}
    
    # create empty dictionaries for lexical similarity
    lexical_tok = {}
    lexical_lem = {}
    
    # cycle through all desired ngram lengths
    for ngram in range(2,maxngram+1):
                
        # calculate similarity for lexical ngrams (tokens and lemmas)
        [vectorT1, vectorT2] = ngram_lexical(tok1,tok2,ngramsize=ngram)
        [vectorL1, vectorL2] = ngram_lexical(lem1,lem2,ngramsize=ngram)        
        lexical_tok['lexical_tok{0}'.format(ngram)] = get_cosine(vectorT1,vectorT2)
        lexical_lem['lexical_lem{0}'.format(ngram)] = get_cosine(vectorL1, vectorL2)
        
        # calculate similarity for Penn POS ngrams (tokens)
        [vector_penn_tok1, vector_penn_tok2] = ngram_pos(penn_tok1,penn_tok2,
                                                ngramsize=ngram,
                                                ignore_duplicates=ignore_duplicates) 
        syntax_penn_tok['syntax_penn_tok{0}'.format(ngram)] = get_cosine(vector_penn_tok1, 
                                                                                            vector_penn_tok2)
        # calculate similarity for Penn POS ngrams (lemmas)
        [vector_penn_lem1, vector_penn_lem2] = ngram_pos(penn_lem1,penn_lem2,
                                                              ngramsize=ngram,
                                                              ignore_duplicates=ignore_duplicates) 
        syntax_penn_lex['syntax_penn_lex{0}'.format(ngram)] = get_cosine(vector_penn_lem1, 
                                                                                            vector_penn_lem2) 

        # if desired, also calculate using Stanford POS
        if add_stanford_tags == True:         
          
            # calculate similarity for Stanford POS ngrams (tokens)
            [vector_stan_tok1, vector_stan_tok2] = ngram_pos(stan_tok1,stan_tok2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            syntax_stan_tok['syntax_stan_tok{0}'.format(ngram)] = get_cosine(vector_stan_tok1,
                                                                                                vector_stan_tok2)
                        
            # calculate similarity for Stanford POS ngrams (lemmas)
            [vector_stan_lem1, vector_stan_lem2] = ngram_pos(stan_lem1,stan_lem2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            syntax_stan_lem['syntax_stan_lem{0}'.format(ngram)] = get_cosine(vector_stan_lem1,
                                                                                                vector_stan_lem2)
        
    # return requested information
    if add_stanford_tags == True:
        dictionaries_list = [syntax_penn_tok, syntax_penn_lex,
                             syntax_stan_tok, syntax_stan_lem, 
                             lexical_tok, lexical_lem]      
    else:
        dictionaries_list = [syntax_penn_tok, syntax_penn_lex,
                             lexical_tok, lexical_lem]      
            
    return dictionaries_list

## Generate turn-level analysis of alignment scores

In [37]:
def conceptualAlignment(lem1, lem2, vocablist, highDimModel):
    
    """
    Calculate conceptual alignment scores from list of lemmas
    from between two interocutors (suffix `1` and `2` in arguments
    passed to function) using `word2vec`.
    """

    # aggregate composite high-dimensional vectors of all words in utterance
    W2Vec1 = build_composite_semantic_vector(lem1,vocablist,highDimModel)
    W2Vec2 = build_composite_semantic_vector(lem2,vocablist,highDimModel)

    # return cosine distance alignment score
    return 1 - spatial.distance.cosine(W2Vec1, W2Vec2) 

In [38]:
def returnMultilevelAlignment(cond_info,
                                   partnerA,tok1,lem1,penn_tok1,penn_lem1,
                                   partnerB,tok2,lem2,penn_tok2,penn_lem2,
                                   vocablist, highDimModel, 
                                   stan_tok1=None,stan_lem1=None,
                                   stan_tok2=None,stan_lem2=None,
                                   add_stanford_tags=False,
                                   maxngram=2, 
                                   ignore_duplicates=True):

    """
    Calculate lexical, syntactic, and conceptual alignment
    between a pair of turns by individual interlocutors 
    (suffix `1` and `2` in arguments passed to function), 
    including leading/following comparison directionality.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `maxngram` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # create empty dictionaries 
    partner_direction = {}
    condition_info = {}
    cosine_semanticL = {}
    
    # calculate lexical and syntactic alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 maxngram=maxngram,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tags=add_stanford_tags)
    
    # calculate conceptual alignment
    cosine_semanticL['cosine_semanticL'] = conceptualAlignment(lem1,lem2,vocablist,highDimModel)
    dictionaries_list.append(cosine_semanticL.copy())
    
    # determine directionality of leading/following comparison
    partner_direction['partner_direction'] = str(int(partnerA)) + ">" + str(int(partnerB))
    dictionaries_list.append(partner_direction.copy())

    # add condition information
    condition_info['condition_info'] = cond_info    
    dictionaries_list.append(condition_info.copy())
    
    # return alignment scores
    return dictionaries_list

In [39]:
def TurnByTurnAnalysis(dataframe,
                            vocablist,
                            highDimModel, 
                            delay=1,
                            maxngram=2,
                            add_stanford_tags=False,
                            ignore_duplicates=True):    

    """
    Calculate lexical, syntactic, and conceptual alignment
    between interlocutors over an entire conversation.
    Automatically detect individual speakers by unique
    speaker codes.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # if we don't want the Stanford tagger data, set defaults
    if add_stanford_tags == False:
        stan_tok1=None
        stan_lem1=None
        stan_tok2=None
        stan_lem2=None
    
    # prepare the data to the appropriate type    
    dataframe['token'] = dataframe['token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))    
    dataframe['lemma'] = dataframe['lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        
    # if desired, prepare the Stanford tagger data
    if add_stanford_tags == True:           
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        
    # create lagged version of the dataframe
    df_original = dataframe.drop(dataframe.tail(delay).index,inplace=False)
    df_lagged = dataframe.shift(-delay).drop(dataframe.tail(delay).index,inplace=False)
        
    # cycle through each pair of turns
    aggregated_df = pd.DataFrame()
    for i in range(0,df_original.shape[0]):

        # identify the condition for this dataframe
        cond_info = dataframe['file'].unique()
        if len(cond_info)==1: 
            cond_info = str(cond_info[0])
        
        # break and flag error if we have more than 1 condition per dataframe
        else: 
            raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)

        # grab all of first participant's data
        first_row = df_original.iloc[i]
        first_partner = first_row['participant']
        tok1=first_row['token']
        lem1=first_row['lemma']
        penn_tok1=first_row['tagged_token']
        penn_lem1=first_row['tagged_lemma']

        # grab all of lagged participant's data
        lagged_row = df_lagged.iloc[i]
        lagged_partner = lagged_row['participant']
        tok2=lagged_row['token']
        lem2=lagged_row['lemma']
        penn_tok2=lagged_row['tagged_token']
        penn_lem2=lagged_row['tagged_lemma']
                
        # if desired, grab the Stanford tagger data for both participants
        if add_stanford_tags == True:         
            stan_tok1=first_row['tagged_stan_token']
            stan_lem1=first_row['tagged_stan_lemma']
            stan_tok2=lagged_row['tagged_stan_token']
            stan_lem2=lagged_row['tagged_stan_lemma']
   
        # process multilevel alignment
        dictionaries_list=returnMultilevelAlignment(cond_info=cond_info,
                                                         partnerA=first_partner,
                                                         tok1=tok1,lem1=lem1,
                                                         penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                         partnerB=lagged_partner,
                                                         tok2=tok2,lem2=lem2,
                                                         penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                         vocablist=vocablist,
                                                         highDimModel=highDimModel,
                                                         stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                         stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                         maxngram = maxngram,
                                                         ignore_duplicates = ignore_duplicates,
                                                         add_stanford_tags = add_stanford_tags) 
                
        # sort columns so they are in order, append data to existing structures   
        next_df_line = pd.DataFrame.from_dict(OrderedDict(k for num, i in enumerate(d for d in dictionaries_list) for k in sorted(i.items())),
                               orient='index').transpose()
        aggregated_df = aggregated_df.append(next_df_line)
        
    # reformat turn information and add index
    aggregated_df = aggregated_df.reset_index(drop=True).reset_index().rename(columns={"index":"time"})

    # give us our finished dataframe
    return aggregated_df

Generate conversation-level analysis of alignment scores
-----------------------------------------------------

In [40]:
def ConvoByConvoAnalysis(dataframe,
                          maxngram=2,
                          ignore_duplicates=True,
                          add_stanford_tags=False):

    """
    Calculate analysis of multilevel similarity over
    a conversation between two interlocutors from a 
    transcript dataframe prepared by Phase 1
    of ALIGN. Automatically detect speakers by unique
    speaker codes.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # identify the condition for this dataframe
    cond_info = dataframe['file'].unique()
    if len(cond_info)==1: 
        cond_info = str(cond_info[0])
    
    # break and flag error if we have more than 1 condition per dataframe
    else: 
        raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)
   
    # if we don't want the Stanford info, set defaults 
    if add_stanford_tags == False:
        stan_tok1 = None
        stan_lem1 = None
        stan_tok2 = None
        stan_lem2 = None

    # identify individual interlocutors
    df_A = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[0]]
    df_B = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[1]]
   
    # concatenate the token, lemma, and POS information for participant A
    tok1 = [word for turn in df_A['token'] for word in turn]
    lem1 = [word for turn in df_A['lemma'] for word in turn]
    penn_tok1 = [POS for turn in df_A['tagged_token'] for POS in turn]    
    penn_lem1 = [POS for turn in df_A['tagged_token'] for POS in turn] 
    if add_stanford_tags == True:
        stan_tok1 = [POS for turn in df_A['tagged_stan_token'] for POS in turn]    
        stan_lem1 = [POS for turn in df_A['tagged_stan_lemma'] for POS in turn] 

    # concatenate the token, lemma, and POS information for participant B
    tok2 = [word for turn in df_B['token'] for word in turn]
    lem2 = [word for turn in df_B['lemma'] for word in turn]
    penn_tok2 = [POS for turn in df_B['tagged_token'] for POS in turn]    
    penn_lem2 = [POS for turn in df_B['tagged_token'] for POS in turn] 
    if add_stanford_tags == True:
        stan_tok2 = [POS for turn in df_B['tagged_stan_token'] for POS in turn]    
        stan_lem2 = [POS for turn in df_B['tagged_stan_lemma'] for POS in turn] 
    
    # process multilevel alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 maxngram=maxngram,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tags=add_stanford_tags)
    
    # append data to existing structures
    dictionary_df = pd.DataFrame.from_dict(OrderedDict(k for num, i in enumerate(d for d in dictionaries_list) for k in sorted(i.items())),
                       orient='index').transpose()    
    dictionary_df['condition_info'] = cond_info
            
    # return the dataframe
    return dictionary_df

## RUN Phase 2: Actual Partners

* For each prepped transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [41]:
def PHASE2RUN_REAL(input_files, 
                    output_file_directory,
                    semantic_model_input_file,
                    pretrained_input_file,
                    high_sd_cutoff=3,
                    low_n_cutoff=1,
                    delay=1,
                    maxngram=2,
                    use_pretrained_vectors=False,
                    ignore_duplicates=True,
                    add_stanford_tags=False,
                    input_as_directory=True):   
    
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    """
    
    # grab the files in the list
    if input_as_directory == False:
        file_list = glob.glob(input_files)
    else:
        file_list = glob.glob(input_files+"*.txt")
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                       pretrained_input_file=pretrained_input_file,
                                                       use_pretrained_vectors=use_pretrained_vectors,
                                                       high_sd_cutoff=high_sd_cutoff,
                                                       low_n_cutoff=low_n_cutoff)  
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
        
    # cycle through each prepared file
    for fileName in file_list:   
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel,
                                         add_stanford_tags=add_stanford_tags,
                                         ignore_duplicates=ignore_duplicates)   
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             maxngram = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tags = add_stanford_tags)
            AlignmentC2C=AlignmentC2C.append(xC2C)
            
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN.to_csv(output_file_directory+"AlignmentT2T.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO.to_csv(output_file_directory+"AlignmentC2C.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN, FINAL_CONVO

Generate surrogate pairings
-------------------------
* Collects all possible pairs of participants across the dyads in each condition and creates surrogate pairings by combining their conversational turns, preserving turn order. Output saved as new separate conversational transcripts. 
* Main Function:
    * GenerateSurrogate 

In [42]:
def GenerateSurrogate(original_conversation_list,
                           surrogate_file_directory,
                           all_surrogates=True,
                           keep_original_turn_order=True)
                           id_separator = '\-',
                           dyad_label='dyad',
                           condition_label='cond'):
    
    """
    Create transcripts for surrogate pairs of 
    participants (i.e., participants who did not 
    genuinely interact in the experiment), which
    will later be used to generate baseline levels 
    of alignment. Store surrogate files in a new
    folder each time the surrogate generation is run.
    
    Returns a list of all surrogate files created.

    By default, the separator between dyad ID and
    condition ID is a hyphen ('\-'). If desired,
    this may be changed in the `id_separator` 
    argument.

    By default, condition IDs will be identified as 
    any characters following `cond`. If desired,
    this may be changed with the `condition_label`
    argument.
    
    By default, dyad IDs will be identified as 
    any characters following `dyad`. If desired,
    this may be changed with the `dyad_label`
    argument.
    
    By default, generate surrogates from all possible 
    pairings. If desired, instead generate surrogates 
    only from a subset of all possible pairings
    with `all_surrogates=False`.
    
    By default, create surrogates by retaining the 
    original ordering of each surrogate partner's 
    data. If desired, create surrogates by shuffling 
    all turns within each surrogate partner's data
    with `keep_original_turn_order = False`.
    """
        
    # create a subfolder for the new set of surrogates
    import time
    new_surrogate_path = surrogate_file_directory + 'surrogate_run-' + str(time.time()) +'/'
    if not os.path.exists(new_surrogate_path):
        os.makedirs(new_surrogate_path)
        
    # grab condition types from each file name
    file_info = [re.sub('\.txt','',os.path.basename(file_name)) for file_name in original_conversation_list]
    condition_ids = list(set([re.findall('[^'+id_separator+']*'+condition_label+'.*',metadata)[0] for metadata in file_info]))
    files_conditions = {}
    for unique_condition in condition_ids:
        next_condition_files = [add_file for add_file in original_conversation_list if unique_condition in add_file]
        files_conditions[unique_condition] = next_condition_files
    
    # cycle through conditions
    for condition in files_conditions.keys():
        
        # grab all possible pairs of conversations of this condition
        paired_surrogates = [pair for pair in combinations(files_conditions[condition],2)]
        
        # default: randomly pull from all pairs to get target surrogate sample
        if all_surrogates == False:
            import math
            paired_surrogates = random.sample(paired_surrogates, 
                                              int(math.ceil(len(files_conditions[condition])/2)))
            
        # cycle through surrogate pairings
        for next_surrogate in paired_surrogates:
            
            # read in the files
            original_file1 = os.path.basename(next_surrogate[0])
            original_file2 = os.path.basename(next_surrogate[1])
            original_df1=pd.read_csv(next_surrogate[0], sep='\t',encoding='utf-8')
            original_df2=pd.read_csv(next_surrogate[1], sep='\t',encoding='utf-8')
            
            # get participants A and B from df1
            participantA_1_code = min(original_df1['participant'].unique())
            participantB_1_code = max(original_df1['participant'].unique())
            participantA_1 = original_df1[original_df1['participant'] == participantA_1_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_1 = original_df1[original_df1['participant'] == participantB_1_code].reset_index().rename(columns={'file': 'original_file'})
            
            # get participants A and B from df2
            participantA_2_code = min(original_df2['participant'].unique())
            participantB_2_code = max(original_df2['participant'].unique())
            participantA_2 = original_df2[original_df2['participant'] == participantA_2_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_2 = original_df2[original_df2['participant'] == participantB_2_code].reset_index().rename(columns={'file': 'original_file'})
            
            # identify truncation point for both surrogates (to have even number of turns)
            surrogateX_turns=min([participantA_1.shape[0],
                                  participantB_2.shape[0]])
            surrogateY_turns=min([participantA_2.shape[0],
                                  participantB_1.shape[0]])
            
            # if desired, preserve original turn order for surrogate pairs
            if keep_original_turn_order == True:                
                surrogateX = participantA_1.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_2.truncate(after=surrogateX_turns-1,copy=False)).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})
                surrogateY = participantA_2.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_1.truncate(after=surrogateX_turns-1,copy=False)).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})
            
            # otherwise, just shuffle all turns within participants
            else:
                
                # shuffle for first surrogate pairing
                surrogateX_A1 = participantA_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX_B2 = participantB_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX = surrogateX_A1.append(surrogateX_B2).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})
                
                # and for second surrogate pairing
                surrogateY_A2 = participantA_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY_B1 = participantB_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY = surrogateY_A2.append(surrogateY_B1).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})

            # create filename for our surrogate file
            original_dyad1 = re.findall(dyad_label+'[^'+id_separator+']*',original_file1)[0]
            original_dyad2 = re.findall(dyad_label+'[^'+id_separator+']*',original_file2)[0]
            surrogateX['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            surrogateY['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            nameX='SurrogatePair-'+original_dyad1+'A'+'-'+original_dyad2+'B'+'-'+condition+'.txt'
            nameY='SurrogatePair-'+original_dyad2+'A'+'-'+original_dyad1+'B'+'-'+condition+'.txt'
            
            # save to file
            surrogateX.to_csv(new_surrogate_path + nameX, encoding='utf-8',index=False,sep='\t')
            surrogateY.to_csv(new_surrogate_path + nameY, encoding='utf-8',index=False,sep='\t')
            
    # return list of all surrogate files
    return glob.glob(new_surrogate_path+"*.txt")

SyntaxError: invalid syntax (<ipython-input-42-1b0cc3a35cc6>, line 4)

RUN Phase 2: Surrogate Partners
-------------------------------
* Runs function to generate new surrogate transcript conversations (separate files)
* For each surrogate transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [None]:
def PHASE2RUN_SURROGATE(input_files, 
                         surrogate_file_directory,
                         output_file_directory,
                         semantic_model_input_file,
                         pretrained_input_file,   
                         high_sd_cutoff=3,
                         low_n_cutoff=1,
                         id_separator = '\-',
                         condition_label='cond',
                         dyad_label='dyad',
                         all_surrogates=True,
                         keep_original_turn_order=True,
                         delay=1,
                         maxngram=2,
                         use_pretrained_vectors=False,   
                         ignore_duplicates=True,
                         add_stanford_tags=False,
                         input_as_directory=True):   
    
    
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics
    for surrogate baseline conversations.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tags=True`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, the separator between dyad ID and
    condition ID in each file name is a hyphen ('\-'). 
    If desired, this may be changed with the 
    `id_separator` argument.

    By default, condition IDs in each file name
    will be identified as any characters following 
    `cond`. If desired, this may be changed with the 
    `condition_label` argument.
    
    By default, dyad IDs in each file name
    will be identified as any characters following 
    `dyad`. If desired, this may be changed with the 
    `dyad_label` argument.
    
    By default, generate surrogates from all possible 
    pairings within a condition.  If desired, instead 
    generate surrogates only from a subset of all
    possible pairings within a condition with 
    `all_surrogates=False`.
    
    By default, accept `input_files` as a directory
    that includes `.txt` files of each individual 
    conversation. If desired, provide individual files
    as a list of literal paths to the `input_files`
    argument and set `input_as_directory=False`.
    """
    
    # grab the files in the input list
    if input_as_directory==False:
        file_list = glob.glob(input_files)
    else:
        file_list = glob.glob(input_files+"*.txt")
    
    # create a surrogate file list
    surrogate_file_list = GenerateSurrogate(original_conversation_list = file_list,
                                                   surrogate_file_directory = surrogate_file_directory,
                                                   all_surrogates = all_surrogates,
                                                   id_separator = id_separator,
                                                   condition_label = condition_label,
                                                   dyad_label = dyad_label,
                                                   keep_original_turn_order = keep_original_turn_order) 
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                       pretrained_input_file=pretrained_input_file,
                                                       use_pretrained_vectors=use_pretrained_vectors,
                                                       high_sd_cutoff=high_sd_cutoff,
                                                       low_n_cutoff=low_n_cutoff)  
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through the files
    for fileName in surrogate_file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel)
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             maxngram = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tags = add_stanford_tags)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN_SURROGATE = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO_SURROGATE = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN_SURROGATE.to_csv(output_file_directory+"AlignmentT2T_Surrogate.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO_SURROGATE.to_csv(output_file_directory+"AlignmentC2C_Surrogate.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN_SURROGATE, FINAL_CONVO_SURROGATE

# Run everything!

## Phase 1: Prep

In [41]:
import time
start_phase1 = time.time()

**To Nick**: By the way, I did some hunting around to help suppress the NLTK warnings, and it looks like a lot of folks have this issue. There are a few ways we could handle this, but I think it might be useful for us to preserve the warning. What do you think?

**To Alex** Agreed. I'm fine with leaving it. Reminder to see about eventually updating. 

In [64]:
# specified_filler_list = ['ignore','these','words']
specified_filler_list = ['yeah']
#loaded_filler_list = list(pd.read_csv(INPUT_PATH+'package_files/fillers.txt',squeeze=True))

model_store = PHASE1RUN(
          input_files=INPUT_PATH+TRANSCRIPTS,
#           input_files=INPUT_PATH+'TRANSCRIPTS/',
          input_as_directory=True,
          save_concatenated_dataframe=True,
          output_file_directory=INPUT_PATH+PREPPED_TRANSCRIPTS,
          training_dictionary=INPUT_PATH+'package_files/tasa.txt',
          minwords=3,
#           use_filler_list=None,
          use_filler_list=specified_filler_list,
          use_both=True,
          add_stanford_tags=False,
          stanford_pos_path=STANFORD_POS_PATH)
model_store 

Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_10-condition_1.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_10-condition_2.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_12-condition_1.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_12-condition_2.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_15-condition_2.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_15-condition_1.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_13-condition_1.txt
Processing: /Users/nduran/Desktop/GitProjects/align-linguistic-alignment/examples/toy_data-original/dyad_13-condition_2.txt


Unnamed: 0,participant,content,token,lemma,tagged_token,tagged_lemma,file
0,1,what're we doing,"[what, are, we, doing]","[what, be, we, do]","[(what, WDT), (are, VBP), (we, PRP), (doing, V...","[(what, WP), (be, VB), (we, PRP), (do, VBP)]",dyad_10-condition_1.txt
1,2,we just want to make sure that we're capturin...,"[we, just, want, to, make, sure, that, we, are...","[we, just, want, to, make, sure, that, we, be,...","[(we, PRP), (just, RB), (want, VBP), (to, TO),...","[(we, PRP), (just, RB), (want, VBP), (to, TO),...",dyad_10-condition_1.txt
2,1,we don't really care about what we're saying r...,"[we, do, not, really, care, about, what, we, a...","[we, do, not, really, care, about, what, we, b...","[(we, PRP), (do, VBP), (not, RB), (really, RB)...","[(we, PRP), (do, VBP), (not, RB), (really, RB)...",dyad_10-condition_1.txt
3,2,let's test some more,"[let, us, test, some, more]","[let, u, test, some, more]","[(let, VB), (us, PRP), (test, VB), (some, DT),...","[(let, VB), (u, JJ), (test, VB), (some, DT), (...",dyad_10-condition_1.txt
4,1,more testing but we need more variety,"[more, testing, but, we, need, more, variety]","[more, testing, but, we, need, more, variety]","[(more, RBR), (testing, JJ), (but, CC), (we, P...","[(more, RBR), (testing, JJ), (but, CC), (we, P...",dyad_10-condition_1.txt
5,2,and more testing this is just some sample text,"[and, more, testing, this, is, just, some, sam...","[and, more, test, this, be, just, some, sample...","[(and, CC), (more, RBR), (testing, VBG), (this...","[(and, CC), (more, JJR), (test, NN), (this, DT...",dyad_10-condition_1.txt
0,2,let's test some more and more testing,"[let, us, test, some, more, and, more, testing]","[let, u, test, some, more, and, more, testing]","[(let, VB), (us, PRP), (test, VB), (some, DT),...","[(let, VB), (u, JJ), (test, VB), (some, DT), (...",dyad_10-condition_2.txt
1,1,more testing but we need more variety,"[more, testing, but, we, need, more, variety]","[more, testing, but, we, need, more, variety]","[(more, RBR), (testing, JJ), (but, CC), (we, P...","[(more, RBR), (testing, JJ), (but, CC), (we, P...",dyad_10-condition_2.txt
2,2,this is just some sample text,"[this, is, just, some, sample, text]","[this, be, just, some, sample, text]","[(this, DT), (is, VBZ), (just, RB), (some, DT)...","[(this, DT), (be, VB), (just, RB), (some, DT),...",dyad_10-condition_2.txt
3,1,what're we doing,"[what, are, we, doing]","[what, be, we, do]","[(what, WDT), (are, VBP), (we, PRP), (doing, V...","[(what, WP), (be, VB), (we, PRP), (do, VBP)]",dyad_10-condition_2.txt


## Phase 2: Real

In [None]:
start_phase2real = time.time()

In [43]:
[turn_real,convo_real]= PHASE2RUN_REAL(
#                         input_files = INPUT_PATH+PREPPED_TRANSCRIPTS,
                        input_files = INPUT_PATH+'PREPPED_TRANSCRIPTS_STANFORD/',
                        input_as_directory=True,
                        output_file_directory = INPUT_PATH+ANALYSIS_READY,
                        delay=1,
                        maxngram=3,
                        ignore_duplicates=True,
                        add_stanford_tags=True,       
                        use_pretrained_vectors=True,
                        pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                        semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt', 
                        high_sd_cutoff=3,
                        low_n_cutoff=1)
turn_real,convo_real 

(Empty DataFrame
 Columns: []
 Index: [], Empty DataFrame
 Columns: []
 Index: [])

## Phase 2: Surrogate

In [None]:
start_phase2surrogate = time.time()

In [None]:
[turn_surrogate,convo_surrogate] = PHASE2RUN_SURROGATE(
                             input_files = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                             surrogate_file_directory= INPUT_PATH+SURROGATE_TRANSCRIPTS,
                             input_as_directory=True,
                             output_file_directory=INPUT_PATH+ANALYSIS_READY,
                             semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                             pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=True,
                             keep_original_turn_order=True,
                             delay=1,
                             maxngram=3,
                             use_pretrained_vectors=False,
                             ignore_duplicates=True,
                             add_stanford_tags=False)
turn_surrogate,convo_surrogate

In [None]:
end=time.time()

## Speed calculations

Phase 1 time:

In [None]:
start_phase2real - start_phase1

Phase 2 real time:

In [None]:
start_phase2surrogate - start_phase2real

Phase 2 surrogate time:

In [None]:
end - start_phase2surrogate

All 3 phases:

In [None]:
end - start_phase1

## Printouts!

In [None]:
turn_real.head(10)

In [None]:
convo_real.head(10)

In [None]:
turn_surrogate.head(10)

In [None]:
convo_surrogate.head(10)