# To-do list

**Things for Alex to do:**
* [ ] Handling requirements (after getting them)
* [ ] Dockerizing
* [ ] Jupyter app-ifying
* [ ] Getting Stanford tagger included automatically
* [ ] Clean up markdown text (when final notebooks are ready)
* [ ] See if I can implement w2v function (https://github.com/a-paxton/Gensim-LSI-Word-Similarities)
* [ ] Convert functions into library

**Things for Nick to do:**
* [x] Implement surrogate to match by conversation order AND conversation type
* [x] Make file names more intuitive
* [ ] Identify condition/dyad/number flexibly (using regex) - SKIPPED
* [x] Allow surrogate baseline to be created using a smaller subset (permutations) — 2-3x?
* [**???**] Do pip freeze or conda list -e > req.txt
* [**???**] Redo analysis with new baseline + consider doing sample-wise shuffled baseline - Have questions on how to proceed
* [ ] Go over manuscript again with new baseline + review comments/edits

**Note to Nick**: After going through the notebook, it looks like a *lot* of this could be offloaded to a Python package with good documentation. The benefit there is that the "guts" of the notebook can be really streamlined and converted into more of an example file, allowing people to see *what* all of this does rather than *how* it does it. (There will be folks, for instance, who don't know code well enough to want to get into these underlying issues.)

***

# ALIGN

This notebook provides an introduction to **ALIGN**, a tool for quantifying multi-level linguistic similarity between speakers. 

***

**Table of Contents**:

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [User-specified parameters](#User-specified-parameters)
    * [Main calls](#Main-calls)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [User-specified settings](#User-specified-settings)
* [Phase 1: Generate "prepped" transcripts](#Phase-1:-Generate-"prepped"-transcripts)
    * [](#)
* [Phase 2: Generate alignment scores](#Phase-2:-Generate-alignment-scores)

***

# Getting Started

### Prerequisites

* Jupyter Notebook with Python 2.7.1.3 kernel
* Packages in `requirements.txt`

*See notes in "DISTRIBUTION ISSUES" Notebook for suggestions on how to package effectively and accomodate Python 3 users*

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See `filename.txt` in Github repository for an example

### Filename conventions

* Each conversation text file needs to be named in the format: `A_B.txt`
    * `A` corresponds to the dyad number for that conversation
    * `B` corresponding to a condition code for that conversation 

### User-specified parameters

* Define input path
* Define input folder where original transcripts are located
* Define folder to save prepped transcripts 
* Define folder to save surrogate transcripts 
* Decide the minimum number of words for each turn
    * Default: 1
* Decide maximum size for n-gram chunking
    * Default: 4
* Decide on whether to run the Stanford tagger (time-consuming) along with Penn tagger
    * Default: Penn tagger only
* Decide on max delay between partner's turns to generate alignment score
    * Currently only option is for contiguous turns
    * Will be updated in a future version

**Note to Nick**: The default in surrogate set sounds like it's missing some parts of the sentence. Can you double-check for sense?

**ND** I don't think we need this option after all. It does not take long to run the full surrogate set and if users want fewer for analysis, they can remove accordingly when they get to that stage

### Main calls

`PHASE1RUN`

* Converts each conversation into standardized format.
* Each utterance is tokenized and lemmatized and has POS tags added.

`PHASE2RUN_REAL`

* Generates turn-level and conversation-level alignment scores (lexical, conceptual, and syntactic) across a range of n-gram sequences

`PHASE2RUN_SURROGATE`

* Generates a surrogage corpus.
* Runs identical analysis as PHASE2RUN_REAL on the surrogage corpus.

***

# Setup

Here, we'll get ready to run ALIGN over our target dataset.

*NOTE: Can include these in the Anaconda ".yml" file so automatically imported and can be removed from the main notebook here*

[To top](#ALIGN).

## Import libraries

### Standard libraries

In [1]:
import os,re,math,csv,string,random,logging,glob,itertools,operator
from os import listdir 
from os.path import isfile, join 
from collections import Counter, defaultdict 
from itertools import chain, combinations

### Third-party libraries

For data analysis and data handling:

In [2]:
import pandas as pd
import numpy as np
from scipy import spatial 

For natural language processing:

In [3]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn 
from nltk.tag.stanford import StanfordPOSTagger
from nltk.util import ngrams

Don't forget to grab the POS tagger we'll be using:

In [4]:
nltk.download('maxent_treebank_pos_tagger')

[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/alexandra/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!


True

**Note**: The `StanfordPOSTagger` will be
used in conjunction with local folder `stanford-postagger-2015-04-20` and `.jar` file. Both files will be called below, after user-specified folders.

For building semantic space:

In [5]:
from gensim.models import word2vec 

## User-specified settings

### Directories and folders

Set working directory, in which all notebook and supporting files are located.

Or "Pathname for the unzipped project folder" if going with `anaconda-project.yml` configuration file

In [6]:
INPUT_PATH=os.getcwd()+'/'

Set variable for folder name (as string) for relative location of folder containing the original transcript files.

In [7]:
TRANSCRIPTS = 'toy_data-original/'

Set variable for folder name (as string) for relative location of folder into which prepared transcript files will be saved.

In [8]:
PREPPED_TRANSCRIPTS = 'toy_data-prepped/'

Set variable for folder name (as string) for relative location of folder into which analysis-ready dataframe files will be saved.

In [9]:
ANALYSIS_READY = 'toy_data-analysis/'

Set variable for folder name (as string) for relative location of folder into which all prepared surrogate transcript files will be saved.

In [10]:
SURROGATE_TRANSCRIPTS = 'toy_data-surrogate/'

### Analysis settings

Set minimum number of words for each turn. (Default: 1)

In [11]:
MINWORDS = 1

Set maximum size for n-gram chunking. (Default: 4)

In [12]:
MAXNGRAM = 4

Decide whether to run the Stanford POS tagger along with Penn POS tagger. Adding the Stanford POS tagger to the Penn tagger will lead to an increase in processing time. (Default: Penn only).

**Note to Nick**: What do the values here (below) mean? I'm interpreting it as `1` means to run Stanford and `0` would mean not to run Stanford. If so, we should include this.

In [13]:
STANFORD = 1

Decide whether to extract a surrogate set that includes all possible dyad pairs within condition. (Default: extracts a smaller the is 2x the size as the original dataset)

**Note to Nick**: As I mentioned in the earlier `ALLSURROGATE` description, this seems like there's some words missing above. Also, what does this value mean?

In [14]:
ALLSURROGATE = 0

Set max delay between partner's turns when generating alignment score.

Currently, the only acceptable value is 1 (i.e., contiguous turns).

In [15]:
DELAY = 1

In [16]:
# PHASE1RUN()
# PHASE2RUN_REAL()
# PHASE2RUN_SURROGATE()

**Note to Nick**: What does the code chunk above do?

***

# Phase 1: Generate "prepped" transcripts

## Initial clean-up

* **[Clean up text](#Clean-up-text)** by removing:
    * numbers, punctuation, and other non-ASCII alphabet characters
    * common speech fillers (e.g., "um", "huh") and their derivations
    * empty turns that may have inadvertently been included
    * user-specified short turns
        * Default: removes 1-word turns
* **[Merge adjacent turns by the same participant](#Merge-adjacent-turns-by-the-same-participant)** into a single utterance row.

[To top](#ALIGN).

### Clean up text

In [17]:
# def InitialCleanup(dataframe,minwords):
#     WHITELIST = string.letters + '\'' + ' '
#     clean = []
#     utteranceLen = []
#     for value in dataframe['content'].values:
#         cleantext = ''.join(c for c in value if c in WHITELIST).lower() 
        
#         ## OPTIONAL: remove typical speech fillers, examples: "um, mm, oh, hm, uhm, uh" ; additional regular expressions can be added as needed for specific needs
#         removefiller = re.sub('^[uhmo]+[mh]+\s', ' ', cleantext) ## at the start of a string
#         removefiller = re.sub('\s[uhmo]+[mh]+\s', ' ', removefiller) ## within a string                    
#         clean.append(removefiller)        
        
#         utteranceLen.append(len(re.findall(r'\w+', removefiller)))
    
#     ## drop the old "content" column and add the clean "content" column
#     dataframe = dataframe.iloc[:, [0,1]]
#     dataframe['content'] = clean
    
#     ## remove rows that are blank or do not meet "minwords" requirement, then drop length column
#     dataframe['utteranceLen'] = utteranceLen 
#     dataframe = dataframe.loc[(dataframe['utteranceLen'] != 0) & (dataframe['utteranceLen'] > int(minwords))]
#     dataframe = dataframe.iloc[:, [0,1]]
#     return dataframe

**Note to Nick**: The form above doesn't catch fillers that appear at the end of lines, since it requires there to be space before and after fillers -- but `\n` aren't included in the `WHITELIST` variable.

... Alternatively, we could add a line before line 6 to swap all newlines for spaces. Otherwise, we might not catch fillers at the very end of the line (since the regex requires us to find a whitespace before and after the filler).

**Note to Nick**: It looks like we only catch 2-letter fillers. Something like "uhm" or "huh" wouldn't be caught, so we'll need to update it to catch 3-letter fillers without getting non-filler words. It might be worth our while to create a dictionary of filler words rather than a regex expression.

That would probably lead to a performance boost for this, too, since we could do a list comprehension instead.

In [18]:
def alex_InitialCleanup(dataframe,
                        minwords=1,
                        remove_regex_fillers=True,
                        remove_other_list=None):

    """
    Perform basic text cleaning to prepare dataframe
    for analysis. Remove non-letter/-space characters,
    empty turns, turns below a minimum length (default:
    1 word), and fillers.
    
    By default, remove 2-letter fillers through regex.
    If desired, skip regex filtering of fillers with
    `remove_regex_fillers=FALSE`.
    
    If desired, remove other words (e.g., fillers) 
    passed as a list to `remove_other_list` argument.
    """
    
    # only allow strings, spaces, and newlines to pass
    WHITELIST = string.letters + '\'' + ' '
    clean = []
    utteranceLen = []
    for value in dataframe['content'].values:
        cleantext = ''.join(c for c in value if c in WHITELIST).lower() 
        
        ## OPTIONAL: remove typical speech fillers, examples: "um, mm, oh, hm, uh"
        if remove_regex_fillers==True:
            cleantext = re.sub('(^|\s)[uhmo]+[mh]+[\s|$]', ' ', cleantext)
        
        # optional: remove other words specified in list
        if remove_other_list!=None:
            cleantext = [word for word in cleantext if word not in other_filler_list]
        
        # append cleaned lines
        clean.append(cleantext)        
        utteranceLen.append(len(re.findall(r'\w+', cleantext)))
    
    ## drop the old "content" column and add the clean "content" column
    dataframe = dataframe.iloc[:, [0,1]]
    dataframe['content'] = clean
    
    ## remove rows that are blank or do not meet "minwords" requirement, then drop length column
    dataframe['utteranceLen'] = utteranceLen 
    dataframe = dataframe.loc[(dataframe['utteranceLen'] != 0) & (dataframe['utteranceLen'] > int(minwords))]
    dataframe = dataframe.iloc[:, [0,1]]
    
    # return the cleaned dataframe
    return dataframe

### Merge adjacent turns by the same participant

In [19]:
def AdjacentMerge(dataframe):
    repeat=1
    while repeat==1:
        l1=len(dataframe) 
        DfMerge = []
        k = 0
        if len(dataframe) > 0:
            while k < len(dataframe)-1: 
                if dataframe['participant'].iloc[k] != dataframe['participant'].iloc[k+1]:
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])         
                    k = k + 1
                elif dataframe['participant'].iloc[k] == dataframe['participant'].iloc[k+1]:                    
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k] + " " + dataframe['content'].iloc[k+1]])           
                    k = k + 2   
            if k == len(dataframe)-1:
                DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])      
        
        dataframe=pd.DataFrame(DfMerge,columns=('participant','content'))
        if l1==len(dataframe): 
            repeat=0    
    return dataframe

## Prepare transcript text

* **[Check spelling](#Check-spelling)** via a Bayesian spell-checking algorithm (http://norvig.com/spell-correct.html).
* **[Tokenize and apply spell correction](#Tokenize-and-apply-spell-correction)** to the original transcript text.
* **[Lemmatize](#Lemmatize)** using WordNet-derived categories.
* [**Part-of-speech tagging**](#Part-of-speech-tagging) with user-defined tagger(s) on both lemmatized and non-lemmatized tokens.
    * Users may choose to use the Penn Treebank POS tagger (default) and/or the Stanford POS tagger (optional). The Penn tagger is more time-efficient.

### Check spelling

In [20]:
# def words(text): return re.findall('[a-z]+', text.lower()) 

**Note to Nick**: This is only used in the master function, so I suggest incorporating it into there. See `alex_PHASE1RUN` for a possible implementation.

In [21]:
# def train(features): 
#     model = defaultdict(lambda: 1)
#     for f in features:
#         model[f] += 1
#     return model

**Note to Nick**: Same as previous comment.

In [22]:
# def edits1(word): 
#     alphabet = 'abcdefghijklmnopqrstuvwxyz'
#     splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
#     deletes    = [a + b[1:] for a, b in splits if b]
#     transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
#     replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
#     inserts    = [a + c + b     for a, b in splits for c in alphabet]
#     return set(deletes + transposes + replaces + inserts)

**Note to Nick**: I'd recommend (for Pythonic purposes) using `string.lowercase` instead of a new string. See below:

In [23]:
# def edits1(word): 
#     splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
#     deletes    = [a + b[1:] for a, b in splits if b]
#     transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
#     replaces   = [a + c + b[1:] for a, b in splits for c in string.lowercase if b]
#     inserts    = [a + c + b     for a, b in splits for c in string.lowercase]
#     return set(deletes + transposes + replaces + inserts)

In [24]:
# def known_edits2(word,nwords):
#     return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

In [25]:
# def known(words,nwords): return set(w for w in words if w in nwords)

In [26]:
# def correct(word,nwords):
#     candidates = known([word],nwords) or known(edits1(word),nwords) or known_edits2(word,nwords) or [word]
#     return max(candidates, key=nwords.get)

**Note to Nick**: For documentation purposes, I'd recommend wrapping all of the functions related to `Tokenize` into the `Tokenize` function. I've proposed a modified version below as `alex_Tokenize`.

### Tokenize and apply spell correction

In [27]:
# def Tokenize(text,nwords):
    
#     cleantoken = []
#     token = word_tokenize(text)
    
#     for word in token:
#         cleantoken.append(correct(word,nwords))
#     return cleantoken

In [28]:
def alex_Tokenize(text,nwords):
    """
    Given list of text to be processed and a list 
    of known words, return a list of edited and 
    tokenized words.
    """
    
    # internal function: identify possible spelling errors for a given word
    def edits1(word): 
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
        replaces   = [a + c + b[1:] for a, b in splits for c in string.lowercase if b]
        inserts    = [a + c + b     for a, b in splits for c in string.lowercase]
        return set(deletes + transposes + replaces + inserts)

    # internal function: identify known edits
    def known_edits2(word,nwords):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

    # internal function: identify known words
    def known(words,nwords): return set(w for w in words if w in nwords)

    # internal function: correct spelling
    def correct(word,nwords):
        candidates = known([word],nwords) or known(edits1(word),nwords) or known_edits2(word,nwords) or [word]
        return max(candidates, key=nwords.get)
    
    # process all words in the text
    cleantoken = []
    token = word_tokenize(text)    
    for word in token:
        cleantoken.append(correct(word,nwords))

    # return list of tokenized words
    return cleantoken

**Note to Nick**: I'm not sure that defining the functions here is helpful to readers trying to figure out what's happening. I think it might lead to more transparent (and possibly shorter) code if we simply move the functions into the loop as plain lines of code.

### Lemmatize

In [29]:
def penn_to_wn(tag):
    """
    Convert Penn tagger output into a format that Wordnet
    can use in order to properly lemmatize the text.
    """
    
    # create some inner functions for simplicity
    def is_noun(tag):
        return tag in ['NN', 'NNS', 'NNP', 'NNPS']
    def is_verb(tag):
        return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    def is_adverb(tag):
        return tag in ['RB', 'RBR', 'RBS']
    def is_adjective(tag):
        return tag in ['JJ', 'JJR', 'JJS']
    
    # check each tag against all possible categories
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return wn.NOUN

**Note to Nick**: Is line 26 (above) needed, or is that an error? It seems as though it's functioning as an implicit `else`. Is that intentional?

In [30]:
def Lemmatize(tokenlist):
    lemmatizer = WordNetLemmatizer() 
    pennPos = nltk.pos_tag(tokenlist) # get the POS tags from Penn Treebank tagset that will be used
    words_lemma = []
    for item in pennPos: # need to convert these POS tags to a format that wordnet uses to properly lemmatize
        words_lemma.append(lemmatizer.lemmatize(item[0],
                                                penn_to_wn(item[1])))
    return words_lemma

In [31]:
def alex_ApplyTokenLemmatize(df,nwords):
    df['token'] = ""
    df['lemma'] = ""
    for i in range(0,len(df)):
        df['token'].iloc[i]=alex_Tokenize(df['content'].iloc[i],nwords)
        df['lemma'].iloc[i]=Lemmatize(df['token'].iloc[i])  
    return df

### Part-of-speech tagging

In [32]:
# def ImportStanfordTagger():
#     """
#     Import the Stanford POS Tagger included in the package.
#     """
#     stantagger = StanfordPOSTagger(INPUT_PATH + 'stanford-postagger-2015-04-20/models/english-bidirectional-distsim.tagger',
#                                    INPUT_PATH + 'stanford-postagger-2015-04-20/stanford-postagger.jar')
#     return stantagger

**Note to Nick**: `ImportStanfordTagger` is a very targeted function that's only used in (as far as I can see) one other function. I think we should remove it and instead incorporate it into the POS tagger as an optional argument (see below).

In [33]:
def alex_ApplyPOSTagging(df,filename,add_stanford_tags=0):

    """
    Apply part-of-speech tagging to a dataframe of conversation turns 
    (df). Pass filename as a string to create create a new df variable. 
    By default, return only tags from the Penn POS tagger. Optionally,
    also return Stanford POS tagger results by setting  
    `add_stanford_tags=1`.
    """
    
    # create new columns in our dataframe
    df['tagged_penn_token'] = ""
    df['tagged_penn_lemma'] = ""
    df['file'] = ""
    if add_stanford_tags == 1:
        df['tagged_stan_token'] = ""
        df['tagged_stan_lemma'] = ""
        
    # if desired, import Stanford tagger
    if add_stanford_tags == 1:
        stanford_tagger = StanfordPOSTagger(INPUT_PATH + 'stanford-postagger-2015-04-20/models/english-bidirectional-distsim.tagger',
                                            INPUT_PATH + 'stanford-postagger-2015-04-20/stanford-postagger.jar')
    
    # cycle through each line in the dataframe
    for i in range(0,len(df)):
        df['file'].iloc[i]=filename

        # by default, tag with Penn POS tagger
        pos_penn_token=nltk.pos_tag(df['token'].iloc[i])
        df['tagged_penn_token'].iloc[i]=pos_penn_token 
        pos_penn_lemma=nltk.pos_tag(df['lemma'].iloc[i])
        df['tagged_penn_lemma'].iloc[i]=pos_penn_lemma 

        # if desired, also tag with Stanford tagger
        if add_stanford_tags == 1:
            pos_stan_token=stanford_tagger.tag(df['token'].iloc[i])
            df['tagged_stan_token'].iloc[i]=pos_stan_token    
            pos_stan_lemma=stanford_tagger.tag(df['lemma'].iloc[i])
            df['tagged_stan_lemma'].iloc[i]=pos_stan_lemma  

    # return finished dataframe
    return df

**Note to Nick**: I updated the above with some new variable names, combining the guts of `ImportStanfordTagger` into the function, and more comments. It also no longer returns the Stanford columns if the user opts out of using it.

## RUN Phase 1

* For each original transcript file, saves new file with columns for:
    * "Clean" text
    * Tokenized words
    * Tokenized lemmatized-words
    * Penn POS-tagging on tokenized words
    * Penn POS-tagging on lemmatized-words
    * Stanford POS-tagging on tokenized words
    * Stanford POS-tagging on lemmatized-words
* Also saves a single datasheet with all tokenized lemmatized utterances from all transcripts as individual rows
    * called forSemantic.txt
    * to be used in building semantic space for Phase 2

In [34]:
# def PHASE1RUN():   

#     # import the POS taggers and spell checking
#     st = ImportStanfordTagger()
#     nwords = train(words(file(INPUT_PATH + 'big.txt').read()))
    
#     # grab filenames
#     filesList = [ f for f in listdir(INPUT_PATH + TRANSCRIPTS) if isfile(join(INPUT_PATH + TRANSCRIPTS,f)) ]
# #     filesList=filesList[19:21] ##// run all files   
#     filesList=filesList[1:]
    
#     # cycle through all files 
#     main = []
#     for fileName in filesList:      
#         print fileName
#         dataframe = pd.read_csv(INPUT_PATH + TRANSCRIPTS + fileName, sep='\t',encoding='utf-8')
#         dataframe = InitialCleanup(dataframe,MINWORDS)
#         dataframe = AdjacentMerge(dataframe)
#         dataframe = ApplyTokenLemmatize(dataframe,nwords)
#         dataframe = ApplyPOSTagging(dataframe, fileName, st)
        
#         # export the dataframe as a CSV
#         dataframe.to_csv(INPUT_PATH + PREPPED_TRANSCRIPTS + fileName, encoding='utf-8',index=False,sep='\t')
#         main.append(dataframe)
#         dataframe=None

#     # return the finished dataframe
#     main = pd.concat(main, axis=0)
#     return main
# #     main.to_csv(INPUT_PATH  + "forSemantic.txt",encoding='utf-8',index=False,sep='\t') 

**Note to Nick**: I've got a few cells below where I tried to pull back the function to examine its guts. I made a few notes along the way in markdown cells!

In [35]:
#     # import the POS taggers and spell checking
#     st = ImportStanfordTagger()
#     nwords = train(words(file(INPUT_PATH + 'big.txt').read()))
    
#     # grab filenames
#     filesList = [ f for f in listdir(INPUT_PATH + TRANSCRIPTS) if isfile(join(INPUT_PATH + TRANSCRIPTS,f)) ]
# #     filesList=filesList[19:21] ##// run all files   
#     filesList=filesList[1:]
    
#     # cycle through all files 
#     main = []

In [36]:
# fileName=filesList[0]

In [37]:
#         print fileName        

In [38]:
#         dataframe = pd.read_csv(INPUT_PATH + TRANSCRIPTS + fileName, sep='\t',encoding='utf-8')

In [39]:
# dataframe.head(10)

In [40]:
#         dataframe = InitialCleanup(dataframe,MINWORDS)

In [41]:
# dataframe.head(10)

**Note to Nick**: Yeah, it looks like the current regex isn't capturing end-of-turn fillers. We need to update the code to address that (see suggestions above).

In [42]:
#         dataframe = AdjacentMerge(dataframe)

In [43]:
# dataframe.head(10)

In [44]:
#         dataframe = ApplyTokenLemmatize(dataframe,nwords)

In [45]:
# dataframe.head(10)

**Note to Nick**: In inspecting the output of the tagger, it looks like it's accidentally tagging the possessive `'s` as the verb contraction. We should think about what we're doing there.

In [46]:
#         dataframe = ApplyPOSTagging(dataframe, fileName, st)

In [47]:
# dataframe.head(10)

In [48]:
# PHASE1RUN()

**Note to Nick**: I may have just missed it, but I thought that the Stanford tagger was option (cf. line 4). Is there an option or something that I'm missing somewhere?

**Note to Nick**: Are lines 8 and 10 both necessary? Moreover, line 10 starts from the beginning of an R index (i.e., 1) rather than the Pythonic index start (i.e., 0). It seems like we should just remove line 10... or specify an option in the fuction to start from a certain file index number (again, keeping in mind the Pythonic numbering).

**Note to Nick**: Is line 9 legacy code?

**Note to Nick**: Since we rename `dataframe` by loading in a new file each time, I don't think we need line 25.

**Note to Nick**: From the text box above this, it looks like we should be uncommenting line 30? Alternatively, we could assign it to a variable.

In [49]:
def alex_PHASE1RUN(input_file_directory, 
                   output_file_directory,
                   training_dictionary=INPUT_PATH+'big.txt',
                   add_stanford_tagger=0):   

    """
    Given a directory of individual .txt files, 
    return a completely prepared dataframe of transcribed 
    conversations for later ALIGN analysis, including: text 
    cleaning, merging adjacent turns, spell-checking, 
    tokenization, lemmatization, and part-of-speech tagging. 
    By default, return only the Penn Treebank 
    POS tagger values; optionally, also return Stanford POS tagger
    values with `add_standford_tagger=1`.
    """
    
    # create an internal function to train the model
    def train(features): 
        model = defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model
        
    # train our spell-checking model
    nwords = train(re.findall('[a-z]+',(file(training_dictionary).read().lower())))
    
    # cycle through all files 
    import glob
    file_list = glob.glob(input_file_directory+"*.txt")
    main = []
    for fileName in file_list:      
        
        # let us know which file we're processing
        dataframe = pd.read_csv(fileName, sep='\t',encoding='utf-8')
        print "Processing: "+fileName

        # clean up, merge, spellcheck, tokenize, lemmatize, and POS-tag
        dataframe = alex_InitialCleanup(dataframe,MINWORDS)
        dataframe = AdjacentMerge(dataframe)
        dataframe = alex_ApplyTokenLemmatize(dataframe,nwords)
        dataframe = alex_ApplyPOSTagging(dataframe, 
                                    os.path.basename(fileName), 
                                    add_stanford_tagger)
        
        # export the conversation's dataframe as a CSV
        dataframe.to_csv(output_file_directory + os.path.basename(fileName), 
                         encoding='utf-8',index=False,sep='\t')
        main.append(dataframe)

    # save the concatenated dataframe
    main = pd.concat(main, axis=0)
    main.to_csv(output_file_directory +'../' + "align_concatenated_dataframe.txt",encoding='utf-8',
                index=False, sep='\t')
    
    # return the dataframe
    return main

**Note to Nick**: Here's how it would run in our context:

In [50]:
alex_PHASE1RUN(input_file_directory=INPUT_PATH+TRANSCRIPTS,
                      output_file_directory=INPUT_PATH+PREPPED_TRANSCRIPTS,
                      training_dictionary=INPUT_PATH+'big.txt')

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_10-condition_1.txt


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_10-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_12-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_12-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_15-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_15-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_13-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_13-condition_2.txt


Unnamed: 0,participant,content,token,lemma,tagged_penn_token,tagged_penn_lemma,file
0,1,what're we doing uh,"[what, are, we, doing, up]","[what, be, we, do, up]","[(what, WP), (are, VBP), (we, PRP), (doing, VB...","[(what, WP), (be, VB), (we, PRP), (do, VBP), (...",dyad_10-condition_1.txt
1,2,we just want to make sure that we're capturin...,"[we, just, want, to, make, sure, that, we, are...","[we, just, want, to, make, sure, that, we, be,...","[(we, PRP), (just, RB), (want, VBP), (to, TO),...","[(we, PRP), (just, RB), (want, VBP), (to, TO),...",dyad_10-condition_1.txt
2,1,we don't really care about what we're saying r...,"[we, do, not, really, care, about, what, we, a...","[we, do, not, really, care, about, what, we, b...","[(we, PRP), (do, VBP), (not, RB), (really, RB)...","[(we, PRP), (do, VBP), (not, RB), (really, RB)...",dyad_10-condition_1.txt
3,2,let's test some more,"[let, is, test, some, more]","[let, be, test, some, more]","[(let, NN), (is, VBZ), (test, NN), (some, DT),...","[(let, NN), (be, VB), (test, NN), (some, DT), ...",dyad_10-condition_1.txt
4,1,more testing but we need more variety,"[more, testing, but, we, need, more, variety]","[more, testing, but, we, need, more, variety]","[(more, JJR), (testing, NN), (but, CC), (we, P...","[(more, JJR), (testing, NN), (but, CC), (we, P...",dyad_10-condition_1.txt
5,2,and more testing this is just some sample text,"[and, more, testing, this, is, just, some, sam...","[and, more, testing, this, be, just, some, sam...","[(and, CC), (more, JJR), (testing, NN), (this,...","[(and, CC), (more, JJR), (testing, NN), (this,...",dyad_10-condition_1.txt
0,2,let's test some more,"[let, is, test, some, more]","[let, be, test, some, more]","[(let, NN), (is, VBZ), (test, NN), (some, DT),...","[(let, NN), (be, VB), (test, NN), (some, DT), ...",dyad_10-condition_2.txt
1,1,more testing,"[more, testing]","[more, testing]","[(more, JJR), (testing, NN)]","[(more, JJR), (testing, NN)]",dyad_10-condition_2.txt
2,2,and more testing,"[and, more, testing]","[and, more, testing]","[(and, CC), (more, JJR), (testing, NN)]","[(and, CC), (more, JJR), (testing, NN)]",dyad_10-condition_2.txt
3,1,more testing but we need more variety,"[more, testing, but, we, need, more, variety]","[more, testing, but, we, need, more, variety]","[(more, JJR), (testing, NN), (but, CC), (we, P...","[(more, JJR), (testing, NN), (but, CC), (we, P...",dyad_10-condition_2.txt


***

# Phase 2: Generate alignment scores

* [**Create helper functions**](#Create-helper-functions) for processing turn- and conversation-level data.
* **[Build semantic space](#Build-semantic-space)** from the `forSemantic.txt` generated in Phase 1 and return a `word2vec` semantic space and vocabulary list.

[To top.](#ALIGN)

### Create helper functions

**Note to Nick**: This whole section could be easily collapsed into just the package.

In [51]:
# def convert_list_to_string(in_str):
#     """
#     Add something here?
#     """     
#     result = []
#     tokens=in_str.split(",")
#     for t in tokens:
#         s = unicode(t.replace("[","").replace("]", "").replace(" ",""))
#         result.append(s)
#     return result

**Note to Nick**: `convert_list_to_string` was acting a bit strangely for me (and I think was related to the broken loop issue I had, noted below), so I implemented similar functionality in `BuildSemantic` function.

In [52]:
# def convert_tup(in_str):
#     """
#     Add something here?
#     """      
#     result = []
#     current_tuple = []
#     tokens=in_str.split(",")
#     for t in tokens:
#         s = unicode(t.replace("(","").replace(")", "").replace(" ",""))
#         s = unicode(s.replace("[","").replace("]", "").replace(" ",""))
#         current_tuple.append(s)
#         if ")" in t:
#             result.append(tuple(current_tuple))
#             current_tuple = []
#     return (result)

**Note to Nick**: The `convert_tup` function can also be mimicked using existing functions -- I've gone ahead and swapped that out through the code below.

In [53]:
# def gather_items(seq):
#     """
#     Add something here?
#     """         
#     new_seq = []
#     for tup in seq:
#         getPos = [item[1] for item in tup]
#         new_seq.append(' '.join(item for item in getPos))
#     return new_seq


In [54]:
# def gather_items_lexical(seq):
#     """
#     Add something here?
#     """     
#     new_seq = []
#     for tup in seq:
#         getPos = [item for item in tup]
#         new_seq.append(' '.join(item for item in getPos))
#     return new_seq

**Note to Nick**: `gather_items` functions can be implemented in a list comprehension inline.

In [55]:
# def alex_ngram_remove_dups_pos(sequence1, sequence2, ngramsize=2):
#     """
#     Remove mimicked lexical sequences from two interlocutors'
#     sequences and return a dictionary of counts of ngrams
#     of the desired size for each sequence.
    
#     By default, consider bigrams. If desired, this may be 
#     changed by setting `ngramsize` to the appropriate 
#     value.
    
#     Note that the returned counter dictionaries do not retain 
#     original ordering of the n-grams.
#     """     
    
#     # remove duplicates and recreate sequences
#     sequence1 = set(ngrams(sequence1,ngramsize))
#     sequence2 = set(ngrams(sequence2,ngramsize))
#     noDup1 = list(sequence1 - sequence2)
#     noDup2 = list(sequence2 - sequence1)
#     new_sequence1 = [tuple([' '.join(pair) for pair in tup]) for tup in noDup1]
#     new_sequence2 = [tuple([' '.join(pair) for pair in tup]) for tup in noDup2]
    
#     # return counters
#     return Counter(new_sequence1), Counter(new_sequence2)

**Note to Nick**: I modified the de-duplication to be passed as an argument to the `ngram_pos` function -- see below.

In [56]:
def alex_ngram_pos(sequence1,sequence2,
                   ignore_duplicates=True,
                   ngramsize=2):
    """
    Remove mimicked lexical sequences from two interlocutors'
    sequences and return a dictionary of counts of ngrams
    of the desired size for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.
    
    By default, ignore duplicate lexical n-grams when
    processing these sequences. If desired, this may
    be changed with `ignore_duplicates=False`.
    """     

    # remove duplicates and recreate sequences
    sequence1 = set(ngrams(sequence1,ngramsize))
    sequence2 = set(ngrams(sequence2,ngramsize))
 
    # if desired, remove duplicates from sequences
    if ignore_duplicates==True:
        new_sequence1 = [tuple([' '.join(pair) for pair in tup]) for tup in list(sequence1 - sequence2)]
        new_sequence2 = [tuple([' '.join(pair) for pair in tup]) for tup in list(sequence2 - sequence1)]
    else:
        new_sequence1 = [tuple([' '.join(pair) for pair in tup]) for tup in sequence1]
        new_sequence2 = [tuple([' '.join(pair) for pair in tup]) for tup in sequence2]
        
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [57]:
def alex_ngram_lexical(sequence1,sequence2,ngramsize=2):
    """
    Create ngrams of the desired size for each of two
    interlocutors' sequences and return a dictionary 
    of counts of ngrams for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.
    """   
    
    # generate ngrams
    sequence1 = list(ngrams(sequence1,ngramsize))
    sequence2 = list(ngrams(sequence2,ngramsize)) 

    # join for counters
    new_sequence1 = [' '.join(pair) for pair in sequence1]
    new_sequence2 = [' '.join(pair) for pair in sequence2]
    
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

**Note to Nick**: I updated the above with the list comprehension. According to processing of the code, the `ngram_lexical` doesn't deal with `(word,POS)` tuples but a list of words.

**Note to Nick**: This throws some errors when one sequence is shorter than the maximum n-gram size. We need to figure out how we'll deal with this. For example, will we just produce NAs from it?

In [58]:
def alex_get_cosine(vec1, vec2): 
    """
    Derive cosine similarity metric, standard measure.
    Adapted from <https://stackoverflow.com/a/33129724>.
    """     
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

**Note to Nick**: I tried implementing some of the standard functions, but it looks like the `get_cosine` is best for text-based cosines. I found this in a StackOverflow answer and linked to it in the description.

In [59]:
# def create_columns_list(dictionaries_list):
#     """
#     Takes in list of the following dictionaries:
#         cosine_syntaxPN, cosine_syntaxPY, cosine_syntaxPYL, 
#         cosine_syntaxPNL, cosine_syntaxSN, cosine_syntaxSY, cosine_syntaxSYL, 
#         cosine_syntaxSNL, cosine_lexical, cosine_lexicalL,
#         cosine_semanticL, partner_direction, condition_info
        
#     Gathers all the keys from these dictionaries in a list
#     """                 
#     columns_list = [col for i, dic in enumerate(d for d in dictionaries_list) for col in sorted(dic)] 
#     return columns_list


**Note to Nick**: `create_columns_list` (above) is a function that only includes a single, clean line of code -- I think we can just stick it into the relevant functions as-is. (I've gone ahead and implemented that in all of the functions below.) Same for `merge_dict_list` (below).

In [60]:
# def merge_dict_list(dict_list):
#     """
#     Add something here?
#     """     
#     return dict(j for i in dict_list for j in i.items())    

In [61]:
# def add_index(df,index):
#     """
#     Add something here?
#     """    
#     i1=np.arange(len(df))
#     d1 = pd.DataFrame(index, index = i1)
#     df=df.join(d1)
#     df=df.rename(columns = {0:'time'})
#     return df    

**Note to Nick**: It seems like we could swap `add_index` (above) for a native pandas call, like `df.reset_index(inplace=True)` or `df['index1'] = df.index`. I updated these calls with native implementation.

In [62]:
def build_composite_semantic_vector(lemma_seq,vocablist,highDimModel):
    """
    Function for producing vocablist and model is called in the main loop
    """
    getComposite = [0] * len(highDimModel[vocablist[1]])    
    for w1 in lemma_seq:
        if w1 in vocablist:
            semvector = highDimModel[w1]
            getComposite = getComposite + semvector    
    return getComposite

# what we want to do here is find the union of the vocablist within the HDM and then sum over all of the columns.
# should be faster/easier than 

### Build semantic space

In [63]:
# def BuildSemanticModel():
#     """
#     Build a semantic model from all transcripts from all conversations in target corpus.
#     """
    
#     ##// The output of the lemmatizer is saved in text files and when they are loaded again 
#     # lists are read as strings and need to be converted to lists again
#     data1=pd.read_csv(INPUT_PATH + "forSemantic.txt", sep='\t',encoding='utf-8')

#     ## get individual words in each utterance
#     data1 = data1.lemma.values
#     getSentences = [[convert_list_to_string(word) for word in data.split()] for data in data1]

#     ## Generate a dictionary and frequency count of all included words
#     frequency = defaultdict(int)
#     for text in getSentences:
#         for token in text:
#             frequency[token] += 1
#     ## Remove words that only occur once
#     frequency = {k: v for k, v in frequency.iteritems() if v > 1}

#     ## get list of ALL unique words and corresponding frequency counts
#     uniqueWords = frequency.keys() 
#     uniqueValues = frequency.values() 

#     ## find very frequent words based on 3SDs above mean, and generate a final list of ALL unique words and of unique content words (not the highest frequency). 
#     ## the cutoff to be considered high-frequency
#     getOut = np.mean(uniqueValues)+(np.std(uniqueValues)*3) 
#     ## final list of CONTENT words to be used when building composite vectors
#     contentWords = [words2 for words2 in frequency.keys() if frequency[words2] < getOut] 

#     ## Finally, build actual semantic space
#     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#     semmodel = word2vec.Word2Vec(getSentences, min_count=1) ##// one line, simply builds model and word vectors can be easily accessed and combined

#     return contentWords, semmodel

**Note to Nick:** I'm not sure why, but the loop in lines 16-18 broke for me. I've updated below based on that, although I'm really not sure why it broke for me. I also noticed that the high-frequency words aren't stripped from the word2vec model building, since the `getSentences` values aren't stripped of high-frequency words.

In [64]:
def alex_BuildSemanticModel(semantic_model_input_file,
                            high_sd_cutoff=3,
                            low_n_cutoff=1):
    """
    Given an input file produced by the ALIGN Phase 1 functions, 
    build a semantic model from all transcripts in all conversations
    in target corpus after removing high- and low-frequency words.
    High-frequency words are determined by a user-defined number of
    SDs over the mean (by default, `high_sd_cutoff=3`). Low-frequency
    words must appear over a specified number of raw occurrences 
    (by default, `low_n_cutoff=1`).
    
    Frequency cutoffs can be removed by `high_sd_cutoff=None` and/or
    `low_n_cutoff=0`.
    """
    
    # read in the file
    data1 = pd.read_csv(semantic_model_input_file, sep='\t',encoding='utf-8')
    
    # get frequency count of all included words
    all_words = filter(str.isalpha,[word.strip() for word in str(data1['lemma']).split(',')])
    frequency = defaultdict(int)
    for word in all_words:
        frequency[word] += 1
        
    # remove words that only occur more frequently than our cutoff (defined in occurrences)
    frequency = {word: freq for word, freq in frequency.iteritems() if freq > low_n_cutoff}

    # if desired, remove high-frequency words (over user-defined SDs above mean) 
    if high_sd_cutoff == None:
        contentWords = [word for word in frequency.keys()] 
    else:
        getOut = np.mean(frequency.values())+(np.std(frequency.values())*(high_sd_cutoff))
        contentWords = {word: freq for word, freq in frequency.iteritems() if freq < getOut}.keys()
    
    # identify the sentences in the file, stripping out words we won't keep
    getSentences = [re.sub('[^\w\s]+','',str(row)).split(' ') for row in list(data1['lemma'])]
    keepSentences = [[word for word in row if word in contentWords] for row in getSentences]
    
    # build actual semantic space
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    semantic_model = word2vec.Word2Vec(keepSentences, min_count=low_n_cutoff)

    # return all the content words and the word2vec model space
    return contentWords, semantic_model

**Note to Nick**: Here's an example of how the function above would run, given our parameters:

In [65]:
alex_BuildSemanticModel(semantic_model_input_file=INPUT_PATH + "align_concatenated_dataframe.txt")

2017-11-27 17:51:57,414 : INFO : collecting all words and their counts
2017-11-27 17:51:57,415 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-27 17:51:57,416 : INFO : collected 22 word types from a corpus of 316 raw words and 54 sentences
2017-11-27 17:51:57,417 : INFO : Loading a fresh vocabulary
2017-11-27 17:51:57,419 : INFO : min_count=1 retains 22 unique words (100% of original 22, drops 0)
2017-11-27 17:51:57,420 : INFO : min_count=1 leaves 316 word corpus (100% of original 316, drops 0)
2017-11-27 17:51:57,421 : INFO : deleting the raw counts dictionary of 22 items
2017-11-27 17:51:57,424 : INFO : sample=0.001 downsamples 22 most-common words
2017-11-27 17:51:57,426 : INFO : downsampling leaves estimated 51 word corpus (16.2% of prior 316)
2017-11-27 17:51:57,428 : INFO : estimated required memory for 22 words and 100 dimensions: 28600 bytes
2017-11-27 17:51:57,430 : INFO : resetting layer weights
2017-11-27 17:51:57,431 : INFO : training mode

(['just',
  'do',
  'testing',
  'some',
  'sample',
  'want',
  'need',
  'really',
  'what',
  'make',
  'to',
  'test',
  'more',
  'be',
  'we',
  'sure',
  'that',
  'but',
  'not',
  'care',
  'about',
  'this'],
 <gensim.models.word2vec.Word2Vec at 0x116490d50>)

### Calculate lexical and POS alignment scores for each n-gram length across two comparison vectors

In [66]:
def alex_LexicalPOSAlignment(tok1,lem1,penn_tok1,penn_lem1,
                             tok2,lem2,penn_tok2,penn_lem2,
                             stan_tok1=None,stan_lem1=None,
                             stan_tok2=None,stan_lem2=None,
                             ngramsLength=2,
                             ignore_duplicates=True,
                             add_stanford_tagger = 0):
    
    """
    Derive lexical and part-of-speech alignment scores
    between interlocutors (suffix `1` and `2` in arguments
    passed to function). 
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `ngramsLength` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # create empty dictionaries for syntactic similarity
    cosine_syntax_penn_tok = {}
    cosine_syntax_penn_lex = {}
    
    # if desired, generate Stanford-based scores
    if add_stanford_tagger == 1:
        cosine_syntax_stan_tok = {}
        cosine_syntax_stan_lem = {}
    
    # create empty dictionaries for lexical similarity
    cosine_lexical_tok = {}
    cosine_lexical_lem = {}
    
    # cycle through all desired ngram lengths
    for ngram in range(2,ngramsLength+1):
         
        # calculate similarity for lexical ngrams (tokens and lemmas)
        [vectorT1, vectorT2] = alex_ngram_lexical(tok1,tok2,ngramsize=ngram)
        [vectorL1, vectorL2] = alex_ngram_lexical(lem1,lem2,ngramsize=ngram)
        cosine_lexical_tok['cosine_lexical_tok{0}'.format(ngram)] = alex_get_cosine(vectorT1,vectorT2)
        cosine_lexical_lem['cosine_lexical_lem{0}'.format(ngram)] = alex_get_cosine(vectorL1, vectorL2)

        # calculate similarity for Penn POS ngrams (tokens)
        [vector_penn_tok1, vector_penn_tok2] = alex_ngram_pos(penn_tok1,penn_tok2,
                                                ngramsize=ngram,
                                                ignore_duplicates=ignore_duplicates) 
        cosine_syntax_penn_tok['cosine_syntax_penn_tok{0}'.format(ngram)] = alex_get_cosine(vector_penn_tok1, 
                                                                                            vector_penn_tok2)
        
        # calculate similarity for Penn POS ngrams (lemmas)
        [vector_penn_lem1, vector_penn_lem2] = alex_ngram_pos(penn_lem1,penn_lem2,
                                                              ngramsize=ngram,
                                                              ignore_duplicates=ignore_duplicates) 
        cosine_syntax_penn_lex['cosine_syntax_penn_lex{0}'.format(ngram)] = alex_get_cosine(vector_penn_lem1, 
                                                                                            vector_penn_lem2) 

        # if desired, also calculate using Stanford POS
        if add_stanford_tagger == 1:         
          
            # calculate similarity for Stanford POS ngrams (tokens)
            [vector_stan_tok1, vector_stan_tok2] = alex_ngram_pos(stan_tok1,stan_tok2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            cosine_syntax_stan_tok['cosine_syntax_stan_tok{0}'.format(ngram)] = alex_get_cosine(vector_stan_tok1,
                                                                                                vector_stan_tok2)

            # calculate similarity for Stanford POS ngrams (lemmas)
            [vector_stan_lem1, vector_stan_lem2] = alex_ngram_pos(stan_lem1,stan_lem2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            cosine_syntax_stan_lem['cosine_syntax_stan_lem{0}'.format(ngram)] = alex_get_cosine(vector_stan_lem1,
                                                                                                vector_stan_lem2)
        
    # return requested information
    if add_stanford_tagger==1:
        dictionaries_list = [cosine_syntax_penn_tok, cosine_syntax_penn_lex,
                             cosine_syntax_stan_tok, cosine_syntax_stan_lem, 
                             cosine_lexical_tok, cosine_lexical_lem]      
    else:
        dictionaries_list = [cosine_syntax_penn_tok, cosine_syntax_penn_lex,
                             cosine_lexical_tok, cosine_lexical_lem]      
    return dictionaries_list

**Note to Nick**: I updated the code above to implement some of those "if desired" options we talked about in the paper. I also made it so that the code doesn't pass along Stanford taggers if that's not desired.

**Note to Nick**: It looks like the code was generating some output that included duplicates and some that didn't. I unified this here to only generate output according to the `ignore_duplicates` value.

## Generate turn-level analysis of alignment scores

Functions:

* `CombineSemanticLexPos`
    * Gets conceptual alignment scores across two comparison vectors
    * Compines with output from `LexicalPOSAlignment`
    * Adds condition info and direction of alignment between conversational partners


**Note to Nick**: I think we should slightly tweak the `CombineSemanticLexPos` function name. I'm suggesting `returnMultilevelAlignment` below. (I've added `alex_` as a prefix just to indicate that I've changed it.)

**Note to Nick**: I think we should spin out the two parts of the `CombineSemanticLexPos` function, since folks might want to use the useful conceptual alignment function outside of the containing function. To that end, I've added it below as `alex_calculateConceptualAlignment`. 

* `TurnByTurnAnalysis`
    * Builds final dataframe with all combined turn-by-turn alignment scores

In [67]:
def alex_conceptualAlignment(lem1, lem2, vocablist, highDimModel):
    
    """
    Calculate conceptual alignment scores from list of lemmas
    from between two interocutors (suffix `1` and `2` in arguments
    passed to function) using `word2vec`.
    """

    # aggregate composite high-dimensional vectors of all words in utterance
    W2Vec1 = build_composite_semantic_vector(lem1,vocablist,highDimModel)
    W2Vec2 = build_composite_semantic_vector(lem2,vocablist,highDimModel)

    # return cosine distance alignment score
    return 1 - spatial.distance.cosine(W2Vec1, W2Vec2)  

In [68]:
def alex_returnMultilevelAlignment(cond_info,
                                   partnerA, text1,tok1,lem1,penn_tok1,penn_lem1,
                                   partnerB,text2,tok2,lem2,penn_tok2,penn_lem2,
                                   vocablist, highDimModel, 
                                   stan_tok1=None,stan_lem1=None,
                                   stan_tok2=None,stan_lem2=None,
                                   add_stanford_tagger=0,
                                   ngramsLength=2, 
                                   ignore_duplicates=True):
   
    """
    Calculate lexical, syntactic, and conceptual alignment
    between a pair of turns by individual interlocutors 
    (suffix `1` and `2` in arguments passed to function), 
    including leading/following comparison directionality.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `ngramsLength` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # create empty dictionaries 
    partner_direction = {}
    condition_info = {}
    cosine_semanticL = {}
    
    # calculate lexical and syntactic alignment
    dictionaries_list = alex_LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 ngramsLength=ngramsLength,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tagger=add_stanford_tagger)
    
    # calculate conceptual alignment
    cosine_semanticL['cosine_semanticL'] = alex_conceptualAlignment(lem1,lem2,vocablist,highDimModel)
    dictionaries_list.append(cosine_semanticL.copy())
    
    # determine directionality of leading/following comparison
    partner_direction['partner_direction'] = str(partnerA) + ">" + str(partnerB)
    dictionaries_list.append(partner_direction.copy())

    # add condition information
    condition_info['condition_info'] = cond_info    
    dictionaries_list.append(condition_info.copy())

    # return alignment scores
    return dictionaries_list

**Note to Nick**: In the `TurnByTurnAnalysis` function, you set `maxngram`'s default to 4, but the other `ngramsLength` defaults are set to 2. Is there a reason why the defaults are different? For the sake of the package, I'd recommend we use a unified default value, even if we decide to increase that ngram size for our application of it.

FYI, I haven't changed the default value below, since I don't know whether you'd prefer to set the default to 2 or 4 throughout the function. I don't have strong feelings either way. 

However, we should probably put a flag in somewhere that the `maxngram` value only plays nicely with the rest of the code when it's equal to the minimum turn length. That is, our code doesn't work well when a turn is unable to provide at least 1 complete n-gram of the specified length.

In [69]:
def alex_TurnByTurnAnalysis(dataframe,
                            vocablist,
                            highDimModel, 
                            delay=1,
                            maxngram = 4,
                            add_stanford_tagger=0,
                            ignore_duplicates=True):    

    """
    Calculate lexical, syntactic, and conceptual alignment
    between interlocutors over an entire conversation.
    Automatically detect individual speakers by unique
    speaker codes.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # if we don't want the Stanford tagger data, set defaults
    if add_stanford_tagger == 0:
        stan_tok1=None
        stan_lem1=None
        stan_tok2=None
        stan_lem2=None
    
    # prepare the data to the appropriate type
    dataframe['token'] = dataframe['token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['lemma'] = dataframe['lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_penn_token'] = dataframe['tagged_penn_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_penn_token'] = dataframe['tagged_penn_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    dataframe['tagged_penn_lemma'] = dataframe['tagged_penn_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_penn_lemma'] = dataframe['tagged_penn_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    
    # if desired, prepare the Stanford tagger data
    if add_stanford_tagger == 1:           
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086

    # create lagged version of the dataframe
    df_original = dataframe.drop(dataframe.tail(delay).index,inplace=False)
    df_lagged = dataframe.shift(-delay).drop(dataframe.tail(delay).index,inplace=False)
    
    # cycle through each pair of turns
    aggregated_df = pd.DataFrame()
    for i in range(0,df_original.shape[0]):

        # identify the condition for this dataframe
        cond_info = dataframe['file'].unique()
        if len(cond_info)==1: 
            cond_info = str(cond_info[0])
        
        # break and flag error if we have more than 1 condition per dataframe
        else: 
            raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)

        # grab all of first participant's data
        first_row = df_original.iloc[i]
        first_partner = first_row['participant']
        text1=first_row['content']
        tok1=first_row['token']
        lem1=first_row['lemma']
        penn_tok1=first_row['tagged_penn_token']
        penn_lem1=first_row['tagged_penn_lemma']

        # grab all of lagged participant's data
        lagged_row = df_lagged.iloc[i]
        lagged_partner = lagged_row['participant']
        text2=lagged_row['content']
        tok2=lagged_row['token']
        lem2=lagged_row['lemma']
        penn_tok2=lagged_row['tagged_penn_token']
        penn_lem2=lagged_row['tagged_penn_lemma']
        
        # if desired, grab the Stanford tagger data for both participants
        if add_stanford_tagger == 1:           
            stan_tok1=first_row['tagged_stan_token']
            stan_lem1=first_row['tagged_stan_lemma']
            stan_tok2=lagged_row['tagged_stan_token']
            stan_lem2=lagged_row['tagged_stan_lemma']
                
        # process multilevel alignment
        dictionaries_list=alex_returnMultilevelAlignment(cond_info=cond_info,
                                                         partnerA=first_partner,
                                                         text1=text1,
                                                         tok1=tok1,lem1=lem1,
                                                         penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                         partnerB=lagged_partner,
                                                         text2=text2,
                                                         tok2=tok2,lem2=lem2,
                                                         penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                         vocablist=vocablist,
                                                         highDimModel=highDimModel,
                                                         stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                         stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                         ngramsLength = maxngram,
                                                         ignore_duplicates = ignore_duplicates,
                                                         add_stanford_tagger = add_stanford_tagger) 
        
        # append data to existing structures
        next_df_line = pd.DataFrame.from_dict(dict(j for i in dictionaries_list for j in i.items()),
                               orient='index').transpose()
        aggregated_df = aggregated_df.append(next_df_line)
            
    # reformat turn information and add index
    aggregated_df = aggregated_df.reset_index(drop=True).reset_index().rename(columns={"index":"time"})

    # give us our finished dataframe
    return aggregated_df

**Note to Nick**: I removed the column aggregation and merge dict list functions above.

**Note to Nick**: It doesn't look like we're doing anything with the `text` variables (above). Are you wanting to keep it just so that you have a record of the text later?  If that's all, I think it might be useful for us to do something more efficient than slicing and converting the text at each line every time -- for example, by keeping a separate file with the raw text as a dataframe and then joining it at the end with the dataframe or by cutting spinning out the raw text from the dataframe and then re-combining it later.

Generate conversation-level analysis of alignment scores
-----------------------------------------------------
* Main Functions
    * ConvoByConvoAnalysis
        * Combines each lexical and syntactic utternace turn for each participant into a single vector 
    * CombineConvoDict
        * Runs conversation-level vector through "LexicalPOSAlignment" function to get alignment scores
        * Adds condition info
        * Builds final dataframe with all combined turn-by-turn alignment scores

In [70]:
# def getEachParticipant(df):
#     """
#     Gets the unique participant identification codes
#     """
#     possPcodes = df['participant'].values
#     possPcodes2 = np.unique(possPcodes)
#     dA=df[df['participant']==possPcodes2[0]]
#     dB=df[df['participant']==possPcodes2[1]]
#     return dA, dB

**Note to Nick**: I don't really think `getEachParticipant` is neeeded, since we can use native functions for it.

In [71]:
# def alex_ConvoByConvoAnalysis(dataframe,
#                               add_stanford_tagger=0):


#     """
#     Prepare data for conversation-level analysis of 
#     similarity, given a dataframe prepared by Phase 1
#     of ALIGN.
    
#     By default, only consider Penn POS tagger
#     information. If desired, also consider Stanford
#     POS tagger with `add_stanford_tagger=1`.
#     """
    
# #     # prepare the data to the appropriate type
# #     dataframe['token'] = dataframe['token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #     dataframe['lemma'] = dataframe['lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #     dataframe['tagged_penn_token'] = dataframe['tagged_penn_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #     dataframe['tagged_penn_token'] = dataframe['tagged_penn_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
# #     dataframe['tagged_penn_lemma'] = dataframe['tagged_penn_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #     dataframe['tagged_penn_lemma'] = dataframe['tagged_penn_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    
# #     # if desired, prepare the Stanford tagger data
# #     if add_stanford_tagger == 1:           
# #         dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #         dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
# #         dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
# #         dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086

#     # identify the condition for this dataframe
#     cond_info = dataframe['file'].unique()
#     if len(cond_info)==1: 
#         cond_info = str(cond_info[0])

#     # break and flag error if we have more than 1 condition per dataframe
#     else: 
#         raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)
    
#     # concatenate the token, lemma, and POS information for this conversation
#     tok_convo = [word for turn in dataframe['token'] for word in turn]
#     lem_convo = [word for turn in dataframe['lemma'] for word in turn]
#     penn_tok_convo = [POS for turn in dataframe['tagged_penn_token'] for POS in turn]    
#     penn_lem_convo = [POS for turn in dataframe['tagged_penn_token'] for POS in turn] 

#     # if desired, also add Stanford tagger info
#     if add_stanford_tagger == 1:
#         stan_tok_convo = [POS for turn in dataframe['tagged_stan_token'] for POS in turn]    
#         stan_lem_convo = [POS for turn in dataframe['tagged_stan_lemma'] for POS in turn] 
                
#     # return desired data
#     if add_stanford_tagger==1:
#         return cond_info, tok_convo, lem_convo, penn_tok_convo, penn_lem_convo, stan_tok_convo, stan_lem_convo
#     else: 
#         return cond_info, tok_convo, lem_convo, penn_tok_convo, penn_lem_convo

In [72]:
def alex_ConvoByConvoAnalysis(dataframe,
                          ngramsLength = 2,
                          ignore_duplicates=True,
                          add_stanford_tagger = 0):

    """
    Calculate analysis of multilevel similarity over
    a conversation between two interlocutors from a 
    transcript dataframe prepared by Phase 1
    of ALIGN. Automatically detect speakers by unique
    speaker codes.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # identify the condition for this dataframe
    cond_info = dataframe['file'].unique()
    if len(cond_info)==1: 
        cond_info = str(cond_info[0])
    
    # break and flag error if we have more than 1 condition per dataframe
    else: 
        raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)
   
    # if we don't want the Stanford info, set defaults 
    if add_stanford_tagger==0:
        stan_tok1 = None
        stan_lem1 = None
        stan_tok2 = None
        stan_lem2 = None

    # identify individual interlocutors
    df_A = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[0]]
    df_B = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[1]]
   
    # concatenate the token, lemma, and POS information for participant A
    tok1 = [word for turn in df_A['token'] for word in turn]
    lem1 = [word for turn in df_A['lemma'] for word in turn]
    penn_tok1 = [POS for turn in df_A['tagged_penn_token'] for POS in turn]    
    penn_lem1 = [POS for turn in df_A['tagged_penn_token'] for POS in turn] 
    if add_stanford_tagger == 1:
        stan_tok1 = [POS for turn in df_A['tagged_stan_token'] for POS in turn]    
        stan_lem21 = [POS for turn in df_A['tagged_stan_lemma'] for POS in turn] 

    # concatenate the token, lemma, and POS information for participant B
    tok2 = [word for turn in df_B['token'] for word in turn]
    lem2 = [word for turn in df_B['lemma'] for word in turn]
    penn_tok2 = [POS for turn in df_B['tagged_penn_token'] for POS in turn]    
    penn_lem2 = [POS for turn in df_B['tagged_penn_token'] for POS in turn] 
    if add_stanford_tagger == 1:
        stan_tok2 = [POS for turn in df_B['tagged_stan_token'] for POS in turn]    
        stan_lem2 = [POS for turn in df_B['tagged_stan_lemma'] for POS in turn] 
        
    # process multilevel alignment
    dictionaries_list = alex_LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 ngramsLength=ngramsLength,
                                                 ignore_duplicates=ignore_duplicates,
                                                 add_stanford_tagger=add_stanford_tagger)
    
    # append data to existing structures
    dictionary_df = pd.DataFrame.from_dict(dict(j for i in dictionaries_list for j in i.items()),
                           orient='index').transpose()
    dictionary_df['condition_info'] = cond_info
            
    # return the dataframe
    return dictionary_df

**Note to Nick**: I think it makes more sense to collapse the `ConvoByConvoAnalysis` and `CombineConvoDict` functions. See above.

**Note to self**: We could/should probably make `convobyconvo` an optional add-on from `turnbyturn`.

## RUN Phase 2: Actual Partners

* For each prepped transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [73]:
def alex_PHASE2RUN_REAL(input_file_directory, 
                        output_file_directory,
                        semantic_model_input_file,
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        delay=1,
                        maxngram=4,
                        ignore_duplicates=True,
                        add_stanford_tagger=0):   

    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # grab the files in the list
    file_list = glob.glob(input_file_directory+"*.txt")
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = alex_BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                        high_sd_cutoff=high_sd_cutoff,
                                                        low_n_cutoff=low_n_cutoff)

    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through each prepared file
    for fileName in file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=alex_TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel)
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = alex_ConvoByConvoAnalysis(dataframe=dataframe,
                                             ngramsLength = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tagger = add_stanford_tagger)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN.to_csv(output_file_directory+"AlignmentT2T.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO.to_csv(output_file_directory+"AlignmentC2C.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN, FINAL_CONVO

In [74]:
alex_PHASE2RUN_REAL(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                        output_file_directory = INPUT_PATH+ANALYSIS_READY,
                        semantic_model_input_file = INPUT_PATH+'align_concatenated_dataframe.txt',
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        delay=1,
                        maxngram=4,
                        ignore_duplicates=True,
                        add_stanford_tagger=0)

2017-11-27 17:51:57,799 : INFO : collecting all words and their counts
2017-11-27 17:51:57,800 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-27 17:51:57,802 : INFO : collected 22 word types from a corpus of 316 raw words and 54 sentences
2017-11-27 17:51:57,803 : INFO : Loading a fresh vocabulary
2017-11-27 17:51:57,805 : INFO : min_count=1 retains 22 unique words (100% of original 22, drops 0)
2017-11-27 17:51:57,807 : INFO : min_count=1 leaves 316 word corpus (100% of original 316, drops 0)
2017-11-27 17:51:57,808 : INFO : deleting the raw counts dictionary of 22 items
2017-11-27 17:51:57,809 : INFO : sample=0.001 downsamples 22 most-common words
2017-11-27 17:51:57,810 : INFO : downsampling leaves estimated 51 word corpus (16.2% of prior 316)
2017-11-27 17:51:57,812 : INFO : estimated required memory for 22 words and 100 dimensions: 28600 bytes
2017-11-27 17:51:57,813 : INFO : resetting layer weights
2017-11-27 17:51:57,815 : INFO : training mode

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_10-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_10-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_12-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_12-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_15-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_15-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_13-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/workin

(    time cosine_lexical_tok2 cosine_lexical_tok3           condition_info  \
 0      0                   0                   0  dyad_10-condition_1.txt   
 1      1          0.08703883                   0  dyad_10-condition_1.txt   
 2      2                   0                   0  dyad_10-condition_1.txt   
 3      3                   0                   0  dyad_10-condition_1.txt   
 4      4           0.1443376                   0  dyad_10-condition_1.txt   
 5      0                   0                   0  dyad_10-condition_2.txt   
 6      1           0.7071068                   0  dyad_10-condition_2.txt   
 7      2           0.2886751                   0  dyad_10-condition_2.txt   
 8      3                   0                   0  dyad_10-condition_2.txt   
 9      4                   0                   0  dyad_10-condition_2.txt   
 10     5                   0                   0  dyad_10-condition_2.txt   
 11     6          0.08703883                   0  dyad_10-condi

Generate surrogate pairings
-------------------------
* Collects all possible pairs of participants across the dyads in each condition and creates surrogate pairings by combining their conversational turns, preserving turn order. Output saved as new separate conversational transcripts. 
* Main Function:
    * GenerateSurrogate 

In [75]:
# def alex_DyadsEachCondition(filesList,
#                             id_separator = '\-',
#                             condition_label='cond'):
#     """
#     Gets the files in each unique condition category 
#     as specified by user in file name. Information must 
#     be embedded in file names as: `dyad_X-condition_Y.txt`.
    
#     By default, the separator between dyad ID and
#     condition ID is a hyphen (\-). If desired,
#     this may be changed in the `id_separator` 
#     argument.
    
#     By default, condition IDs will be identified as 
#     any characters following `cond`. If desired,
#     this may be changed with the `condition_label`
#     argument.
    
#     """
    
#     # grab the file name from the base path
#     file_info = [re.sub('\.txt','',os.path.basename(file_name)) for file_name in filesList]

#     # separate conditions from dyads
#     condition_ids = list(set([re.findall('[^'+id_separator+']*'+condition_label+'.*',metadata)[0] for metadata in file_info]))
        
#     # find all of the files in each condition
#     files_conditions = {}
#     for unique_condition in condition_ids:
#         next_condition_files = [add_file for add_file in filesList if unique_condition in add_file]
#         files_conditions[unique_condition] = next_condition_files
        
#     # return our dictionary of files and their conditions
#     return files_conditions

**Note to Nick**: I don't think we need this as a separate function -- I've incorporated it into the surrogate function.

In [76]:
# def SmallerSet(file_cond,index):
#     """
#     Option that can be run to extract smaller random set of all possible pairs within conditions
#     """
#     results = list(combinations(file_cond[index],2))
#     inEachCondition = len(file_cond[index])
#     return random.sample(results, inEachCondition) 

In [77]:
# def UniqueCodes(df):
#     """
#     Gets the unique participant identification codes
#     """
#     possPcodes = df['participant'].values
#     possPcodes2 = np.unique(possPcodes)
#     return possPcodes2

In [78]:
# def GenerateSurrogate(flist):
    
#     if ALLSURROGATE == 0:
    
    
#         ##// should clear out anything in there to start fresh- call some sort of os call through python to remove. can this be done?
# #         https://stackoverflow.com/questions/6996603/how-to-delete-a-file-or-folder
    
    
#     file_cond = DyadsEachCondition(flist)
#     for i in range(len(file_cond)):

#         ##// DEFAULT, get a smaller random set of possible pairs within conditions (same number as original dataset)
#         PairConvs=SmallerSet(file_cond, i)
        
#         ##// OPTIONAL: all possible dyad pairs within condition
#         if ALLSURROGATE == 1:
#             PairConvs=combinations(file_cond[i],2)
            
#         for d in PairConvs:
#             df1=pd.read_csv(INPUT_PATH + PREPPED_TRANSCRIPTS + d[0], sep='\t',encoding='utf-8')
#             df2=pd.read_csv(INPUT_PATH + PREPPED_TRANSCRIPTS + d[1], sep='\t',encoding='utf-8')

#             columns = list(df1)
#             columns = columns[0:]

#             partnerA = UniqueCodes(df1)[0]
#             partnerB = UniqueCodes(df1)[1]

#             ### identify how many turn A and B have
#             dA1=df1[df1['participant']==partnerA].reset_index()
#             TurnA1=len(dA1)
#             dA2=df2[df2['participant']==partnerA].reset_index()
#             TurnA2=len(dA2)
#             dB1=df1[df1['participant']==partnerB].reset_index()
#             TurnB1=len(dB1)
#             dB2=df2[df2['participant']==partnerB].reset_index()
#             TurnB2=len(dB2)
#             Turn1=min([TurnA1,TurnB2])
#             Turn2=min([TurnA2,TurnB1])        

#             index1 = np.arange(Turn1)
#             index2 = np.arange(Turn2)       

#             SurrInt1 = pd.DataFrame(columns=columns, index = range(Turn1*2))
#             SurrInt2 = pd.DataFrame(columns=columns, index = range(Turn2*2))
            
#             n=0
#             for t in range(0,Turn1):
#                 SurrInt1.iloc[n]=dA1.iloc[t]
#                 n+=1
#                 SurrInt1.iloc[n]=dB2.iloc[t]
#                 n+=1
#             SurrInt1 = SurrInt1.dropna(thresh=2)

#             n=0   
#             for t in range(0,Turn2):
#                 SurrInt2.iloc[n]=dA2.iloc[t]
#                 n+=1
#                 SurrInt2.iloc[n]=dB1.iloc[t]
#                 n+=1
#             SurrInt2 = SurrInt2.dropna(thresh=2)        

#             SurrInt1['file'] = d[0] + '-' + d[1]
#             SurrInt2['file'] = d[0] + '-' + d[1]
            
#             name1=u'SurrogatePair_'+unicode(d[0])+u'A'+'_'+unicode(d[1])+u'B.txt'
#             name2=u'SurrogatePair_'+unicode(d[1])+u'A'+'_'+unicode(d[0])+u'B.txt'   
            
#             SurrInt1.to_csv(INPUT_PATH + SURROGATE_TRANSCRIPTS + name1, encoding='utf-8',index=False,sep='\t')
#             SurrInt2.to_csv(INPUT_PATH + SURROGATE_TRANSCRIPTS + name2, encoding='utf-8',index=False,sep='\t')
#             SurrInt1=None
#             SurrInt2=None  

In [79]:
def alex_GenerateSurrogate(original_conversation_list,
                           surrogate_file_directory,
                           all_surrogates = False,
                           id_separator = '\-',
                           dyad_label='dyad',
                           condition_label='cond',
                           keep_original_turn_order = False):
    
    """
    Create transcripts for surrogate pairs of 
    participants (i.e., participants who did not 
    genuinely interact in the experiment), which
    will later be used to generate baseline levels 
    of alignment. Store surrogate files in a new
    folder each time the surrogate generation is run.
    
    Returns a list of all surrogate files created.

    By default, the separator between dyad ID and
    condition ID is a hyphen ('\-'). If desired,
    this may be changed in the `id_separator` 
    argument.

    By default, condition IDs will be identified as 
    any characters following `cond`. If desired,
    this may be changed with the `condition_label`
    argument.
    
    By default, dyad IDs will be identified as 
    any characters following `dyad`. If desired,
    this may be changed with the `dyad_label`
    argument.
    
    By default, generate surrogates only from a subset
    of all possible pairings. If desired, instead 
    generate surrogates from all possible pairings
    with `all_surrogates=True`.
    
    By default, create surrogates by shuffling all
    turns within each surrogate partner's data. If 
    desired, retain the original ordering of each
    surrogate partner's data with 
    `keep_original_turn_order = True`.
    """
        
    # create a subfolder for the new set of surrogates
    import time
    new_surrogate_path = surrogate_file_directory + 'surrogate_run-' + str(time.time()) +'/'
    if not os.path.exists(new_surrogate_path):
        os.makedirs(new_surrogate_path)
        
    # grab condition types from each file name
    file_info = [re.sub('\.txt','',os.path.basename(file_name)) for file_name in original_conversation_list]
    condition_ids = list(set([re.findall('[^'+id_separator+']*'+condition_label+'.*',metadata)[0] for metadata in file_info]))
    files_conditions = {}
    for unique_condition in condition_ids:
        next_condition_files = [add_file for add_file in original_conversation_list if unique_condition in add_file]
        files_conditions[unique_condition] = next_condition_files
    
    # cycle through conditions
    for condition in files_conditions.keys():
        
        # grab all possible pairs of conversations of this condition
        paired_surrogates = [pair for pair in combinations(files_conditions[condition],2)]
        
        # default: randomly pull from all pairs to get target surrogate sample
        if all_surrogates == False:
            import math
            paired_surrogates = random.sample(paired_surrogates, 
                                              int(math.ceil(len(files_conditions[condition])/2)))
            
        # cycle through surrogate pairings
        for next_surrogate in paired_surrogates:
            
            # read in the files
            original_file1 = os.path.basename(next_surrogate[0])
            original_file2 = os.path.basename(next_surrogate[1])
            original_df1=pd.read_csv(next_surrogate[0], sep='\t',encoding='utf-8')
            original_df2=pd.read_csv(next_surrogate[1], sep='\t',encoding='utf-8')
            
            # get participants A and B from df1
            participantA_1_code = min(original_df1['participant'].unique())
            participantB_1_code = max(original_df1['participant'].unique())
            participantA_1 = original_df1[original_df1['participant'] == participantA_1_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_1 = original_df1[original_df1['participant'] == participantB_1_code].reset_index().rename(columns={'file': 'original_file'})
            
            # get participants A and B from df2
            participantA_2_code = min(original_df2['participant'].unique())
            participantB_2_code = max(original_df2['participant'].unique())
            participantA_2 = original_df2[original_df2['participant'] == participantA_2_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_2 = original_df2[original_df2['participant'] == participantB_2_code].reset_index().rename(columns={'file': 'original_file'})
            
            # identify truncation point for both surrogates (to have even number of turns)
            surrogateX_turns=min([participantA_1.shape[0],
                                  participantB_2.shape[0]])
            surrogateY_turns=min([participantA_2.shape[0],
                                  participantB_1.shape[0]])
            
            # if desired, preserve original turn order for surrogate pairs
            if keep_original_turn_order == True:
                surrogateX = participantA_1.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_2.truncate(after=surrogateX_turns-1,copy=False)).sort(
                                ['index']).reset_index(drop=True).rename(columns={'index': 'original_index'})
                surrogateY = participantA_2.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_1.truncate(after=surrogateX_turns-1,copy=False)).sort(
                                ['index']).reset_index(drop=True).rename(columns={'index': 'original_index'})
            
            # otherwise, just shuffle all turns within participants
            else:
                
                # shuffle for first surrogate pairing
                surrogateX_A1 = participantA_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX_B2 = participantB_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX = surrogateX_A1.append(surrogateX_B2).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})
                
                # and for second surrogate pairing
                surrogateY_A2 = participantA_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY_B1 = participantB_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY = surrogateY_A2.append(surrogateY_B1).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})

            # create filename for our surrogate file
            original_dyad1 = re.findall(dyad_label+'[^'+id_separator+']*',original_file1)[0]
            original_dyad2 = re.findall(dyad_label+'[^'+id_separator+']*',original_file2)[0]
            surrogateX['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            surrogateY['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            nameX='SurrogatePair-'+original_dyad1+'A'+'-'+original_dyad2+'B'+'-'+condition+'.txt'
            nameY='SurrogatePair-'+original_dyad2+'A'+'-'+original_dyad1+'B'+'-'+condition+'.txt'
            
            # save to file
            surrogateX.to_csv(new_surrogate_path + nameX, encoding='utf-8',index=False,sep='\t')
            surrogateY.to_csv(new_surrogate_path + nameY, encoding='utf-8',index=False,sep='\t')
            
    # return list of all surrogate files
    return glob.glob(new_surrogate_path+"*.txt")

**Note to Nick**: I'm not sure what the `if` statement in line 3 is trying to do, since it seems like clearing out the surrogates could be handled whether or not people wanted to use all possible surrogates or just a subset.

**Note to Nick**: With the surrogates, we're shuffling all of the turns within each participant, right?

**Note to Nick**: This was an interesting choice to me. I'd thought we were independently sampling participants across all dyads for surrogates, but what we're actually doing is randomly pairing dyads and then virtually swapping the partners within it. Perhaps I missed it when reading the paper earlier, but it might be good for us to beef up this explanation.

RUN Phase 2: Surrogate Partners
-------------------------------
* Runs function to generate new surrogate transcript conversations (separate files)
* For each surrogate transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [80]:
def alex_PHASE2RUN_SURROGATE(input_file_directory, 
                             surrogate_file_directory,
                             output_file_directory,
                             semantic_model_input_file,
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=4,
                             ignore_duplicates=True,
                             add_stanford_tagger=0):   
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics
    for surrogate baseline conversations.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `add_stanford_tagger=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, the separator between dyad ID and
    condition ID in each file name is a hyphen ('\-'). 
    If desired, this may be changed with the 
    `id_separator` argument.

    By default, condition IDs in each file name
    will be identified as any characters following 
    `cond`. If desired, this may be changed with the 
    `condition_label` argument.
    
    By default, dyad IDs in each file name
    will be identified as any characters following 
    `dyad`. If desired, this may be changed with the 
    `dyad_label` argument.
    
    By default, generate surrogates only from a subset
    of all possible pairings. If desired, instead 
    generate surrogates from all possible pairings
    with `all_surrogates=True`
    """
    
    # grab the files in the input list
    file_list = glob.glob(input_file_directory+"*.txt")
    surrogate_file_list = alex_GenerateSurrogate(original_conversation_list = file_list,
                                                   surrogate_file_directory = surrogate_file_directory,
                                                   all_surrogates = all_surrogates,
                                                   id_separator = id_separator,
                                                   condition_label = condition_label,
                                                   dyad_label = dyad_label,
                                                   keep_original_turn_order = keep_original_turn_order) 
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = alex_BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                        high_sd_cutoff=high_sd_cutoff,
                                                        low_n_cutoff=low_n_cutoff)
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through the files
    for fileName in surrogate_file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=alex_TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel)
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = alex_ConvoByConvoAnalysis(dataframe=dataframe,
                                             ngramsLength = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             add_stanford_tagger = add_stanford_tagger)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN_SURROGATE = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO_SURROGATE = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN_SURROGATE.to_csv(output_file_directory+"AlignmentT2T_Surrogate.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO_SURROGATE.to_csv(output_file_directory+"AlignmentC2C_Surrogate.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN_SURROGATE, FINAL_CONVO_SURROGATE

In [81]:
[turn_surrogate,convo_surrogate] = alex_PHASE2RUN_SURROGATE(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                             surrogate_file_directory= INPUT_PATH+SURROGATE_TRANSCRIPTS,
                             output_file_directory= INPUT_PATH+ANALYSIS_READY,
                             semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=4,
                             ignore_duplicates=True,
                             add_stanford_tagger=0)

2017-11-27 17:51:58,645 : INFO : collecting all words and their counts
2017-11-27 17:51:58,647 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-27 17:51:58,648 : INFO : collected 22 word types from a corpus of 316 raw words and 54 sentences
2017-11-27 17:51:58,649 : INFO : Loading a fresh vocabulary
2017-11-27 17:51:58,650 : INFO : min_count=1 retains 22 unique words (100% of original 22, drops 0)
2017-11-27 17:51:58,652 : INFO : min_count=1 leaves 316 word corpus (100% of original 316, drops 0)
2017-11-27 17:51:58,654 : INFO : deleting the raw counts dictionary of 22 items
2017-11-27 17:51:58,655 : INFO : sample=0.001 downsamples 22 most-common words
2017-11-27 17:51:58,657 : INFO : downsampling leaves estimated 51 word corpus (16.2% of prior 316)
2017-11-27 17:51:58,658 : INFO : estimated required memory for 22 words and 100 dimensions: 28600 bytes
2017-11-27 17:51:58,660 : INFO : resetting layer weights
2017-11-27 17:51:58,661 : INFO : training mode

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511833918.44/SurrogatePair-dyad_13A-dyad_10B-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511833918.44/SurrogatePair-dyad_10A-dyad_12B-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511833918.44/SurrogatePair-dyad_15A-dyad_12B-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511833918.44/SurrogatePair-dyad_10A-dyad_13B-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511833918.44/SurrogatePair-dyad_12A-dyad_15B-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/wo

# Run everything!

## Phase 1: Prep

In [90]:
import time
start_phase1 = time.time()

In [91]:
model_store = alex_PHASE1RUN(input_file_directory=INPUT_PATH+TRANSCRIPTS,
                      output_file_directory=INPUT_PATH+PREPPED_TRANSCRIPTS,
                      training_dictionary=INPUT_PATH+'big.txt')

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_10-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_10-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_12-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_12-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_15-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_15-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-original/dyad_13-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment

## Phase 2: Real

In [92]:
start_phase2real = time.time()

In [93]:
[turn_real,convo_real]=alex_PHASE2RUN_REAL(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                        output_file_directory = INPUT_PATH+ANALYSIS_READY,
                        semantic_model_input_file = INPUT_PATH+'align_concatenated_dataframe.txt',
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        delay=1,
                        maxngram=4,
                        ignore_duplicates=True,
                        add_stanford_tagger=0)

2017-11-27 17:57:37,564 : INFO : collecting all words and their counts
2017-11-27 17:57:37,566 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-27 17:57:37,567 : INFO : collected 22 word types from a corpus of 316 raw words and 54 sentences
2017-11-27 17:57:37,568 : INFO : Loading a fresh vocabulary
2017-11-27 17:57:37,570 : INFO : min_count=1 retains 22 unique words (100% of original 22, drops 0)
2017-11-27 17:57:37,572 : INFO : min_count=1 leaves 316 word corpus (100% of original 316, drops 0)
2017-11-27 17:57:37,573 : INFO : deleting the raw counts dictionary of 22 items
2017-11-27 17:57:37,575 : INFO : sample=0.001 downsamples 22 most-common words
2017-11-27 17:57:37,577 : INFO : downsampling leaves estimated 51 word corpus (16.2% of prior 316)
2017-11-27 17:57:37,579 : INFO : estimated required memory for 22 words and 100 dimensions: 28600 bytes
2017-11-27 17:57:37,581 : INFO : resetting layer weights
2017-11-27 17:57:37,583 : INFO : training mode

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_10-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_10-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_12-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_12-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_15-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_15-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-prepped/dyad_13-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/workin

## Phase 2: Surrogate

In [94]:
start_phase2surrogate = time.time()

In [95]:
[turn_surrogate,convo_surrogate] = alex_PHASE2RUN_SURROGATE(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                             surrogate_file_directory= INPUT_PATH+SURROGATE_TRANSCRIPTS,
                             output_file_directory= INPUT_PATH+ANALYSIS_READY,
                             semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=4,
                             ignore_duplicates=True,
                             add_stanford_tagger=0)

2017-11-27 17:57:38,102 : INFO : collecting all words and their counts
2017-11-27 17:57:38,103 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-27 17:57:38,105 : INFO : collected 22 word types from a corpus of 316 raw words and 54 sentences
2017-11-27 17:57:38,106 : INFO : Loading a fresh vocabulary
2017-11-27 17:57:38,107 : INFO : min_count=1 retains 22 unique words (100% of original 22, drops 0)
2017-11-27 17:57:38,108 : INFO : min_count=1 leaves 316 word corpus (100% of original 316, drops 0)
2017-11-27 17:57:38,110 : INFO : deleting the raw counts dictionary of 22 items
2017-11-27 17:57:38,111 : INFO : sample=0.001 downsamples 22 most-common words
2017-11-27 17:57:38,112 : INFO : downsampling leaves estimated 51 word corpus (16.2% of prior 316)
2017-11-27 17:57:38,114 : INFO : estimated required memory for 22 words and 100 dimensions: 28600 bytes
2017-11-27 17:57:38,115 : INFO : resetting layer weights
2017-11-27 17:57:38,116 : INFO : training mode

Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511834257.9/SurrogatePair-dyad_10A-dyad_12B-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511834257.9/SurrogatePair-dyad_15A-dyad_12B-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511834257.9/SurrogatePair-dyad_15A-dyad_12B-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511834257.9/SurrogatePair-dyad_12A-dyad_15B-condition_2.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working/ALIGN_NOTEBOOK_CLEAN/toy_data-surrogate/surrogate_run-1511834257.9/SurrogatePair-dyad_12A-dyad_15B-condition_1.txt
Processing: /Users/alexandra/Dropbox/DA/DA_linguisticAlignment/working

In [96]:
end=time.time()

## Speed calculations

Phase 1 time:

In [97]:
start_phase2real - start_phase1

30.144092082977295

Phase 2 real time:

In [98]:
start_phase2surrogate - start_phase2real

0.33298707008361816

Phase 2 surrogate time:

In [99]:
end - start_phase2surrogate

0.48375582695007324

All 3 phases:

In [102]:
end - start_phase1

30.960834980010986

## Printouts!

In [84]:
turn_real.head(10)

Unnamed: 0,time,cosine_lexical_tok2,cosine_lexical_tok3,condition_info,cosine_lexical_lem4,cosine_lexical_tok4,cosine_lexical_lem3,cosine_syntax_penn_tok4,cosine_syntax_penn_tok3,cosine_syntax_penn_tok2,cosine_syntax_penn_lex2,cosine_syntax_penn_lex3,cosine_syntax_penn_lex4,cosine_semanticL,partner_direction,cosine_lexical_lem2
0,0,0.0,0,dyad_10-condition_1.txt,0,0,0,0,0,0,0,0,0,0.571742,1>2.0,0.0
1,1,0.08703883,0,dyad_10-condition_1.txt,0,0,0,0,0,0,0,0,0,0.6656333,2>1.0,0.08703883
2,2,0.0,0,dyad_10-condition_1.txt,0,0,0,0,0,0,0,0,0,0.3349915,1>2.0,0.0
3,3,0.0,0,dyad_10-condition_1.txt,0,0,0,0,0,0,0,0,0,0.40693,2>1.0,0.0
4,4,0.1443376,0,dyad_10-condition_1.txt,0,0,0,0,0,0,0,0,0,0.4127169,1>2.0,0.1443376
5,0,0.0,0,dyad_10-condition_2.txt,0,0,0,0,0,0,0,0,0,0.3651285,2>1.0,0.0
6,1,0.7071068,0,dyad_10-condition_2.txt,0,0,0,0,0,0,0,0,0,1.0,1>2.0,0.7071068
7,2,0.2886751,0,dyad_10-condition_2.txt,0,0,0,0,0,0,0,0,0,0.658746,2>1.0,0.2886751
8,3,0.0,0,dyad_10-condition_2.txt,0,0,0,0,0,0,0,0,0,0.1233059,1>2.0,0.0
9,4,0.0,0,dyad_10-condition_2.txt,0,0,0,0,0,0,0,0,0,0.3596222,2>1.0,0.0


In [85]:
convo_real.head(10)

Unnamed: 0,cosine_lexical_tok2,cosine_lexical_tok3,cosine_lexical_lem4,cosine_lexical_tok4,cosine_lexical_lem3,cosine_syntax_penn_tok4,cosine_syntax_penn_tok3,cosine_syntax_penn_tok2,cosine_syntax_penn_lex2,cosine_syntax_penn_lex3,cosine_syntax_penn_lex4,cosine_lexical_lem2,condition_info
0,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_10-condition_1.txt
1,0.113228,0,0,0,0,0,0,0,0,0,0,0.109254,dyad_10-condition_2.txt
2,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_12-condition_1.txt
3,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_12-condition_2.txt
4,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_15-condition_2.txt
5,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_15-condition_1.txt
6,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,dyad_13-condition_1.txt
7,0.113228,0,0,0,0,0,0,0,0,0,0,0.109254,dyad_13-condition_2.txt


In [87]:
turn_surrogate.head(10)

Unnamed: 0,time,cosine_lexical_tok2,cosine_lexical_tok3,condition_info,cosine_lexical_lem4,cosine_lexical_tok4,cosine_lexical_lem3,cosine_syntax_penn_tok4,cosine_syntax_penn_tok3,cosine_syntax_penn_tok2,cosine_syntax_penn_lex2,cosine_syntax_penn_lex3,cosine_syntax_penn_lex4,cosine_semanticL,partner_direction,cosine_lexical_lem2
0,0,0.0,0,condition_2-dyad_15-dyad_13,0,0,0,0,0,0,0,0,0,0.1233059,1>2.0,0.0
1,1,0.0,0,condition_2-dyad_15-dyad_13,0,0,0,0,0,0,0,0,0,0.3596222,2>1.0,0.0
2,2,0.0,0,condition_2-dyad_15-dyad_13,0,0,0,0,0,0,0,0,0,-0.03504958,1>2.0,0.0
3,3,0.0,0,condition_2-dyad_15-dyad_13,0,0,0,0,0,0,0,0,0,-0.005377397,2>1.0,0.0
4,4,0.0,0,condition_2-dyad_15-dyad_13,0,0,0,0,0,0,0,0,0,0.3349915,1>2.0,0.0
5,0,0.0,0,condition_1-dyad_12-dyad_13,0,0,0,0,0,0,0,0,0,0.3349915,1>2.0,0.0
6,1,0.0,0,condition_1-dyad_12-dyad_13,0,0,0,0,0,0,0,0,0,0.3692705,2>1.0,0.0
7,2,0.0,0,condition_1-dyad_12-dyad_13,0,0,0,0,0,0,0,0,0,0.2776606,1>2.0,0.0
8,3,0.1443376,0,condition_1-dyad_12-dyad_13,0,0,0,0,0,0,0,0,0,0.4127169,2>1.0,0.1443376
9,4,0.0,0,condition_1-dyad_12-dyad_13,0,0,0,0,0,0,0,0,0,0.353934,1>2.0,0.0


In [88]:
convo_surrogate.head(10)

Unnamed: 0,cosine_lexical_tok2,cosine_lexical_tok3,cosine_lexical_lem4,cosine_lexical_tok4,cosine_lexical_lem3,cosine_syntax_penn_tok4,cosine_syntax_penn_tok3,cosine_syntax_penn_tok2,cosine_syntax_penn_lex2,cosine_syntax_penn_lex3,cosine_syntax_penn_lex4,cosine_lexical_lem2,condition_info
0,0.057831,0,0,0,0,0,0,0,0,0,0,0.05547,condition_2-dyad_15-dyad_13
1,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,condition_1-dyad_12-dyad_13
2,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,condition_2-dyad_12-dyad_15
3,0.101274,0,0,0,0,0,0,0,0,0,0,0.101274,condition_2-dyad_15-dyad_13
4,0.057831,0,0,0,0,0,0,0,0,0,0,0.05547,condition_1-dyad_12-dyad_13
5,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,condition_2-dyad_12-dyad_15
6,0.057831,0,0,0,0,0,0,0,0,0,0,0.05547,condition_1-dyad_10-dyad_15
7,0.081786,0,0,0,0,0,0,0,0,0,0,0.078446,condition_1-dyad_10-dyad_15
