# ALIGN Example Notebook: CHILDES Data

This notebook provides an introduction to **ALIGN**, a tool for quantifying multi-level linguistic similarity between speakers. To do so, we use the data from the Kuczaj Corpus (https://childes.talkbank.org/access/Eng-NA/Kuczaj.html).

This method was introduced in "ALIGN: Analyzing Linguistic Interactions with Generalizable techNiques" (Duran, Paxton, & Fusaroli, *submitted*).

***

**Table of Contents**:

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [User-specified parameters](#User-specified-parameters)
    * [Highest-level functions](#Highest-level-functions)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [Specify global ALIGN settings](#Specify-global-ALIGN-settings)
* [Run everything!](#Run-everything!)
    * [Phase 1: Prep](#Phase-1:-Prep)
    * [Phase 2: Real](#Phase-2:-Real)
    * [Phase 2: Surrogate](#Phase-2:-Surrogate)
    * [Speed calculations](#Speed-calculations)
    * [Printouts!](#Printouts!)

***

# Getting Started

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See folder `examples > toy_data-original` in Github repository for an example

### Filename conventions

* Each conversation text file must be regularly formatted, including a prefix for dyad and a prefix for conversation prior to the identifier for each that are separated by a unique character. By default, ALIGN looks for patterns that follow this convention: `dyad1-condA.txt`
    * However, users may choose to include any label for dyad or condition so long as the two labels are distinct from one another and are not subsets of any possible dyad or condition labels. Users may also use any character as a separator so long as it does not occur anywhere else in the filename.
    * The chosen file format **must** be used when saving **all** files for this analysis.

### Highest-level functions

Given appropriately prepared transcript files, ALIGN can be run in 3 high-level functions:

`prepare_transcripts`

* Pre-process each standardized conversation, checking it conforms to the requirements.
* Each utterance is tokenized and lemmatized and has POS tags added.

`calculate_alignment`

* Generates turn-level and conversation-level alignment scores (lexical, conceptual, and syntactic) across a range of n-gram sequences

`calculate_baseline_alignment`

* Generates a surrogate corpus.
* Runs analysis (using identical specifications from `calculate_alignment`) on the surrogate corpus.

***

# Setup

## Import libraries

In [None]:
import align

## Specify global ALIGN settings

For purposes of demonstrating ALIGN, the directory and folder pathnames correspond to data provided in the Github repository associated with this notebook. The default option is set to analyze conversations from a single English corpus from the CHILDES database (MacWhinney, 2000), specifically, Kuczaj’s Abe corpus (Kuczaj, 1976). Here, only the last 20 conversations are evaluated. Analysis is based on default settings unless otherwise indicated.

### Directories and folders

**`INPUT_PATH`**: Set working directory, in which all notebook and supporting files are located.

In [51]:
INPUT_PATH=os.getcwd()+'/'

**`TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder containing the original transcript files.

In [52]:
TRANSCRIPTS = INPUT_PATH + 'examples/CHILDES/childes-original/'

**`STANFORD_POS_PATH`**: Path to Stanford POS tagger files.

In [53]:
STANFORD_POS_PATH = INPUT_PATH + 'package_files/stanford-postagger-full-2017-06-09/'

**`STANFORD_LANGUAGE`**: If using stanford tagger, set language model to be used for POS tagging

In [54]:
STANFORD_LANGUAGE = 'models/english-left3words-distsim.tagger'

**`PREPPED_TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder into which prepared transcript files will be saved.

In [55]:
PREPPED_TRANSCRIPTS = INPUT_PATH + 'examples/CHILDES/childes-prepped/'

**`ANALYSIS_READY`**: Set variable for folder name (as string) for relative location of folder into which analysis-ready dataframe files will be saved.

In [56]:
ANALYSIS_READY = 'examples/CHILDES/childes-analysis/'

**`SURROGATE_TRANSCRIPTS`**: Set variable for folder name (as string) for relative location of folder into which all prepared surrogate transcript files will be saved.

In [57]:
SURROGATE_TRANSCRIPTS = 'examples/CHILDES/childes-surrogate/'

### Analysis settings

`MAXNGRAM`: Set maximum size for n-gram chunking.

* Default: 2

In [59]:
MAXNGRAM = 2

`MINWORDS`: Set minimum number of words for each turn.

* Default: 2

**Note**: The minimum number of words must be at least as long as maximum *n*-gram size (`MAXNGRAM` above).

In [60]:
MINWORDS = 2

`ADD_STANFORD_TAGS`: Choose POS tagger. 

* Default: `False`
    * Run NLTK default POS tagger (NLTK 3.1+): `averaged_perceptron_tagger`
* Option: `True`
    * Run both NLTK default POS tagger and Stanford POS tagger. Note: Adding the Stanford POS tagger will lead to an increase in processing time. 

In [61]:
ADD_STANFORD_TAGS = False

`DELAY`: Set max delay between partner's turns when generating alignment score.

* Currently, the only acceptable value is 1 (i.e., contiguous turns).

In [62]:
DELAY = 1

`USE_FILLER_LIST`: Choose method for removing speech fillers. 

* Default: `None`
    * Does not provide additional speech fillers to be removed.
* Option: list of strings
    * Provide a list of literal strings to be removed from the transcripts.

In [63]:
USE_FILLER_LIST = None

`IGNORE_DUPLICATES`: Choose whether to remove duplicate lexial bigrams when computing syntactic alignment

* Default: `True`
    * Removes duplicate lexical bigrams.
* Option `False`
    * Keeps duplicate lexical bigrams

In [64]:
IGNORE_DUPLICATES = True

`USE_PRETRAINED_VECTORS`: Choose whether to use high-dimensional semantic model pretrained vectors from GoogleNews or to build vectors based on transcripts (each utterance/row is equivalent to a single context). Note: if there are a small number of utterance/rows then the pretrained vectors should be used. 

* Default: `False`
    * Builds high-dimensional based on input transcript
* Option `True`
    * Uses pre-trained vectors from GoogleNews

In [65]:
USE_PRETRAINED_VECTORS = False

`ALL_SURROGATES`: Choose whether to generate surrogates from all possible pairings within a condition or only from a subset of all possible pairings. 

* Default: `True`
    * Generates all possible pairings
* Option `False`
    * Generates from a subset of all possible pairings

In [66]:
ALL_SURROGATES = True

`KEEP_ORIGINAL_TURN_ORDER`: For generating surrogate transcripts, choose whether to to retain the original ordering of each surrogate partner's data or create surrogates by shuffling all turns within each surrogate partner's data. 

* Default: `True`
    * Retains original
    ordering of conversational turns
* Option `False`
    * Shuffles ordering of conversational turns

In [67]:
KEEP_ORIGINAL_TURN_ORDER = True

### Additional settings

ALIGN contains a number of other settings that users may alter if desired. We outline each below and provide the default value for user information, but we preserve them in their defaults for the sake of this notebook. More information about each argument can also be found in the docstring for each function.

* `filler_regex_and_list`: remove common fillers through regex in addition to removing a user-specified list of fillers (default: `False`)
* `high_sd_cutoff`: remove any words that occur in the dataset over a certain number of SDs greater than the mean (default: `3`)
* `low_n_cutoff`: remove any words that occur in the dataset at or below a given raw number of times (default: `1`)
* `input_as_directory`: pass a directory of files (rather than a list of file names) to process data (default: `True`)
* `save_concatenated_dataframe`: save output of Phase 1 as a single dataframe (default: `True`)
* `dyad_label`: prefix before dyad identifier in transcript filenames (default: `dyad`)
* `condition_label`: prefix before dyad identifier in transcript filenames (default: `cond`)
* `id_separator`: unique character separator between dyad and condition in transcript filenames (default: `-`)

# Run everything!

Now that we've walked through all of our functions, let's try out ALIGN on some of the CHILDES data. We'll be getting a sense of the length of time it takes to run ALIGN with each step and then take a peek at the resulting data.

## Phase 1: Prep

In [50]:
import time
start_phase1 = time.time()

In [67]:
model_store = prepare_transcripts(
          input_files=TRANSCRIPTS,
          minwords=2,
          add_stanford_tags=False,
          output_file_directory=PREPPED_TRANSCRIPTS,
          use_filler_list=None,
          filler_regex_and_list=False,          
          training_dictionary=INPUT_PATH+'package_files/gutenberg.txt',
          stanford_pos_path=STANFORD_POS_PATH,
          stanford_language_path=STANFORD_LANGUAGE,
          input_as_directory=True,
          save_concatenated_dataframe=True)

## Phase 2: Real

**Note**: For demonstration purposes, given the small number of transcripts in our example corpus, the example here uses pretrained vectors from Google News rather than building a new semantic space from the example corpus itself.

In [52]:
start_phase2real = time.time()

In [66]:
[turn_real,convo_real]= calculate_alignment(
                        input_files = TRANSCRIPTS,
                        add_stanford_tags=False,  
                        maxngram=2,   
                        use_pretrained_vectors=True,
                        semantic_model_input_file=TRANSCRIPTS + '../' + 'align_concatenated_dataframe.txt',    
                        output_file_directory = PREPPED_TRANSCRIPTS,
                        pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                        ignore_duplicates=True,
                        delay=1,
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        input_as_directory=True)

## Phase 2: Surrogate

**Note**: For demonstration purposes, we again use pre-trained vectors from Google News. We demonstrate other possible uses for labels by setting `dyad_label = time`, allowing us to compare alignment over time across the same speakers. We also demonstrate how to generate a subset of surrogate pairings rather than all possible pairings.

In [54]:
start_phase2surrogate = time.time()

In [65]:
[turn_surrogate,convo_surrogate] = calculate_baseline_alignment(
                                input_files = INPUT_PATH+'examples/CHILDES/childes-prepped/', 
                                add_stanford_tags=False,
                                maxngram=2,
                                use_pretrained_vectors=True,
                                all_surrogates=False,
                                keep_original_turn_order=True,
                                id_separator = '\-',
                                dyad_label='time',
                                condition_label='cond',
                                surrogate_file_directory= INPUT_PATH+'examples/CHILDES/childes-surrogate/',                                
                                output_file_directory=INPUT_PATH+'examples/CHILDES/childes-analysis/',
                                pretrained_input_file=INPUT_PATH+'package_files/GoogleNews-vectors-negative300.bin',
                                semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                                ignore_duplicates=True,
                                delay=1,
                                high_sd_cutoff=3,
                                low_n_cutoff=1,
                                input_as_directory=True)

In [56]:
end=time.time()

## Speed calculations

As promised, let's take a look at how long it takes to run each section. Time is given in seconds.

Phase 1 time:

In [57]:
start_phase2real - start_phase1

33.50847101211548

Phase 2 real time:

In [58]:
start_phase2surrogate - start_phase2real

77.47824192047119

Phase 2 surrogate time:

In [59]:
end - start_phase2surrogate

77.40525507926941

All 3 phases:

In [60]:
end - start_phase1

188.39196801185608

## Printouts!

And that's it! Before we go, let's take a look at the output from the real data analyzed at the turn level for each conversation (`turn_real`) and at the conversation level for each dyad (`convo_real`). We'll then look at our surrogate data, analyzed both at the turn level (`turn_surrogate`) and at the conversation level (`convo_surrogate`). In our next step, we would then take these data and plug them into our statistical model of choice, but we'll stop here for the sake of our tutorial.

In [61]:
turn_real.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.0,0,0,0.285198,cgv>kid,time197-cond1.txt
1,1,0.0,0.0,0,0,0.37358,kid>cgv,time197-cond1.txt
2,2,0.154303,0.0,0,0,0.57782,cgv>kid,time197-cond1.txt
3,3,0.0,0.0,0,0,0.672067,kid>cgv,time197-cond1.txt
4,4,0.111111,0.09245,0,0,0.597504,cgv>kid,time197-cond1.txt
5,5,0.222222,0.27735,0,0,0.617649,kid>cgv,time197-cond1.txt
6,6,0.0,0.0,0,0,0.168668,cgv>kid,time197-cond1.txt
7,7,0.0,0.0,0,0,0.223091,kid>cgv,time197-cond1.txt
8,8,0.0,0.0,0,0,0.323836,cgv>kid,time197-cond1.txt
9,9,0.0,0.0,0,0,0.283156,kid>cgv,time197-cond1.txt


In [62]:
convo_real.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.70911,0.70911,0.099848,0.186072,time197-cond1.txt
1,0.76849,0.76849,0.353514,0.43538,time202-cond1.txt
2,0.744802,0.744802,0.309924,0.356673,time191-cond1.txt
3,0.782399,0.782399,0.353604,0.401469,time209-cond1.txt
4,0.810753,0.810753,0.192589,0.305209,time210-cond1.txt
5,0.766315,0.766315,0.311128,0.365522,time204-cond1.txt
6,0.670246,0.670246,0.164155,0.228145,time196-cond1.txt
7,0.789571,0.789571,0.285261,0.317173,time203-cond1.txt
8,0.741248,0.741248,0.319008,0.383271,time208-cond1.txt
9,0.78544,0.78544,0.188816,0.229783,time205-cond1.txt


In [63]:
turn_surrogate.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.169031,0.0,0.0,0.3444,cgv>kid,time195-time204-cond1
1,1,0.210819,0.111803,0.0,0.0,0.628921,kid>cgv,time195-time204-cond1
2,2,0.105409,0.111803,0.0,0.0,0.619997,cgv>kid,time195-time204-cond1
3,3,0.0,0.0,0.13484,0.13484,0.686174,kid>cgv,time195-time204-cond1
4,4,0.0,0.0,0.0,0.0,0.559554,cgv>kid,time195-time204-cond1
5,5,0.0,0.0,0.0,0.0,0.309329,kid>cgv,time195-time204-cond1
6,6,0.0,0.0,0.0,0.0,0.35275,cgv>kid,time195-time204-cond1
7,7,0.0,0.0,0.0,0.0,0.535742,kid>cgv,time195-time204-cond1
8,8,0.0,0.0,0.353553,0.353553,0.30474,cgv>kid,time195-time204-cond1
9,9,0.0,0.0,0.0,0.0,0.456275,kid>cgv,time195-time204-cond1


In [64]:
convo_surrogate.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.703404,0.703404,0.163967,0.251262,time195-time204-cond1
1,0.767801,0.767801,0.129419,0.157256,time191-time201-cond1
2,0.747546,0.747546,0.102264,0.157979,time210-time197-cond1
3,0.806707,0.806707,0.124169,0.176623,time210-time201-cond1
4,0.790831,0.790831,0.130467,0.211696,time204-time195-cond1
5,0.686797,0.686797,0.135402,0.202031,time194-time210-cond1
6,0.72722,0.72722,0.078701,0.103854,time197-time210-cond1
7,0.736246,0.736246,0.081103,0.156581,time195-time197-cond1
8,0.808982,0.808982,0.120128,0.199052,time201-time210-cond1
9,0.710756,0.710756,0.069476,0.119855,time197-time195-cond1
