# ALIGN Tutorial Notebook: CHILDES

This notebook provides an introduction to **ALIGN**, 
a tool for quantifying multi-level linguistic similarity 
between speakers, using parent-child transcript data from 
the Kuczaj Corpus 
(https://childes.talkbank.org/access/Eng-NA/Kuczaj.html).

This method was introduced in "ALIGN: Analyzing Linguistic Interactions with Generalizable techNiques" (Duran, Paxton, & Fusaroli, *submitted*).

## Tutorial Overview

While many studies of interpersonal linguistic alignment
compare alignment between different dyads across different
conditions (i.e., typically a within- or between-dyads
design in which each dyad contributes only one or two 
conversations), there may also be interest in understanding
longer-scale temporal dynamics *within* a given dyad.
This tutorial provides an example of how ALIGN may be used
to just that end: analyzing how a single dyad's multilevel
alignment changes across different conversations held
at different points over a longer range of time.

To do so, this tutorial walks users throuh an analysis of
conversations from a single English corpus from the CHILDES 
database  (MacWhinney, 2000)---specifically, Kuczaj’s Abe 
corpus (Kuczaj, 1976), used under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License (see GitHub
repository or `data/CHILDES` directory for license). 
We analyze the last 20 conversations in the corpus in order
to explore how ALIGN can be used to track multi-level
linguistic alignment between a parent and child over time,
which may be of interest to developmental language
researchers. Specifically, we explore how alignment between a parent
and a child changes over a brief span of developmental
trajectory.

Data for this tutorial are shipped with the `align`
package on PyPI (https://pypi.python.org/pypi/align) and GitHub
(https://github.com/nickduran/align-linguistic-alignment/).

***

## Table of Contents

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [Highest-level functions](#Highest-level-functions)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [Specify ALIGN path settings](#Specify-ALIGN-path-settings)
* [Phase 1: Prepare transcripts](#Phase-1:-Prepare-transcripts)
    * [Preparation settings](#Preparation-settings)
    * [Run preparation phase](#Run-preparation-phase)
* [Phase 2: Calculate alignment](#Phase-2:-Calculate-alignment)
    * [For real data: Alignment calculation settings](#For-real-data:-Alignment-calculation-settings)
    * [For real data: Run alignment calculation](#For-real-data:-Run-alignment-calculation)
    * [For surrogate data: Alignment calculation settings](#For-surrogate-data:-Alignment-calculation-settings)
    * [For surrogate data: Run alignment calculation](#For-surrogate-data:-Run-alignment-calculation)
* [ALIGN output overview](#ALIGN-output-overview)
    * [Speed calculations](#Speed-calculations)
    * [Printouts!](#Printouts!)

***

# Getting Started

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See folder `examples > toy_data-original` in Github repository for an example

### Filename conventions

* Each conversation text file must be regularly formatted, including a prefix for dyad and a prefix for conversation prior to the identifier for each that are separated by a unique character. By default, ALIGN looks for patterns that follow this convention: `dyad1-condA.txt`
    * However, users may choose to include any label for dyad or condition so long as the two labels are distinct from one another and are not subsets of any possible dyad or condition labels. Users may also use any character as a separator so long as it does not occur anywhere else in the filename.
    * The chosen file format **must** be used when saving **all** files for this analysis.

### Highest-level functions

Given appropriately prepared transcript files, ALIGN can be run in 3 high-level functions:

**`prepare_transcripts`**: Pre-process each standardized 
conversation, checking it conforms to the requirements. 
Each utterance is tokenized and lemmatized and has 
POS tags added.

**`calculate_alignment`**: Generates turn-level and 
conversation-level alignment scores (lexical, 
conceptual, and syntactic) across a range of 
*n*-gram sequences.

**`calculate_baseline_alignment`**: Generate a surrogate corpus
and run alignment analysis (using identical specifications 
from `calculate_alignment`) on it to produce a baseline.

***

# Setup

## Import libraries

Import packages we'll need to run ALIGN.

In [1]:
import align, os
import pandas as pd

Import `time` so that we can get a sense of how
long the ALIGN pipeline takes.

In [None]:
import time

Import `warnings` to flag us if required files aren't provided.

In [None]:
import warnings

Load in the `rpy2.ipython` Jupyter notebook extension so
that we can run R analysis in notebook with a Python kernel.

In [6]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


## Specify ALIGN path settings

ALIGN will need to know where the raw transcripts are stored, where to store the processed data, and where to read in any additional files needed for optional ALIGN parameters.

### Required directories

For the sake of this tutorial, specify a base path that will serve as our jumping-off point for our saved data. All of the CHILDES data and shipped data will be called from the package directory.

**`BASE_PATH`**: Containing directory for this tutorial.

In [7]:
BASE_PATH = os.getcwd()

**`CHILDES_EXAMPLE`**: Subdirectories for output and other
files for this tutorial. (We'll create a default directory
if one doesn't already exist.)

In [None]:
CHILDES_EXAMPLE = os.path.join(BASE_PATH,
                              'examples/CHILDES/')

In [None]:
if not os.path.exists(CHILDES_EXAMPLE):
    os.makedirs(CHILDES_EXAMPLE)

**`TRANSCRIPTS`**: Path to raw transcript files. Automatically provided by `align`.

In [8]:
TRANSCRIPTS = align.datasets.CHILDES_directory

**`PREPPED_TRANSCRIPTS`**: Set variable for folder name 
(as string) for relative location of folder into which 
prepared transcript files will be saved. (We'll create
a default directory if one doesn't already exist.)

In [55]:
PREPPED_TRANSCRIPTS = os.path.join(CHILDES_EXAMPLE,
                                   'childes-prepped/')

In [None]:
if not os.path.exists(PREPPED_TRANSCRIPTS):
    os.makedirs(PREPPED_TRANSCRIPTS)

**`ANALYSIS_READY`**: Set variable for folder name 
(as string) for relative location of folder into 
which analysis-ready dataframe files will be saved.
(We'll create a default directory if one doesn't
already exist.)

In [56]:
ANALYSIS_READY = os.path.join(CHILDES_EXAMPLE,
                              'childes-analysis/')

In [None]:
if not os.path.exists(ANALYSIS_READY):
    os.makedirs(ANALYSIS_READY)

**`SURROGATE_TRANSCRIPTS`**: Set variable for folder name 
(as string) for relative location of folder into which all
prepared surrogate transcript files will be saved. (We'll
create a default directory if one doesn't already exist.)

In [57]:
SURROGATE_TRANSCRIPTS = os.path.join(CHILDES_EXAMPLE,
                                     'childes-surrogate/')

In [None]:
if not os.path.exists(SURROGATE_TRANSCRIPTS):
    os.makedirs(SURROGATE_TRANSCRIPTS)

### Paths for optional parameters

**`OPTIONAL_PATHS`**: If using Stanford POS tagger or
pretrained vectors, the path to these files. If these
files are provided in other locations, be sure to
change the file paths for them. (We'll create a default
directory if one doesn't already exist.)

In [None]:
OPTIONAL_PATHS = os.path.join(CHILDES_EXAMPLE,
                             'optional_directories/')

In [None]:
if not os.path.exists(OPTIONAL_PATHS):
    os.makedirs(OPTIONAL_PATHS)

#### Stanford POS Tagger

The Stanford POS tagger **will not be used** by 
default in this example. However, you may use them
by uncommenting and providing the requested file 
paths in the cells in this section and then changing 
the relevant parameters in the ALIGN calls below.

If desired, we could use the Standford part-of-speech 
tagger along with the Penn part-of-speech tagger
(which is always used in ALIGN). To do so, the files
will need to be downloaded separately: 
https://nlp.stanford.edu/software/tagger.shtml#Download

**`STANFORD_POS_PATH`**: If using Stanford POS tagger
with the Penn POS tagger, path to Stanford directory.

In [53]:
# STANFORD_POS_PATH = os.path.join(OPTIONAL_PATHS,
#                                  'stanford-postagger-full-2017-06-09/')

In [None]:
# if not STANFORD_POS_PATH:
#     warnings.warn('Stanford POS directory not found at the specified '
#                   'location. Please update the file path or comment '
#                   'out the `STANFORD_POS_PATH` information.')

**`STANFORD_LANGUAGE`**: If using Stanford tagger,
set language model to be used for POS tagging.

In [54]:
# STANFORD_LANGUAGE = os.path.join(OPTIONAL_PATHS,
#                                  'english-left3words-distsim.tagger')

In [None]:
# if not STANFORD_LANGUAGE:
#     warnings.warn('Stanford tagger language not found at the specified '
#                   'location. Please update the file path or comment '
#                   'out the `STANFORD_POS_PATH` information.')

#### Google News pretrained vectors

The Google News pretrained vectors **will be used**
by default in this example. The file is available for
download here: https://code.google.com/archive/p/word2vec/

If desired, researchers may choose to read in pretrained
`word2vec` vectors rather than creating a semantic space
from the corpus provided. This may be especially useful 
for small corpora (i.e., fewer than 30k unique words),
although the choice of semantic space corpus should be
made with careful consideration about the nature of the
linguistic context (for further discussion, see Duran, 
Paxton, & Fusaroli, *submitted*).

**`PRETRAINED_INPUT_FILE`**: If using pretrained vectors, path
to pretrained vector files. You may choose to download the file
directly to this path or to change the path to a different one.

In [None]:
PRETRAINED_INPUT_FILE = os.path.join(OPTIONAL_PATHS,
                            'GoogleNews-vectors-negative300.bin')

In [None]:
if not PRETRAINED_INPUT_FILE:
    warnings.warn('Google News vector not found at the specified '
                  'location. Please update the file path or comment '
                  'out the `PRETRAINED_INPUT_FILE` information.')

***

# Phase 1: Prepare transcripts

In Phase 1, we take our raw transcripts and get them ready
for later ALIGN analysis.

## Preparation settings

There are a number of parameters that we can set for the
`prepare_transcripts()` function:

In [2]:
print align.prepare_transcripts.__doc__


    Prepare transcripts for similarity analysis.

    Given individual .txt files of conversations,
    return a completely prepared dataframe of transcribed
    conversations for later ALIGN analysis, including: text
    cleaning, merging adjacent turns, spell-checking,
    tokenization, lemmatization, and part-of-speech tagging.
    The output serve as the input for later ALIGN
    analysis.

    input_files : str (directory name) or list of str (file names)
        Cleaned files to be analyzed. Behavior governed by `input_as_directory`
        parameter as well.

    output_file_directory : str
        Name of directory where output for individual conversations will be
        saved.

    training_dictionary : str, optional (default: None)
        Specify whether to train the spell-checking dictionary using a
        provided file name (str) or the default Project
        Gutenberg corpus [http://www.gutenberg.org] (None).

    minwords : int, optional (2)
        Specify the minim

For the sake of this demonstration, we'll keep everything as
defaults. Among other parameters, this means that:
* any turns fewer than 2 words will be removed from the corpus
 (`minwords=2`),
* we'll be using regex to strip out any filler words
 (e.g., "uh," "um," "huh"; `use_filler_list=None`),
* we'll be using the Project Gutenberg corpus to create our 
 spell-checker algorithm (`training_dictionary=None`),
* we'll rely only on the Penn POS tagger 
 (`add_stanford_tags=False`), and
* our data will be saved both as individual conversation files
 and as a master dataframe of all conversation outputs
 (`save_concatenated_dataframe=True`).

## Run preparation phase

First, we prepare our transcripts by reading in individual `.txt`
files for each conversation, clean up undesired text and turns,
spell-check, tokenize, lemmatize, and add POS tags.

In [None]:
start_phase1 = time.time()

In [67]:
model_store = align.prepare_transcripts(
                        input_files=TRANSCRIPTS,
                        output_file_directory=PREPPED_TRANSCRIPTS,
                        minwords=2,
                        use_filler_list=None,
                        training_dictionary=None,
                        add_stanford_tags=False,
                        save_concatenated_dataframe=True)

In [None]:
end_phase1 = time.time()

***

# Phase 2: Calculate alignment

## For real data: Alignment calculation settings

There are a number of parameters that we can set for the
`calculate_alignment()` function:

In [3]:
print align.calculate_alignment.__doc__


    Calculate lexical, syntactic, and conceptual alignment between speakers.

    Given a directory of individual .txt files and the
    vocabulary list that have been generated by the `prepare_transcripts`
    preparation stage, return multi-level alignment
    scores with turn-by-turn and conversation-level metrics.

    Parameters
    ----------

    input_files : str (directory name) or list of str (file names)
        Cleaned files to be analyzed. Behavior governed by `input_as_directory`
        parameter as well.

    output_file_directory : str
        Name of directory where output for individual conversations will be
        saved.

    semantic_model_input_file : str
        Name of file to be used for creating the semantic model. A compatible
        file will be saved as an output of `prepare_transcripts()`.

    pretrained_input_file : str or None
        If using a pretrained vector to create the semantic model, use
        name of model here. If not, use None. Behavior

For the sake of this tutorial, we'll keep everything as
defaults. Among other parameters, this means that we'll:
* use only unigrams and bigrams for our *n*-grams
 (`maxngram=2`),
* use pretrained vectors instead of creating our own
 semantic space, since our tutorial corpus is quite
 small (`use_pretrained_vectors=True` and
 `pretrained_file_directory=PRETRAINED_INPUT_FILE`),
* ignore exact lexical duplicates when calculating
 syntactic alignment,
* we'll rely only on the Penn POS tagger 
 (`add_stanford_tags=False`), and
* implement high- and low-frequency cutoffs to clean
 our transcript data (`high_sd_cutoff=3` and 
 `low_n_cutoff=1`).

Whenever we calculate a baseline level of alignment,
we need to include the same parameter values for any
parameters that are present in both `calculate_alignment()`
(this step) and `calculate_baseline_alignment()`
(next step). As a result, we'll specify these here:

In [None]:
# set standards to be used for real and surrogate
INPUT_FILES = PREPPED_TRANSCRIPTS
MAXNGRAM = 2
USE_PRETRAINED_VECTORS = True
SEMANTIC_MODEL_INPUT_FILE = os.path.join(CHILDES_EXAMPLE,
                                         'align_concatenated_dataframe.txt')
PRETRAINED_FILE_DRIRECTORY = PRETRAINED_INPUT_FILE
ADD_STANFORD_TAGS = False
IGNORE_DUPLICATES = True
HIGH_SD_CUTOFF = 3
LOW_N_CUTOFF = 1

## For real data: Run alignment calculation

In [52]:
start_phase2real = time.time()

In [66]:
[turn_real,convo_real]= calculate_alignment(
                            input_files=INPUT_FILES,
                            maxngram=MAXNGRAM,   
                            use_pretrained_vectors=USE_PRETRAINED_VECTORS,
                            pretrained_input_file=PRETRAINED_INPUT_FILE,
                            semantic_model_input_file=SEMANTIC_MODEL_INPUT_FILE,
                            output_file_directory=ANALYSIS_READY,
                            add_stanford_tags=ADD_STANFORD_TAGS,
                            ignore_duplicates=IGNORE_DUPLICATES,
                            high_sd_cutoff=HIGH_SD_CUTOFF,
                            low_n_cutoff=LOW_N_CUTOFF)

In [None]:
end_phase2real = time.time()

## For surrogate data: Alignment calculation settings

For the surrogate or baseline data, we have many of the same
parameters for `calculate_baseline_alignment()` as we do for
`calculate_alignment()`:

In [4]:
print align.calculate_baseline_alignment.__doc__


    Calculate baselines for lexical, syntactic, and conceptual
    alignment between speakers.

    Given a directory of individual .txt files and the
    vocab list that have been generated by the `prepare_transcripts`
    preparation stage, return multi-level alignment
    scores with turn-by-turn and conversation-level metrics
    for surrogate baseline conversations.

    Parameters
    ----------

    input_files : str (directory name) or list of str (file names)
        Cleaned files to be analyzed. Behavior governed by `input_as_directory`
        parameter as well.

    surrogate_file_directory : str
        Name of directory where raw surrogate data will be saved.

    output_file_directory : str
        Name of directory where output for individual surrogate
        conversations will be saved.

    semantic_model_input_file : str
        Name of file to be used for creating the semantic model. A compatible
        file will be saved as an output of `prepare_transcripts()`.


As mentioned above, when calculating the baseline, it is **vital** 
to include the *same* parameter values for any parameters that 
are included  in both `calculate_alignment()` and 
`calculate_baseline_alignment()`. As a result, we re-use those
values here.

We demonstrate other possible uses for labels by setting 
`dyad_label = time`, allowing us to compare alignment over 
time across the same speakers. We also demonstrate how to 
generate a subset of surrogate pairings rather than all 
possible pairings.

In addition to the parameters that we're re-using from
the `calculate_alignment()` values (see above), we'll 
keep most parameters at their defaults by:
* preserving the turn order when creating surrogate
 pairs (`keep_original_turn_order=True`),
* specifying condition with `cond` prefix 
 (`condition_label='cond'`), and
* using a hyphen to separate the condition and
 dyad identifiers (`id_separator='\-'`).
 
However, we will also change some of these defaults,
including:
* generating only a subset of surrogate data equal
 to the size of the real data (`all_surrogates=False`)
 and
* specifying that we'll be shuffling the baseline data
 by time instead of by dyad (`dyad_label='time'`).

## For surrogate data: Run alignment calculation

In [54]:
start_phase2surrogate = time.time()

In [65]:
[turn_surrogate,convo_surrogate] = calculate_baseline_alignment(
                                    input_files=INPUT_FILES, 
                                    maxngram=MAXNGRAM,
                                    use_pretrained_vectors=USE_PRETRAINED_VECTORS,
                                    pretrained_input_file=PRETRAINED_INPUT_FILE,
                                    semantic_model_input_file=SEMANTIC_MODEL_INPUT_FILE,
                                    output_file_directory=ANALYSIS_READY,
                                    add_stanford_tags=ADD_STANFORD_TAGS,
                                    ignore_duplicates=IGNORE_DUPLICATES,
                                    high_sd_cutoff=HIGH_SD_CUTOFF,
                                    low_n_cutoff=LOW_N_CUTOFF,
                                    surrogate_file_directory=SURROGATE_TRANSCRIPTS,
                                    all_surrogates=False,
                                    keep_original_turn_order=True,
                                    id_separator='\-',
                                    dyad_label='time',
                                    condition_label='cond')

In [None]:
end_phase2surrogate = time.time()

***

# ALIGN output overview

## Speed calculations

As promised, let's take a look at how long it takes to run each section. Time is given in seconds.

**Phase 1:**

In [57]:
end_phase1 - start_phase1

33.50847101211548

**Phase 2, real data:**

In [58]:
end_phase2real - start_phase2real

77.47824192047119

**Phase 2, surrogate data:**

In [59]:
end_phase2surrogate - start_phase2surrogate

77.40525507926941

**All phases:**

In [60]:
end_phase2surrogate - start_phase1

188.39196801185608

## Printouts!

And that's it! Before we go, let's take a look at the output from the real data analyzed at the turn level for each conversation (`turn_real`) and at the conversation level for each dyad (`convo_real`). We'll then look at our surrogate data, analyzed both at the turn level (`turn_surrogate`) and at the conversation level (`convo_surrogate`). In our next step, we would then take these data and plug them into our statistical model of choice, but we'll stop here for the sake of our tutorial.

In [61]:
turn_real.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.0,0,0,0.285198,cgv>kid,time197-cond1.txt
1,1,0.0,0.0,0,0,0.37358,kid>cgv,time197-cond1.txt
2,2,0.154303,0.0,0,0,0.57782,cgv>kid,time197-cond1.txt
3,3,0.0,0.0,0,0,0.672067,kid>cgv,time197-cond1.txt
4,4,0.111111,0.09245,0,0,0.597504,cgv>kid,time197-cond1.txt
5,5,0.222222,0.27735,0,0,0.617649,kid>cgv,time197-cond1.txt
6,6,0.0,0.0,0,0,0.168668,cgv>kid,time197-cond1.txt
7,7,0.0,0.0,0,0,0.223091,kid>cgv,time197-cond1.txt
8,8,0.0,0.0,0,0,0.323836,cgv>kid,time197-cond1.txt
9,9,0.0,0.0,0,0,0.283156,kid>cgv,time197-cond1.txt


In [62]:
convo_real.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.70911,0.70911,0.099848,0.186072,time197-cond1.txt
1,0.76849,0.76849,0.353514,0.43538,time202-cond1.txt
2,0.744802,0.744802,0.309924,0.356673,time191-cond1.txt
3,0.782399,0.782399,0.353604,0.401469,time209-cond1.txt
4,0.810753,0.810753,0.192589,0.305209,time210-cond1.txt
5,0.766315,0.766315,0.311128,0.365522,time204-cond1.txt
6,0.670246,0.670246,0.164155,0.228145,time196-cond1.txt
7,0.789571,0.789571,0.285261,0.317173,time203-cond1.txt
8,0.741248,0.741248,0.319008,0.383271,time208-cond1.txt
9,0.78544,0.78544,0.188816,0.229783,time205-cond1.txt


In [63]:
turn_surrogate.head(10)

Unnamed: 0,time,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,cosine_semanticL,partner_direction,condition_info
0,0,0.0,0.169031,0.0,0.0,0.3444,cgv>kid,time195-time204-cond1
1,1,0.210819,0.111803,0.0,0.0,0.628921,kid>cgv,time195-time204-cond1
2,2,0.105409,0.111803,0.0,0.0,0.619997,cgv>kid,time195-time204-cond1
3,3,0.0,0.0,0.13484,0.13484,0.686174,kid>cgv,time195-time204-cond1
4,4,0.0,0.0,0.0,0.0,0.559554,cgv>kid,time195-time204-cond1
5,5,0.0,0.0,0.0,0.0,0.309329,kid>cgv,time195-time204-cond1
6,6,0.0,0.0,0.0,0.0,0.35275,cgv>kid,time195-time204-cond1
7,7,0.0,0.0,0.0,0.0,0.535742,kid>cgv,time195-time204-cond1
8,8,0.0,0.0,0.353553,0.353553,0.30474,cgv>kid,time195-time204-cond1
9,9,0.0,0.0,0.0,0.0,0.456275,kid>cgv,time195-time204-cond1


In [64]:
convo_surrogate.head(10)

Unnamed: 0,syntax_penn_tok2,syntax_penn_lem2,lexical_tok2,lexical_lem2,condition_info
0,0.703404,0.703404,0.163967,0.251262,time195-time204-cond1
1,0.767801,0.767801,0.129419,0.157256,time191-time201-cond1
2,0.747546,0.747546,0.102264,0.157979,time210-time197-cond1
3,0.806707,0.806707,0.124169,0.176623,time210-time201-cond1
4,0.790831,0.790831,0.130467,0.211696,time204-time195-cond1
5,0.686797,0.686797,0.135402,0.202031,time194-time210-cond1
6,0.72722,0.72722,0.078701,0.103854,time197-time210-cond1
7,0.736246,0.736246,0.081103,0.156581,time195-time197-cond1
8,0.808982,0.808982,0.120128,0.199052,time201-time210-cond1
9,0.710756,0.710756,0.069476,0.119855,time197-time195-cond1
