In [4]:
!pip list
import sys
print(sys.path)
import pyclan # first, import the library

# ignore this it's just for pretty printing...
import pprint
pp = pprint.PrettyPrinter()

[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
Package         Version Location                                                  
--------------- ------- ----------------------------------------------------------
codecheck       1.0     /home/sarp/Desktop/SeedlingsBabyLab/.annotid/src/codecheck
Distance        0.1.3   
enum34          1.1.10  
msgpack         1.0.0   
mysql-connector 2.2.9   
numpy           1.16.4  
pandas          0.24.2  
pip             20.0.2  
pkg-resources   0.0.0   
pudb            2019.2  
pyclan          0.2     /home/sarp/Desktop/SeedlingsBabyLab/pyclan                
Pygments        2.5.2   
Pyment          0.3.3   
python-dateutil 2.8.0   
pytz            2019.1  


ImportError: No module named codecheck

# objects

These are the objects that pyclan exposes. They represent progressively smaller subdivisions of a CLAN file:
- **ClanFile**
 - this represents the whole CLAN file (.cha)
- **BlockGroup**
 - this is a collection of ClanBlocks
- **ClanBlock**
 - this is a single conversation block
  - delimited by:
   - @Bg Conversation XYZ (to begin)
   - @Eg Conversation XYZ (to end)
- **LineRange**
 - this is a collection of single ClanLine's
- **ClanLine**
 - this is a single line within the CLAN file. Line's are delimited by "\n"

# loading a CLAN file

First you need to construct a **ClanFile** object by loading a .cha file into it. Just supply a path to the file:

In [2]:
clan_file = pyclan.ClanFile("sample_data/31_14_coderSD_final.cha")

# class ClanFile

A ClanFile object has a bunch of different values associated with it and methods you can call to filter/get info about the file. 


Let's print some of the basic variables that are part of every ClanFile object:


In [3]:
clan_file.num_full_blocks

714

In [4]:
clan_file.clan_path

'sample_data/31_14_coderSD_final.cha'

## ClanFile.line_map

Each ClanFile has a line_map member variable. This is a list of ClanLines. The line_map list is *the* fundamental internal representation of a CLAN file. You can loop through the line_map and print the content within each ClanLine:


In [5]:
# we're just looking at lines 50-60 to save space, 
# but you can loop through the entire line map if 
# you want everything in a CLAN file

for line in clan_file.line_map[50:60]:
    print line.line

%xdb:	average_dB="-34.55" peak_dB="-19.83"

*OLN:	0 . 14510_16190

%xdb:	average_dB="-30.45" peak_dB="-20.70"

*NOF:	0 . 16190_17000

%xdb:	average_dB="-40.45" peak_dB="-32.04"

*OLN:	0 . 17000_18300

%xdb:	average_dB="-27.94" peak_dB="-17.41"

*NOF:	0 . 18300_19100

%xdb:	average_dB="-42.45" peak_dB="-27.65"

*SIL:	0 . 19100_20180



You have access to all the information in every ClanLine object in the line_map. In this next example, we loop through, check if a line is a tiered ClanLine, and if so, print just the tier:

In [6]:
# again, just looking at lines 50-60 to save space

for line in clan_file.line_map[50:60]:
    if line.is_tier_line:
        print line.tier

OLN
NOF
OLN
NOF
SIL


# class ClanLine

Here are all the variables that belong to a ClanLine object:


In [7]:
# just select a random ClanLine (happens to be 149th line in the line_map
# in this example) and print all the member variables:

random_clanline = clan_file.line_map[149]
pp.pprint(random_clanline.__dict__)

{'content': '0 . ',
 'conv_block_num': 0,
 'index': 149,
 'is_clan_comment': False,
 'is_conv_block_delimiter': False,
 'is_end_header': False,
 'is_header': False,
 'is_multi_parent': False,
 'is_paus_block_delimiter': False,
 'is_tier_line': True,
 'is_tier_without_timestamp': False,
 'is_user_comment': False,
 'line': '*NOF:\t0 . \x1570240_74140\x15\n',
 'multi_line_parent': None,
 'tier': 'NOF',
 'time_offset': 74140,
 'time_onset': 70240,
 'total_time': 3900,
 'within_conv_block': False,
 'within_paus_block': False,
 'xdb_average': 0,
 'xdb_line': False,
 'xdb_peak': 0}


# filters for ClanFile

There are a bunch of filters available to all the classes which behave roughly identically across different objects. In other words, a filter like get_user_comments() will return the same kind of result whether it's on a ClanFile, ClanBlock, or LineRange.

Here are some examples of filters and their results:

## ClanFile.get_tiers(*tiers)

get_tiers() will return a **LineRange** filled will all the lines in a ClanFile that have the specified tier. For example, let's get all the lines that are "FAN" or "MAN" tiered:


In [8]:
# fan_or_man will be a LineRange object with just 
# the tiered lines that are "FAN" or "MAN"

fan_or_man = clan_file.get_tiers("FAN", "MAN")

# a LineRange object has a "total_time" member. This
# is the cumulative time in milliseconds of all the 
# ClanLines 
print fan_or_man.total_time

2134717


or how about "FAN", "MAN", "FAF" and "OLN":


In [9]:
fan_man_faf_oln = clan_file.get_tiers("FAN", "MAN", "FAF", "OLN")

print fan_man_faf_oln.total_time

5175587


# ClanFile.get_conv_block(block_num)

get_conv_block() returns a **ClanBlock** object given an integer number. For example:


In [10]:
block_42 = clan_file.get_conv_block(42)

In [11]:
block_42.total_time # in milliseconds

13290

In [12]:
block_42.onset # in milliseconds, relative to start of CLAN file

1179960

In [13]:
block_42.offset # in milliseconds, relative to start of CLAN file

1193250

Remember when we mentioned that filters are available across objects? Here's an example. Instead of calling get_tiers() on a ClanFile object, let's call it on this ClanBlock object we've pulled out:


In [14]:
fan_man_in_block42 = block_42.get_tiers("FAN", "MAN")

In [15]:
fan_man_in_block42.total_time

5780

In [16]:
for line in fan_man_in_block42.line_map:
    print "the tier: " + line.tier
    print "the timestamp: " + line.timestamp()
    print "the raw content of the line:   " + line.line
    

the tier: FAN
the timestamp: 1179960_1181080
the raw content of the line:   *FAN:	&=w4_74 . 1179960_1181080

the tier: MAN
the timestamp: 1181890_1183030
the raw content of the line:   *MAN:	&=w4_78 . 1181890_1183030

the tier: MAN
the timestamp: 1188790_1189790
the raw content of the line:   *MAN:	&=w0_90 . 1188790_1189790

the tier: FAN
the timestamp: 1190730_1193250
the raw content of the line:   *FAN:	&=w11_78 . 1190730_1193250



fan_man_in_block42 is a LineRange object representing just the FAN and MAN tiered lines in block 42 of this CLAN file.

# ClanFile.get_conv_blocks(begin=1, end=None, select=None)

Instead of just picking out a single conversation block, you can filter out more than one at a time. There's two ways you can call this function. Option 1 is giving it "begin" and "end" markers. This will return all the blocks between begin and end. Option 2 is supplying a list of specific indices of blocks, and it will return just these specific blocks (in ascending order, list doesn't have to be ordered).

Example of Option 1:


In [17]:
blocks_3_to_50 = clan_file.get_conv_blocks(begin=3, end=50)

print blocks_3_to_50.total_time

356880


Example of Option 2:

In [18]:
blocks_7_12_56_and_158 = clan_file.get_conv_blocks(select=[7, 12, 56, 158])

print blocks_7_12_56_and_158.total_time

32330


The resulting object of a get_conv_blocks() function call is a **BlockGroup**.

A BlockGroup is a collection of **ClanBlock** objects, layed out into a single line_map. So you can loop through a BlockGroup's line_map just like in a ClanFile of ClanBlock or LineRange:


In [23]:
# just looping through the first 20 lines to save space...
for line in blocks_7_12_56_and_158.line_map[0:20]:
    if line.is_tier_line:
        print "the tier:     " + line.tier
        print "timestamp:    " + line.timestamp()
        print "raw content:  " + line.line

the tier:     MAN
timestamp:    299340_300350
raw content:  *MAN:	&=w2_90 . 299340_300350

the tier:     CXN
timestamp:    579360_580520
raw content:  *CXN:	0 . 579360_580520

the tier:     FAN
timestamp:    1486680_1487840
raw content:  *FAN:	&=w6_54 . 1486680_1487840

the tier:     OLN
timestamp:    1487840_1488640
raw content:  *OLN:	0 . 1487840_1488640

the tier:     FAN
timestamp:    1488640_1490680
raw content:  *FAN:	&=w8_17 . 1488640_1490680

the tier:     OLN
timestamp:    1490680_1491660
raw content:  *OLN:	0 . 1490680_1491660



Each block within a **BlockGroup** is also represented as a distinct **ClanBlock** in the BlockGroup.blocks member variable (so not just a list of **ClanLines** in the line_map variable). So in our current example, we should have 4 ClanBlock objects in the blocks_7_12_56_and_158 BlockGroup variable:


In [21]:
blocks_7_12_56_and_158.blocks

[<pyclan.elements.ClanBlock at 0x105342e90>,
 <pyclan.elements.ClanBlock at 0x105361450>,
 <pyclan.elements.ClanBlock at 0x1053614d0>,
 <pyclan.elements.ClanBlock at 0x105361510>]

Looks like we do. Let's loop through them and print some info about each one

In [30]:
for block in blocks_7_12_56_and_158.blocks:
    print "block index:   {}".format(block.index)
    print "onset:         {}".format(block.onset)
    print "offset:        {}".format(block.offset)
    print "total time:    {}".format(block.total_time)
    print
    

block index:   7
onset:         299340
offset:        300350
total time:    1010

block index:   12
onset:         579360
offset:        580520
total time:    1160

block index:   56
onset:         1486680
offset:        1500240
total time:    13560

block index:   158
onset:         4004000
offset:        4020600
total time:    16600



# More Filters

Here are some more useful filters that are available:


## get_user_comments()

returns a list of comment strings

In [36]:
user_comments = clan_file.get_user_comments()

pp.pprint(user_comments)

[%com:	this is a user comment
,
 %xcom:	subregion 1 of 5  (ranked 1 of 5)  starts at 2100000 -- previous timestamp adjusted: was 2100420
,
 %com:	do not know what SIS is saying here
,
 %com:	FAT is listing off book titles of a stack of books
,
 %com:	refers to a piece of clothing, not the fabric
,
 %com:	MOT misspeaks, means mothership, corrects herself
,
 %xcom:	subregion 1 of 5  (ranked 1 of 5)  ends at 5700000 -- previous
,
 %xcom:	subregion 2 of 5  (ranked 2 of 5)  starts at 8400000 -- previous timestamp adjusted: was 8402910
,
 %com:	refers to physical DVDs they're looking at
,
 %com:	refers to physical DVDs they're looking at
,
 %com:	refers to physical DVDs they're looking at
,
 %com:	refers to physical DVDs they're looking at
,
 %com:	the movie, holding the DVD case
,
 %xcom:	the movie, looking at the DVD case
,
 %com:	super muffled while CHI has a coat on
,
 %com:	can't understand what MOT and SIS are saying here
,
 %com:	begin car ride
,
 %com:	MOT takes the vest off here to 

In [33]:
comments_in_block = block_42.get_user_comments()

# no comments in this block
print comments_in_block

[]


# get_within_time(begin=0, end=None)

This returns a LineRange with all the lines between the specified time range. If "begin" is left out, it'll start from the very beggining, until end. If "end" is left out, it will start from "begin" until the very end. 

In [53]:
lines_within_10000_and_20000 = clan_file.get_within_time(begin=10000, end=20000)

print "\nthe lines within 10000ms and 20000ms: \n\n{}\n\n".format(lines_within_10000_and_20000.line_map)
 

print "total time of LineRange: {}".format(lines_within_10000_and_20000.total_time)



the lines within 10000ms and 20000ms: 

[*NOF:	0 . 11890_14510
, %xdb:	average_dB="-34.55" peak_dB="-19.83"
, *OLN:	0 . 14510_16190
, %xdb:	average_dB="-30.45" peak_dB="-20.70"
, *NOF:	0 . 16190_17000
, %xdb:	average_dB="-40.45" peak_dB="-32.04"
, *OLN:	0 . 17000_18300
, %xdb:	average_dB="-27.94" peak_dB="-17.41"
, *NOF:	0 . 18300_19100
, %xdb:	average_dB="-42.45" peak_dB="-27.65"
, *SIL:	0 . 19100_20180
, %xdb:	average_dB="-46.96" peak_dB="-31.70"
, @Eg:	Pause 1
, @Bg:	Conversation 1
]


total time of LineRange: 8290


# ClanFile.new_file_from_blocks(path, blocks=[], rewrite_timestamps=False, begin=1, end=None)

This function allows you to create a brand new .cha file containing only the blocks you specify as argument. You can either specify a range with begin= and end= or select specific blocks by passing in a list of indices with the blocks=[] argument. For example:

In [54]:
clan_file.new_file_from_blocks("new_CLAN_file_blocks10-45.cha",begin=10, end=45)

or

In [55]:
clan_file.new_file_from_blocks("new_CLAN_file_blocks_3_7_13_29_147.cha", blocks=[3, 7, 13, 29, 147])

the rewrite_timestamps argument hasn't been implemented yet, but once it is, you'll be able to have all the timestamps in the new .cha file start at 0 and be contiguous with each other.

# More examples coming soon....