**Intra-sentential code switch frequency**

We are analyzing conversations in files to see how often a child shifts between English and Cantonese within individual sentences.

In [None]:
# CHILDES Data we used
# more specifically Eng/TimMixed folder
# NOTE: folder_path need to be edited in order to run for yourself
!wget https://git.talkbank.org/childes/data/Biling/YipMatthews.zip

--2024-12-07 07:12:46--  https://git.talkbank.org/childes/data/Biling/YipMatthews.zip
Resolving git.talkbank.org (git.talkbank.org)... 128.2.24.88
Connecting to git.talkbank.org (git.talkbank.org)|128.2.24.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 319 [text/html]
Saving to: ‘YipMatthews.zip.1’


2024-12-07 07:12:46 (125 MB/s) - ‘YipMatthews.zip.1’ saved [319/319]



In [None]:
# Installs the 'pylangacq' module
# helps with reading .cha files (data was in this format)
!pip install pylangacq



In [None]:
import re
import os
from pylangacq import read_chat

def identify_language(word):
  if re.match(r'^[a-zA-Z]+$', word):
    return 'eng'
  elif any(not char.isascii() for char in word):
    return 'yue'
  else:
    # other: punctuation, numbers, etc
    return 'other'

In [None]:
def analyze_utterance(utterance):
    # Remove speaker tag and any annotations
    utterance = utterance.split(':', 1)[-1].strip()
    # removes the yue tag
    if utterance.startswith('[- yue]'):
      utterance = utterance[7:].strip()
    elif utterance.startswith('[- eng]'):
      return 'eng'

    words = utterance.split()
    languages = [identify_language(word) for word in words]
    if 'eng' in languages and 'yue' in languages:
      return 'intra-sentential'
    elif 'eng' in languages and 'yue' not in languages:
      return 'eng'
    elif 'yue' in languages and 'eng' not in languages:
      return 'yue'
    else:
      return 'other'

In [None]:
def analyze_intra_sentential_code_switching(file_path):
    corpus = read_chat(file_path)
    utterances = corpus.utterances()

    language_counts = {'intra-sentential': 0, 'eng': 0, 'yue': 0, 'other': 0}
    # Changed 'mixed' to 'intra-sentential' to align with function logic

    total_utterances = 0

    for utterance in utterances:
        if utterance.participant == 'CHI':
            language, _, _, _ = analyze_utterance(utterance.tiers[utterance.participant])

            # Check if the returned value is a tuple and extract the language label
            if isinstance(language, tuple):
                language = language[0]

            # Map 'mixed' to 'intra-sentential' for consistency
            if language == 'mixed':
                language = 'intra-sentential'

            language_counts[language] += 1
            total_utterances += 1

    return {
        'total_utterances': total_utterances,
        'intra_sentential_count': language_counts['intra-sentential'],
        'intra_sentential_frequency': language_counts['intra-sentential'] / total_utterances if total_utterances > 0 else 0,
        'language_distribution': {
            lang: count / total_utterances
            for lang, count in language_counts.items()
        }
    }

def analyze_folder(folder_path):
    results = []
    cha_files = [f for f in os.listdir(folder_path) if f.endswith('.cha')]
    cha_files.sort()

    for i, file_name in enumerate(cha_files):
        if i % 2 == 0:
            file_path = os.path.join(folder_path, file_name)
            file_results = analyze_intra_sentential_code_switching(file_path)
            results.append((file_name, file_results))

    return results

folder_path = '/content/drive/MyDrive/TimMixed'
folder_results = analyze_folder(folder_path)

for file_name, result in folder_results:
    print(f"Results for {file_name}:")
    print(f"Total utterances: {result['total_utterances']}")
    print(f"Intra-sentential count: {result['intra_sentential_count']} ({result['intra_sentential_frequency']:.2%})")
    print(f"Code-Switching Frequency: {result['intra_sentential_frequency']:.2%}")
    print("Language distribution:")
    for lang, freq in result['language_distribution'].items():
        print(f"  {lang}: {freq:.2%}")
    print()

Results for 010520.cha:
Total utterances: 123
Intra-sentential count: 0 (0.00%)
Code-Switching Frequency: 0.00%
Language distribution:
  intra-sentential: 0.00%
  eng: 41.46%
  yue: 31.71%
  other: 26.83%

Results for 010624.cha:
Total utterances: 132
Intra-sentential count: 0 (0.00%)
Code-Switching Frequency: 0.00%
Language distribution:
  intra-sentential: 0.00%
  eng: 54.55%
  yue: 35.61%
  other: 9.85%

Results for 010723.cha:
Total utterances: 114
Intra-sentential count: 0 (0.00%)
Code-Switching Frequency: 0.00%
Language distribution:
  intra-sentential: 0.00%
  eng: 65.79%
  yue: 24.56%
  other: 9.65%

Results for 010826.cha:
Total utterances: 273
Intra-sentential count: 1 (0.37%)
Code-Switching Frequency: 0.37%
Language distribution:
  intra-sentential: 0.37%
  eng: 43.22%
  yue: 48.72%
  other: 7.69%

Results for 011002.cha:
Total utterances: 204
Intra-sentential count: 3 (1.47%)
Code-Switching Frequency: 1.47%
Language distribution:
  intra-sentential: 1.47%
  eng: 42.65%
  yu

**Results**:

total_utterances: total number of utterances analyzed in the file

intra_sentential_count: number of utterances that contain sentential code-switching (mix of English and Cantonese)

intra_sentential_frequency: frequency of intra-sentential code-switching

language_distribution:
Shows the proportion of utterances in each category:

'intra-sentential': same as intra_sentential_frequency

'eng': Proportion of utterances that are purely in English

'yue': Proportion of utterances that are purely in Cantonese

'other': Proportion of utterances that don't fit the two languages


**Grammatical Structure of Code Switching**

Here, we examine the components of speech of English words used in mixed-language sentences. In addition to determining average sentence lengths, the frequency of mixed-langage sentences, and the distribution of various language kinds.

In [None]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download the necessary NLTK data package
# helps with part of speech tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

def identify_language(word):
  if re.match(r'^[a-zA-Z]+$', word):
    return 'eng'
  elif any(not char.isascii() for char in word):
    return 'yue'
  else:
    # other: punctuation, numbers, etc
    return 'other'

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [None]:
def analyze_pos(word):
  pos = pos_tag([word])[0][1]
  if pos.startswith('VB'):
    return 'verb'
  elif pos.startswith('NN'):
    return 'noun'
  elif pos.startswith('JJ'):
    return 'adjective'
  elif pos.startswith('RB'):
    return 'adverb'
  else:
    return 'other'

In [None]:
def analyze_utterance(utterance):
    utterance = utterance.split(':', 1)[-1].strip()
    if utterance.startswith('[- yue]'):
        utterance = utterance[7:].strip()
    elif utterance.startswith('[- eng]'):
        return 'eng', [], 0, 0

    words = utterance.split()
    languages = [identify_language(word) for word in words]
    eng_words = [word for word, lang in zip(words, languages) if lang == 'eng']
    eng_pos = [analyze_pos(word) for word in eng_words]

    if 'eng' in languages and 'yue' in languages:
        return 'mixed', eng_pos, len(words), len(eng_words)
    elif 'eng' in languages:
        return 'eng', eng_pos, len(words), len(eng_words)
    elif 'yue' in languages:
        return 'yue', eng_pos, len(words), len(eng_words)
    else:
        return 'other', eng_pos, len(words), len(eng_words)

In [None]:
def analyze_file(file_path):
    corpus = read_chat(file_path)
    utterances = corpus.utterances()

    language_counts = {'mixed': 0, 'eng': 0, 'yue': 0, 'other': 0}
    pos_counts = {'noun': 0, 'verb': 0, 'adjective': 0, 'adverb': 0, 'other': 0}
    total_words = 0
    total_eng_words = 0
    total_sentences = 0
    mixed_sentences = 0
    mixed_words = 0

    for utterance in utterances:
        if utterance.participant == 'CHI':
            language, eng_pos, sent_len, eng_len = analyze_utterance(utterance.tiers[utterance.participant])
            language_counts[language] += 1
            total_sentences += 1
            total_words += sent_len
            total_eng_words += eng_len
            if language == 'mixed':
                mixed_sentences += 1
                mixed_words += sent_len
                for pos in eng_pos:
                    pos_counts[pos] += 1

    # caculate the averages
    avg_sent_len = total_words / total_sentences if total_sentences > 0 else 0
    avg_mixed_sent_len = mixed_words / mixed_sentences if mixed_sentences > 0 else 0
    avg_eng_words = total_eng_words / total_sentences if total_sentences > 0 else 0


    return {
        'file_name': os.path.basename(file_path),
        'total_sentences': total_sentences,
        'mixed_sentence_count': language_counts['mixed'],
        'mixed_sentence_frequency': language_counts['mixed'] / total_sentences if total_sentences > 0 else 0,
        'avg_sentence_length': avg_sent_len,
        'avg_mixed_sentence_length': avg_mixed_sent_len,
        'avg_english_words_per_sentence': avg_eng_words,
        'pos_distribution': {pos: count / sum(pos_counts.values()) for pos, count in pos_counts.items() if sum(pos_counts.values()) > 0},
        'language_distribution': {lang: count / total_sentences for lang, count in language_counts.items()}
    }

In [None]:
def analyze_folder(folder_path):
    results = []
    cha_files = [f for f in os.listdir(folder_path) if f.endswith('.cha')]
    cha_files.sort()

    for i, file_name in enumerate(cha_files):
        if i % 2 == 0:
            file_path = os.path.join(folder_path, file_name)
            file_results = analyze_file(file_path)
            results.append((file_name, file_results))

    return results

folder_path = '/content/drive/MyDrive/TimMixed'
folder_results = analyze_folder(folder_path)

for file_name, result in folder_results:
    print(f"Results for {file_name}:")
    print(f"Total sentences: {result['total_sentences']}")
    print(f"Mixed sentences: {result['mixed_sentence_count']} ({result['mixed_sentence_frequency']:.2%})")
    print(f"Average sentence length: {result['avg_sentence_length']:.2f}")
    print(f"Average mixed sentence length: {result['avg_mixed_sentence_length']:.2f}")
    print(f"Average English words per sentence: {result['avg_english_words_per_sentence']:.2f}")
    print("POS distribution of English words in mixed sentences:")
    for pos, freq in result['pos_distribution'].items():
        print(f"  {pos}: {freq:.2%}")
    print("Language distribution:")
    for lang, freq in result['language_distribution'].items():
        print(f"  {lang}: {freq:.2%}")
    print()

Results for 010520.cha:
Total sentences: 123
Mixed sentences: 0 (0.00%)
Average sentence length: 2.33
Average mixed sentence length: 0.00
Average English words per sentence: 0.44
POS distribution of English words in mixed sentences:
Language distribution:
  mixed: 0.00%
  eng: 41.46%
  yue: 31.71%
  other: 26.83%

Results for 010624.cha:
Total sentences: 132
Mixed sentences: 0 (0.00%)
Average sentence length: 2.36
Average mixed sentence length: 0.00
Average English words per sentence: 0.61
POS distribution of English words in mixed sentences:
Language distribution:
  mixed: 0.00%
  eng: 54.55%
  yue: 35.61%
  other: 9.85%

Results for 010723.cha:
Total sentences: 114
Mixed sentences: 0 (0.00%)
Average sentence length: 2.21
Average mixed sentence length: 0.00
Average English words per sentence: 0.69
POS distribution of English words in mixed sentences:
Language distribution:
  mixed: 0.00%
  eng: 65.79%
  yue: 24.56%
  other: 9.65%

Results for 010826.cha:
Total sentences: 273
Mixed sen

Results:

Total sentences: Total number of utterances analyzed in the file.

Mixed sentences: Number and percentage of utterances that contain both English and Cantonese

Average sentence length: The mean number of words per utterance across all utterances.

Average mixed sentence length: The mean number of words in utterances that contain both English and Cantonese.

Average English words per sentence: The mean number of English words across all utterances.

POS distribution of English words in mixed sentences:
This shows the distribution of parts of speech for English words in mixed utterances.

noun: Percentage of English words that are nouns.

verb: Percentage of English words that are verbs.

adjective: Percentage of English words that are adjectives.

adverb: Percentage of English words that are adverbs.

other: Percentage of English words that are other parts of speech or unclassified.

Language distribution:
This shows the proportion of utterances in each language category.

mixed: Percentage of utterances containing both English and Cantonese.

eng: Percentage of utterances that are purely in English.

yue: Percentage of utterances that are purely in Cantonese.

other: Percentage of utterances that don't fit into the above categories (might include other languages, non-linguistic vocalizations, etc.).

**Comparing Files**

Here, we compare the analysis findings from multiple files. It arranges the files and compares key metrics between them. The metrics employed are the frequency of code-switching, average sentence length, average mixed sentence length, and the average number of English words per sentence.

In [None]:
def compare_files(results):
  # Sort results by file name
  sorted_results = sorted(results, key=lambda x: x[0])

  print("File Comparison:")
  for i, (file_name, result) in enumerate(sorted_results):
        print(f"\nFile: {file_name}")
        print(f"Code-switching frequency: {result['mixed_sentence_frequency']:.2%}")
        print(f"Average sentence length: {result['avg_sentence_length']:.2f}")
        print(f"Average mixed sentence length: {result['avg_mixed_sentence_length']:.2f}")
        print(f"Average English words per sentence: {result['avg_english_words_per_sentence']:.2f}")

        if i > 0:
            prev_result = sorted_results[i-1][1]
            print("\nChanges from previous file:")
            print(f"Code-switching frequency change: {result['mixed_sentence_frequency'] - prev_result['mixed_sentence_frequency']:.2%}")
            print(f"Average sentence length change: {result['avg_sentence_length'] - prev_result['avg_sentence_length']:.2f}")
            print(f"Average mixed sentence length change: {result['avg_mixed_sentence_length'] - prev_result['avg_mixed_sentence_length']:.2f}")
            print(f"Average English words per sentence change: {result['avg_english_words_per_sentence'] - prev_result['avg_english_words_per_sentence']:.2f}")

folder_path = '/content/drive/MyDrive/TimMixed'
folder_results = analyze_folder(folder_path)

compare_files(folder_results)

File Comparison:

File: 010520.cha
Code-switching frequency: 3.88%
Average sentence length: 3.46
Average mixed sentence length: 6.58
Average English words per sentence: 1.07

File: 010624.cha
Code-switching frequency: 8.82%
Average sentence length: 3.61
Average mixed sentence length: 5.94
Average English words per sentence: 0.51

Changes from previous file:
Code-switching frequency change: 4.94%
Average sentence length change: 0.15
Average mixed sentence length change: -0.64
Average English words per sentence change: -0.57

File: 010723.cha
Code-switching frequency: 5.71%
Average sentence length: 3.33
Average mixed sentence length: 6.08
Average English words per sentence: 0.84

Changes from previous file:
Code-switching frequency change: -3.11%
Average sentence length change: -0.28
Average mixed sentence length change: 0.14
Average English words per sentence change: 0.33

File: 010826.cha
Code-switching frequency: 3.01%
Average sentence length: 3.34
Average mixed sentence length: 5.64


Results:

Code-switching frequency change: Percentage change in how oftern the child mixes English and Cantonese within sentences

Average sentence length change: Change in average number of words per sentence, indicating overall sentence complexity

Average mixed sentence length change: Shows how the length of sentences containing both languages has changed.

Average English words per sentence change: Represents the shift in the average number of English words used in each sentence.