# Text Feature Extraction Pipeline

## Overview
This Jupyter notebook implements comprehensive text analysis and feature extraction for multi-language video transcripts. It uses spaCy and textdescriptives to analyze linguistic features across 24 different languages.

### Key Features
- Multi-language support (24 languages)
- Comprehensive text metrics extraction:
  - Readability scores
  - Linguistic complexity measures
  - Text quality metrics
  - POS tag distributions
  - Dependency parsing features
- Fallback handling for unsupported languages
- Progress tracking and resumable processing

### Prerequisites


In [None]:
from datetime import datetime
from pathlib import Path
import spacy
import pandas as pd
import textdescriptives
import webvtt
from tqdm import tqdm



### Language Models
Requires spaCy language models for:
- Major European languages (en, de, fr, es, etc.)
- Asian languages (zh, ja, ko)
- Nordic languages (da, sv, nb)
- Eastern European languages (ru, uk, pl)

In [None]:
!python -m spacy download ca_core_news_md
!python -m spacy download zh_core_web_md
!python -m spacy download hr_core_news_md
!python -m spacy download da_core_news_md
!python -m spacy download nl_core_news_md
!python -m spacy download en_core_web_md
!python -m spacy download fi_core_news_md
!python -m spacy download fr_core_news_md
!python -m spacy download de_core_news_md
!python -m spacy download el_core_news_md
!python -m spacy download it_core_news_md
!python -m spacy download ja_core_news_md
!python -m spacy download ko_core_news_md
!python -m spacy download lt_core_news_md
!python -m spacy download mk_core_news_md
!python -m spacy download nb_core_news_md
!python -m spacy download pl_core_news_md
!python -m spacy download pt_core_news_md
!python -m spacy download ro_core_news_md
!python -m spacy download ru_core_news_md
!python -m spacy download sl_core_news_md
!python -m spacy download es_core_news_md
!python -m spacy download sv_core_news_md
!python -m spacy download uk_core_news_md



### Input/Output
- Input: 
  - `Video_Transcriptions.csv`: Video transcripts
  - `Detected_Language_Confident.csv`: Language detection results
- Output: 
  - `processed_text_features.csv`: Extracted text features

### Process Flow
1. Load transcriptions and language detection data
2. Match appropriate language models
3. Extract comprehensive text features
4. Handle unsupported languages with fallback model
5. Save results with progress tracking

In [4]:
# load transcriptions
Video_Transcriptions = pd.read_csv("Video_Transcriptions.csv")
# and detected language
dlc = pd.read_csv("Detected_Language_Confident.csv")

df = Video_Transcriptions.merge(dlc, on='Video ID', how='left')
df.head()

Unnamed: 0,Video ID,transcript,language,confidence
0,0-pwca91OCM,These pork and mango spring rolls are one of ...,en,0.999961
1,2K0AHZh6OrM,"What's up y'all, Forrest here. To start off t...",en,0.999187
2,-ED-vjRCKaE,"What's up guys, I'm RandomFrankP back with an...",en,0.998429
3,LsNg-KrFxCA,Let me go to Big Head Joe's for you. They hav...,en,0.998661
4,1S0lygj3w84,Would you be willing to trade the outfit you ...,en,0.986416


In [None]:
lang2model = {
    "ca": "ca_core_news_md",
    "zh": "zh_core_web_md",
    "hr": "hr_core_news_md",
    "da": "da_core_news_md",
    "nl": "nl_core_news_md",
    "en": "en_core_web_md",
    "fi": "fi_core_news_md",
    "fr": "fr_core_news_md",
    "de": "de_core_news_md",
    "el": "el_core_news_md",
    "it": "it_core_news_md",
    "ja": "ja_core_news_md",
    "ko": "ko_core_news_md",
    "lt": "lt_core_news_md",
    "mk": "mk_core_news_md",
    "nb": "nb_core_news_md",
    "pl": "pl_core_news_md",
    "pt": "pt_core_news_md",
    "ro": "ro_core_news_md",
    "ru": "ru_core_news_md",
    "sl": "sl_core_news_md",
    "es": "es_core_news_md",
    "sv": "sv_core_news_md",
    "uk": "uk_core_news_md"
}

# fallback option, blank model with language-independent components
fallback_model = spacy.blank("en")
fallback_model.add_pipe("sentencizer")
fallback_model.add_pipe("textdescriptives/descriptive_stats")
fallback_model.add_pipe("textdescriptives/readability")
fallback_model.add_pipe("textdescriptives/quality")

expected_columns = [
    "text", "passed_quality_check", "n_stop_words", "alpha_ratio", "mean_word_length",
    "doc_length", "symbol_to_word_ratio_#", "proportion_ellipsis", "proportion_bullet_points", 
    "contains_lorem ipsum", "duplicate_line_chr_fraction", "duplicate_paragraph_chr_fraction", 
    "duplicate_ngram_chr_fraction_5", "duplicate_ngram_chr_fraction_6", "duplicate_ngram_chr_fraction_7", 
    "duplicate_ngram_chr_fraction_8", "duplicate_ngram_chr_fraction_9", "duplicate_ngram_chr_fraction_10", 
    "top_ngram_chr_fraction_2", "top_ngram_chr_fraction_3", "top_ngram_chr_fraction_4", 
    "oov_ratio", "pos_prop_ADJ", "pos_prop_ADP", "pos_prop_ADV", "pos_prop_AUX", 
    "pos_prop_CCONJ", "pos_prop_DET", "pos_prop_INTJ", "pos_prop_NOUN", "pos_prop_NUM", 
    "pos_prop_PART", "pos_prop_PRON", "pos_prop_PROPN", "pos_prop_PUNCT", "pos_prop_SCONJ", 
    "pos_prop_SYM", "pos_prop_VERB", "pos_prop_X", "dependency_distance_mean", "dependency_distance_std", 
    "prop_adjacent_dependency_relation_mean", "prop_adjacent_dependency_relation_std", 
    "entropy", "perplexity", "per_word_perplexity", "first_order_coherence", "second_order_coherence", 
    "flesch_reading_ease", "flesch_kincaid_grade", "smog", "gunning_fog", "automated_readability_index", 
    "coleman_liau_index", "lix", "rix", "token_length_mean", "token_length_median", "token_length_std", 
    "sentence_length_mean", "sentence_length_median", "sentence_length_std", "syllables_per_token_mean", 
    "syllables_per_token_median", "syllables_per_token_std", "n_tokens", "n_unique_tokens", 
    "proportion_unique_tokens", "n_characters", "n_sentences", "video_id", "language", "lang_confidence", 
    "fallback_used"
]


# Try to load previously processed data, if it exists
try:
    processed_df = pd.read_csv("processed_text_features.csv")
    processed_video_ids = set(processed_df["video_id"].unique())
    print(f"Resuming from {len(processed_video_ids)} previously processed videos.")
except FileNotFoundError:
    processed_video_ids = set()
    print("Starting fresh, no previously processed videos found.")

def process_transcript(text, lang, video_id, confidence):
    """Extract text metrics based on the detected language and include Video ID."""
    if lang in lang2model:
        spacy_model = lang2model[lang]
        metrics = textdescriptives.extract_metrics(text=text, spacy_model=spacy_model)
        is_fallback = False
    else:
        doc = fallback_model(text)
        metrics = textdescriptives.extract_df(doc)
        is_fallback = True
    
    # Add metadata to metrics
    metrics["video_id"] = video_id
    metrics["language"] = lang
    metrics["lang_confidence"] = confidence
    metrics["fallback_used"] = is_fallback
    return metrics 

# Iterate over rows in the dataframe
for _, row in tqdm(df.iterrows(), total=len(df)):
    video_id = row['Video ID']
    
    # Skip if video has already been processed
    if video_id in processed_video_ids:
        continue
    
    # Process transcript 
    metrics = process_transcript(row['transcript'], row['language'], video_id, row['confidence'])
    metrics_df = metrics.reindex(columns=expected_columns)  # Ensure all columns are present
    
    # Append to the CSV, ensuring headers are written only once
    metrics_df.to_csv("processed_text_features.csv", mode="a", header=not bool(processed_video_ids), index=False)

    # Add the processed video ID to the set
    processed_video_ids.add(video_id)

Resuming from 189 previously processed videos.


  similarities.append(sent.similarity(sents[i + order]))
  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 195/10956 [00:46<1:20:21,  2.23it/s]

[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  similarities.append(sent.similarity(sents[i + order]))
  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 206/10956 [02:10<14:22:59,  4.82s/it]

[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m
[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 208/10956 [02:14<10:47:13,  3.61s/it]

[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m
[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 213/10956 [02:50<20:42:24,  6.94s/it]

[38;5;3m⚠ Could not load lexeme probability table for language it. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 220/10956 [03:40<23:16:48,  7.81s/it]

[38;5;3m⚠ Could not load lexeme probability table for language ru. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 222/10956 [03:50<20:15:53,  6.80s/it]

[38;5;3m⚠ Could not load lexeme probability table for language da. This will
result in NaN values for perplexity and entropy.[0m


  similarities.append(sent.similarity(sents[i + order]))
  2%|▏         | 227/10956 [04:23<22:06:33,  7.42s/it]

In [None]:
df = pd.read_csv("processed_text_features.csv")

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10956 entries, 0 to 10955
Data columns (total 74 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   text                                    10956 non-null  object 
 1   passed_quality_check                    10956 non-null  bool   
 2   n_stop_words                            10956 non-null  float64
 3   alpha_ratio                             10956 non-null  float64
 4   mean_word_length                        10956 non-null  float64
 5   doc_length                              10956 non-null  float64
 6   symbol_to_word_ratio_#                  10956 non-null  float64
 7   proportion_ellipsis                     10956 non-null  float64
 8   proportion_bullet_points                10956 non-null  float64
 9   contains_lorem ipsum                    10956 non-null  float64
 10  duplicate_line_chr_fraction             10956 non-null  fl

In [None]:
df[df['fallback_used']==1]['language'].unique()

array(['ar', 'vi', 'cs', 'hi', 'hu', 'th', 'id', 'ta', 'cy', 'nn', 'tr',
       'et', 'tl', 'la', 'no', 'he', 'sk', 'ms'], dtype=object)