# Language Distribution Analysis

This notebook provides some samples of the requests that can be made using the *LanguageDistributionChart* class from the module '*scripts/language_distribution.py*'.

*LanguageDistributionChart* contains the following methods:
- *count_graph*: Returns the number of documents grouped by language.
- *language_percentage_distribution*: Return the distribution of French and English words inside French, English and Bilingual documents. This analysis is done under the assumption that none of these documents are 100% unilingual and allowed to define a 'bilinguism threshold' of 30% for a document to be considered 'bilingual'. These findings would be fundamental in the development of the Topic Analysis pipeline.

To execute this notebook, please start by running the initialization script below. Then, you can run and modify the other code cells according to your needs. 

In [1]:
# Import required libraries
from pathlib import Path
from datetime import datetime
from loguru import logger

# Define constants
LOGS_DIR = Path('logs')
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
LOGS_FILE = LOGS_DIR / f"language_distribution_{timestamp}.log"
PDF_LIST = Path('data/pdf_list.csv')
DB_PATH = Path('data/database.db')

logger.add(
    LOGS_FILE,
    rotation="1 day",
    retention="7 days",
    level="DEBUG",
    format="{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}",
)

1

In [2]:
from scripts.language_distribution import LanguageDistributionChart

chart = LanguageDistributionChart(DB_PATH)
overall_dist, detailed_analysis = chart.analyze_all()
overall_dist
detailed_analysis.describe()

[32m2024-10-05 16:00:40.555[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m25[0m - [1mDatabase connection successful.[0m
[32m2024-10-05 16:00:44.170[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36m__init__[0m:[36m89[0m - [1mLanguageDistributionChart initialized successfully[0m
[32m2024-10-05 16:00:44.171[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36manalyze_all[0m:[36m262[0m - [1mStarting comprehensive language distribution analysis[0m
[32m2024-10-05 16:00:44.172[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m92[0m - [1mGenerating graph for All categories[0m
[32m2024-10-05 16:00:44.666[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m134[0m - [1mGraph generated and saved for All categories[0m
[32m2024-10-05 16:00:44.668[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m


Summary of 'Other' Category (Error Margin):
Average 'Other' percentage: 2.12%
Maximum 'Other' percentage: 6.84%
Number of documents with 'Other' content: 98

Sample 'Other' content:
Document Overall Average (2.12%): No samples


[32m2024-10-05 16:03:10.223[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mvisualize_language_distribution[0m:[36m259[0m - [1mLanguage distribution visualizations completed[0m
[32m2024-10-05 16:03:10.224[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36manalyze_all[0m:[36m281[0m - [1mComprehensive language distribution analysis completed[0m


Unnamed: 0,English (%),French (%),Other (%),Code Switches
count,110.0,110.0,110.0,110.0
mean,14.859862,83.016082,2.124056,12.46789
std,31.54154,31.506091,1.838689,30.604182
min,0.0,0.236991,0.0,0.0
25%,0.0,93.552032,0.754351,0.0
50%,0.326895,96.752689,1.568118,2.0
75%,3.947248,98.485113,3.136436,14.0
max,99.466194,100.0,6.841815,278.0
