# Language Distribution Analysis

This notebook provides some samples of the requests that can be made using the *LanguageDistributionChart* class from the module '*scripts/language_distribution.py*'.

*LanguageDistributionChart* contains the following methods:
- *count_graph*: Returns the number of documents grouped by language.
- *language_percentage_distribution*: Return the distribution of French and English words inside French, English and Bilingual documents. This analysis is done under the assumption that none of these documents are 100% unilingual and allowed to define a 'bilinguism threshold' of 30% for a document to be considered 'bilingual'. These findings would be fundamental in the development of the Topic Analysis pipeline.

To execute this notebook, please start by running the initialization script below. Then, you can run and modify the other code cells according to your needs. 

In [1]:
# Import required libraries
from pathlib import Path
from datetime import datetime
from loguru import logger

# Define constants
LOGS_DIR = Path('logs')
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
LOGS_FILE = LOGS_DIR / f"language_distribution_{timestamp}.log"
PDF_LIST = Path('data/pdf_list.csv')
DB_PATH = Path('data/database.db')

logger.add(
    LOGS_FILE,
    rotation="1 day",
    retention="7 days",
    format="{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}",
)

1

In [2]:
from scripts.language_distribution import LanguageDistributionChart

chart = LanguageDistributionChart(DB_PATH)
overall_dist, detailed_analysis = chart.analyze_all()
overall_dist
detailed_analysis.describe()

[32m2024-10-02 17:07:01.884[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m25[0m - [1mDatabase connection successful.[0m
[32m2024-10-02 17:07:05.318[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36m__init__[0m:[36m95[0m - [1mLanguageDistributionChart initialized successfully[0m
[32m2024-10-02 17:07:05.319[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36manalyze_all[0m:[36m267[0m - [1mStarting comprehensive language distribution analysis[0m
[32m2024-10-02 17:07:05.319[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m98[0m - [1mGenerating graph for All categories[0m
[32m2024-10-02 17:07:05.481[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m140[0m - [1mGraph generated and saved for All categories[0m
[32m2024-10-02 17:07:05.483[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mcount_graph[0m:[36m


Other Lang 1 samples:
it: emanantd unevolontecitoyenneenvertududroitd initia; uneprogrammationplussignificatived artistesissusde; autochtonie.4.3comprehensiondel
ca: malgre mes references et mes competences difficile
ca: elle porte le foulard integral.
ca: j ai demenage a montreal il y a 10 ans.; 18 statistiques canada  no 89-657-x2019002.; 20 statistiques canada  no 11-631-x.
it: ..  ..9 conclusion ...............................; - rapprochement interculturel.; https:  ici.radio-canada.ca nouvelle
it: cela n a rien d anodin.; l annee suivante  le festival presence autochtone 
it: gaz metropolitain inc.
it: stabilite  securite financiere inaccessible pour m
ca: view?fbclid iwar2xoilfy7vex7i5x5dgs6f_zkpi6jf0idx ; statistique canada.  ; url : https:  www12.statcan.gc.ca census-  recense
so: canlii 8506  qc cm .; canlii 8506  qc cm   ;
it: - periode de la colonisation; nord 1078994 montreal; le profilage racial  
ca: d abord definir les termes 5 2.; - sentiment d injustice
ca: [ORG] 26 

[32m2024-10-02 17:09:10.563[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mlanguage_percentage_distribution[0m:[36m228[0m - [1mGraph generated and saved for All languages[0m
[32m2024-10-02 17:09:10.566[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mvisualize_language_distribution[0m:[36m233[0m - [1mVisualizing language distribution[0m
[32m2024-10-02 17:09:10.966[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36mvisualize_language_distribution[0m:[36m264[0m - [1mLanguage distribution visualizations completed[0m
[32m2024-10-02 17:09:10.966[0m | [1mINFO    [0m | [36mscripts.language_distribution[0m:[36manalyze_all[0m:[36m283[0m - [1mComprehensive language distribution analysis completed[0m


Unnamed: 0,English (%),French (%),Other (%),Code Switches
count,110.0,110.0,110.0,110.0
mean,14.859862,83.016082,2.124056,47.807339
std,31.54154,31.506091,1.838689,86.072775
min,0.0,0.236991,0.0,0.0
25%,0.0,93.552032,0.754351,12.5
50%,0.326895,96.752689,1.568118,31.5
75%,3.947248,98.485113,3.136436,51.75
max,99.466194,100.0,6.841815,788.0
