# Topic Analysis

This notebook provides some samples of the requests that can be made using the *main* methods from the module *scripts/topic_analysis.py*.

*main* calls the following classes:
- *Database* ('*scripts/database.py*'): For database operations
    - *fetch_single*: Fetches a single document for topic analysis
    - *fetch_all*: Fetches all documents for topic analysis
- *Process* ('*scripts/topic_analysis/text_processing.py*'): For text processing
    - *single_doc*: Processes text from a single document
    - *docs_parallel*: Processes text from several documents using parallel processing
- *Analysis* ('*scripts/topic_analysis/analysis.py*'): For analysis of processed documents
    - *analyze_docs*: Analyzes documents by topic

To execute this notebook, please start by running the initialization script below. Then, you can run and modify the other code cells according to your needs. 

In [1]:
# Import required libraries
from pathlib import Path
from datetime import datetime
from loguru import logger

from scripts.topic_analysis_main import topic_analysis

# Define constants
LOGS_DIR = Path('logs')
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
LOGS_FILE = LOGS_DIR / f"topic_analysis_{timestamp}.log"
PDF_LIST = Path('data/pdf_list.csv')
DB_PATH = Path('data/database.db')

logger.add(
    LOGS_FILE,
    rotation="1 day",
    retention="7 days",
    level="DEBUG",
    format="{time:YYYY-MM-DD at HH:mm:ss} | {level} | {message}",
)


1

## Analysis by single documents

In [3]:
topic_df, words_df = topic_analysis(lang="fr", mode='single', document_id=10, num_topics=20)
if topic_df is not None and words_df is not None:
    print(topic_df)
    print(words_df)
else:
    print("No documents found in the database.")

[32m2024-10-09 09:03:44.937[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m25[0m - [1mDatabase connection successful.[0m
[32m2024-10-09 09:03:44.939[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m25[0m - [1mDatabase connection successful.[0m
[32m2024-10-09 09:03:44.940[0m | [1mINFO    [0m | [36mscripts.topic_analysis.tools[0m:[36m__init__[0m:[36m26[0m - [1mInitializing tools for fr...[0m
[32m2024-10-09 09:03:45.975[0m | [1mINFO    [0m | [36mscripts.topic_analysis.tools[0m:[36m__init__[0m:[36m47[0m - [1mFrench model loaded successfully.[0m
[32m2024-10-09 09:03:45.976[0m | [1mINFO    [0m | [36mscripts.topic_analysis.tools[0m:[36mload_additional_stopwords[0m:[36m84[0m - [1mLoading additional stopwords from c:\Users\nicol\Documents\UdeM\Maîtrise\Données\labrri_ocpm_systemic_racism\scripts\topic_analysis\stopwords.txt...[0m
[32m2024-10-09 09:03:45.978[0m | [1mINFO    [0m | [36mscripts.topi

   Topic_Number                                        Topic_Label  \
0             1     Topic 1: Plainte harcelement (systemes, prima)   
1             2  Topic 2: Entraine effet (employes ville, plain...   
2             3   Topic 3: Consequences (neutre, montreal rapport)   
3             4            Topic 4: Ville (droits, racial racisme)   
4             5  Topic 5: Discrimination systemique (prima faci...   
5             6  Topic 6: Egard (metropolitain inc, montreal re...   
6             7        Topic 7: Meme (ainsi reglement, systemique)   
7             8        Topic 8: Personne droits (entraine, office)   

   Coherence_Score  
0         0.752646  
1         0.421317  
2         0.973777  
3         0.787927  
4         0.551930  
5         0.537463  
6         0.990537  
7         0.592758  
    Topic_Number                 Word
0              1  plainte harcelement
1              1             systemes
2              1          application
3              1      montre

In [None]:
topic_df, words_df = topic_analysis(lang="fr", mode='all', num_topics=10)
if topic_df is not None and words_df is not None:
    print(topic_df)
    print(words_df)
else:
    print("No documents found in the database.")

## Topics by language

In [None]:
main(lang="fr", mode='all')

In [None]:
main(lang="en", mode='all')

In [None]:
main(lang="bilingual", mode='all')