# Word Frequency Analysis

This notebook provides some samples of the requests that can be made using the *WordFrequencyChart* class from the module *scripts/word_frequency.py*.

*WordFrequencyChart* contains the following methods:
- *top_20_words_category*: Returns the top 20 most common words by organization category.
- *top_20_words_lang*: Returns the tops most common words by document language.
- *frequency_certain_words*: Returns word frequency from a user-defined list of words.

To execute this notebook, please start by running the initialization script below. Then, you can run and modify the other code cells according to your needs. 

In [1]:
# Import required libraries
from pathlib import Path
from datetime import datetime
import logging

# Define constants
LOGS_DIR = Path('logs')
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
LOGS_FILE = LOGS_DIR / f"word_frequency_{timestamp}.log"
DB_PATH = Path('data/database.db')

# Set up logging
logging.basicConfig(
    filename=LOGS_FILE,
    level=logging.INFO,
    format='%(asctime)s:%(levelname)s:%(message)s'
    )


## Most Frequent Words

### By category

In [2]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.top_n_words('Organismes communautaires et à but non-lucratif', n=20, ngram=3)

[32m2024-09-30 18:24:38.545[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:24:38.547[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m55[0m - [1mAnalyzing top 20 3-grams for category: Organismes communautaires et à but non-lucratif[0m
[32m2024-09-30 18:24:39.127[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m93[0m - [1mAnalysis complete for category: Organismes communautaires et à but non-lucratif[0m


Unnamed: 0,3-gram,Frequency
0,maryse alcindor copresidente,110
1,seance soiree steno,72
2,steno cindy lavertu,70
3,racisme discrimination systemiques,56
4,productions feux sacres,47
5,soiree steno cindy,44
6,personnes limitations fonctionnelles,38
7,ligue noirs org,34
8,dephy montreal mem,32
9,montreal mem racisme,32


In [4]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.top_n_words('fr', n=20, ngram=2, lang='fr')

[32m2024-09-30 18:25:06.095[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:25:06.097[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36m__del__[0m:[36m193[0m - [1mWordFrequencyChart object is being destroyed[0m
[32m2024-09-30 18:25:06.098[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__del__[0m:[36m167[0m - [1mDatabase connection closed.[0m
[32m2024-09-30 18:25:06.098[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m55[0m - [1mAnalyzing top 20 2-grams for language: fr[0m
[32m2024-09-30 18:25:08.007[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m93[0m - [1mAnalysis complete for language: fr[0m


Unnamed: 0,2-gram,Frequency
0,ville montreal,1128
1,profilage racial,638
2,racisme discrimination,481
3,consultation publique,404
4,discrimination systemiques,341
5,emond copresidente,297
6,alcindor copresidente,292
7,maryse alcindor,291
8,droits personne,260
9,racisme systemique,255


In [5]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.compare_languages(n=20, ngram=2)

[32m2024-09-30 18:26:10.104[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:26:10.105[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36m__del__[0m:[36m193[0m - [1mWordFrequencyChart object is being destroyed[0m
[32m2024-09-30 18:26:10.106[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__del__[0m:[36m167[0m - [1mDatabase connection closed.[0m
[32m2024-09-30 18:26:10.107[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mcompare_languages[0m:[36m124[0m - [1mComparing top 20 2-grams across languages[0m
[32m2024-09-30 18:26:10.107[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m55[0m - [1mAnalyzing top 20 2-grams for language: fr[0m
[32m2024-09-30 18:26:11.981[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m93[0m - [1mAnalysis complete for language: fr[0m
[32m2024-09-30 

In [6]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.compare_categories(['Organismes communautaires et à but non-lucratif', 'Chercheurs et experts'], n=20, ngram=2)

[32m2024-09-30 18:27:02.388[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:27:02.389[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36m__del__[0m:[36m193[0m - [1mWordFrequencyChart object is being destroyed[0m
[32m2024-09-30 18:27:02.390[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__del__[0m:[36m167[0m - [1mDatabase connection closed.[0m
[32m2024-09-30 18:27:02.391[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mcompare_categories[0m:[36m97[0m - [1mComparing top 20 2-grams across categories: ['Organismes communautaires et à but non-lucratif', 'Chercheurs et experts'][0m
[32m2024-09-30 18:27:02.391[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtop_n_words[0m:[36m55[0m - [1mAnalyzing top 20 2-grams for category: Organismes communautaires et à but non-lucratif[0m
[32m2024-09-30 18:27:02.939[0m | [1mINFO    [0m | [

In [7]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.tfidf_analysis('Organismes communautaires et à but non-lucratif')

[32m2024-09-30 18:27:29.562[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:27:29.563[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36m__del__[0m:[36m193[0m - [1mWordFrequencyChart object is being destroyed[0m
[32m2024-09-30 18:27:29.565[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__del__[0m:[36m167[0m - [1mDatabase connection closed.[0m
[32m2024-09-30 18:27:29.566[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtfidf_analysis[0m:[36m151[0m - [1mPerforming TF-IDF analysis for category: Organismes communautaires et à but non-lucratif[0m
[32m2024-09-30 18:27:29.756[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtfidf_analysis[0m:[36m189[0m - [1mTF-IDF analysis complete for category: Organismes communautaires et à but non-lucratif[0m


Unnamed: 0,Word,TF-IDF Score
0,ca,2.731688
1,montreal,2.510689
2,personnes,2.223517
3,copresidente,2.113263
4,ville,2.021335
5,org,1.846373
6,discrimination,1.771375
7,seance,1.704174
8,etre,1.681445
9,femmes,1.67047


In [8]:
from scripts.word_frequency import WordFrequencyChart

wfc = WordFrequencyChart(DB_PATH)
wfc.tfidf_analysis('fr', lang='fr')

[32m2024-09-30 18:27:32.545[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__init__[0m:[36m24[0m - [1mDatabase connection successful.[0m
[32m2024-09-30 18:27:32.547[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36m__del__[0m:[36m193[0m - [1mWordFrequencyChart object is being destroyed[0m
[32m2024-09-30 18:27:32.548[0m | [1mINFO    [0m | [36mscripts.database[0m:[36m__del__[0m:[36m167[0m - [1mDatabase connection closed.[0m
[32m2024-09-30 18:27:32.548[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtfidf_analysis[0m:[36m151[0m - [1mPerforming TF-IDF analysis for language: fr[0m
[32m2024-09-30 18:27:32.951[0m | [1mINFO    [0m | [36mscripts.word_frequency[0m:[36mtfidf_analysis[0m:[36m189[0m - [1mTF-IDF analysis complete for language: fr[0m


Unnamed: 0,Word,TF-IDF Score
0,ca,6.892542
1,montreal,6.840371
2,org,6.0985
3,ville,5.639455
4,personnes,5.403555
5,copresidente,4.928763
6,discrimination,4.31773
7,etre,4.246087
8,racisme,4.06518
9,2019,3.852594
