# NLP workflow (from Natural Language Processing Fundamentals)
* Data collection
* Data preprocessing
* Feature extraction
* Model development
* Model assessment
* Model deployment

## Data collection
Because CanLII blocks web scraping with captchas and because high-volume web scraping violates CanLII's ToS, this program will have to rely on manually downloaded HTML pages for now. The ToS suggest that individuals may be able to secure mass-downloading rights, so I will look into this as the program develops.

The HTML files listed are copies of all reported criminal (and some quasi-criminal) decisions on CanLII from 2023 as of 2023-01-31. I selected cases based solely on the style of cause, including all cases that followed the style *R v Defendant* or *Defendant v R*. As this is a NLP project, I only selected English decisions, thereby limiting the number of reported (quasi-)criminal cases from Quebec.

### Compiled decisions
Decisions listed in this dictionary have been processed and have a TXT equivalent saved to file.

In [1]:
completed = {"scc": ["./canlii_crim_corpus/html/2023/ca/scc/2023scc2.html",
                     "./canlii_crim_corpus/html/2023/ca/scc/2023scc3.html",],
             "bcca": ["./canlii_crim_corpus/html/2023/bc/ca/2023bcca2.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca3.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca4.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca6.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca8.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca13.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca19.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca29.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca33.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca37.html",
                      "./canlii_crim_corpus/html/2023/bc/ca/2023bcca50.html",],
             "bcsc": ["./canlii_crim_corpus/html/2023/bc/sc/2023bcsc50.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc147.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc72.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc85.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc92.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc96.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc106.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc134.html",
                      "./canlii_crim_corpus/html/2023/bc/sc/2023bcsc141.html",],
             "bcpc": ["./canlii_crim_corpus/html/2023/bc/pc/2023bcpc3.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc14.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc4.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc5.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc6.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc7.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc11.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc12.html",
                      "./canlii_crim_corpus/html/2023/bc/pc/2023bcpc13.html",],
             "abca": ["./canlii_crim_corpus/html/2023/ab/ca/2023abca2.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca26.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca3.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca5.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca7.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca10.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca11.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca18.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca20.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca23.html",
                      "./canlii_crim_corpus/html/2023/ab/ca/2023abca29.html",],
             "abkb": ["./canlii_crim_corpus/html/2023/ab/kb/2023abkb45.html",
                      "./canlii_crim_corpus/html/2023/ab/kb/2023abkb13.html",
                      "./canlii_crim_corpus/html/2023/ab/kb/2023abkb26.html",
                      "./canlii_crim_corpus/html/2023/ab/kb/2023abkb9.html",
                      "./canlii_crim_corpus/html/2023/ab/kb/2023abkb39.html",],
             "abpc": ["./canlii_crim_corpus/html/2023/ab/pc/2023abpc17.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc3.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc9.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc6.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc1.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc16.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc8.html",
                      "./canlii_crim_corpus/html/2023/ab/pc/2023abpc7.html",],
             "skca": ["./canlii_crim_corpus/html/2023/sk/ca/2023skca15.html",
                      "./canlii_crim_corpus/html/2023/sk/ca/2023skca1.html",
                      "./canlii_crim_corpus/html/2023/sk/ca/2023skca2.html",
                      "./canlii_crim_corpus/html/2023/sk/ca/2023skca12.html",
                      "./canlii_crim_corpus/html/2023/sk/ca/2023skca7.html",
                      "./canlii_crim_corpus/html/2023/sk/ca/2023skca6.html",],
             "skkb": ["./canlii_crim_corpus/html/2023/sk/kb/2023skkb1.html",
                      "./canlii_crim_corpus/html/2023/sk/kb/2023skkb8.html",],
             "skpc": ["./canlii_crim_corpus/html/2023/sk/pc/2023skpc6.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc10.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc2.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc9.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc5.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc14.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc1.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc3.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc4.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc12.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc8.html",
                      "./canlii_crim_corpus/html/2023/sk/pc/2023skpc7.html",],
             "mbca": ["./canlii_crim_corpus/html/2023/mb/ca/2023mbca5.html",
                      "./canlii_crim_corpus/html/2023/mb/ca/2023mbca8.html",
                      "./canlii_crim_corpus/html/2023/mb/ca/2023mbca2.html",
                      "./canlii_crim_corpus/html/2023/mb/ca/2023mbca4.html",
                      "./canlii_crim_corpus/html/2023/mb/ca/2023mbca6.html",
                      "./canlii_crim_corpus/html/2023/mb/ca/2023mbca1.html",],
             "mbkb": ["./canlii_crim_corpus/html/2023/mb/kb/2023mbkb7.html",
                      "./canlii_crim_corpus/html/2023/mb/kb/2023mbkb2.html",
                      "./canlii_crim_corpus/html/2023/mb/kb/2023mbkb10.html",
                      "./canlii_crim_corpus/html/2023/mb/kb/2023mbkb12.html",
                      "./canlii_crim_corpus/html/2023/mb/kb/2023mbkb6.html",
                      "./canlii_crim_corpus/html/2023/mb/kb/2023mbkb1.html",],
             "mbpc": [],
             "onca": ["./canlii_crim_corpus/html/2023/on/ca/2023onca19.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca33.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca40.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca23.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca2.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca24.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca45.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca6.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca10.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca38.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca13.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca48.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca5.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca35.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca8.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca31.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca32.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca47.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca53.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca7.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca20.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca12.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca3.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca36.html",
                      "./canlii_crim_corpus/html/2023/on/ca/2023onca4.html",],
             "onsc": ["./canlii_crim_corpus/html/2023/on/sc/2023onsc538.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc414.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc124.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc496.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc103.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc286.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc547.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc254.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc347.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc549.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc62.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc396.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc64.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc283.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc452.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc640.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc220.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc14.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc97.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc268.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc662.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc568.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc621.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc146.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc296.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc555.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc190.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc200.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc416.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc166.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc567.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc400.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc300.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc406.html",
                      "./canlii_crim_corpus/html/2023/on/sc/2023onsc516.html",],
             "oncj": ["./canlii_crim_corpus/html/2023/on/cj/2023oncj18.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj24.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj10.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj9.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj20.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj25.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj16.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj12.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj45.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj43.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj17.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj28.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj40.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj6.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj14.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj36.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj15.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj11.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj29.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj5.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj31.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj27.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj21.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj4.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj41.html",
                      "./canlii_crim_corpus/html/2023/on/cj/2023oncj22.html",],
             "qcca": ["./canlii_crim_corpus/html/2023/qc/ca/2023qcca34.html",
                      "./canlii_crim_corpus/html/2023/qc/ca/2023qcca13.html",
                      "./canlii_crim_corpus/html/2023/qc/ca/2023qcca89.html",
                      "./canlii_crim_corpus/html/2023/qc/ca/2023qcca57.html",],
             "qccs": [],
             "qccq": ["./canlii_crim_corpus/html/2023/qc/cq/2023qccq86.html",
                      "./canlii_crim_corpus/html/2023/qc/cq/2023qccq15.html",],
             "nbca": ["./canlii_crim_corpus/html/2023/nb/ca/2023nbca6.html",],
             "nbkb": [],
             "nbpc": ["./canlii_crim_corpus/html/2023/nb/pc/2023nbpc1.html",],
             "nsca": ["./canlii_crim_corpus/html/2023/ns/ca/2023nsca3.html",
                      "./canlii_crim_corpus/html/2023/ns/ca/2023nsca2.html",
                      "./canlii_crim_corpus/html/2023/ns/ca/2023nsca1.html",],
             "nssc": ["./canlii_crim_corpus/html/2023/ns/sc/2023nssc25.html",
                      "./canlii_crim_corpus/html/2023/ns/sc/2023nssc9.html",
                      "./canlii_crim_corpus/html/2023/ns/sc/2023nssc3.html",
                      "./canlii_crim_corpus/html/2023/ns/sc/2023nssc4.html",
                      "./canlii_crim_corpus/html/2023/ns/sc/2023nssc2.html",
                      "./canlii_crim_corpus/html/2023/ns/sc/2023nssc28.html",],
             "nspc": [],
             "peca": [],
             "pesc": ["./canlii_crim_corpus/html/2023/pe/sc/2023pesc4.html",],
             "nlca": [],
             "nlsc": ["./canlii_crim_corpus/html/2023/nl/sc/2023nlsc6.html",],
             "nlpc": ["./canlii_crim_corpus/html/2023/nl/pc/2023canlii605.html", 
                      "./canlii_crim_corpus/html/2023/nl/pc/2023canlii460.html",
                      "./canlii_crim_corpus/html/2023/nl/pc/2023canlii466.html",
                      "./canlii_crim_corpus/html/2023/nl/pc/2023canlii2060.html",
                      "./canlii_crim_corpus/html/2023/nl/pc/2023canlii2521.html",
                      "./canlii_crim_corpus/html/2023/nl/pc/2023canlii3051.html",],
             "ykca": [],
             "yksc": [],
             "yktc": ["./canlii_crim_corpus/html/2023/yk/tc/2023yktc1.html",],
             "ntca": ["./canlii_crim_corpus/html/2023/nt/ca/2023ntca1.html",],
             "ntsc": [],
             "nttc": [],
             "nuca": [],
             "nucj": []
            }

scc_2022 = ["./canlii_scc_corpus/html/2022/2022scc1.html",
            "./canlii_scc_corpus/html/2022/2022scc2.html",
            "./canlii_scc_corpus/html/2022/2022scc3.html",
            "./canlii_scc_corpus/html/2022/2022scc4.html",
            "./canlii_scc_corpus/html/2022/2022scc5.html",
            "./canlii_scc_corpus/html/2022/2022scc6.html",
            "./canlii_scc_corpus/html/2022/2022scc7.html",
            "./canlii_scc_corpus/html/2022/2022scc8.html",
            "./canlii_scc_corpus/html/2022/2022scc9.html",
            "./canlii_scc_corpus/html/2022/2022scc10.html",
            "./canlii_scc_corpus/html/2022/2022scc11.html",
            "./canlii_scc_corpus/html/2022/2022scc12.html",
            "./canlii_scc_corpus/html/2022/2022scc13.html",
            "./canlii_scc_corpus/html/2022/2022scc14.html",
            "./canlii_scc_corpus/html/2022/2022scc15.html",
            "./canlii_scc_corpus/html/2022/2022scc16.html",
            "./canlii_scc_corpus/html/2022/2022scc17.html",
            "./canlii_scc_corpus/html/2022/2022scc18.html",
            "./canlii_scc_corpus/html/2022/2022scc19.html",
            "./canlii_scc_corpus/html/2022/2022scc20.html",
            "./canlii_scc_corpus/html/2022/2022scc21.html",
            "./canlii_scc_corpus/html/2022/2022scc22.html",
            "./canlii_scc_corpus/html/2022/2022scc23.html",
            "./canlii_scc_corpus/html/2022/2022scc24.html",
            "./canlii_scc_corpus/html/2022/2022scc25.html",
            "./canlii_scc_corpus/html/2022/2022scc26.html",
            "./canlii_scc_corpus/html/2022/2022scc27.html",
            "./canlii_scc_corpus/html/2022/2022scc28.html",
            "./canlii_scc_corpus/html/2022/2022scc29.html",
            "./canlii_scc_corpus/html/2022/2022scc30.html",
            "./canlii_scc_corpus/html/2022/2022scc31.html",
            "./canlii_scc_corpus/html/2022/2022scc32.html",
            "./canlii_scc_corpus/html/2022/2022scc33.html",
            "./canlii_scc_corpus/html/2022/2022scc34.html",
            "./canlii_scc_corpus/html/2022/2022scc35.html",
            "./canlii_scc_corpus/html/2022/2022scc36.html",
            "./canlii_scc_corpus/html/2022/2022scc37.html",
            "./canlii_scc_corpus/html/2022/2022scc38.html",
            "./canlii_scc_corpus/html/2022/2022scc39.html",
            "./canlii_scc_corpus/html/2022/2022scc40.html",
            "./canlii_scc_corpus/html/2022/2022scc41.html",
            "./canlii_scc_corpus/html/2022/2022scc42.html",
            "./canlii_scc_corpus/html/2022/2022scc43.html",
            "./canlii_scc_corpus/html/2022/2022scc44.html",
            "./canlii_scc_corpus/html/2022/2022scc45.html",
            "./canlii_scc_corpus/html/2022/2022scc46.html",
            "./canlii_scc_corpus/html/2022/2022scc47.html",
            "./canlii_scc_corpus/html/2022/2022scc48.html",
            "./canlii_scc_corpus/html/2022/2022scc49.html",
            "./canlii_scc_corpus/html/2022/2022scc50.html",
            "./canlii_scc_corpus/html/2022/2022scc51.html",
            "./canlii_scc_corpus/html/2022/2022scc52.html",
            "./canlii_scc_corpus/html/2022/2022scc53.html",
            "./canlii_scc_corpus/html/2022/2022scc54.html",
           ]

### Defective files
Holding cell for files that have either been downloaded and cannot be processed using the normal compliation process, or decisions that are "published" on CanLII but don't have reasons available yet.

In [100]:
# Defects
## The defect files listed here each posted NoneType errors when run through the decision_paragraph function
defects_nonetype = ["./canlii_crim_corpus/html/2023/ca/scc/2023scc1.html",
                    "./canlii_crim_corpus/html/2023/bc/ca/2023bcca16.html",
                    "./canlii_crim_corpus/html/2023/bc/ca/2023bcca38.html",
                    "./canlii_crim_corpus/html/2023/ab/pc/2023abpc22.html",
                    "./canlii_crim_corpus/html/2023/on/sc/2023onsc462.html",
                    "./canlii_crim_corpus/html/2023/on/sc/2023onsc519.html",]

## Paragraphs in PECA decisions don't appear to follow the "paragWrapper" div convention, and thus aren't caught up by the compiler
defects_peca = ["./canlii_crim_corpus/html/2023/pe/ca/2023peca1.html",
                "./canlii_crim_corpus/html/2023/pe/ca/2023peca2.html",]

# Reasons currently unavailable
## Decisions published without reasons are listed here for future follow-up
reasons_unavailable = [".canlii_crim_corpus/html/2023/mb/pc/2023mbpc1.html"]

assorted_defects = []

### Unprocessed decisions
Staging ground for decisions that have been downloaded but still need to be processed

In [125]:
# Unprocessed decisions

## Federal
scc_list = []

## British Columbia
bcca_list = []
bcsc_list = []
bcpc_list = []
       
    
## Alberta
abca_list = []
abkb_list = []
abpc_list = []


## Saskatchewan
skca_list = []
skkb_list = []
skpc_list = []


## Manitoba
mbca_list = []
mbkb_list = []
mbpc_list = []


## Ontario
onca_list = []
onsc_list = []
oncj_list = []


## Quebec
qcca_list = []
qccq_list = []
qccs_list = []


## New Brunswick
nbca_list = []
nbkb_list = []
nbpc_list = []


## Newfoundland & Labrador
nlsc_list = []
nlpc_list = []
nlca_list = []


## Prince Edward Island
peca_list = []
pesc_list = []
pepc_list = []


## Nova Scotia
nsca_list = []
nssc_list = []
nspc_list = []


## Yukon
ykca_list = []
yksc_list = []
yktc_list = []


## Northwest Territories
ntca_list = []
ntsc_list = []
nttc_list = []


## Nunavut
nuca_list = []
nucj_list = []


## Aggregate variable
unprocessed = [scc_list,
               bcca_list, bcsc_list, bcpc_list,
               abca_list, abkb_list, abpc_list,
               skca_list, skkb_list, skpc_list,
               mbca_list, mbkb_list, mbpc_list,
               onca_list, onsc_list, oncj_list,
               qcca_list, qccs_list, qccq_list,
               nbca_list, nbkb_list, nbpc_list,
               nsca_list, nssc_list, nspc_list,
               peca_list, pesc_list, pepc_list,
               nlca_list, nlsc_list, nlpc_list,
               ykca_list, yksc_list, yktc_list,
               ntca_list, ntsc_list, nttc_list,
               nuca_list, nucj_list
              ]

### Tree structure
The following code snippet shows the HTML files that will be used to build the first test mini-corpus in a tree format.

In [262]:
import os

def list_directory_tree(directory):
    print(directory)
    for path, dirs, files in os.walk(directory):
        level = path.replace(directory, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(path)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

#list_directory_tree("./canlii_crim_corpus/html/2023/")

## Data preprocessing
These functions remove extraneous HTML and save the clean text to file. Where available, the preprocessing functions split the decision into the decision's numbered paragraphs. Where the decision doesn't come with pre-formatted paragraph numbers, the functions should infer them from the document's structure. For some older decisions, it may be possible to infer pagination, though this functionality may not be necessary or useful.

### NLP language base
The case-brief program uses spaCy for NLP. After some cursory testing, the medium English library appears best suited to this program's use case. Specifically, it appears to detect LAW entities much more accurately than either the small or large models do.

In [2]:
import spacy
nlp = spacy.load("en_core_web_md")
nlp.max_length=10000000

### HTML to TXT
The HTML to TXT functions extract text from raw HTML files. 

In [3]:
import re
from bs4 import BeautifulSoup

nlp = spacy.load("en_core_web_md")

# Reads an HTML file and returns a BeautifulSoup object
def read_html_file(filename: str)->BeautifulSoup:
    '''
    Reads an HTML file and returns a BeautifulSoup object.
    '''
    with open(filename, 'r', encoding="utf-8") as file:
        soup: BeautifulSoup = BeautifulSoup(file, 'html.parser')
    return soup


def create_title(filepath: str)-> str:
    """Create a title for the text file from the html file name"""
    path_list = filepath.split("/")
    title_list = path_list[-1].split(".")
    title = title_list[0]
    
    # The first group of numbers is the year
    year = re.findall(r"\d+", title)[0]
    # The second group of numbers is the file number
    file_number = re.findall(r"\d+", title)[1]
    # The group of letters is the jurisdiction and court
    jurisdiction = re.findall(r"[a-z]+", title)[0]
    
    if jurisdiction == "canlii":
        jurisdiction = "CanLII"
        title = f"{year} {jurisdiction} {file_number}"
    else:
        title = f"{year} {jurisdiction.upper()} {file_number}"
 
    return title


def decision_paragraphs(filename: str)->tuple:
    '''
    Extracts the decision paragraphs. The decision text
    is contained in the <div class="paragWrapper"> tags. This function extracts
    the text from these tags and appends it to a list.
    '''
    
    decision = read_html_file(filename)
    
    # Find the first and last instances of the "paragWrapper" div
    first_div = decision.find("div", class_="paragWrapper")
    last_div = decision.find_all("div", class_="paragWrapper")[-1]

    paragraphs = []
    footnotes = []

    # Iterate over all siblings between the first and last instances of the "paragWrapper" div
    sibling = first_div
    paragraphs.append(first_div)
    while sibling != last_div:
        sibling = sibling.find_next_sibling()
        paragraphs.append(sibling)
        
    # Finds and appends footnotes where applicable
    if decision.find("SPAN", class_="MsoFootnoteReference"):
        decision_footnotes(decision)
        
    return paragraphs, footnotes


def decision_footnotes(decision: str)->list:
    '''
    Generates a list of footnotes in decisions containing them.
    '''
    footnote = decision.find("SPAN", class_="MsoFootnoteReference")
    footnotes.append(footnote)
    while footnote.find_next_sibling("SPAN", class_="MsoFootnoteReference"):
        footnote = footnote.find_next_sibling("SPAN", class_="MsoFootnoteReference")
        footnotes.append(footnote)
    
    return footnotes


def clean_text(paragraph: str, remove_para_nums: bool=False)->list:
    '''
    Returns text with problematic characters removed. These include paragraph
    numbers enclosed in square brackets and superfluous periods after paragraph
    and section pinpoints, as these can sometimes confuse the sentence 
    detectors.
    '''
    doc = nlp(paragraph)
    try:
        if remove_para_nums and paragraph[0] == "[" and paragraph[1].isdigit():
            doc = re.sub(r"\[\d+\]\s", "", paragraph)
            return nlp(doc)
        else:
            return doc
    except:
        pass


def compile_decision_text(filename)->list:
    '''
    The aggregate function that runs the others.
    '''
    
    decision = decision_paragraphs(filename)[0]
    footnotes = decision_paragraphs(filename)[1]
    clean_decision = []
    clean_decision.append(filename)
    
    for paragraph in decision:
        clean_decision.append(clean_text(paragraph.text))

    for item in clean_decision:
        try:
            if len(item) == 0:
                clean_decision.remove(item)
        except:
            pass
        
    if footnotes:
        for footnote in footnotes:
            decision.append(clean_text(footnote, False))
    
    return clean_decision


In [5]:
for case in scc_2022:
    compile_decision_text(case)

IndexError: list index out of range

### TXT to file
The next set of functions sort and save the newly-created text files.

In [4]:
def html_path_to_txt(filename: str) -> str:
    file_path_list = filename.split("/")
    del file_path_list[2]
    file_path_list.insert(2,"txt")
    save_path = "/".join(file_path_list)
    
    save_path_corrected = save_path.split(".")
    del save_path_corrected[-1]
    save_path_corrected.append("txt")
    save_path_corrected = ".".join(save_path_corrected)
    
    return save_path_corrected

def export_to_file(clean_decision: list, filename: str):
    '''
    Saves a copy of the cleaned decision text to file.
    '''
    save_path_corrected = html_path_to_txt(filename)    
    with open(save_path_corrected, "w") as f:
        for paragraph in clean_decision:
            try:
                f.write(paragraph + "\n")
            except:
                if paragraph:
                    f.write(paragraph.text + "\n")
    
    print(f"Wrote {save_path_corrected}")

def export_all(jurisdiction_list: list):
    '''
    Takes file paths from the jurisdiction list and exports them to text en
    masse.
    '''
    for court_list in jurisdiction_list:
        for decision in court_list:
            decision_text = compile_decision_text(decision)
            decision_text_file = export_to_file(decision_text, decision)
            

### Text post-processing

#### Correcting fixed-width paragraphs
Some of the HTML files converted to text were written with line breaks inside paragraph tags. When extracted, these line breaks make their way into the text files. These functions correct the affected files so that they can be more easily loaded into text classifier and NER functions later on. The correct_split_paragraphs function solves this problem by iterating through these files, removing intra-paragraph line breaks, and separating each newly-merged paragraph from the next with a single line space.

HTML files susceptible to this problem are known to occur in the following jurisdiction lists:



##### British Columbia
* BCCA
* BSCS
* BCPC

##### Saskatchewan
* SKCA
* SKKB
* SKPC

##### Manitoba
* MBCA
* MBKB

##### Ontario
* ONSC
* ONCJ

##### Quebec
* QCCA
* QCCQ

##### New Brunswick
* NBCA
* NBPC

##### Prince Edward Island
* PESC

##### Newfoundland & Labrador
* NLSC
* NLPC

##### Yukon
* YKTC


In [5]:
split_line_jurisdictions = ["bcca", "bcsc", "bcpc",
                            "skca", "skkb", "skpc",
                            "mbca", "mbkb",
                            "onsc", "oncj",
                            "qcca", "qccq",
                            "nbca", "nbpc",
                            "pesc",
                            "nlsc", "nlpc",
                            "yktc",]

def create_split_lines():
    '''
    This function recreates the split line decisions.
    '''
    for jurisdiction in split_line_jurisdictions:
        if jurisdiction in completed:
            file_paths = completed[jurisdiction]
            for file_path in file_paths:
                clean_decision = compile_decision_text(file_path)
                export_to_file(clean_decision, file_path)
                print(f"Wrote {file_path}")

def correct_split_paragraphs():

    for jurisdiction in split_line_jurisdictions:
        if jurisdiction in completed:
            file_paths = completed[jurisdiction]
            for file_path in file_paths:
                text_file_path = html_path_to_txt(file_path)
                
                with open(text_file_path, 'r') as f:
                    text = f.read()
                
                text = re.sub(r'(\S)\n(\S)', r'\1 \2', text)
                text = re.sub(r'\n{2,}', '\n', text)
                
                with open(text_file_path, 'w') as f:
                    f.write(text)
                    print(f"Wrote {text_file_path}")


In [204]:
#create_split_lines()
#correct_split_paragraphs()

### Corpus construction
Once the data is cleaned up, sorted out, and saved to file it needs to be added to the common corpus.

#### File amalgamation
Reads disparate file contents into a single text file. The amalgamation is broken up into five different regions, as the full version was unwieldy on the equipment I tried compiling it on.

In [29]:
# Federal
directory_federal = ["./canlii_crim_corpus/txt/2023/ca/scc"]

corpus_federal = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_fed.txt"

# Western Canada
directories_west = ["./canlii_crim_corpus/txt/2023/bc/ca",
                    "./canlii_crim_corpus/txt/2023/bc/sc",
                    "./canlii_crim_corpus/txt/2023/bc/pc",
                    "./canlii_crim_corpus/txt/2023/ab/ca",
                    "./canlii_crim_corpus/txt/2023/ab/kb",
                    "./canlii_crim_corpus/txt/2023/ab/pc",
                    "./canlii_crim_corpus/txt/2023/mb/ca",
                    "./canlii_crim_corpus/txt/2023/mb/kb",
                    "./canlii_crim_corpus/txt/2023/mb/pc",]

corpus_west = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_west.txt"

# Northern Canada
directories_north = ["./canlii_crim_corpus/txt/2023/yk/ca",
                     "./canlii_crim_corpus/txt/2023/yk/sc",
                     "./canlii_crim_corpus/txt/2023/yk/tc",
                     "./canlii_crim_corpus/txt/2023/nt/ca",
                     "./canlii_crim_corpus/txt/2023/nt/sc",
                     "./canlii_crim_corpus/txt/2023/nt/tc",
                     "./canlii_crim_corpus/txt/2023/nu/ca",
                     "./canlii_crim_corpus/txt/2023/nu/cj",]

corpus_north = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_north.txt"

# Central Canada
directories_central = ["./canlii_crim_corpus/txt/2023/on/ca",
                       "./canlii_crim_corpus/txt/2023/on/sc",
                       "./canlii_crim_corpus/txt/2023/on/cj",]

corpus_central = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_central.txt"

# Eastrn Canada
directories_east = ["./canlii_crim_corpus/txt/2023/qc/ca",
                    "./canlii_crim_corpus/txt/2023/qc/cs",
                    "./canlii_crim_corpus/txt/2023/qc/cq",
                    "./canlii_crim_corpus/txt/2023/nb/ca",
                    "./canlii_crim_corpus/txt/2023/nb/kb",
                    "./canlii_crim_corpus/txt/2023/nb/pc",
                    "./canlii_crim_corpus/txt/2023/ns/ca",
                    "./canlii_crim_corpus/txt/2023/ns/sc",
                    "./canlii_crim_corpus/txt/2023/ns/pc",
                    "./canlii_crim_corpus/txt/2023/pe/ca",
                    "./canlii_crim_corpus/txt/2023/pe/sc",
                    "./canlii_crim_corpus/txt/2023/pe/pc",
                    "./canlii_crim_corpus/txt/2023/nl/ca",
                    "./canlii_crim_corpus/txt/2023/nl/sc",
                    "./canlii_crim_corpus/txt/2023/nl/pc",]

corpus_east = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_central.txt"

# Full corpus
directories = ["./canlii_crim_corpus/txt/2023/bc/ca",
               "./canlii_crim_corpus/txt/2023/bc/sc",
               "./canlii_crim_corpus/txt/2023/bc/pc",
               "./canlii_crim_corpus/txt/2023/ab/ca",
               "./canlii_crim_corpus/txt/2023/ab/kb",
               "./canlii_crim_corpus/txt/2023/ab/pc",
               "./canlii_crim_corpus/txt/2023/mb/ca",
               "./canlii_crim_corpus/txt/2023/mb/kb",
               "./canlii_crim_corpus/txt/2023/mb/pc",
               "./canlii_crim_corpus/txt/2023/on/ca",
               "./canlii_crim_corpus/txt/2023/on/sc",
               "./canlii_crim_corpus/txt/2023/on/cj",
               "./canlii_crim_corpus/txt/2023/qc/ca",
               "./canlii_crim_corpus/txt/2023/qc/cs",
               "./canlii_crim_corpus/txt/2023/qc/cq",
               "./canlii_crim_corpus/txt/2023/nb/ca",
               "./canlii_crim_corpus/txt/2023/nb/kb",
               "./canlii_crim_corpus/txt/2023/nb/pc",
               "./canlii_crim_corpus/txt/2023/ns/ca",
               "./canlii_crim_corpus/txt/2023/ns/sc",
               "./canlii_crim_corpus/txt/2023/ns/pc",
               "./canlii_crim_corpus/txt/2023/pe/ca",
               "./canlii_crim_corpus/txt/2023/pe/sc",
               "./canlii_crim_corpus/txt/2023/pe/pc",
               "./canlii_crim_corpus/txt/2023/nl/ca",
               "./canlii_crim_corpus/txt/2023/nl/sc",
               "./canlii_crim_corpus/txt/2023/nl/pc",
               "./canlii_crim_corpus/txt/2023/yk/ca",
               "./canlii_crim_corpus/txt/2023/yk/sc",
               "./canlii_crim_corpus/txt/2023/yk/tc",
               "./canlii_crim_corpus/txt/2023/nt/ca",
               "./canlii_crim_corpus/txt/2023/nt/sc",
               "./canlii_crim_corpus/txt/2023/nt/tc",
               "./canlii_crim_corpus/txt/2023/nu/ca",
               "./canlii_crim_corpus/txt/2023/nu/cj",
              ]

corpus_file_full = "./canlii_crim_corpus/canlii_crim_corpus_2023_01_full.txt"


In [3]:
import os

def generate_corpus(directories: list, corpus_file: str):
    '''
    Amalgamates the processed text files into a single corpus. Individual 
    decisions can still be distinguished by their file path in the corpus file
    proper.
    '''
    contents = ""

    for directory in directories:
        for filename in os.listdir(directory):
            # check if the file is a text file
            if filename.endswith(".txt"):
                # open the text file
                with open(os.path.join(directory, filename), "r") as file:
                    # read the contents of the text file
                    file_contents = file.read()
                    # add the contents of the text file to the contents string
                    contents += file_contents
    
    if os.path.exists(corpus_file):
        os.remove(corpus_file)
    
    with open(corpus_file, "w") as outfile:
        outfile.write(contents)
        print(f"Wrote the corpus to {corpus_file}")


In [30]:
generate_corpus(directory_federal, corpus_federal)

Wrote the corpus to ./canlii_crim_corpus/canlii_crim_corpus_2023_01_fed.txt


#### Tokenization
Once the corpora are properly assembled into text files, the next step is to tokenize them so that they can be examined using spaCy.

In [None]:
with open(corpus_federal, "r") as f:
    text = f.read()
    doc = nlp(text)

In [None]:
with open("./canlii_crim_corpus/canlii_crim_corpus_2023_01_west_n.bin", "wb") as file:
    file.write(doc.to_bytes())

## Feature extraction

### Citations

Citations are an important part of a reported case. They signify *stare decisis*, and in **most** cases imply that the words or axioms cited are true and authoritative. Identifying citations inside a reported case is a key step to understanding its internal logic. And because citations are lexicgraphically distinct from all other tokens in a reported decision, they make for a good subject for early experiments with named entity recognition (NER).


#### ner_citations_v1

Version 1 uses three labels to identify citations as **case citations**, **statute citations**, or **other citations** (books, journal articles, factums, etc.). It treats full citations and citation fragments as fundamentally the same entities.

##### Labels

* CASE_CITATION
* STATUTE_CITATION
* OTHER_CITATION

##### Corpus

* [canlii_crim_corpus_2023_01_west_n.txt](canlii_crim_corpus/canlii_crim_corpus_2023_01_west_n.txt)
* Same as [canlii_crim_corpus_2023_01_west.txt](canlii_crim_corpus/canlii_crim_corpus_2023_01_west.txt), but with some manual formatting

##### Sample size & dimensions

* Size: 1468 lines/paragraphs
* Cases used:
  * *R v Campbell*, 2023 BCCA 19
  * *R v Kehoe*, 2023 BCCA 2
  * *R v Zsombor*, 2023 BCCA 37
  * *R v Kooner*, 2023 BCCA 8
  * *R v Mohsenipour*, 2023 BCCA 6
  * *R v Wilkinson*, 2023 BCCA 3
  * *R v PRJ*, 2023 BCCA 13
  * *R v Armstrong*, 2023 BCCA 50
  * *R v Clayton*, 2023 BCCA 33
  * *R v Prins*, 2023 BCCA 4
  * *R v Richardson*, 2023 BCCA 29 
  
##### Efficacy

The model claims accuracy scores up to 82% after training on just under 1500 cases consisting of about 82,350 words. Field accuracy leaves much to be desired, though, as the model has a great deal of difficulty correctly identifying citation fragments (eg. "at para 34", "p. 708", etc).

The sample output is from a case I argued a few years back. 


In [2]:
import spacy
from spacy import displacy

In [None]:
with open("./data/2019mb1b150.txt") as file:
    sample_decision = file.read()

In [16]:
# Sample v1 output

nlp = spacy.load("./models/ner_citations_v1/model-last/")
doc = nlp(sample_decision)
displacy.render(doc, style='ent')

## Model development

#### ner_citations_v2

V2 uses ten labels to describe the same three general citation classes as V1. It treats full citations and citation fragments as different entities.

##### Lables

* Cases
    * CASE_CITATION
    * CASE_PINPOINT
    * CASE_SHORTFORM
    * CASE_HISTORY
* Statutes
    * STATUTE_CITATION
    * STATUTE_PINPOINT
    * STATUTE_SHORTFORM
* Other
    * OTHER_CITATION
    * OTHER_PINPOINT
    * OTHER_SHORTFORM
    
##### Efficacy

The model began claiming accuracy scores up to 83% after training on just a few hundred paragraphs, and was more accurate on random unlabelled samples than the more fully-trained V1.

In [None]:
# Sample v2 output

nlp = spacy.load("./models/ner_citations_v2/model-last/")
doc = nlp(sample_decision)
displacy.render(doc, style='ent')

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

### Text classification

#### textcat_firac_v1

The methodology for text classification should comply with the following guidelines:

##### **General**
* Classify sentences, not whole paragraphs
* The fewer labels, the better
  * Try to stick with just one label per text block to the extent possible
  * Anything more than two labels needs an explanation

##### **Facts**
* Any fact of factual finding related to an appeal issue
  * May include procedural history where prior procedural events are a direct appellate issue (eg, ineffective assistance of counsel)

##### *History*
* (Procedural) History is a factual subset
* This category covers the proceedings at trial and any subsequent appeals
* Separating this category from **Facts** right off the bat will make for more economic factual summaries, as procedural history generally isn't relevant to an appeal issue

##### **Issue**
* The questions that the decision answers
* These roughly correspond to the arguments one would define in argdown
* Issues can be tricky to spot, as they are often phrased in their argumentative form
  * Eg, The argument "the defendant argues his detention was unlawful" corresponds to the issue "Was the defendant's detention unlawful?"

##### **Rule**
* Axioms, legal tests, statutetory provisions and definitions
* A rule should either be a major premise or readily capable of being translated into one

##### **Analysis**
- Parties' submissions at the appeal
- Facts and decisions in cited cases
- Court's opinion on the facts, history, and submissions

##### **Conclusion**
* Argument conclusions and issue resolutions
* Dispositions

In [66]:
import spacy
from spacy import displacy

nlp = spacy.load("./models/texcat_firac_v1/model-last/")
nlp.add_pipe("sentencizer")

paragraph = "  In the final analysis, I find it unnecessary to deal with the issue of extraterritoriality to dispose of this appeal. This is so because the CFNIS did not violate the Charter. Working within the constraints of its authority in Virginia, the CFNIS sought the cooperation of local authorities to obtain and execute a warrant under Virginia law. The warrant which issued authorized the search, seizure, and analysis of Cpl. McGregor’s electronic devices expressly. The evidence of sexual assault was discovered inadvertently by the investigators in the process of triaging the devices at the scene of the search; its incriminating nature was immediately apparent. Although the warrant did not contemplate such evidence, the digital files in issue fell squarely within the purview of the plain view doctrine. Furthermore, the CFNIS obtained Canadian warrants before conducting an in‑depth analysis of these devices. It is difficult to see how the CFNIS investigators could have acted differently to attain their legitimate investigative objectives. I conclude that they did not infringe Cpl. McGregor’s rights under s. 8 of the Charter."
doc = nlp(paragraph)

sentence_list = []
for sentence in doc.sents:
    sentence_list.append(sentence)
    categories = nlp(sentence.text).cats
    max_key = max(categories, key=lambda k: categories[k])
    print(sentence, max_key)
    


  In the final analysis, I find it unnecessary to deal with the issue of extraterritoriality to dispose of this appeal. CONCLUSION
This is so because the CFNIS did not violate the Charter. ANALYSIS
Working within the constraints of its authority in Virginia, the CFNIS sought the cooperation of local authorities to obtain and execute a warrant under Virginia law. CONCLUSION
The warrant which issued authorized the search, seizure, and analysis of Cpl. FACTS
McGregor’s electronic devices expressly. ANALYSIS
The evidence of sexual assault was discovered inadvertently by the investigators in the process of triaging the devices at the scene of the search; its incriminating nature was immediately apparent. ANALYSIS
Although the warrant did not contemplate such evidence, the digital files in issue fell squarely within the purview of the plain view doctrine. ANALYSIS
Furthermore, the CFNIS obtained Canadian warrants before conducting an in‑depth analysis of these devices. ANALYSIS
It is difficu

#### textcat_firac_v2

The methodology for text classification should comply with the following guidelines:

##### **General**
* Classify sentences, not whole paragraphs
* The fewer labels, the better
  * Try to stick with just one label per text block to the extent possible
  * Anything more than two labels needs an explanation

##### **Facts**
* Events occurring prior to a court's involvement are facts
* Events occurring after a court gets involved are generally not facts

##### *History*
* (Procedural) History is a factual subset
* This category covers the proceedings at trial and any subsequent appeals

##### **Issue**
* The questions that the decision answers
* These roughly correspond to the arguments one would define in argdown
* Issues can be tricky to spot, as they are often phrased in their argumentative form
  * Eg, The argument "the defendant argues his detention was unlawful" corresponds to the issue "Was the defendant's detention unlawful?"
  * The courts will sometimes answer an issue prior to properly raising it
    * I label these issue/conclusion

##### **Rule**
* Axioms, legal tests, statutetory provisions and definitions
* A rule should either be a major premise or readily capable of being translated into one

##### **Analysis**
- Parties' submissions at the appeal
- Facts and decisions in cited cases
- Court's opinion on the facts, history, and submissions

##### **Conclusion**
* Argument conclusions and issue resolutions
* Dispositions

In [69]:
import spacy
from spacy import displacy

nlp = spacy.load("./models/texcat_firac_v2/model-last/")
nlp.add_pipe("sentencizer")

paragraph = "Section 8 of the Charter guarantees “the right to be secure against unreasonable search or seizure”. A search is reasonable within the meaning of s. 8 “if it is authorized by law, if the law itself is reasonable and if the manner in which the search was carried out is reasonable” (R. v. Collins, 1987 CanLII 84 (SCC), [1987] 1 S.C.R. 265, at p. 278; see also R. v. Caslake, 1998 CanLII 838 (SCC), [1998] 1 S.C.R. 51, at para. 10; R. v. Nolet, 2010 SCC 24, [2010] 1 S.C.R. 851, at para. 21; R. v. Vu, 2013 SCC 60, [2013] 3 S.C.R. 657, at paras. 21‑23; Wakeling v. United States of America, 2014 SCC 72, [2014] 3 S.C.R. 549, at para. 41; R. v. Fearon, 2014 SCC 77, [2014] 3 S.C.R. 621, at para. 12; Goodwin v. British Columbia (Superintendent of Motor Vehicles), 2015 SCC 46, [2015] 3 S.C.R. 250, at para. 48; R. v. Saeed, 2016 SCC 24, [2016] 1 S.C.R. 518, at para. 36; R. v. Tim, 2022 SCC 12, at para. 46). This Court has established a presumption that “a search requires prior authorization, usually in the form of a warrant, from a neutral arbiter” (R. v. M. (M.R.), 1998 CanLII 770 (SCC), [1998] 3 S.C.R. 393, at para. 44, referring to Hunter v. Southam Inc., 1984 CanLII 33 (SCC), [1984] 2 S.C.R. 145, at pp. 160‑62; see also Vu, at para. 22; R. v. Grant, 1993 CanLII 68 (SCC), [1993] 3 S.C.R. 223, at pp. 238‑39)."
doc = nlp(paragraph)

sentence_list = []
for sentence in doc.sents:
    sentence_list.append(sentence)
    categories = nlp(sentence.text).cats
    max_key = max(categories, key=lambda k: categories[k])
    print(sentence, max_key)
    


Section 8 of the Charter guarantees “the right to be secure against unreasonable search or seizure”. ANALYSIS
A search is reasonable within the meaning of s. 8 “if it is authorized by law, if the law itself is reasonable and if the manner in which the search was carried out is reasonable” (R. v. Collins, 1987 CanLII 84 (SCC), [1987] 1 S.C.R. 265, at p. 278; see also R. v. Caslake, 1998 CanLII 838 (SCC), [1998] 1 S.C.R. 51, at para. ANALYSIS
10; R. v. Nolet, 2010 SCC 24, [2010] 1 S.C.R. 851, at para. RULE
21; R. v. Vu, 2013 SCC 60, [2013] 3 S.C.R. 657, at paras. RULE
21‑23; Wakeling v. United States of America, 2014 SCC 72, [2014] 3 S.C.R. 549, at para. RULE
41; R. v. Fearon, 2014 SCC 77, [2014] 3 S.C.R. 621, at para. RULE
12; Goodwin v. British Columbia (Superintendent of Motor Vehicles), 2015 SCC 46, [2015] 3 S.C.R. 250, at para. RULE
48; R. v. Saeed, 2016 SCC 24, [2016] 1 S.C.R. 518, at para. PROCEDURE
36; R. v. Tim, 2022 SCC 12, at para. RULE
46). CONCLUSION
This Court has establish

## Model assessment

## Model deployment