**Recursively counting words in markdown within a folder**

A question had popped into my mind this week, "How many words have I written in this blog?". As per normal, we'll go into how to use Python to solve this problem. This is an opportunity to get more familiar with `pathlib`, a renowned, enjoyable way to handle pathing within Python programs.

In [1]:
import os
import re
import pathlib
import io
from typing import List
from nbformat import current

A great way to start projects, is to see if it's already been done before. [@gandreadis](https://github.com/gandreadis) over at GitHub had already tackled this problem on the single file basis: [https://github.com/gandreadis/markdown-word-count](https://github.com/gandreadis/markdown-word-count).

This means all we need to do is:
- Recursively loop through a folder and find all the markdown files
- Call [@gandreadis](https://github.com/gandreadis) function to get the count of words for each file
- Sum total words

First of let's bring [@gandreadis](https://github.com/gandreadis) function into our program. It's been modified slightly to take in a file path, read the file contents in a variable, clean up contents and return the length of an array split by spaces.

For example if the sentence (string) is 'How many words have I written in this blog?', when .split is called this returns an array of each word split by a space. Allowing us to calculate how many words, by counting the number of elements in the array.

In [6]:
testSentence = 'How many words have I written in this blog?'
print(testSentence.split())

['How', 'many', 'words', 'have', 'I', 'written', 'in', 'this', 'blog?']


In [5]:
# Source: https://github.com/gandreadis/markdown-word-count
def count_words_in_markdown(filePath: str):
    with open(filePath, 'r', encoding='utf8') as f:
        text = f.read()

    # Comments
    text = re.sub(r'<!--(.*?)-->', '', text, flags=re.MULTILINE)
    # Tabs to spaces
    text = text.replace('\t', '    ')
    # More than 1 space to 4 spaces
    text = re.sub(r'[ ]{2,}', '    ', text)
    # Footnotes
    text = re.sub(r'^\[[^]]*\][^(].*', '', text, flags=re.MULTILINE)
    # Indented blocks of code
    text = re.sub(r'^( {4,}[^-*]).*', '', text, flags=re.MULTILINE)
    # Replace newlines with spaces for uniform handling
    text = text.replace('\n', ' ')
    # Custom header IDs
    text = re.sub(r'{#.*}', '', text)
    # Remove images
    text = re.sub(r'!\[[^\]]*\]\([^)]*\)', '', text)
    # Remove HTML tags
    text = re.sub(r'</?[^>]*>', '', text)
    # Remove special characters
    text = re.sub(r'[#*`~\-–^=<>+|/:]', '', text)
    # Remove footnote references
    text = re.sub(r'\[[0-9]*\]', '', text)
    # Remove enumerations
    text = re.sub(r'[0-9#]*\.', '', text)

    return len(text.split())

In [3]:
# Top directory to search through
topFolder: pathlib.Path = pathlib.Path().cwd().parent.parent.parent.parent.joinpath('content')

allMarkdown: List  = []

# Iterate through all files using pathlib
for singleFile in topFolder.glob('**/*'):
    if singleFile.suffix == '.md':
        allMarkdown.append(singleFile)

print(len(allMarkdown))

57


In [4]:
totalWordCount: int = 0
for singleFile in allMarkdown:
    totalWordCount += count_words_in_markdown(singleFile)

print(totalWordCount)

31623
