# Playing with libs for string operations

Python's fun, yet complex ! Using libraries is a way to ease your life around some problems.\
Built-ins functionalities of python like re, collections or functools are pillars for clean, optimized code.\
Do not rewrite code, find a lib that does what you want ! Then, enhance it by adding more functionnalities as needed.

In [1]:
from string import whitespace, punctuation, digits
from timeit import timeit
from collections import Counter, defaultdict
import functools
import re

# str.join method is the fastest and memory friendly compared to many other formatting styles (especially on loops !)
non_chars = "".join([whitespace, punctuation, digits])
text = " ".join([
    "Python's fun, yet complex ! Using libraries is a way to ease your",
    "life around some problems. Built-ins 'fonctionalités' of python like re,",
    "collections or functools are pillars for a clean, optimized code.\\\n",
    "Do not rewrite code, enhance it!",
    "Now, Do It Yourself ;)"
])


def classic_get_words(_string: str) -> list[str]:
    """
    The most commonly used method to count words in a text is poorly optimized.
    On top of being slow, it does not allow proper counting without further operations.
    Changing a thing can oftentimes harden the debug process and lead to unwanted results.
    """
    # strip removes end-line characters, making a one-line text
    words = _string.strip().split()
    for word_index in range(len(words)):
        # If the word is alphabetic, we do not need further operations
        if not words[word_index].isalpha():
            alpha_word = ""
            for letter in words[word_index]:
                if letter not in non_chars or letter in ('\'', '-', '_'):
                    # or letter in (...) is to capture 's and linked-_-words
                    # It is done this way because :
                    # 'not in' breaks itself on first missmatch
                    # PS: If we were to invert the tests, process would be much slower !
                    alpha_word += letter

            words[word_index] = alpha_word
    # returns only words that are not empty.
    return [word for word in words if word]

def regex_get_words(_string: str) -> list[str]:
    """
    On top of being quite clear and easy to adapt to specific cases,
    it is also pretty fast to gather anything you want in a text.
    In regex, \w is an alias for [a-zA-Z0-9_] meaning it will match ascii_letters, digits and symbol '_'.
    We only want words, so we have to use [a-zA-ZÀ-ÿ'_-]+
    Where:
    - [a-zA-Z] is the same as string.ascii_letters
    - [À-ÿ] takes accents into consideration
    - ['_-] allows us to capture 's and linked-_-words
    - + means 'find as much as you can in a row' if any character is not one of those: get next word

    We could also use negation with non-capturing expressions like (?!:[a-ZA-Z])
    """
    words = re.findall(r"[a-zA-ZÀ-ÿ'_-]+", _string)
    return words

if classic_get_words(text) != regex_get_words(text):
    print("Ouch, functions should return the same")
print()


# In order to have better oversight on time taken for simple functions, we have to run them many times 
nb_runs = 100000
print(f"first, let's compare classic approach to regex findall  on {nb_runs} runs:")
print("classic :",
      f"{timeit(functools.partial(classic_get_words, text), number=nb_runs):.3f}s")
print("regex :",
      f"{timeit(functools.partial(regex_get_words, text), number=nb_runs):.3f}s")


first, let's compare classic approach to regex findall  on 100000 runs:
classic : 2.007s
regex : 1.219s


As we can see, __regex__ is almost twice as fast as the 'classic' method. (other optimized solutions exists)

Now we have sorted this out, in the next tests, we will exclusively use this regex method, since it is faster...

### Next step:
    Since we only want to know how fast each methodology is, we are not going to time regex findings.

In [2]:
def counter_count_words(words):
    counter = Counter(words)
    return counter


def defaultdict_count_words(words):
    my_counter = defaultdict(int)
    for word in words:
        my_counter[word] = my_counter.get(word, 0) + 1
    return my_counter


def dict_count_words(words):
    my_counter = {}
    for word in words:
        my_counter[word] = my_counter.get(word, 0) + 1
    return my_counter


# not timed
words = regex_get_words(text)
print()
print(f"then, if we want to count words occurrences in a text, using Counter or dicts  on {nb_runs} runs:")
print("default_dict :", 
      f"{timeit(functools.partial(defaultdict_count_words, words), number=nb_runs):.3f}s")
print("dict :", 
      f"{timeit(functools.partial(dict_count_words, words) , number=nb_runs):.3f}s")
print("Counter :", 
      f"{timeit(functools.partial(counter_count_words, words), number=nb_runs):.3f}s")


then, if we want to count words occurrences in a text, using Counter or dicts  on 100000 runs:
default_dict : 0.714s
dict : 0.666s
Counter : 0.336s


Now, it is clear that regex and Counter are the fastest way to count words, on top of freeing code space,

Let's see what functionalities offers Counter, on top of being fast :
https://docs.python.org/3/library/collections.html#collections.Counter

In [3]:
counter = Counter(words)

print("elements of counter :\n", counter.elements())
print("\ntop 4 used words :\n", counter.most_common(4))

print("\nIf you want to compare two texts and showcase similarities or differences :")
second_words = regex_get_words("a string that have some words to substract from the other text.")
second_counter = counter_count_words(second_words)
counter.subtract(second_words)
print("difference of two texts :\n", counter)

print("\nIf you only want to take elements that are on a positive count :\n", +counter)

print("total words :\n", counter.total())


elements of counter :
 <itertools.chain object at 0x0000028240825700>

top 4 used words :
 [('a', 2), ('code', 2), ('Do', 2), ("Python's", 1)]

If you want to compare two texts and showcase similarities or differences :
difference of two texts :
 Counter({'code': 2, 'Do': 2, "Python's": 1, 'fun': 1, 'yet': 1, 'complex': 1, 'Using': 1, 'libraries': 1, 'is': 1, 'a': 1, 'way': 1, 'ease': 1, 'your': 1, 'life': 1, 'around': 1, 'problems': 1, 'Built-ins': 1, "'fonctionalités'": 1, 'of': 1, 'python': 1, 'like': 1, 're': 1, 'collections': 1, 'or': 1, 'functools': 1, 'are': 1, 'pillars': 1, 'for': 1, 'clean': 1, 'optimized': 1, 'not': 1, 'rewrite': 1, 'enhance': 1, 'it': 1, 'Now': 1, 'It': 1, 'Yourself': 1, 'to': 0, 'some': 0, 'string': -1, 'that': -1, 'have': -1, 'words': -1, 'substract': -1, 'from': -1, 'the': -1, 'other': -1, 'text': -1})

If you only want to take elements that are on a positive count :
 Counter({'code': 2, 'Do': 2, "Python's": 1, 'fun': 1, 'yet': 1, 'complex': 1, 'Using': 1

AttributeError: 'Counter' object has no attribute 'total'

Counter.total() should be a thing, according to the doc... But we use a lib version that does not have it.

Also, while we are at it, remember this ? `<itertools.chain object at 0x00....>` we could try and make it so print(counter.elements()) doesn't requires us to iter through with `[elem for elem in counter.elements()]`

Now, if we were to create it ourselves, we could do something like __this__:

In [4]:
class MyCounter(Counter):
    def __init__(self, *args, **kwds):
        """Call to super().__init__(...) creates an instance of Counter"""
        super().__init__(*args, **kwds)

    def total(self):
        """add each word's count"""
        total = 0
        for word, count in self.items():
            total += count
        return total
    
    def elements(self):
        """alter counter.elements() to return a list of elements, instead of an itertools.chain object"""
        return [elem for elem in super().elements()]


counter = MyCounter(words)
print(counter.total())
print(counter.elements())

42
["Python's", 'fun', 'yet', 'complex', 'Using', 'libraries', 'is', 'a', 'a', 'way', 'to', 'ease', 'your', 'life', 'around', 'some', 'problems', 'Built-ins', "'fonctionalités'", 'of', 'python', 'like', 're', 'collections', 'or', 'functools', 'are', 'pillars', 'for', 'clean', 'optimized', 'code', 'code', 'Do', 'Do', 'not', 'rewrite', 'enhance', 'it', 'Now', 'It', 'Yourself']


### We made it !
You now have a MyCounter class that have a .total() method and .elements() now returns a list.

What else could we do with it ??
