### Scenarios:
1. Single file -> Single word cloud
2. Multiple files -> Single word cloud
3. Multiple files -> Multiple word clouds + Combined word cloud

### Limitations:
* Text in images cannot be read. Here is a [workaround](https://www.thewindowsclub.com/extract-text-from-an-image-in-word) to extract text from images.
* Scanned PDFs (You know a PDF was scanned if you can't select text with your mouse when you open it normally)
    * SOTA OCR methods are still not perfect
    * Takes longer to run

---

Install necessary libraries:  
* PDF Plumber
* NLTK
* PyMuPDF

In [62]:
!pip install pdfplumber

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [63]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [64]:
!pip install pymupdf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [65]:
!pip install pytesseract

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [66]:
!sudo apt update
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

[33m0% [Working][0m            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1[0m[33m0% [1 InRelease gpgv 1,581 B] [Connecting to archive.ubuntu.com] [Connecting to[0m                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
[33m0% [1 InRelease gpgv 1,581 B] [Connecting to archive.ubuntu.com (91.189.91.39)][0m[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net[0m[33m0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Conn[0m                                                                               Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
[33m0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Conn[0m                                           

Import libraries

In [67]:
import re
import os
import zipfile

import fitz
import matplotlib.pyplot as plt
import nltk
import pdfplumber
import pytesseract

from collections import Counter
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join
from wordcloud import WordCloud

Download NLTK stopwords

In [68]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Set parameters

In [69]:
DATA_PATH = "./data"
INDIVIDUAL_WORD_CLOUDS = False
CONTAINS_SCANNED_PDFS = False # Set to true if you know your data contains scanned documents, but you cannot specify which are scanned
SCANNED_PDFS_TAGGED = False # Only set to True if you have named ALL scanned PDFs correctly i.e. file name ends with _scanned.pdf

Main functions

In [70]:
def read_file(file_path: str) -> str:
    file_string = ""
    try:
        # PDF
        if file_path.endswith('.pdf'):
            if SCANNED_PDFS_TAGGED: #Assumed that you know which ones were scanned
                if file_path.endswith("_scanned.pdf"): #These were scanned
                    doc = fitz.open(file_path)
                    for page in doc:
                        pix = page.get_pixmap()
                        output = "outfile.png"
                        pix.save(output)
                        file_string += (pytesseract.image_to_string('outfile.png').lower() + " ")
                        os.remove("./outfile.png") 
                else: #These were not scanned
                    with pdfplumber.open(file_path) as pdf:
                        for page in pdf.pages:
                            file_string += (page.extract_text().lower() + " ")
            elif CONTAINS_SCANNED_PDFS: # Assumed that you DON'T know which were scanned, but you know that there ARE scanned documents
                # Treat everything like it was scanned
                doc = fitz.open(file_path)
                for page in doc:
                    pix = page.get_pixmap()
                    output = "outfile.png"
                    pix.save(output)
                    file_string += (pytesseract.image_to_string('outfile.png').lower() + " ")
                    os.remove("./outfile.png") 
            else: # You know nothing was scanned
                with pdfplumber.open(file_path) as pdf:
                    for page in pdf.pages:
                        file_string += (page.extract_text().lower() + " ")
        # Word Document
        elif file_path.endswith(('.doc', '.docx')):
            docx = zipfile.ZipFile(file_path)
            file_string = docx.read('word/document.xml').decode('utf-8')
            file_string = re.sub('<(.|\n)*?>','',file_string).lower()
        # Plain Text
        else:
            with open(file_path, 'r') as f:
                file_string = f.read().lower()
    except Exception as e:
        print(f"Error: {e}")
    finally:
        return file_string
  

def create_word_cloud(text, title):
    # Removing non-alphanumeric characters in string
    re_pattern = re.compile(r'[^\w\s]', re.UNICODE)
    text = re_pattern.sub('', text)

    # Remove unnecessary words (stop words) like "the", "and", etc.
    words_to_count = text.split() # Split sentence into list of words
    stop_word_set = set(stopwords.words('english'))
    words_to_count = [word for word in words_to_count if word not in stop_word_set] # Remove stop words

    # Count the words using Python's Counter
    word_cloud_dict = Counter(words_to_count)

    # Create the word cloud from the counted words
    # try:
    wordcloud = WordCloud(max_font_size=40, background_color="white").generate_from_frequencies(word_cloud_dict)

    # Display the generated image:
    plt.figure(figsize=(16, 10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(title)
    plt.savefig(f"./output/{title}.png")
    plt.show()
    # except Exception as e:
    #     print(f"ERROR: {e}")
    #     print("IT APPEARS YOU TRIED TO RUN THIS SCRIPT ON A SCANNED OR EMPTY DOCUMENT WITHOUT THE RIGHT PARAMETERS!")
    #     print()
    #     print("Things to check...")
    #     print("1. Make sure the data folder contains at least one file")
    #     print("2. If you know which PDFs are scanned, rename them to end with _scanned.pdf, and set SCANNED_PDFS_TAGGED to True")
    #     print("3. If you do not know which PDFs are scanned, set CONTAINS_SCANNED_PDFS to True")

In [78]:
combined_file_text = ""
if not os.path.isdir('output'):
    os.mkdir('output')
for file_path in [f for f in listdir(DATA_PATH) if isfile(join(DATA_PATH, f))]:
    file_path = join(DATA_PATH, file_path)
    print(file_path)
    file_text = read_file(file_path)
    if INDIVIDUAL_WORD_CLOUDS:
        create_word_cloud(file_text, file_path.split("/")[-1])
    combined_file_text += file_text
print("Combined Word Cloud")
create_word_cloud(combined_file_text, "Combined Word Cloud")

./data/Kwadwo Agyapon-Ntra_Resume.pdf
Combined Word Cloud


AttributeError: ignored

In [76]:
!echo "hello there and hello again and again" | wordcloud_cli --imagefile wordcloud.png

In [None]:
!sudo apt-get update -y
!sudo apt-get install python3.8

In [77]:
!pip install Pillow==8.3.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Pillow==8.3.1
  Downloading Pillow-8.3.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Pillow
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.1.1
    Uninstalling Pillow-9.1.1:
      Successfully uninstalled Pillow-9.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.12.0+cu113 requires pillow!=8.3.*,>=5.3.0, but you have pillow 8.3.1 which is incompatible.
pdfplumber 0.7.1 requires Pillow>=9.1, but you have pillow 8.3.1 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompati