<a href="https://colab.research.google.com/github/Ayuathm/Job_market_Analysis_AI_SSD/blob/main/Job_Text_Extraction_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📄 Job Attachment Text Extraction (PDF + DOCX with OCR)

This notebook extracts text from job advert attachments, including scanned PDFs using OCR, and saves it for further analysis.

In [None]:
# ✅ Step 1: Install required packages
!apt-get install poppler-utils tesseract-ocr -y
!pip install pdf2image pytesseract python-docx docx2txt PyMuPDF

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Ign:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8 [186 kB]
Fetched 186 kB in 1min 49s (1,710 B/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 126109 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.8_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.8) ...
Setting up poppler-utils (22.02.0-2ubuntu0.8) ...
Processing triggers for man-db (2.10.2-1) ...
Collecting pdf2i

In [None]:
# ✅ Step 2: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# ✅ Step 3: Define utility functions for extraction
import os
import fitz  # PyMuPDF
import pytesseract
import docx2txt
import pandas as pd
from pdf2image import convert_from_path
from PIL import Image

# 👇 UPDATE your folder path
attachments_folder = "/content/drive/MyDrive/Job_test"

def extract_text_from_pdf(file_path):
    try:
        text = ""
        with fitz.open(file_path) as doc:
            for page in doc:
                page_text = page.get_text()
                if page_text.strip():
                    text += page_text
        if text.strip():
            return text
        else:
            images = convert_from_path(file_path)
            ocr_text = ""
            for img in images:
                ocr_text += pytesseract.image_to_string(img)
            return ocr_text
    except Exception as e:
        return f"ERROR: {e}"

def extract_text_from_docx(file_path):
    try:
        return docx2txt.process(file_path)
    except Exception as e:
        return f"ERROR: {e}"

In [None]:
# ✅ Step 4: Run extraction for all files
all_files = [f for f in os.listdir(attachments_folder) if f.lower().endswith(('.pdf', '.docx'))]
extracted_data = []

for file in all_files:
    full_path = os.path.join(attachments_folder, file)
    if file.lower().endswith('.pdf'):
        text = extract_text_from_pdf(full_path)
    elif file.lower().endswith('.docx'):
        text = extract_text_from_docx(full_path)
    else:
        text = "Unsupported"
    extracted_data.append({"filename": file, "text": text})

In [None]:
# ✅ Step 5: Save extracted data to CSV
df = pd.DataFrame(extracted_data)
output_path = os.path.join(attachments_folder, "extracted_job_texts.csv")
df.to_csv(output_path, index=False)
print(f"✅ Extracted text saved to: {output_path}")
df.head()

✅ Extracted text saved to: /content/drive/MyDrive/Job_test/extracted_job_texts.csv


Unnamed: 0,filename,text
0,9r1Fa5YxSuQ14r1upV08RZpoGeA.pdf,\n \n\n@ W658\nZOAG Ge Dercas\nSouth Sudan ...
1,tqQRLgO01TK5fOvpeGs90zg929i.pdf,\n\npy\nZOA-Dorcas South Sudan is an internat...
2,iOUNarYEhq2pxR41H6B68xuPMQQ.pdf,6 May 2024\n\n \n\nSave the Children\n\nJob ...
3,gNVtYv182piJQ2mtIhT1Gwjz0Iw.pdf,\n\n \n\n \n\n- a\n\n@\n\nsondage sos\n\n \n\...
4,mh3Yd6Ektz5oNYCOfHbppwfCWb5.pdf,CamScanner\nCamScanner\nCamScanner\nCamScanner...
