# How to convert unstructured data to structured data

**Author**: Erika Russi

In this tutorial, you will use IBM’s open source [Docling](https://github.com/docling-project/docling) with Python to convert [unstructured data](https://www.ibm.com/think/topics/unstructured-data) contained in a group of scanned files into a [structured](https://www.ibm.com/think/topics/structured-vs-unstructured-data) format. 

## Structured vs. unstructured data
Structured data is information organized into a fixed field within a record or file. It resides in [SQL](https://www.ibm.com/think/topics/structured-query-language) databases, JSON from [APIs](https://www.ibm.com/think/topics/api), XML and CSV files and Excel spreadsheets. Structured data is data ready for efficient [data processing](https://www.ibm.com/think/topics/data-processing), analysis and management. 

By contrast, unstructured data is information that does not conform to a predefined [data model](https://www.ibm.com/think/topics/data-modeling) or schema. It lacks an organized, tabular form and is typically text-heavy. Examples include emails, social media posts and customer reviews as well as non-text formats such as audio recordings, video files and images. 

Unstructured data makes up the vast majority (90%) of enterprise information, growing faster than any other type of data.<sup>[1]</sup>
Certain industries—like healthcare or logistics and supply chain—can have a plethora of scanned documents saved as images ready for processing. Although rich in information, unstructured data is challenging for conventional [databases](https://www.ibm.com/think/topics/database) and [data analysis ](https://www.ibm.com/think/topics/big-data-analytics) tools to process directly.

## The importance of conversion
The conversion from unstructured to structured data is vital because structured information is readily interpretable by machines and [algorithms](https://www.ibm.com/think/topics/machine-learning-algorithms). It enables:
- Analysis automation: Running [real-time ](https://www.ibm.com/think/topics/real-time-data) queries, generating reports and performing statistical analysis.
- [Business intelligence](https://www.ibm.com/think/topics/business-intelligence): Extracting valuable insights for [decision-making](https://www.ibm.com/think/topics/data-driven-decision-making).
- [Machine learning (ML)](https://www.ibm.com/think/topics/machine-learning) model readiness: Providing clean, organized inputs for ML models to learn from.
- AI-powered solutions: Enabling advanced analytics powered by [AI models ](https://www.ibm.com/think/topics/ai-model) or [retrieval augmented generation (RAG)](https://www.ibm.com/think/topics/retrieval-augmented-generation) applications by using [generative AI](https://www.ibm.com/think/topics/generative-ai).

## The conversion process
The goal of the unstructured to structured data [conversion process ](https://www.ibm.com/think/topics/unstructured-data-processing) is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume. 

After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every [dataset](https://www.ibm.com/think/topics/dataset) maintains accuracy and structure throughout the process. Some techniques include:
- [Optical character recognition (OCR)](https://www.ibm.com/think/topics/optical-character-recognition)—converts scanned documents or images into machine-readable text
- [Natural language processing (NLP)](https://www.ibm.com/think/topics/natural-language-processing)—pre-processes text and can be used for keyword or [feature extraction](https://www.ibm.com/think/topics/feature-extraction)


## Prerequisites 

To run this tutorial effectively, users need to have [Python downloaded](https://www.python.org/downloads/). This tutorial stably runs with Python 3.13.  

## Steps

### Step 1. Set up your environment
There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our [GitHub repository](https://github.com/IBM/ibmdotcom-tutorials) to run the full Jupyter Notebook. 

#### Option 1: Use watsonx.ai
Follow the following steps to set up an IBM account to use a Jupyter Notebook.

1. You need an [IBM Cloud® account](https://cloud.ibm.com/registration) to create a [watsonx.ai](https://www.ibm.com/products/watsonx-ai) project.
2. Create a [watsonx.ai](http://watsonx.ai) [project](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-creating-project) by using your IBM Cloud account.
3. Create a [Jupyter Notebook](https://www.ibm.com/docs/en/watsonx/saas?topic=editor-creating-managing-notebooks).

    This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your [watsonx.ai](http://watsonx.ai) project as an asset. 

#### Option 2: Run the tutorial locally
1. Several Python versions can work for this tutorial. At the time of publishing, we recommend [downloading](https://www.python.org/downloads/) Python 3.13, the latest version.
2. In your preferred IDE, clone the GitHub repository by using `https://github.com/IBM/ibmdotcom-tutorials.git` as the HTTPS URL. For detailed steps on how to clone a repository, refer to the [GitHub documentation](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository). 

    This tutorial can be found inside the docs/tutorials/docling directory.

3. Inside a terminal, create a virtual environment to avoid Python dependency issues.

    ```sh
    python3.13 -m venv myvenv
    source myvenv/bin/activate
    ```

4. Then, navigate to this tutorial's directory.

    ```sh
    cd docs/tutorials/docling
    ```

### Step 2. Install and import relevant libraries 
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation can be performed. The `-q` flag quiets or suppresses the progress bars.

Some helpful libraries here include `docling` and `pandas`. We will be using [open source ](https://www.ibm.com/think/topics/open-source) [Docling](https://github.com/docling-project/docling)'s OCR support for parsing JPG files, but similar OCR tools are available to use with OpenAI and AWS. Pandas will be used to visualize the extracted data from scanned images as a structured dataframe.

We will use scanned images from this [Kaggle Scanned Document: Table Dataset](https://www.kaggle.com/datasets/jayasooryantm/table-extraction?select=10.1.1.1.2019_2.jpg).

In [None]:
! pip install -q docling pandas

### Step 3. Convert scanned images

In this example, from a set of source JPG files, we use Docling to convert the documents from scanned images into text and tables. 

We will establish a list `sources` with the list of two JPG images' URLs. To convert from images, we will use `PdfPipelineOptions` setting `do_table_extraction`, `do_ocr` and `generate_picture_images` all to `True`. Once the format options are established in `format_options`, we can use `DocumentConverter` to convert the sources and save the results in a dictionary, `conversions`.

In [None]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions


sources = ["https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_2.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260216T162351Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=452bf02297d455b8d09229a59e758b3bf268c45e67c97b9a237b36b2dc9d365bee8d0d83e21fa55df78ec9224e8a64edfc7f25312dc3bfc6a4026a62a79ebb8a72a85856e139b18753d5f60c795bb452c9176b7fd360d400f7af5399e22c615284001af5d68df20c8b25975b86452856391a2bd37ba0354453011dcba6757f6617f7d90d82d660df1777c618d7877b835fd848cfda7cbfae7e14f3566542d9108d8e3f53e8956779a1a6d1afedf42d680411bf280f9827a7410631d708e94c1d03d83f6b40226ccb0bcfa4e9a76fa62aa2c4ea5d8869566f0a4fe9dd9dd0a47a9d46714459173ce0182fcd4e9658c2b2e1a19c5ab27b454bab24e53ffba23cc2",
           "https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_3.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260212%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260212T215329Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=afb95356cb9ab3c186c37380b3b5d01a80a68a25cf5793a0877dad8e7948e85ed0b31e9352c347f7ef6810490391998dd05ea4abe15993ff2041eafda3831ba7eb9e583e2e200dd330664f0210e389e4c3a8cedae0ba09ac48d4bb3b7e95b0b9c07abb21fa9d9149c6f936bad6a3e7c29de4d96bbe2da8962168c166a482d89a9257d21a19ccebf5c24e4248d0c87aafffd29039a95559a2bea28d30e89e08afce27d894755809b1e19813e7fec56d20bc7922af62152f3d35d180bf8bbb2a2ef36ae4d71cc443b3ca6a2b93ea40d452a81b795f2a098aaa9969798d31b125c43c50590a0b1a59158e96da682499e5d9e96e4103cad68c9185c522921f73c60f"
 ]

pdf_pipeline_options = PdfPipelineOptions(
    do_table_extraction=True,
    do_ocr=True,
    generate_picture_images=True,
)

format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}

converter = DocumentConverter(format_options=format_options)

conversions = { source: converter.convert(source=source).document for source in sources}

### Step 4. Review structured text output

Next, we can see some of the text output from the conversions and how it has been organized and structured. Note how we now have text labels associated with the text blocks extracted.

If we were setting up a RAG pipeline here, we would chunk the text next before vectorizing and storing in a vector database. 

In [None]:
conversions[sources[0]].texts

### Step 5. Export tables to dataframes

Finally, we can export the extracted structured tables to dataframes for easier visualization. Now that the tables have been extracted to a tabular format, they are, by definition, structured and easier for AI application consumption. We can save the [extracted data](https://www.ibm.com/think/topics/information-extraction) into other structured formats to be stored.

If we were to continue with a RAG application, we would convert the table data to markdown format for passing into a [large language model (LLM)](https://www.ibm.com/think/topics/large-language-models).

In [None]:
import pandas as pd

for source in sources:
    for table_ix, table in enumerate(conversions[source].tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conversions[source])
        print(f"## Source {source}")
        display(table_df)
        

## Conclusion

In this tutorial, you converted unstructured data held in scanned documents into an AI-ready structured output. 

Although we only converted a few documents, the concepts explored in this simplified use case serve as the foundation for creating an [automated](https://www.ibm.com/think/topics/data-automation) [extract, transform, load (ETL) ](https://www.ibm.com/think/topics/etl) [workflow ](https://www.ibm.com/think/topics/workflow) for enterprise data. For larger unstructured datasets, [validation](https://www.ibm.com/think/topics/data-validation) is an important step to perform after conversion to ensure high data quality and accuracy.

_<a name="Footnote 1">1</a>: “[Untapped value: What every executive needs to know about unstructured data](https://www.box.com/resources/unstructured-data-paper),” IDC, Aug 2023._

