# Extract all text from Word Document into Markdown Format

Below is an example of using AnyParser to accurately extract the layout and text from a sample Word Document into markdown format.
    ```

### 1. Load the libraries

To install the packages, uncomment the commands below.

In [None]:
# !pip3 install python-dotenv
# !pip3 install mammoth
# !pip3 install IPython

Next, to use AnyParser, either install the public package or install the SDK locally.

In [2]:
# Option 1: install public package
# !pip3 install --upgrade any-parser

# Option 2: if you have sdk respository installed locally, add system path
# import sys
# sys.path.append(".")
# sys.path.append("..")
# sys.path.append("../..")

After performing Option 1 or 2 above, import the libraries.

In [7]:
import os
import mammoth
from dotenv import load_dotenv
from IPython.display import display, HTML, Markdown
from any_parser import AnyParser

### 2. Set up your AnyParser API key

To set up your `CAMBIO_API_KEY` API key, you will:

1. create a `.env` file in your root folder;
2. add the following one line to your `.env file:
    ```
    CAMBIO_API_KEY=17b************************
    ```

Then run the below line to load your API key.

In [8]:
load_dotenv(override=True)
example_apikey = os.getenv("CAMBIO_API_KEY")

### 3. Load the test sample data

Now let's load a sample data to test AnyParser's capabilities. AnyParser supports both image and PDF. 

Let's visualize the sample Word file first!

In [9]:
example_local_file = "./sample_data/test_odf.docx"

# Open the .docx file and convert it to HTML
with open(example_local_file, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html_content = result.value

# Preview Word document
display(HTML(html_content))

### 4. Run AnyParser and Visualize the Markdown Output

We will run AnyParser on our sample data and then display it in the Markdown format. The extract may take 1-20 seconds per page. Note that this example uses the Synchronous API. To see how AnyParser can be used asynchronously, see the [Asynchronous API notebook](./async_pdf_to_markdown.ipynb).

In [10]:
ap = AnyParser(example_apikey)

# extract returns a tuple containing the markdown as a string and total time
markdown_string, total_time = ap.extract(example_local_file)

display(Markdown(markdown_string))
print(total_time)

## Test document

## Here is an example chart:

| Investor Metrics | FY23 Q1 | FY23 Q2 | FY23 Q3 | FY23 Q4 | FY24 Q1 |
|------------------|---------|---------|---------|---------|---------|
| Office Commercial products and cloud services revenue growth (y/y) | 7% / 13% | 7% / 14% | 13% / 17% | 12% / 14% | 15% / 14% |
| Office Consumer products and cloud services revenue growth (y/y) | 7% / 11% | (2)% / 3% | 1% / 4% | 3% / 6% | 3% / 4% |
| Office 365 Commercial seat growth (y/y) | 14% | 12% | 11% | 11% | 10% |
| Microsoft 365 Consumer subscribers (in millions) | 65.1 | 67.7 | 70.8 | 74.9 | 76.7 |
| Dynamics products and cloud services revenue growth (y/y) | 15% / 22% | 13% / 20% | 17% / 21% | 19% / 21% | 22% / 21% |
| LinkedIn revenue growth (y/y) | 17% / 21% | 10% / 14% | 8% / 11% | 6% / 8% | 8% |

Growth rates include non-GAAP CC growth (GAP % / CC %)

Done.

Time Elapsed: 5.39 seconds


## End of the notebook

Check more [case studies](https://www.cambioml.com/blog) of CambioML!

<a href="https://www.cambioml.com/" title="Title">
    <img src="./sample_data/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>