![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_PowerPoint_Reader_Demo.ipynb)

# Introducing PowerPoint reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().ppt()` method in Spark NLP that parses Excel content from both local files and both local and distributed file systems into a Spark DataFrame.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading html files was introduced in Spark NLP 5.5.2. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of HTML files from Spark NLP Github repo:

In [7]:
!mkdir power-point-files
!!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point.pptx -P power-point-files

--2024-12-24 15:23:15--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/fake-power-point.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38412 (38K) [application/octet-stream]
Saving to: ‘power-point-files/fake-power-point.pptx’


2024-12-24 15:23:15 (3.31 MB/s) - ‘power-point-files/fake-power-point.pptx’ saved [38412/38412]



## Parsing PowerPoint slides from Local Files
Use the `ppt()` method to parse Excel content from local directories.

In [8]:
import sparknlp
ppt_df = sparknlp.read().ppt("./power-point-files")

ppt_df.show()

+--------------------+--------------------+--------------------+
|                path|             content|                 ppt|
+--------------------+--------------------+--------------------+
|file:/content/pow...|[50 4B 03 04 14 0...|[{Title, Adding a...|
+--------------------+--------------------+--------------------+



In [9]:
ppt_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- content: binary (nullable = true)
 |-- ppt: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`