How to translate ppt files ?
Read my Medium article to discover how the library was built !
Purpose
Free online translators of PowerPoint files have 2 main issues :
- The translation API's are often neither robust to short half-sentences (very common in PowerPoints) not to long text traductions
- The structure of PowerPoint presentations are very complex (lots of unordered shapes) and after modification, nice presentation often get shapes misplaced
This project aims to solve the problem and to automate the process of translating *.pptx files with the same nice-reendering as the original, with well-traducted sentences/expressions.
This repo contains materials to :
- Translate texts using Selenium on deepL translation website.
- Extract and modify PowerPoint texts from different objects with the powerful python-pptx library
src
folder :
4 Scripts are available in - default_selenium.py :
defaultSelenium
class contains the bases to connect to Selenium API and launch a website - deepL_selenium.py :
seleniumDeepL
inheritates from the previous one and contains all the interaction specifically needed to the deepL context - ppt_interaction.py : contains functions to inspect a presentations : from presentation, to slides, to shapes, to their
text_frame
properties. - ppt_translation.py : uses both functions from ppt_interaction.py and
seleniumDeepL
to accomplish the final task : translating files.
Running the translator
The translation object uses a corpus
concept. Text must be given as a list of strings (each string equals to a sentence, max number of caracters in a single sentence is 4900 due to deepL's webpage limits). A translation example is provided.
There are 5 steps to run the translation on a corpus.
- Clone the repo
git clone https://github.com/ThibaudLamothe/translate-pptx.git
-
Download the selenium chromedriver at the project's root. By the way, Google Chrome needs to be installed.
-
Go to src folder
cd src/
- Install necessary libraries
pip install -r requirements.txt
- Run the deepL_selenium.py file
python deepL_selenium
The output is the following one :
Translator's features
Initiating the translator launchs the selenium driver and needs a driver to run correctly. This one has to be specified with the driver_path
argument. The loglevel might also be indicated (error/warning/information/debug) depending on the level of information to track. See the previous picture.
deepL = seleniumDeepL(driver_path='../chromedriver', loglevel='debug')
When running that command an empty internet pages open. We can now start the translation process.
Functions available
The seleniumDeepL
contains multiple methods, but only 4 are useful for the translation process. The other ones are only part of the processing.
deepL.run_translation( see next part for parameters )
This is the main function. It takes the corpus, transforms to better suit the deepL's website, make the traduction and store the results into a dictionnary.
deepL.get_translated_corpus()
It returns the dictionnary of the traducted sentences. Keys are the orginals sentences or group of words, values correspond to their translations.
deepL.save_translations(json_path as str)
It is possible to store the translated as a json file, using that function. It only needs one argument : the path to the json file as a string.
deepL.load_translations(json_path as str)
During the translation process, a sentence which has already been translated is not translated a second time. It is possible to reload translations from a previous run with that functions. It takes the path to a json file as a string.
Running the translation
So far we've seen the 4 useful functions of seleniumDeepL
. The deepL.run_translation()
is the most important one. Wee'll see now how to correctly use and parameter it.
- corpus (as str or list, default : 'Hello, World!')
The corpus is the text to be translated. Can be a string
or a list
of strings. And as translating one sentence does not necessarly need automation, the list option is more interesting.
- destination_language (as str, default :
'en'
)
self.available_languages = ['fr', 'en', 'de', 'es', 'pt', 'it', 'nl', 'pl', 'ru', 'ja', 'zh']
-
joiner (as str, default :
'\n____\n'
) -
quit_web (as boolean, default : True)
-
time_to_translate (as integer, default : 10)
-
time_batch_rest (as integer, default : 2)
-
raise_error (as boolean, default : False)
-
load_at (as string default :
None
) -
store_at (as string default :
None
) -
load_and_store_at (as string default :
None
)
PPT Insertion
- Replacing text without modifying its look
Good to know
NB : the project was developped on MacOS and selenium used with Google Chrome
Resources
- Changing the text but keeping the Font in python-pptx
- Module-wide variables in Python (1/2)
- Module-wide variables in Python (2/2)
- Selenium French Documentation
- Chromedriver
- CSS Selectors (recommended into the Selenium documentation)
TODO
- Deal with bigger texts. Idea. Separate long sentences on \n's. Reconciliate them after translation. Do it at the reception and delivey of the corpus, so that no modification are done in the batch_corpus creation ?