# In this tutorial you will generate a .txt file translated given a pdf file in python.

In [1]:
import cv2 
import pytesseract
import pyocr
import pyocr.builders
from PIL import Image 
import sys 
from pdf2image import convert_from_path 
import os 
from googletrans import Translator

translator = Translator()
#from wand.image import Image

# Convert pdf to images

In [2]:
#Path of the pdf 
PDF_file = "gans.pdf"

  
# Store all the pages of the PDF in a variable 
pages = convert_from_path(PDF_file, 500) 

 
# Counter to store images of each page of PDF to image 
image_counter = 1
  
# Iterate through all the pages stored above 
for page in pages: 
  
    # Declaring filename for each page of PDF as JPG 
    # For each page, filename will be: 
    # PDF page 1 -> page_1.jpg 
    # PDF page 2 -> page_2.jpg 
    # PDF page 3 -> page_3.jpg 
    # .... 
    # PDF page n -> page_n.jpg 
    filename = "page_"+str(image_counter)+".jpg"
      
    # Save the image of the page in system 
    page.save(filename, 'JPEG') 
  
    # Increment the counter to update filename 
    image_counter = image_counter + 1
  


### Look at your folder, there must be .jpg images from the pdf 

# Let's start using [googletrans](https://py-googletrans.readthedocs.io/en/latest/) in python
## Example using the traslator given some text

In [3]:
text = 'Hello, how are you? I has been a long way back home'
print(translator.detect('이 문장은 한글로 쓰여졌습니다.'))
spa_trans = translator.translate(text, src='en',dest='es')
#print(translator.translate(text, src='en',dest='es'))
print(spa_trans.text)

Detected(lang=ko, confidence=1.0)
¿Hola como estas? I ha sido un largo camino de vuelta a casa


## Let's use Tesseract in an image and see the results

In [4]:
text = pytesseract.image_to_string(cv2.imread('page_2.jpg'))
text = text.replace('-\n', '')
print(text)

This framework can yield specific training algorithms for many kinds of model and optimization
algorithm. In this article, we explore the special case when the generative model generates samples
by passing random noise through a multilayer perceptron, and the discriminative model is also a
multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train
both models using only the highly successful backpropagation and dropout algorithms [17] and
sample from the generative model using only forward propagation. No approximate inference or
Markov chains are necessary.

2 Related work

An alternative to directed graphical models with latent variables are undirected graphical models
with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann
machines (DBMs) [26] and their numerous variants. The interactions within such models are
represented as the product of unnormalized potential functions, normalized by a global summatio

## Example of the previous generated text translated into Spanish

In [9]:
# Identifying the language
from langdetect import detect_langs
detect_langs(text)

[en:0.9999955035469207]

In [7]:
# Once the language is identified: Translate
spanish_translation = translator.translate(text, src='en',dest='es')
print(spanish_translation.text)

Este marco puede producir algoritmos específicos de formación para muchos tipos de modelo y optimización
algoritmo. En este artículo se analiza el caso especial cuando el modelo generativo genera muestras
haciendo pasar el ruido aleatorio a través de un perceptrón de múltiples capas, y el modelo discriminativo es también una
perceptrón multicapa. Nos referimos a este caso especial como redes de confrontación. En este caso, podemos entrenar
ambos modelos usando sólo el backpropagation gran éxito y algoritmos de deserción [17] y
muestra desde el modelo generativo utilizando sólo propagación hacia adelante. No inferencia aproximada o
Las cadenas de Markov son necesarios.

2. Trabajo relacionado

Una alternativa a modelos gráficos dirigidos con variables latentes están no dirigidos modelos gráficos
con variables latentes, tales como máquinas de Boltzmann restringidos (RBM) [27, 16], en el fondo Boltzmann
máquinas (DBMS) [26] y sus numerosas variantes. Las interacciones dentro de tales mode

# Iterate over all images and generate the text file

In [10]:
''' 
Part #2 - Recognizing text from the images using OCR 
'''
    
# Variable to get count of total number of pages 
filelimit = image_counter-1
  
# Creating a text file to write the output 
 
# Open the file in append mode so that  
# All contents of all images are added to the same file 
with open('out_translated_text.txt', 'w', encoding='utf-8') as f:
  
     # Iterate from 1 to total number of pages 
    for i in range(1, filelimit + 1): 

        # Set filename to recognize text from 
        # Again, these files will be: 
        # page_1.jpg 
        # page_2.jpg 
        # .... 
        # page_n.jpg 
        filename = "page_"+str(i)+".jpg"

        # Recognize the text as string in image using pytesserct 
        text = str(((pytesseract.image_to_string(Image.open(filename))))) 

        # The recognized text is stored in variable text 
        # Any string processing may be applied on text 
        # Here, basic formatting has been done: 
        # In many PDFs, at line ending, if a word can't 
        # be written fully, a 'hyphen' is added. 
        # The rest of the word is written in the next line 
        # Eg: This is a sample text this word here GeeksF- 
        # orGeeks is half on first line, remaining on next. 
        # To remove this, we replace every '-\n' to ''. 
        text = text.replace('-\n', '')     
        spa_trans = translator.translate(text, src='en',dest='es')
        f.write(spa_trans.text) 
        print('Successfully translated: ',filename)
        # Finally, write the processed text to the file. 
        #f.write(text) 

        # Close the file after writing all the text. 
f.close() 


Successfully translated:  page_1.jpg
Successfully translated:  page_2.jpg
Successfully translated:  page_3.jpg
Successfully translated:  page_4.jpg
Successfully translated:  page_5.jpg
Successfully translated:  page_6.jpg
Successfully translated:  page_7.jpg
Successfully translated:  page_8.jpg
Successfully translated:  page_9.jpg


## Visualize the output text file generated

In [12]:
file = open("out_translated_text.txt", "r", encoding='utf-8')
print(file.read())

1406.266lv1 [stat ML] 10 de junio 2014

mi
mi

ar X1Vv

 

Redes generativas adversarias

Ian J. Goodfellow, Jean Pouget-Abadie; Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair! Aaron Courville, Yoshua Bengio *
Departamento d’Informatique et de recherche opérationnelle

Universidad de Montreal
Montreal, QC H3C 3J7

Resumen

Se propone un nuevo marco para la estimación de los modelos generativos a través de un proceso contradictorio, en el que al mismo tiempo entrenamos dos modelos: un modelo generativo G
que captura la distribución de datos, y un modelo discriminativo D que las estimaciones
la probabilidad de que una muestra de vino de los datos de entrenamiento en lugar de G. El procedimiento de entrenamiento para G' es maximizar la probabilidad de cometer un error D. Esta
marco corresponde a un Minimax juego de dos jugadores. En el espacio de arbitraria
funciones G y D, existe una solución única, con G recuperación de los datos de entrenamiento
distribución y D igual a 5 en t