# Part 4 - Texts Preprocessing
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import os, sys, subprocess
sys.path.insert(0, '../utils') # for import customed modules
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *
from utils_data import conf_pyspark, load_dataset

# intiate PySpark
sc, spark = conf_pyspark()

sc

[('spark.driver.port', '39034'),
 ('spark.app.name', 'local'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.host', '192.168.1.207'),
 ('spark.driver.memory', '62g'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1547711831420'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.cores', '6'),
 ('spark.driver.maxResultSize', '4g')]


**After wangling and exploring the dataset, we need to preprocess the dataset for training topic modeling appropriately. There are several questions we need consider:**

* Did the texts well extracted from images? If the quality of the texts is poor, the result of topic modeling will not expect good.
* Are there methods to improve the quality of texts? Better texts would produce more accurate topic models.
* There are plenty of NLP preprocesses, which preprocesses should we perform? Well preprocessed texts could reduce the matrix to computing and could be helpful for training topic models.   

## 1 Load Data

Load clean dataset:

In [2]:
df = load_dataset('dataset', spark)

nrow_raw = df.count()
print('Shape of dataframe: ({}, {})'.format(nrow_raw, len(df.columns)))
df.sample(False, 0.00001).limit(10).show()

Shape of dataframe: (16131646, 7)
+-------+--------------------+-----------------+----------+-----+--------------------+--------------------+
|     id|           publisher|           region|      date|  ads|               title|             content|
+-------+--------------------+-----------------+----------+-----+--------------------+--------------------+
|1893237| Bay Of Plenty Times|    Bay of Plenty|1878-09-17|false|      THE ST. LEGER.|THE ST. LEGER.Fou...|
|1996852|Hawera & Normanby...|         Taranaki|1881-07-23|false|Mr. Alfred Buckla...|b"Mr. Alfred Buck...|
|2231036|     Lyttelton Times|       Canterbury|1859-11-05|false|            MARRIED.|MARRIED.Nov. 1, a...|
|2347532|       Bush Advocate|      Hawke's Bay|1899-11-11|false|       COUNTY ROLLS.|COUNTY ROLLS.(To ...|
|2448530|Hawera & Normanby...|         Taranaki|1884-08-23|false| OAMARU HARBOR LOAN.|OAMARU HARBOR LOA...|
|2583150|   Northern Advocate|        Northland|1887-10-15|false|    KAMO TOWN BOARD.|KAMO TOWN BOARD.

## 2 OCR Quality

**The most important factor for topic modeling is the quality of corpus/corpora. Let us check the quality of the text in the dataset to see if there any space to improve.** 

**We select the first article of Lyttelton Times as the experiment object, the image looks like below. The original image is [here](https://paperspast.natlib.govt.nz/imageserver/newspapers/P29pZD1MVDE4NTEwODA5LjIuMi4xJmNvbG9yPTMyJmV4dD1naWYmYXJlYT0x)**

![img](img.jpg)

**Now we use [tesseract](https://github.com/tesseract-ocr/tesseract) to do OCR job on the image.**

>Tesseract was developed as a proprietary software by Hewlett Packard Labs. In 2005, it was open sourced by HP in collaboration with the University of Nevada, Las Vegas. Since 2006 it has been actively developed by Google and many open source contributors.

In [8]:
! tesseract img.gif ocr -l eng --oem 1 --psm 3 # save OCR result to output.txt

Tesseract Open Source OCR Engine v4.0.0-153-g238cb2 with Leptonica
Estimating resolution as 366


In [10]:
path = r'ocr.txt'
with open(path) as f:
    content_ocr = ' '.join([x.strip() for x in f.readlines()])

print(content_ocr)

HE price of Advertisements in this Paper is, threepence a line for the first insertion,  and a penny a line for every subsequent one. All communications to the Editor are re- quested to be addressed to the Office of the LyrrerroN Trams, Section 2, Norwich Quay, Lyttelton, where the Paper may be obtained. Advertisements must be left at this Office before Thursday evening, for insertion of the same week, and must be paid for at the time of  insertion.    NOTICE,  IS EXCELLENCY SIR GEORGE GREY having declared that he will raise no objection to the erection of Canterbury into a separate Province, if the power be left in his hands, and if the settlers in Canterbury desire it, we, the undersigned, Magistrates of the dis- trict, think it right that opportunities should be given of ascertaining the wishes of the people on this important subject. For this purpose Public Meetings will be held at the Mitre Hotel, Lyttelton, on Wednesday, Aug. 13, at Two o'clock, p.M.; and at the Golden Fleece Hot

**Comparing above text to the same article in the dataset shown below, we see both of text are high quality, only slight differences. So the dataset should be the best we can get from the raw images, if the topics quality is not as good as we expected, the dataset reason is out of our considering.**

In [11]:
content_df = df.where(df.id == 1911291).select('content').collect()[0]['content']
print(content_df)

fTIHE price of Advertisements in this Paper ■A- is, threepence a line for the first insertion, and a penny a line for every subsequent one. All communications to the Editor are requested to be addressed to the Office of the Lyttelton Times, Section 2, Norwich Quay, Lyttelton, where the Paper may be obtained. Advertisements must be left at this Office before Thursday evening, for insertion of the same week, and must be paid for at the time of insertion. NOTICE. ~ HIS EXCELLENCY SIR GEORGE GREY having declared that he will raise no objection to the erection of Canterbury into a separate Province, if the power be left in his hands, and if the settlers in Canterbury desire it, we, the undersigned, Magistrates of the district, think it right that opportunities should be given of ascertaining the wishes of the people on this important subject. For this purpose Public Meetings will be held at the Mitre Hotel, Lyttelton, on Wednesday, Aug. 13, at Two o'clock, p.m.; and at the Golden Fleece Hot

**We should quantitatively measure the quality of OCR (error rate), but it takes time and we do not focus on this field. Below are the related works:**
>The first method is used by Simon Tanner, Trevor Muñoz, and Pich Hemy Ros in their evaluation of the OCR quality of the British Library’s 19th Century Online Newspaper Archive. Working with a sample of 1% of the 2 million pages digitized by the British Library, the team calculated the highest rates of OCR accuracy achieved by comparing the generated XML text to a “double re-keyed” ground-truth version. Having a verified ground-truth document enabled the team to provide accurate results about the quality of generated texts (which were generally disappointing, particularly for proper nouns and “significant” or content words). Their approach, however, is labor and time intensive – more than can be taken on by most individual scholars with limited financial resources. (Tanner, Simon, Trevor Muñoz, and Pich Hemy Ros. “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned From Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive.” D-Lib Magazine 15, no. 7/8 (2009): §5. doi: http://dx.doi.org/10.1045/july2009-munoz)    
The second approach, which the Mapping Texts team used for their analysis of the OCR accuracy of the Texas newspapers in Chronicling America, is to compare the generated text to an authoritative wordlist and compute the number of words outside the approved set. This approach is easier to implement, as it takes much less time to compile a list of relevant words than to re-key even a 1% sample of the text. However, the results are less accurate. The method is blind to places where the OCR engine produced a word that, while in the wordlist, does not match the text on the page or where spelling variations that occur on the page are flagged as OCR errors because they are not included in the word list. (Torget, Andrew J., Mihalcea, Rada, Christensen, Jon, and McGhee, Geoff. “Mapping Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers.” Mapping Texts (2011): Accessed March 29, 2017. http://mappingtexts.org/whitepaper/MappingTexts_WhitePaper.pdf.)

## 3 Spelling Correction

**Although the OCR text quality is high, the OCR text are still full of errors. If we could correct at least part of errors, the quality of text will be better, thus the quality of topics will be better. Let's try a spelling correction tool.**

**We use [SymSpell](https://github.com/wolfgarbe/SymSpell) to do the spelling correction job.**

> SymSpell is an algorithm (implementations are available in many programming languages) to be used by developers to add fast approximate string search or spelling correction to their products, rather than being a consumer-ready turnkey product itself.

In [13]:
from symspellpy.symspellpy import SymSpell, Verbosity
!wget https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell/frequency_dictionary_en_82_765.txt \
      -O ../temp/frequency_dictionary_en_82_765.txt

--2019-01-17 21:38:53--  https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell/frequency_dictionary_en_82_765.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.164.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.164.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1331889 (1.3M) [text/plain]
Saving to: ‘../temp/frequency_dictionary_en_82_765.txt’


2019-01-17 21:38:54 (8.01 MB/s) - ‘../temp/frequency_dictionary_en_82_765.txt’ saved [1331889/1331889]



In [17]:
%%time

def correct_spelling(input_term, 
                     initial_capacity=83000, 
                     max_edit_distance_dictionary=2, 
                     max_edit_distance_lookup=2, 
                     prefix_length=7):
    ''' correct spelling of input text
    '''
    # create object
    #initial_capacity = 83000
    # maximum edit distance per dictionary precalculation
    #max_edit_distance_dictionary = 2
    #prefix_length = 7
    sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary,
                         prefix_length)
    # load dictionary
    dictionary_path = os.path.join(os.path.dirname("../temp/"),
                                   "frequency_dictionary_en_82_765.txt")
    term_index = 0  # column of the term in the dictionary text file
    count_index = 1  # column of the term frequency in the dictionary text file
    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return -1
    
    # lookup suggestions for multi-word input strings (supports compound
    # splitting & merging)
    # max edit distance per lookup (per single word, not per whole input string)
    #max_edit_distance_lookup = 2
    suggestions = sym_spell.lookup_compound(input_term,
                                            max_edit_distance_lookup)
    # display suggestion term, edit distance, and term frequency
    #for suggestion in suggestions:
    #    print("{}, {}, {}".format(suggestion.term, suggestion.count,
    #                              suggestion.distance))

    return suggestions[0].term # risk: we suppose len(suggestions) == 1


content_cs = correct_spelling(content_df)
print('Spelling Correction Result:\n', content_cs, '\n')

Spelling Correction Result:
 ﻿the price of advertisements in this paper a is threepence a line for ﻿the first insertion and a penny a line for every subsequent one all communications to ﻿the editor are requested to be addressed to ﻿the office of ﻿the let elton times section a norwich quay let elton where ﻿the paper may be obtained advertisements must be left at this office before thursday evening for insertion of ﻿the same week and must be paid for at ﻿the time of insertion notice his excellency sir george grey having declared that he will raise no objection to ﻿the erection of canterbury into a separate province if ﻿the power be left in his hands and if ﻿the settlers in canterbury desire it we ﻿the undersigned magistrates of ﻿the district think it right that opportunities should be given of ascertaining ﻿the wishes of ﻿the people on this important subject for this purpose public meetings will be held at ﻿the mitre hotel let elton on wednesday aug of at two clock pm and at ﻿the golden 

**We compare the original text and corrected text sentence by sentence shown as below, and found that the spelling correction tool corrected few unimportant words but moved out most special noun words such as name and location. The spelling correction job could improve the fluency of sentences but does not contribute to extracting representative words for topic modeling, plus it takes time to compute, those drawbacks cannot resolve by tuning parameters. Furthermore, the dictionary is used for spelling check is modern English, which may consider some 100 year ago correct words as incorrect, and currently we did not have the time to find dictionaries for English of that era. Thus, we will not use spelling correction tools in the following steps.**

```
fTIHE price of Advertisements in this Paper ■A- is, threepence a line for the first insertion, and a penny a line for every subsequent one.
the   price of advertisements in this paper  a  is  threepence a line for the first insertion  and a penny a line for every subsequent one

All communications to the Editor are requested to be addressed to the Office of the Lyttelton Times, Section 2, Norwich Quay, Lyttelton, where the Paper may be obtained.
all communications to the editor are requested to be addressed to the office of the let elton times  section a  norwich quay  let elton  where the paper may be obtained

Advertisements must be left at this Office before Thursday evening, for insertion of the same week, and must be paid for at the time of insertion.
advertisements must be left at this office before thursday evening  for insertion of the same week  and must be paid for at the time of insertion

NOTICE. ~ HIS EXCELLENCY SIR GEORGE GREY having declared that he will raise no objection to the erection of Canterbury into a separate Province, 
notice    his excellency sir george grey having declared that he will raise no objection to the erection of canterbury into a separate province

if the power be left in his hands, and if the settlers in Canterbury desire it, we, the undersigned, Magistrates of the district, 
if the power be left in his hands  and if the settlers in canterbury desire it  we  the undersigned  magistrates of the district 

think it right that opportunities should be given of ascertaining the wishes of the people on this important subject. 
think it right that opportunities should be given of ascertaining the wishes of the people on this important subject 

For this purpose Public Meetings will be held at the Mitre Hotel, Lyttelton, on Wednesday, Aug. 13, 
for this purpose public meetings will be held at the mitre hotel  let elton  on wednesday  aug  of 

at Two o'clock, p.m.; and at the Golden Fleece Hotel, Christchurch, on Thursday, Aug. 14, at 12 o'clock, noon.
at two clock    pm    and at the golden fleece hotel  christchurch  on thursday  aug  of  at of   clock  noon

J. R. Godley, R. M. W. G. Beittan, J.P. H. G. Gouland, J.P. H. Phillips, J. P. H. J. Tancrbd, J.P. J. C. "W. Russell, J.P. Wm. Deans, J.P. R. Rhodes, J .P. E. J. Wakisfield, J. P. Lyttelton, July 28,1851.
a  a  go ley  a  a  a  a  be than  a a  a  a  go land  a a  a  phillips  a  a  a  a  tancred  a a  a  a   a  russell  a a  we  deans  a a  a  rhodes  a  a  a  a  wiki field  a a  let elton  july of of of

A PUBLIC DINNER WILL be held at the MITRE HOTEL, Lyttelton, on Wednesday, the 13th inst., to commemorate the holding of the First Public Meeting in the Canterbury Settlement.
a public dinner will be held at the mitre hotel  let elton  on wednesday  the with inst   to commemorate the holding of the first public meeting in the canterbury settlement

The Chair will be taken by J. R. Godley, Esq., at half-past six o'clock precisely.
the chair will be taken by a  a  go ley  esq   at half past six   clock precisely

Tickets, price 12s. 6d. each, may be obtained at the Mitre Hotel, and from the following Gentlemen, who have consented to act as Stewards on the occasion.
tickets  price is   cd  each  may be obtained at the mitre hotel  and from the following gentlemen  who have consented to act as stewards on the occasion

PUBLIC NOTICE. NOTICE IS HEREBY GIVEN, that all parties squatting on any of the town reserves, or unappropriated town sections of Christchurch or Lyttelton, will be required to remove from off the land, on or before the first of September next.
public notice  notice is hereby given  that all parties squatting on any of the town reserves  or unappropriated town sections of christchurch or let elton  will be required to remove from off the land  on or before the first of september next

Parties will be allowed to remove the materials of any buildings which they may have erected on such land, or should ..£hey prefer it, the Association will take them from them at a valuation. By order of the Agent of the Canterbury Association. W. G. Bhittan.
parties will be allowed to remove the materials of any buildings which they may have erected on such land  or should    hey prefer it  the association will take them from them at a valuation  by order of the agent of the canterbury association  a  a  bit tan 

£5 REWARD. TX^HEREAS many of the boundary " * marks and pegs of the Town and Country sections have been wantonly removed or destroyed, 
a  reward  to her as many of the boundary     marks and pegs of the town and country sections have been wantonly removed or destroyed 

This is to give notice that the above reward will be given to any person who will give such information at the Land-Office, Christchurch, 
this is to give notice that the above reward will be given to any person who will give such information at the land office  christchurch 

as will lead to the conviction of parties removing, obliterating, or destroying Trigonometrical Stations, Boundary Pegs of Sections, 
as will lead to the conviction of parties removing  obliterating  or destroying trigon metrical stations  boundary pegs of sections 

or any other marks connected with the surveys. By order of the Agent of the Canterbury Association, W . G . Biuttan. Land-Office, Christchnrch, July 17,1851.
or any other marks connected with the surveys  by order of the agent of the canterbury association  a   a   button   land office  christchurch  july of of of 
```

**There other methods could correct spelling, we do not focus on this field, so it could be the future work to improve the quality of corpora.**

## 4 NLP Preprocess

**General NLP preprocess for texts including "tokenize", "remove stop words", "bigram or multigram", "lemmatization" and "stemming" etc., the MALLET tools integrates "tokenize", "remove stop words" and "bigram", and normally "lemmatization" and "stemming" would not help topic modeling perceptibly, thus, we do not separately implement those NLP processes, only use MALLET preprocess the texts while import data to MALLET.**

---