<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/master/aut_428_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# aut-428 testing

## Test data extraction

- Apache Spark 2.4.5
- `aut-0.50.0`
- `--master local[25] --executor-memory 100g --driver-memory 100g --conf spark.driver.maxResultSize=0 --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'" --packages "io.archivesunleashed:aut:0.50.0"`
- script:
    ```scala
    import io.archivesunleashed._
    import io.archivesunleashed.df._
    import io.archivesunleashed.matchbox._
    
    RecordLoader.loadArchives("/data/banq-datathon/EnvironnementQc/warcs", sc)
      .webpages()
      .select(RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
      .write.parquet("/data/banq-datathon/EnvironnementQc/derivatives/aut-428/webpages.parquet")
      
    RecordLoader.loadArchives("/data/banq-datathon/EnvironnementQc/warcs", sc)
      .webpages()
      .select(RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
      .write.csv("/data/banq-datathon/EnvironnementQc/derivatives/aut-428/webpages.csv")
      
    RecordLoader.loadArchives("/data/banq-datathon/EnvironnementQc/warcs", sc)
      .keepValidPages()
      .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTMLRDD(RemoveHTTPHeaderRDD(r.getContentString))))
      .saveAsTextFile("/data/banq-datathon/EnvironnementQc/derivatives/aut-428/rdd-text/")
  ```
- [data](http://cloud.archivesunleashed.org/aut-428.tar.gz)





In [1]:
!curl -# -L "http://cloud.archivesunleashed.org/aut-428.tar.gz" > aut-428.tar.gz

######################################################################## 100.0%


In [2]:
!tar -xzvf aut-428.tar.gz

aut-428/
aut-428/aut-428-df.csv
aut-428/aut-428-rdd.txt
aut-428/webpages.parquet/
aut-428/webpages.parquet/part-00215-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/.part-01022-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet.crc
aut-428/webpages.parquet/.part-00134-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet.crc
aut-428/webpages.parquet/part-00471-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/.part-01449-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet.crc
aut-428/webpages.parquet/part-00347-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/part-01145-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/part-00814-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/part-00144-86ac75e8-a368-4ced-951b-7f8e79083f45-c000.snappy.parquet
aut-428/webpages.parquet/part-00179-86ac75e8-a368-4ced-951b-7f8e79

In [0]:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import matplotlib.pyplot as plt

In [5]:
parquet = pq.read_table('aut-428/webpages.parquet')
webpages_parquet = parquet.to_pandas()
webpages_parquet

Unnamed: 0,content
0,Communiqu� Le ministre | Le ministère | Air et...
1,Communiqu� Le ministre | Le ministère | Air et...
2,Communiqu� Le ministre | Le ministère | Air et...
3,Communiqu� Le ministre | Le ministère | Air et...
4,Communiqu� Le ministre | Le ministère | Air et...
...,...
404575,Communiqu� Communiqu� de presse Tourn�e sur le...
404576,Communiqu� Communiqu� de presse Centre-du-Qu�b...
404577,Communiqu� Communiqu� de presse Qu�bec injecte...
404578,Communiqu� Communiqu� de presse Rencontre posi...


In [12]:
webpages_csv = pd.read_csv('aut-428/aut-428-df.csv', error_bad_lines=False, header=None)
webpages_csv

b'Skipping line 58: expected 1 fields, saw 23\nSkipping line 195: expected 1 fields, saw 73\nSkipping line 263: expected 1 fields, saw 84\nSkipping line 267: expected 1 fields, saw 84\nSkipping line 592: expected 1 fields, saw 23\nSkipping line 679: expected 1 fields, saw 91\nSkipping line 713: expected 1 fields, saw 53\nSkipping line 718: expected 1 fields, saw 6\nSkipping line 720: expected 1 fields, saw 3\nSkipping line 725: expected 1 fields, saw 83\nSkipping line 1106: expected 1 fields, saw 23\nSkipping line 1114: expected 1 fields, saw 23\nSkipping line 1125: expected 1 fields, saw 2\nSkipping line 1911: expected 1 fields, saw 18\nSkipping line 2304: expected 1 fields, saw 54\nSkipping line 2309: expected 1 fields, saw 21\nSkipping line 2317: expected 1 fields, saw 17\nSkipping line 2499: expected 1 fields, saw 107\nSkipping line 2691: expected 1 fields, saw 6\nSkipping line 2861: expected 1 fields, saw 11\nSkipping line 2898: expected 1 fields, saw 23\nSkipping line 3345: expec

Unnamed: 0,0
0,Communiqu� Le ministre | Le ministère | Air et...
1,Communiqu� Le ministre | Le ministère | Air et...
2,Communiqu� Le ministre | Le ministère | Air et...
3,Communiqu� Le ministre | Le ministère | Air et...
4,Communiqu� Le ministre | Le ministère | Air et...
...,...
399150,Communiqu� Communiqu� de presse Tourn�e sur le...
399151,Communiqu� Communiqu� de presse Centre-du-Qu�b...
399152,Communiqu� Communiqu� de presse Qu�bec injecte...
399153,Communiqu� Communiqu� de presse Rencontre posi...


In [14]:
with open('aut-428/aut-428-rdd.txt') as content:
  head = [next(content) for x in range(10)]
print(head)

["(20121218,www.mddep.gouv.qc.ca,http://www.mddep.gouv.qc.ca/infuseur/communique.asp?no=1303,Communiqu� Le ministre\xa0| Le ministère\xa0| Air et changements climatiques\xa0| Biodiversité\xa0|\xa0 Développement\xa0durable\xa0| Eau\xa0| Évaluations environnementales Faune\xa0| Matières résiduelles\xa0| Milieu agricole\xa0| Milieu industriel\xa0| Parcs\xa0| Pesticides\xa0| Regards sur l'environnement\xa0| Terrains contaminés Plan du site Communiqu� de presse \xa0 Version\xa0\xa0\xa0 imprimable Les mesures n�cessaires ont �t� prises pour confiner le paraxyl�ne Qu�bec, le 16 avril 2008 � Le minist�re du D�veloppement durable, de l'Environnement et des Parcs tient � apporter certaines pr�cisions � la suite du reportage du 15 avril 2008 pr�sent� par Radio-Canada concernant le d�versement de paraxyl�ne dans le Port de Montr�al. Contrairement � ce que laissent croire les informations transmises, il est n�cessaire de rappeler que depuis le d�versement jusqu'� aujourd'hui, les mesures n�cessaire