# Exploring the data
# Part 1 : Extracting the Data

In this notebook we will shortly go over the work that was done to discover the overall shape of our dataset, and how we will go about to clean it and extract what is relevant for us.

Starting an sql context.

In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

We will begin by working only by working with the data of one month, to understand it and so that the computations hold on our local system.

In [2]:
text = sqlContext.read.format('com.databricks.spark.xml').options(rowTag="entity").load('02.xml')

We first look at the schema of th PySpark DataFrame so that we understand how it was loaded in our system.

In [5]:
text.printSchema()

root
 |-- full_text: string (nullable = true)
 |-- links: struct (nullable = true)
 |    |-- continuation_from: string (nullable = true)
 |    |-- continuation_to: string (nullable = true)
 |    |-- first_id: string (nullable = true)
 |    |-- last_id: string (nullable = true)
 |    |-- next_id: string (nullable = true)
 |    |-- prev_id: string (nullable = true)
 |    |-- source: string (nullable = true)
 |-- meta: struct (nullable = true)
 |    |-- box: string (nullable = true)
 |    |-- entity_type: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- issue_date: string (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- page_no: long (nullable = true)
 |    |-- publication: string (nullable = true)
 |    |-- snp: string (nullable = true)
 |    |-- suspicious_chars_count: long (nullable = true)
 |    |-- total_chars_count: long (nullable = true)
 |    |-- updated_char_count: long (nullable = true)
 | 

And we look at the first row of our dataset.

In [7]:
text.show(1)

+--------------------+--------------------+--------------------+
|           full_text|               links|                meta|
+--------------------+--------------------+--------------------+
|7 Suisse Nouvelle...|[null,null,Ar0010...|[10 402 1262 495,...|
+--------------------+--------------------+--------------------+
only showing top 1 row



We can also take a look at the first row as a PySpark Row type.

In [11]:
text.first()

Row(full_text="7 Suisse Nouvelles révélations dans l'affaire de corruption qui secoue le DMF 25 Sports Le mois de janvier s'est révélé agité pour certains entraîneurs de football 26 Culture Services Rencontre avec Kathryn Bigelow, 28 Décès, 29 Carnet, Mots croisés la seule femme spécialisée 30 Cinémas, 31 Télévision, dans le cinéma d'action hollywoodien 32 Météo, « 24 heures »", links=Row(continuation_from=None, continuation_to=None, first_id='Ar00100', last_id='Ad00109', next_id='Ar00101', prev_id=None, source='103_JDG_1996_02_01_0001.PDF'), meta=Row(box='10 402 1262 495', entity_type='Article', id='Ar00100', issue_date='01/02/1996', language='French', name='Untitled Article', page_no=1, publication='JDG', snp='Ar00100S.png', suspicious_chars_count=0, total_chars_count=307, updated_char_count=307, updated_word_count=52, word_count=64))

And we can look at some of the fields of the DataFrame one by one.

In [15]:
text.first().full_text

"7 Suisse Nouvelles révélations dans l'affaire de corruption qui secoue le DMF 25 Sports Le mois de janvier s'est révélé agité pour certains entraîneurs de football 26 Culture Services Rencontre avec Kathryn Bigelow, 28 Décès, 29 Carnet, Mots croisés la seule femme spécialisée 30 Cinémas, 31 Télévision, dans le cinéma d'action hollywoodien 32 Météo, « 24 heures »"

In [18]:
print('text.meta.box :',text.first().meta.box)
print('text.meta.snp :',text.first().meta.snp)
print('text.links.source :',text.first().links.source)

text.meta.box : 10 402 1262 495
text.meta.snp : Ar00100S.png
text.links.source : 103_JDG_1996_02_01_0001.PDF


We find that for each article, we have a text and several other parameters.

We make the first assumptions that most of these parameters will not be of real help for us so we will keep only the following parameters.

 - full_text
 - meta.issue_data As we want to know which day the article was published on
 - meta.suspicious character count We need that to know the number of characters given by the OCR reader

In [31]:
textClean = text.select('full_text','meta.issue_date','meta.suspicious_chars_count')
textClean.show(3)

+--------------------+----------+----------------------+
|           full_text|issue_date|suspicious_chars_count|
+--------------------+----------+----------------------+
|7 Suisse Nouvelle...|01/02/1996|                     0|
|ÉDITORIAL Le mili...|01/02/1996|                     0|
|Bérets bleus : la...|01/02/1996|                     0|
+--------------------+----------+----------------------+
only showing top 3 rows



We can also look if for this month we have suspicious characters.

In [35]:
textClean.select('suspicious_chars_count').distinct().show()

+----------------------+
|suspicious_chars_count|
+----------------------+
|                     0|
+----------------------+



And we see that for january 1999, no suspicious characters, which was expected as the source is surely electronic and not paper version!

Now that we have taken a look at the data, we begin to apply functions to the data

# Part 2 : Transforming the Data

We defined a small pipeline to transform each article.
1. Separate each text into characters
2. Put each word to lower case, remove basic stopwords (. , "'" etc..)
3. Remove common words that are not useful to our analysis (le, la, de, te etc..)
4. Count the number of times each resulting words, and how many words are in total (needed for word frequency).