# Exploring the data
# Part 1 : Extracting the Data

In this notebook we will shortly go over the work that was done to discover the overall shape of our dataset, and how we will go about to clean it and extract what is relevant for us.

Starting an sql context.

In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

We will begin by working only by working with the data of one month, to understand it and so that the computations hold on our local system.

In [None]:
text = sqlContext.read.format('com.databricks.spark.xml').options(rowTag="entity").load('02.xml')

We first look at the schema of th PySpark DataFrame so that we understand how it was loaded in our system.

In [None]:
text.printSchema()

And we look at the first row of our dataset.

In [None]:
text.show(1)

We can also take a look at the first row as a PySpark Row type.

In [None]:
text.first()

And we can look at some of the fields of the DataFrame one by one.

In [None]:
text.first().full_text

In [None]:
print('text.meta.box :',text.first().meta.box)
print('text.meta.snp :',text.first().meta.snp)
print('text.links.source :',text.first().links.source)

We find that for each article, we have a text and several other parameters.

We make the first assumptions that most of these parameters will not be of real help for us so we will keep only the following parameters.

 - full_text
 - meta.issue_data As we want to know which day the article was published on
 - meta.suspicious character count We need that to know the number of characters given by the OCR reader

In [None]:
textClean = text.select('full_text','meta.issue_date','meta.suspicious_chars_count')
textClean.show(3)

We can also look if for this month we have suspicious characters.

In [None]:
textClean.select('suspicious_chars_count').distinct().show()

And we see that for january 1999, no suspicious characters, which was expected as the source is surely electronic and not paper version!

Now that we have taken a look at the data, we begin to apply functions to the data

# Part 2 : Transforming the Data

We defined a small pipeline to transform each article.
1. Separate each text into characters
2. Put each word to lower case, remove basic stopwords (. , "'" etc..)
3. Remove common words that are not useful to our analysis (le, la, de, te etc..)
4. Count the number of times each resulting words, and how many words are in total (needed for word frequency).

Let's take a look at the processing steps for one article.

In [None]:
article1 = te