# Exploratory Data Analysis
---
In this notebook I will investigate the data looking for possible problems, class balance and interesting correlations.

### Table of contents
- [Setup](#setup)

## Setup <a name='setup'></a>

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pyspark as ps
import pyspark.sql.functions as psf

%matplotlib inline
%config InlineBackend.figure_format='retina'

plt.rcParams['figure.figsize'] = (10, 5)
sns.set(font_scale=1.2)
sns.set_style('whitegrid')

In [2]:
spark = ps.sql.SparkSession.builder\
                            .master('local[4]')\
                            .appName('eda')\
                            .getOrCreate()

## Class balance

Since we are framing the our problem as a sentiment classification problem, understanding the distribution of a classes is fundamental for the model's success.

Before, starting with any analysis I will load the data and create a temporary view.

In [3]:
amazon = spark.read.parquet('../data/amazon.parquet')

In [4]:
amazon.createOrReplaceTempView('amazon')

One of the most advantages of using PySpark is the flexibility that comes with it in term of querying. PySpark's data frame abstraction allows the freely alternate between declarative and imperative querying paradigms.

In [29]:
query = spark.sql('''
    SELECT asin, overall, summary, positiveVotes, 
    FROM amazon
    LIMIT 100
''').show(100)

+----------+-------+--------------------+-------------+
|      asin|overall|             summary|positiveVotes|
+----------+-------+--------------------+-------------+
|0006353282|      5|        An easy read|            0|
|0006353282|      4|Interesting pictu...|            0|
|0006353282|      4|An engaging memoi...|            4|
|0006353282|      5|An Autobiography,...|            0|
|0006353282|      4|"Instead  I went ...|            3|
|0006353282|      4|Personally and cu...|            0|
|0006353282|      4|There are really ...|           26|
|0006353282|      5|enjoyable read!wh...|           43|
|0006353282|      3|An unexpected lif...|            0|
|0006353282|      5|Great life story ...|            4|
|0006353282|      5|    My Favorite Book|           15|
|0006353282|      5|What a lady! That...|            1|
|0006353282|      5|Welcome to the wo...|            1|
|0006353282|      5|    A Very Good Read|            0|
|0006353282|      5|wonderful look at...|       