In [None]:
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

t0 = time.time()

# 1. Installation `pyspark`

In [None]:
pip install pyspark

#### Import libraries

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
print('Installation takes %s seconds'%(time.time() - t0))

#### Build-in a spark-session

In [None]:
spark = SparkSession.builder \
                    .master("local") \
                    .appName("Word Count") \
                    .config("spark.some.config.option", "some-value") \
                    .getOrCreate()
spark

In [None]:
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Loading `csv.data` to `spark` & viewing by `pandas`

In [None]:
movement_df = spark.read.format("csv").option("header", "true").load(r'../input/big-data-vers-1/movement.csv')
movement_df.toPandas().head()

In [None]:
visiting_df = spark.read.format("csv").option("header", "true").load(r'../input/big-data-vers-1/visiting.csv')
visiting_df.toPandas().head()

#### Register the `loaded-dataframe` as table in `SQL`

In [None]:
dataframes = [movement_df, visiting_df]
table_names = ["movement", "visiting"]
for k in range(2):
    t0 = time.time()
    SQLContext.registerDataFrameAsTable(sc, df = dataframes[k], tableName = table_names[k])  #or using visiting_df.createGlobalTempView(table_names[k])
    print('Attach table %s (%s) takes %s seconds'%(k+1, table_names[k], time.time() - t0))

# 2. Cleaning data.`table: visiting`
When using data, most people agree that your insights and analysis are only as good as the data you are using. Essentially, garbage data in is garbage analysis out. Data cleaning, also referred to as data cleansing and data scrubbing, is one of the most important steps for your organization if you want to create a culture around quality data decision-making.


#### Definition.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.

## 2.1. Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For example, you may find `“N/A”` and `“Not Applicable”` both appear, but they should be analyzed as the same category.

**Step 1. Viewing the structure at each column**

In [None]:
visiting = visiting_df.toPandas()
visiting.info()

Hence, in this database, all of your `feature` are stored as an `object` or `text type features`, but some of them; such as
- `utc_timestamp`
- `local_timestamp`
- `minimum_dwell`
- etc

must be stored as an `numeric`. Look at the following `transformation: object --> numeric`.

In [None]:
numeric_columns = ['utc_timestamp', 'local_timestamp', 'minimum_dwell']
for col in numeric_columns:    
    if col != 'minimum_dwell':
        visiting[col] = visiting[col].apply(lambda x: int(x))
    else:

        visiting[col] = visiting[col].apply(lambda x: float(x))
visiting.info()

**Step 2. Using `countplot` / `barplot`.**

In this step, check for typos or inconsistent capitalization. This is mostly a concern for categorical features, and you can look at your bar plots to check.

In [None]:
import seaborn as sns

query = spark.sql("""SELECT COUNT(*) AS count, brands 
                     FROM visiting
                     GROUP BY brands
                     ORDER BY count DESC
                     LIMIT 10
                 """)
query = query.toPandas()
sns.barplot(x = 'brands', y = 'count', orient = 'h', data = query)

## 2.2. Remove duplicate or irrelevant observations
### 2.2.1. Duplicate observations
Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. Duplicate observations will happen most often during data collection. When you :
- combine data sets from multiple places, 
- scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data. 

De-duplication is one of the largest areas to be considered in this process.

In `Python`, `an` observation is corresponding to the `row in DataFrame table`, hence: you can use the function `dataframe.drop_duplicates()` from `pandas` to solve this process.

In [None]:
print('Data-dimension; before drop_duplicates:', visiting.shape)
visiting.drop_duplicates(inplace = True)
print('Data-dimension; after drop_duplicates:', visiting.shape)

$\qquad \Rightarrow$ Hence, there is not any `duplicated row` in this dataset.

### 2.2.2. Irrelevant observations
`Irrelevant observations` are when you notice observations that do not fit into the **specific problem** you are trying to analyze. 

For example, 
- If you want to analyze data regarding `millennial customers`, but your dataset includes `older generations`, you might remove those `irrelevant observations`. This can make analysis more efficient and minimize distraction from your primary target—as well as creating a more manageable and more performant dataset.

- If you were building a model for `Single-Family` homes only, you wouldn't want observations for `Apartments` in there.

Checking for `irrelevant observations` before `engineering features` can save you many headaches down the road.

=> Now, come back our dataset: `visiting`

$\qquad$ **Step 1. Apply the `count plot`**

In [None]:
visiting.columns

In [None]:
visiting.plot()

So

## 2.3.

## 3. Cleaning data. `table: movement`