# **Generate Learning Data** - We begin by reading our data set from csv file and put it in the spark dataframe and use signal to make feature vectors
**Get Data** - we read data from a csv file which each tuple known with its index and list of denial constarints
**Prepare Data** - We parse each denial constraints and to make SQL query over it 
**Analyze Data** - For each constraint we make a dataframe that contains tuple that satisfy that denial constraint      
    

> The ***pyspark*** library is used for all the data analysis excluding a small piece of the data presentation section. The ***numpy*** library will only be needed for set functionality. Importing the libraries is the first step we will take in the 

In [2]:
# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library: 
##from (library) import (specific library function)
 
import dataEngine as de#we testing this module
import pyspark as ps #this is how I usually import spark
import numpy as np #this is how I usually import numpy
import sys #For some system calls

In [3]:
print('Python version ' + sys.version)
print('Spark version ' + ps.__version__)
print('numpy version ' + np.__version__)

Python version 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Spark version 2.2.0
numpy version 1.12.1


# Initialize Spark Session  

The spark session for this tutorial runs locally and its name is "Grounding"

In [4]:
# The inital set of baby names and bith rates
spark = ps.sql.SparkSession.builder.master("local").appName("Data Featurizing").getOrCreate()

# Reading data 

In this section we use spark session to read csv file which its name is 10.csv that include 10 weather data with header index , city , temp (which is the temperature) , tempType (the temperature type TMIN or TMAX) and date .

In [5]:
df=spark.read.csv("10.csv",header=True)
df.show()

+-----+-----------+--------+--------+----+
|index|       city|    date|tempType|temp|
+-----+-----------+--------+--------+----+
|    1|ITE00100554|18000101|    TMAX| -75|
|    2|ITE00100554|18000101|    TMIN|-148|
|    3|GM000010962|18000101|    PRCP|   0|
|    4|EZE00100082|18000101|    TMAX| -86|
|    5|EZE00100082|18000101|    TMAX|-135|
|    6|ITE00100554|18000102|    TMAX| -60|
|    7|ITE00100554|18000102|    TMIN|-125|
|    8|GM000010962|18000102|    PRCP|   0|
|    9|EZE00100082|18000102|    TMAX| -44|
|   10|EZE00100082|18000102|    TMIN|-135|
+-----+-----------+--------+--------+----+



## DenialConstraint Class 
This module parse denial constraints and make query for finding tuples that satisfy the a specific constraint. We choose that all denial constraints have structure as shown in below

In [6]:
dcCode=['t1&t2&EQ(t1.city,t2.city)&EQ(t1.temp,t2.temp)&IQ(t1.tempType,t2.tempType)']
noisy_cells=[('5','city'),('1','temp')]

In here we can test method $dc2SqlCondition$ by creating an object from DenialConstraint and give $dcCode$ 

In [7]:
dcObj=de.DenialConstraint(dcCode)
set_of_preds,operations=dcObj.dc2SqlCondition()
print ("List of lists that containt predicates for each DC :\n")
for i in range(0,len(set_of_preds)):
    print ("sql pred of " + dcCode[i] + " is : ")
    print (set_of_preds[i])
    print ("and its operations:")
    print (operations[i])
    print("\n")

List of lists that containt predicates for each DC :

sql pred of t1&t2&EQ(t1.city,t2.city)&EQ(t1.temp,t2.temp)&IQ(t1.tempType,t2.tempType) is : 
['table1.city=table2.city', 'table1.temp=table2.temp', 'table1.tempType<>table2.tempType']
and its operations:
['EQ', 'EQ', 'IQ']




As you can see if the predicate have n part then it will make a list of length n and put each predicate in order. Method (make_condition) just return list that in position conditionInd


### Finding satisfied tuples using Spark

In this section we make method for finding tuple that shows the indices record of data that satisfy the specified constraint for this we have noViolation_tuple that make dataframe with columns that eash row shows which two indices are satisfies that denial constraint.

In [8]:
noVio=dcObj.noViolation_tuple(df, 0, spark)
cnt=noVio.count()
print(noVio.show(cnt)) 

+-------+-------+
|indexT1|indexT2|
+-------+-------+
|      1|      1|
|      1|      2|
|      1|      3|
|      1|      4|
|      1|      5|
|      1|      6|
|      1|      7|
|      1|      8|
|      1|      9|
|      1|     10|
|      2|      1|
|      2|      2|
|      2|      3|
|      2|      4|
|      2|      5|
|      2|      6|
|      2|      7|
|      2|      8|
|      2|      9|
|      2|     10|
|      3|      1|
|      3|      2|
|      3|      3|
|      3|      4|
|      3|      5|
|      3|      6|
|      3|      7|
|      3|      8|
|      3|      9|
|      3|     10|
|      4|      1|
|      4|      2|
|      4|      3|
|      4|      4|
|      4|      5|
|      4|      6|
|      4|      7|
|      4|      8|
|      4|      9|
|      4|     10|
|      5|      1|
|      5|      2|
|      5|      3|
|      5|      4|
|      5|      5|
|      5|      6|
|      5|      7|
|      5|      8|
|      5|      9|
|      6|      1|
|      6|      2|
|      6|      3|
|      6| 

As you can see in dataframe index 10 and 5 doesnt satisfiy the denial costraint the city is paris and the minimum temperature in one year is equal to maximum temperature in another year so this is error and as you can see in the tple dataframe neither (10,5) and (5,10) didn't appeared

In [9]:
df.show()

+-----+-----------+--------+--------+----+
|index|       city|    date|tempType|temp|
+-----+-----------+--------+--------+----+
|    1|ITE00100554|18000101|    TMAX| -75|
|    2|ITE00100554|18000101|    TMIN|-148|
|    3|GM000010962|18000101|    PRCP|   0|
|    4|EZE00100082|18000101|    TMAX| -86|
|    5|EZE00100082|18000101|    TMAX|-135|
|    6|ITE00100554|18000102|    TMAX| -60|
|    7|ITE00100554|18000102|    TMIN|-125|
|    8|GM000010962|18000102|    PRCP|   0|
|    9|EZE00100082|18000102|    TMAX| -44|
|   10|EZE00100082|18000102|    TMIN|-135|
+-----+-----------+--------+--------+----+



### Finding not satisfied tuples using Spark

In this section we make method for finding tuples that shows the indices of records of data that not satisfy the we specified constraint.For this we have violation_tuple that make dataframe with columns that eash row shows which two indices didn't satisfy the specified denial constraint.

In [10]:
vio=dcObj.violation_tuple(df, 0, spark)
cnt=vio.count()
print(vio.show(cnt)) 

+-------+-------+
|indexT1|indexT2|
+-------+-------+
|      5|     10|
|     10|      5|
+-------+-------+

None


## DomainPruning Class
In this section we try to check the domain pruning part of the code. We have 2 dirty cells in this example and for each of them first we calculate the feasible values

In [11]:
dp=de.DomainPruning(df,noisy_cells)
dVal=dp.candidate_values()
for i in range(0,len(dVal)):
    print("The domain for attribute ",noisy_cells[i][1]," is ",dVal[i])

('The domain for attribute ', 'city', ' is ', [u'EZE00100082'])
('The domain for attribute ', 'temp', ' is ', [u'-75'])


Next method we have make dictionary of lists that show domain for each attribute in the database 

In [12]:
dp.allowable_doamin_value()

{'city': [u'EZE00100082'],
 'date': [u'18000101', u'18000102'],
 'temp': [u'-75'],
 'tempType': [u'PRCP', u'TMAX', u'TMIN']}

There are some other methods like allowable row that return dataframe from  that is are line without noisy cells and its value are in the domain after pruning. that will help when we want to choose training data

We will use other function when we make featurized table

## QuantativeStatisticsFeaturize Class

This class make feature vector from statistical signals for cells in the dataset.

In [13]:
qsObj=de.QuantativeStatisticsFeaturize(df)
qsFeatureVectors=qsObj.featurize(spark)
print(qsFeatureVectors.columns)
print(qsFeatureVectors.show())



['cell', 'freq_2_city_date', 'freq_2_city_temp', 'freq_2_city_tempType', 'freq_2_date_temp', 'freq_2_date_tempType', 'freq_2_tempType_temp', 'freq_3_city_date_temp', 'freq_3_city_date_tempType', 'freq_3_city_tempType_temp', 'freq_3_date_tempType_temp', 'freq_city', 'freq_date', 'freq_temp', 'freq_tempType']
+------------+----------------+----------------+--------------------+----------------+--------------------+--------------------+---------------------+-------------------------+-------------------------+-------------------------+---------+---------+---------+-------------+
|        cell|freq_2_city_date|freq_2_city_temp|freq_2_city_tempType|freq_2_date_temp|freq_2_date_tempType|freq_2_tempType_temp|freq_3_city_date_temp|freq_3_city_date_tempType|freq_3_city_tempType_temp|freq_3_date_tempType_temp|freq_city|freq_date|freq_temp|freq_tempType|
+------------+----------------+----------------+--------------------+----------------+--------------------+--------------------+-----------------

## DenialConstraintFeaturize Class

This class make feature vector from denial constraints signals for cells in the dataset.

In [14]:
dcfObj=de.DenialConstraintFeaturize(df,dcCode)

For featurize for denial constraints we need list of all satisfied tuple this list come from DenialConstraint class we have its object just call 

In [15]:
noviolationList=dcObj.all_dc_nonViolation(df,spark)
dcfFeatureVector=dcfObj.featurize(noviolationList,spark)
dcfFeatureVector.show()

+------------+---+---+---+---+---+---+---+---+---+---+
|        cell|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10|
+------------+---+---+---+---+---+---+---+---+---+---+
|    [1,city]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [1,date]|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|[1,tempType]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [1,temp]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [2,city]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [2,date]|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|[2,tempType]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [2,temp]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [3,city]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [3,date]|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|[3,tempType]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [3,temp]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [4,city]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [4,date]|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|[4,tempType]|  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|
|    [4,te

## HolocleanData Class

### Join all signal feature vector
This class make aggregation between signals , split feature vector for noisy cells and clean data now we make object and try to aggregate our tables.

In [16]:
hd=de.HolocleanData(df)
joint_table =hd.signal_features_aggregator(dcfFeatureVector,qsFeatureVectors)
joint_table.columns

['cell',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 'freq_2_city_date',
 'freq_2_city_temp',
 'freq_2_city_tempType',
 'freq_2_date_temp',
 'freq_2_date_tempType',
 'freq_2_tempType_temp',
 'freq_3_city_date_temp',
 'freq_3_city_date_tempType',
 'freq_3_city_tempType_temp',
 'freq_3_date_tempType_temp',
 'freq_city',
 'freq_date',
 'freq_temp',
 'freq_tempType']

### Split clean cells and noisy cells 
Next step is seprate the feature vectors of noisy cells and clean cells

In [17]:
noisy_cells_fv,clean_cells_fv=hd.split_noisy_nonNoisy(joint_table,noisy_cells,spark)
print("Whole data",joint_table.count())
print("Noisy cells",noisy_cells_fv.count())
print("Clean cells",clean_cells_fv.count())

('Whole data', 40)
('Noisy cells', 2)
('Clean cells', 38)


### Select cells for trainig data
In our learning policy we dont use cells that are in the row that has noisy cell we also need to choose data from clean cells that are in the pruned domain For this we might need to prune data from clean cells

In [18]:
selected_cell=dp.allowable_rows(clean_cells_fv,spark)
print ("Number of clean cells",clean_cells_fv.count())
print ("Number of selected cells",selected_cell.count())

('Number of clean cells', 38)
('Number of selected cells', 19)


### Claculate label for training data

In the last part of data preparing we need to calculate the label of selected trainig data so we can use another function of HolocleanData class 

In [19]:
label_selected_cell=hd.make_trainingdata_label(selected_cell,spark)
print(label_selected_cell.count())
label_selected_cell.columns


19


['city_withValue_ITE00100554',
 'city_withValue_GM000010962',
 'city_withValue_EZE00100082',
 'date_withValue_18000101',
 'date_withValue_18000102',
 'tempType_withValue_TMAX',
 'tempType_withValue_TMIN',
 'tempType_withValue_PRCP',
 'temp_withValue_-75',
 'temp_withValue_-148',
 'temp_withValue_0',
 'temp_withValue_-86',
 'temp_withValue_-135',
 'temp_withValue_-60',
 'temp_withValue_-125',
 'temp_withValue_-44']

### Seprat index column from data
We need seprate data index from feature vector by using another function from HolocleanData 

In [21]:
index,featureVectors=hd.index_X(selected_cell,spark)
print(index.show())
print('Before sepration',selected_cell.columns)
print("After sepration",featureVectors.columns)

+-------------+
|           as|
+-------------+
| [4,tempType]|
|     [2,date]|
|     [3,date]|
|     [8,date]|
| [8,tempType]|
|     [9,date]|
|     [6,date]|
|    [10,city]|
|    [10,date]|
| [3,tempType]|
| [9,tempType]|
| [6,tempType]|
|[10,tempType]|
|     [7,date]|
| [7,tempType]|
| [2,tempType]|
|     [4,city]|
|     [9,city]|
|     [4,date]|
+-------------+

None
('Before sepration', ['cell', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'freq_2_city_date', 'freq_2_city_temp', 'freq_2_city_tempType', 'freq_2_date_temp', 'freq_2_date_tempType', 'freq_2_tempType_temp', 'freq_3_city_date_temp', 'freq_3_city_date_tempType', 'freq_3_city_tempType_temp', 'freq_3_date_tempType_temp', 'freq_city', 'freq_date', 'freq_temp', 'freq_tempType'])
('After sepration', ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'freq_2_city_date', 'freq_2_city_temp', 'freq_2_city_tempType', 'freq_2_date_temp', 'freq_2_date_tempType', 'freq_2_tempType_temp', 'freq_3_city_date_temp', 'freq_3_city_da

## Go to learning !
In the last part we our x and y so we can fed them to TensorFlow