# Big data mining and Applications  Course
## Individual household electric power consumption Data Set

### Data Set Information:
Available at: https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption 

[Individual household electric power consumption dataset] from UCI Machine Learning Repository About 2 million instances, 20MB (compressed) in size.

This archive contains 2075259 measurements gathered between December 2006 and November 2010 (47 months).
Notes:

1. (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.

2. The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.

### Attribute Information:
1. date: Date in format dd/mm/yyyy
2. time: time in format hh:mm:ss
3. global_active_power: household global minute-averaged active power (in kilowatt)
4. global_reactive_power: household global minute-averaged reactive power (in kilowatt)
5. voltage: minute-averaged voltage (in volt)
6. global_intensity: household global minute-averaged current intensity (in ampere)
7. sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy).
    - It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
8. sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). 
    - It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
9. sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy).
    - It corresponds to an electric water-heater and an air-conditioner.

In [5]:
import os
import sys

spark_path = 'C:/spark-2.4.3-bin-hadoop2.7'
os.environ['SPARK_HOME']= spark_path
os.environ['HADOOP_HOME']=spark_path
sys.path.append(spark_path+'/bin')
sys.path.append(spark_path+'/python')
sys.path.append(spark_path+'/python/pyspark')
sys.path.append(spark_path+'/python/lib')
sys.path.append(spark_path+'/python/lib/pyspark.zip')
sys.path.append(spark_path+'/python/lib/py4j-0.10.7-src.zip')

from pyspark import SparkContext
from pyspark import SparkConf

In [6]:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import rand, randn

In [7]:
conf = SparkConf().setMaster("local[6]").setAppName("Household Power Consumption")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext=SQLContext(sc)

### Check Apache Spark environtment

In [8]:
sc

In [9]:
sqlContext

<pyspark.sql.context.SQLContext at 0x2141e0bd9e8>

### Load the dataset

In [9]:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', sep=';').load('household_power_consumption.txt')
df.take(5)

[Row(Date='16/12/2006', Time='17:24:00', Global_active_power='4.216', Global_reactive_power='0.418', Voltage='234.840', Global_intensity='18.400', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:25:00', Global_active_power='5.360', Global_reactive_power='0.436', Voltage='233.630', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=16.0),
 Row(Date='16/12/2006', Time='17:26:00', Global_active_power='5.374', Global_reactive_power='0.498', Voltage='233.290', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='2.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:27:00', Global_active_power='5.388', Global_reactive_power='0.502', Voltage='233.740', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:28:00', Global_active_power='3.666', Global_reactive_power='0.528', Voltage='235.680', Global_inten

### Take 5 sample from dataset

In [10]:
df.take(5)

[Row(Date='16/12/2006', Time='17:24:00', Global_active_power='4.216', Global_reactive_power='0.418', Voltage='234.840', Global_intensity='18.400', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:25:00', Global_active_power='5.360', Global_reactive_power='0.436', Voltage='233.630', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=16.0),
 Row(Date='16/12/2006', Time='17:26:00', Global_active_power='5.374', Global_reactive_power='0.498', Voltage='233.290', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='2.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:27:00', Global_active_power='5.388', Global_reactive_power='0.502', Voltage='233.740', Global_intensity='23.000', Sub_metering_1='0.000', Sub_metering_2='1.000', Sub_metering_3=17.0),
 Row(Date='16/12/2006', Time='17:28:00', Global_active_power='3.666', Global_reactive_power='0.528', Voltage='235.680', Global_inten

In [11]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Global_active_power: string (nullable = true)
 |-- Global_reactive_power: string (nullable = true)
 |-- Voltage: string (nullable = true)
 |-- Global_intensity: string (nullable = true)
 |-- Sub_metering_1: string (nullable = true)
 |-- Sub_metering_2: string (nullable = true)
 |-- Sub_metering_3: double (nullable = true)



In [12]:
df2 = df

### check the types of each columns

In [13]:
df2.dtypes

[('Date', 'string'),
 ('Time', 'string'),
 ('Global_active_power', 'string'),
 ('Global_reactive_power', 'string'),
 ('Voltage', 'string'),
 ('Global_intensity', 'string'),
 ('Sub_metering_1', 'string'),
 ('Sub_metering_2', 'string'),
 ('Sub_metering_3', 'double')]

In [14]:
from pyspark.sql.types import IntegerType, DateType, DoubleType

### Import pyspark.sql for change data type

In [15]:
#Change String types format to DoubleType "Global_active_power","Global_reactive_power","Voltage","Global_intensity"
df3 = df2.withColumn("Global_active_power", df2["Global_active_power"].cast(DoubleType()))
df4 = df3.withColumn("Global_reactive_power", df3["Global_reactive_power"].cast(DoubleType()))
df5 = df4.withColumn("Voltage", df4["Voltage"].cast(DoubleType()))
df6 = df5.withColumn("Global_intensity", df5["Global_intensity"].cast(DoubleType()))

### Check datafreame types after changed 

In [16]:
df6.dtypes

[('Date', 'string'),
 ('Time', 'string'),
 ('Global_active_power', 'double'),
 ('Global_reactive_power', 'double'),
 ('Voltage', 'double'),
 ('Global_intensity', 'double'),
 ('Sub_metering_1', 'string'),
 ('Sub_metering_2', 'string'),
 ('Sub_metering_3', 'double')]

### Check schema dataframe

In [17]:
df6.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Global_active_power: double (nullable = true)
 |-- Global_reactive_power: double (nullable = true)
 |-- Voltage: double (nullable = true)
 |-- Global_intensity: double (nullable = true)
 |-- Sub_metering_1: string (nullable = true)
 |-- Sub_metering_2: string (nullable = true)
 |-- Sub_metering_3: double (nullable = true)



### Change missing values 

In [18]:
df6.fillna(0)

DataFrame[Date: string, Time: string, Global_active_power: double, Global_reactive_power: double, Voltage: double, Global_intensity: double, Sub_metering_1: string, Sub_metering_2: string, Sub_metering_3: double]

### Show the Dataset

In [19]:
df6.show(20)

+----------+--------+-------------------+---------------------+-------+----------------+--------------+--------------+--------------+
|      Date|    Time|Global_active_power|Global_reactive_power|Voltage|Global_intensity|Sub_metering_1|Sub_metering_2|Sub_metering_3|
+----------+--------+-------------------+---------------------+-------+----------------+--------------+--------------+--------------+
|16/12/2006|17:24:00|              4.216|                0.418| 234.84|            18.4|         0.000|         1.000|          17.0|
|16/12/2006|17:25:00|               5.36|                0.436| 233.63|            23.0|         0.000|         1.000|          16.0|
|16/12/2006|17:26:00|              5.374|                0.498| 233.29|            23.0|         0.000|         2.000|          17.0|
|16/12/2006|17:27:00|              5.388|                0.502| 233.74|            23.0|         0.000|         1.000|          17.0|
|16/12/2006|17:28:00|              3.666|                0.528

### Using library of pyspark to anaysis minimum, maximum, mean and standar deviation

In [20]:
from pyspark.sql.functions import min,max,mean,stddev,count

### Output the minimum, maximum and count of the columns 'Global_active_power'

In [21]:
print ("Show output the minimum, maximum and count of the columns 'Global_active_power'")
df6.select([min("Global_active_power").alias("Minimum Global Active Power"),
           max("Global_active_power").alias("Maximum Global Active Power"),
           count("Global_active_power").alias("Count Global Active Power")]).show()

Show output the minimum, maximum and count of the columns 'Global_active_power'
+---------------------------+---------------------------+-------------------------+
|Minimum Global Active Power|Maximum Global Active Power|Count Global Active Power|
+---------------------------+---------------------------+-------------------------+
|                      0.076|                     11.122|                  2049280|
+---------------------------+---------------------------+-------------------------+



In [22]:
print ("Count each value of Global Active Power")
df6.groupBy("Global_active_power").count().show(n = 30)

Count each value of Global Active Power
+-------------------+-----+
|Global_active_power|count|
+-------------------+-----+
|               3.26|  163|
|              2.784|  273|
|              3.456|  147|
|              1.462| 2310|
|               0.66|  802|
|              2.712|  325|
|              4.132|   86|
|              0.204| 4846|
|              1.882|  667|
|              7.818|    3|
|              0.526| 1431|
|              0.194| 2748|
|              2.952|  248|
|              2.132|  441|
|               2.86|  238|
|              4.318|   82|
|              4.592|   33|
|              4.752|   42|
|              10.65|    1|
|              5.114|   31|
|              0.134| 1394|
|              2.206|  412|
|              3.174|  176|
|              0.262| 6188|
|               1.82|  690|
|              2.802|  252|
|              1.246|  938|
|              1.642| 1053|
|              0.772|  609|
|              2.808|  260|
+-------------------+-----+
only sho

### Output the minimum, maximum and count of the columns 'Global_reactive_power'

In [23]:
print ("Show output the minimum, maximum and count of the columns 'Global_reactive_power'")
df6.select([min("Global_reactive_power").alias("Minimum Global Reactive Power"),
           max("Global_reactive_power").alias("Maximum Global Reactive Power"),
           count("Global_reactive_power").alias("Count Global Reactive Power")]).show()

Show output the minimum, maximum and count of the columns 'Global_reactive_power'
+-----------------------------+-----------------------------+---------------------------+
|Minimum Global Reactive Power|Maximum Global Reactive Power|Count Global Reactive Power|
+-----------------------------+-----------------------------+---------------------------+
|                          0.0|                         1.39|                    2049280|
+-----------------------------+-----------------------------+---------------------------+



In [24]:
print ("Count each value of Global Reactive Power")
df6.groupBy("Global_reactive_power").count().show(n = 30)

Count each value of Global Reactive Power
+---------------------+------+
|Global_reactive_power| count|
+---------------------+------+
|                0.134| 10037|
|                0.194|  9231|
|                0.204| 10551|
|                 0.66|    50|
|                0.526|   253|
|                0.262|  4421|
|                0.772|    22|
|                 0.07| 18826|
|                0.716|    37|
|                0.744|    23|
|                 0.84|    13|
|                0.886|     3|
|                0.302|  2639|
|                0.396|  1101|
|                0.394|  1103|
|                0.572|   166|
|                0.894|     7|
|                  0.0|481561|
|                0.234|  8937|
|                0.494|   363|
|                0.524|   281|
|                0.878|     7|
|                 0.87|    11|
|                0.058| 18127|
|                0.516|   281|
|                0.978|     3|
|                0.714|    39|
|                0.576|   16

### Output the minimum, maximum and count of the columns 'Voltage'

In [25]:
print ("Show output the minimum, maximum and count of the columns 'Voltage'")
df6.select([min("Voltage").alias("Minimum Voltage"),
           max("Voltage").alias("Maximum Voltage"),
          count("Voltage").alias("Count Voltage")]).show()

Show output the minimum, maximum and count of the columns 'Voltage'
+---------------+---------------+-------------+
|Minimum Voltage|Maximum Voltage|Count Voltage|
+---------------+---------------+-------------+
|          223.2|         254.15|      2049280|
+---------------+---------------+-------------+



In [26]:
print ("Count each value of Voltage")
df6.groupBy("Voltage").count().show(n = 30)

Count each value of Voltage
+-------+-----+
|Voltage|count|
+-------+-----+
| 239.49| 2483|
| 242.52| 2732|
|  234.9|  399|
| 239.34| 2349|
|  244.8|  810|
| 244.43| 1333|
| 238.14| 1556|
| 233.28|  188|
| 231.13|   66|
|  232.9|  210|
| 229.78|   34|
| 227.05|    9|
|  229.6|   23|
| 234.29|  396|
| 243.16| 1635|
| 245.13|  932|
|  239.1| 2223|
| 242.99| 2374|
| 237.03|  804|
| 234.22|  387|
| 248.73|  138|
| 229.89|   38|
| 229.87|   37|
| 249.51|   79|
| 226.45|    7|
| 224.58|    1|
| 232.99|  230|
| 236.75|  979|
| 247.26|  336|
|  243.6| 1943|
+-------+-----+
only showing top 30 rows



### Output the minimum, maximum and count of the columns 'Global_intensity'

In [27]:
print ("Show output the minimum, maximum and count the columns of 'Global_intensity'")
df6.select([min("Global_intensity").alias("Minimum Global Intensity "),
          max("Global_intensity").alias("Maximum Global Intensity"),
          count("Global_intensity").alias("Count Global Intensity")]).show()

Show output the minimum, maximum and count the columns of 'Global_intensity'
+-------------------------+------------------------+----------------------+
|Minimum Global Intensity |Maximum Global Intensity|Count Global Intensity|
+-------------------------+------------------------+----------------------+
|                      0.2|                    48.4|               2049280|
+-------------------------+------------------------+----------------------+



In [28]:
print ("Count each value of Global Intesity")
df6.groupBy("Global_intensity").count().show(n =30)

Count each value of Global Intesity
+----------------+------+
|Global_intensity| count|
+----------------+------+
|            13.4|  3915|
|            15.4|  4306|
|             2.4| 39360|
|            10.2|  9571|
|             8.0| 14094|
|            26.4|   247|
|            46.4|     2|
|            11.4|  7014|
|             5.4| 53373|
|            16.6|  2668|
|            23.8|   503|
|             7.0| 22536|
|            37.4|     7|
|            36.2|    12|
|            35.6|    15|
|            11.6|  6441|
|            26.6|   242|
|            25.2|   434|
|            31.6|    62|
|             0.2| 11081|
|             6.6| 29201|
|            28.8|    96|
|            21.4|   835|
|            26.8|   244|
|             1.4|164720|
|            29.0|   107|
|            20.2|  1023|
|            15.8|  3531|
|             7.4| 18272|
|            42.0|     8|
+----------------+------+
only showing top 30 rows



### Output the mean and standar deviation 'Global_active_power'

In [29]:
print ("Show the output mean and standar deviation of 'Global_active_power'")
df6.select([mean("Global_active_power").alias("Mean Global Active Power"),
           stddev("Global_active_power").alias("Standar Deviation Global Active Power")]).show()

Show the output mean and standar deviation of 'Global_active_power'
+------------------------+-------------------------------------+
|Mean Global Active Power|Standar Deviation Global Active Power|
+------------------------+-------------------------------------+
|       1.091615036500528|                   1.0572941610939885|
+------------------------+-------------------------------------+



### Output the mean and standar deviation 'Global_reactive_power'

In [30]:
print ("Show the output mean and standar deviation of  'Global_reactive_power'")
df6.select([mean("Global_reactive_power").alias("Mean Global Reactive Power"),
           stddev("Global_reactive_power").alias("Standar Global Reactive power")]).show()

Show the output mean and standar deviation of  'Global_reactive_power'
+--------------------------+-----------------------------+
|Mean Global Reactive Power|Standar Global Reactive power|
+--------------------------+-----------------------------+
|       0.12371447630387077|          0.11272197955071597|
+--------------------------+-----------------------------+



### Output the mean and standar deviation 'Voltage'

In [31]:
print ("Show the output mean and standar deviation of 'Voltage'")
df6.select([mean("Voltage").alias("Mean Voltage"),
           stddev("Voltage").alias("Standar Deviation Voltage")]).show()

Show the output mean and standar deviation of 'Voltage'
+------------------+-------------------------+
|      Mean Voltage|Standar Deviation Voltage|
+------------------+-------------------------+
|240.83985797450788|        3.239986679009753|
+------------------+-------------------------+



### Output the mean and standar deviation 'Global_intensity'

In [32]:
print ("Show the output mean and standar deviation of 'Global_intensity'")
df6.select([mean("Global_intensity").alias("Mean Global Intensity"),
           stddev("Global_intensity").alias("Standar Deviation Global Intensity")]).show()

Show the output mean and standar deviation of 'Global_intensity'
+---------------------+----------------------------------+
|Mean Global Intensity|Standar Deviation Global Intensity|
+---------------------+----------------------------------+
|    4.627759310587014|                 4.444396259786142|
+---------------------+----------------------------------+



### Show Min, Max, Mean,Count From seleted Columns

In [33]:
df6.describe(["Global_active_power","Global_reactive_power","Voltage","Global_intensity"]).show()

+-------+-------------------+---------------------+------------------+-----------------+
|summary|Global_active_power|Global_reactive_power|           Voltage| Global_intensity|
+-------+-------------------+---------------------+------------------+-----------------+
|  count|            2049280|              2049280|           2049280|          2049280|
|   mean|  1.091615036500528|  0.12371447630387077|240.83985797450788|4.627759310587014|
| stddev| 1.0572941610939885|  0.11272197955071597| 3.239986679009753|4.444396259786142|
|    min|              0.076|                  0.0|             223.2|              0.2|
|    max|             11.122|                 1.39|            254.15|             48.4|
+-------+-------------------+---------------------+------------------+-----------------+



### Show the Summary data with arrange Date, Time

In [34]:
from pyspark.sql import functions as F
df6.groupBy("Date","Time").agg(F.sum("Global_active_power").alias("Global Active Power"),
                       F.sum("Global_reactive_power").alias("Global Reactive Power"),
                       F.sum("Voltage").alias("Voltage"),
                       F.sum("Global_intensity").alias("Global Intensity")).show(n = 30)

+----------+--------+-------------------+---------------------+-------+----------------+
|      Date|    Time|Global Active Power|Global Reactive Power|Voltage|Global Intensity|
+----------+--------+-------------------+---------------------+-------+----------------+
|16/12/2006|19:59:00|              3.214|                0.078| 232.66|            13.8|
|16/12/2006|20:27:00|              3.258|                0.076| 234.57|            13.8|
|16/12/2006|21:40:00|               2.36|                0.064| 236.89|            10.8|
|17/12/2006|03:19:00|              0.424|                  0.0| 244.88|             3.2|
|17/12/2006|04:39:00|              3.412|                0.052| 242.49|            14.0|
|17/12/2006|09:44:00|              0.338|                0.076| 241.21|             1.4|
|17/12/2006|13:07:00|              1.712|                0.344| 236.85|             7.2|
|17/12/2006|14:37:00|              2.118|                 0.25| 242.89|             8.6|
|17/12/2006|14:53:00|

### (3) Perform min-max normalization on the columns to generate normalized output

The min-max technique is based on the  mmin and amx values of the attribute as follows.
Normalize values will be between 0 and 1

Vn = (v - vmin) / (vmax - vmin) where

vn = normalized value
<br>v = original value
<br>vmin = minimum value
<br>vmax = maximum value

In [35]:
from pyspark.sql.functions import col

### Min-Max Normalization Global Active Power

In [36]:
print ("Min-Max Normalization Global Active Power")
(df6.select(min("Global_active_power").alias("MIN_Global_active_power"),
            max("Global_active_power").alias("MAX_Global_active_power")).crossJoin(df6).withColumn("Min-Max_Normalization",(col("Global_active_power") - col("MIN_Global_active_power")) / (col("MAX_Global_active_power") - col("MIN_Global_active_power")))).select("Global_active_power",
            "MIN_Global_active_power",
            "MAX_Global_active_power",
            "Min-Max_Normalization").show()

Min-Max Normalization Global Active Power
+-------------------+-----------------------+-----------------------+---------------------+
|Global_active_power|MIN_Global_active_power|MAX_Global_active_power|Min-Max_Normalization|
+-------------------+-----------------------+-----------------------+---------------------+
|              4.216|                  0.076|                 11.122|   0.3747963063552418|
|               5.36|                  0.076|                 11.122|   0.4783632084012313|
|              5.374|                  0.076|                 11.122|   0.4796306355241717|
|              5.388|                  0.076|                 11.122|  0.48089806264711216|
|              3.666|                  0.076|                 11.122|   0.3250045265254391|
|               3.52|                  0.076|                 11.122|    0.311787072243346|
|              3.702|                  0.076|                 11.122|   0.3282636248415716|
|                3.7|                 

In [37]:
sc

### Min-Max Normalization Global Reactive Power

In [38]:
print ("Min-Max Normalization Global Reactive Power")
(df6.select(min("Global_reactive_power").alias("MIN_Global_Reactive_power"),
            max("Global_reactive_power").alias("MAX_Global_Reactive_power")).crossJoin(df6).withColumn("Min-Max_Normalization",(col("Global_reactive_power") - col("MIN_Global_Reactive_power")) / (col("MAX_Global_Reactive_power") - col("MIN_Global_Reactive_power")))).select("Global_Reactive_power",
            "MIN_Global_Reactive_power",
            "MAX_Global_Reactive_power",
            "Min-Max_Normalization").show()

Min-Max Normalization Global Reactive Power
+---------------------+-------------------------+-------------------------+---------------------+
|Global_Reactive_power|MIN_Global_Reactive_power|MAX_Global_Reactive_power|Min-Max_Normalization|
+---------------------+-------------------------+-------------------------+---------------------+
|                0.418|                      0.0|                     1.39|  0.30071942446043165|
|                0.436|                      0.0|                     1.39|  0.31366906474820144|
|                0.498|                      0.0|                     1.39|  0.35827338129496406|
|                0.502|                      0.0|                     1.39|   0.3611510791366907|
|                0.528|                      0.0|                     1.39|   0.3798561151079137|
|                0.522|                      0.0|                     1.39|  0.37553956834532376|
|                 0.52|                      0.0|                     1.39

### Min-Max Normalization Voltage

In [39]:
print ("Min-Max Normalization Voltage")
(df6.select(min("Voltage").alias("MIN_Voltage"),
            max("Voltage").alias("MAX_Voltage")).crossJoin(df6).withColumn("Min-Max_Normalization",(col("Voltage") - col("MIN_Voltage")) / (col("MAX_Voltage") - col("MIN_Voltage")))).select("Voltage",
            "MIN_Voltage",
            "MAX_Voltage",
            "Min-Max_Normalization").show()

Min-Max Normalization Voltage
+-------+-----------+-----------+---------------------+
|Voltage|MIN_Voltage|MAX_Voltage|Min-Max_Normalization|
+-------+-----------+-----------+---------------------+
| 234.84|      223.2|     254.15|    0.376090468497577|
| 233.63|      223.2|     254.15|  0.33699515347334413|
| 233.29|      223.2|     254.15|  0.32600969305331173|
| 233.74|      223.2|     254.15|   0.3405492730210021|
| 235.68|      223.2|     254.15|   0.4032310177705981|
| 235.02|      223.2|     254.15|   0.3819063004846531|
| 235.09|      223.2|     254.15|   0.3841680129240713|
| 235.22|      223.2|     254.15|  0.38836833602584825|
| 233.99|      223.2|     254.15|   0.3486268174474964|
| 233.86|      223.2|     254.15|  0.34442649434571954|
| 232.86|      223.2|     254.15|  0.31211631663974215|
| 232.78|      223.2|     254.15|  0.30953150242326355|
| 232.99|      223.2|     254.15|   0.3163166397415191|
| 232.91|      223.2|     254.15|   0.3137318255250405|
| 235.24|      223

### Min-Max Normalization Global Intensity

In [40]:
print ("Min-Max Normalization Global Intesity")
(df6.select(min("Global_intensity").alias("MIN_Global_intensity"),
            max("Global_intensity").alias("MAX_Global_intensity")).crossJoin(df6).withColumn("Min-Max_Normalization",(col("Global_reactive_power") - col("MIN_Global_Intensity")) / (col("MAX_Global_Intensity") - col("MIN_Global_Intensity")))).select("Global_intensity",
            "MIN_Global_Intensity",
            "MAX_Global_Intensity",
            "Min-Max_Normalization").show()

Min-Max Normalization Global Intesity
+----------------+--------------------+--------------------+---------------------+
|Global_intensity|MIN_Global_Intensity|MAX_Global_Intensity|Min-Max_Normalization|
+----------------+--------------------+--------------------+---------------------+
|            18.4|                 0.2|                48.4| 0.004522821576763486|
|            23.0|                 0.2|                48.4| 0.004896265560165976|
|            23.0|                 0.2|                48.4| 0.006182572614107884|
|            23.0|                 0.2|                48.4| 0.006265560165975104|
|            15.8|                 0.2|                48.4| 0.006804979253112034|
|            15.0|                 0.2|                48.4| 0.006680497925311204|
|            15.8|                 0.2|                48.4| 0.006639004149377594|
|            15.8|                 0.2|                48.4| 0.006639004149377594|
|            15.8|                 0.2|          