# Wine

### Introduction:

This exercise is a adaptation from the UCI Wine dataset.
The only pupose is to practice deleting data with pandas.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=4c4fdf1257dc3a546edab658e14d3304a6f7d6edf93c76ed41e4719b9c89c218
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max

In [3]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the dataset from this [address](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data).

In [4]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

--2024-04-15 21:30:09--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘wine.data’

wine.data               [<=>                 ]       0  --.-KB/s               wine.data               [ <=>                ]  10.53K  --.-KB/s    in 0s      

2024-04-15 21:30:10 (94.8 MB/s) - ‘wine.data’ saved [10782]



### Step 3. Assign it to a variable called wine

In [44]:
wine = spark.read.csv('wine.data', sep=',', inferSchema=True)

In [45]:
wine.show(5)

+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8| _c9|_c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
|  1|14.23|1.71|2.43|15.6|127| 2.8|3.06|0.28|2.29|5.64|1.04|3.92|1065|
|  1| 13.2|1.78|2.14|11.2|100|2.65|2.76|0.26|1.28|4.38|1.05| 3.4|1050|
|  1|13.16|2.36|2.67|18.6|101| 2.8|3.24| 0.3|2.81|5.68|1.03|3.17|1185|
|  1|14.37|1.95| 2.5|16.8|113|3.85|3.49|0.24|2.18| 7.8|0.86|3.45|1480|
|  1|13.24|2.59|2.87|21.0|118| 2.8|2.69|0.39|1.82|4.32|1.04|2.93| 735|
+---+-----+----+----+----+---+----+----+----+----+----+----+----+----+
only showing top 5 rows



### Step 4. Delete the first, fourth, seventh, nineth, eleventh, thirteenth and fourteenth columns

In [46]:
to_del_col = [c for idx, c in enumerate(wine.columns) if idx in [0,3,6,8,10,12,13]]

In [47]:
to_del_col

['_c0', '_c3', '_c6', '_c8', '_c10', '_c12', '_c13']

In [48]:
wine = wine.drop(*to_del_col)

In [49]:
wine.show(5)

+-----+----+----+---+----+----+----+
|  _c1| _c2| _c4|_c5| _c7| _c9|_c11|
+-----+----+----+---+----+----+----+
|14.23|1.71|15.6|127|3.06|2.29|1.04|
| 13.2|1.78|11.2|100|2.76|1.28|1.05|
|13.16|2.36|18.6|101|3.24|2.81|1.03|
|14.37|1.95|16.8|113|3.49|2.18|0.86|
|13.24|2.59|21.0|118|2.69|1.82|1.04|
+-----+----+----+---+----+----+----+
only showing top 5 rows



### Step 5. Assign the columns as below:

The attributes are (donated by Riccardo Leardi, riclea '@' anchem.unige.it):  
1) alcohol  
2) malic_acid  
3) alcalinity_of_ash  
4) magnesium  
5) flavanoids  
6) proanthocyanins  
7) hue

In [50]:
new_col_names=['alcohol', 'malic_acid', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'proanthocyanins', 'hue']
old_col_names = wine.columns

In [51]:
for idx, name in enumerate(new_col_names):
  wine = wine.withColumnRenamed(old_col_names[idx], name)

In [52]:
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|
+-------+----------+-----------------+---------+----------+---------------+----+
|  14.23|      1.71|             15.6|      127|      3.06|           2.29|1.04|
|   13.2|      1.78|             11.2|      100|      2.76|           1.28|1.05|
|  13.16|      2.36|             18.6|      101|      3.24|           2.81|1.03|
|  14.37|      1.95|             16.8|      113|      3.49|           2.18|0.86|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|
+-------+----------+-----------------+---------+----------+---------------+----+
only showing top 5 rows



### Step 6. Set the values of the first 3 rows from alcohol as NaN

In [53]:
from pyspark.sql import Window

In [54]:
wine = wine.withColumn('idx', F.monotonically_increasing_id())

In [55]:
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+---+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|idx|
+-------+----------+-----------------+---------+----------+---------------+----+---+
|  14.23|      1.71|             15.6|      127|      3.06|           2.29|1.04|  0|
|   13.2|      1.78|             11.2|      100|      2.76|           1.28|1.05|  1|
|  13.16|      2.36|             18.6|      101|      3.24|           2.81|1.03|  2|
|  14.37|      1.95|             16.8|      113|      3.49|           2.18|0.86|  3|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|  4|
+-------+----------+-----------------+---------+----------+---------------+----+---+
only showing top 5 rows



In [56]:
wine = wine.withColumn('alcohol', F.when(wine.idx.isin([0,1,2]), None).otherwise(wine.alcohol))

In [57]:
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+---+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|idx|
+-------+----------+-----------------+---------+----------+---------------+----+---+
|   NULL|      1.71|             15.6|      127|      3.06|           2.29|1.04|  0|
|   NULL|      1.78|             11.2|      100|      2.76|           1.28|1.05|  1|
|   NULL|      2.36|             18.6|      101|      3.24|           2.81|1.03|  2|
|  14.37|      1.95|             16.8|      113|      3.49|           2.18|0.86|  3|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|  4|
+-------+----------+-----------------+---------+----------+---------------+----+---+
only showing top 5 rows



### Step 7. Now set the value of the rows 3 and 4 of magnesium as NaN

In [58]:
wine = wine.withColumn('magnesium', F.when(wine.idx.isin([2,3]), None).otherwise(wine.magnesium))

In [59]:
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+---+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|idx|
+-------+----------+-----------------+---------+----------+---------------+----+---+
|   NULL|      1.71|             15.6|      127|      3.06|           2.29|1.04|  0|
|   NULL|      1.78|             11.2|      100|      2.76|           1.28|1.05|  1|
|   NULL|      2.36|             18.6|     NULL|      3.24|           2.81|1.03|  2|
|  14.37|      1.95|             16.8|     NULL|      3.49|           2.18|0.86|  3|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|  4|
+-------+----------+-----------------+---------+----------+---------------+----+---+
only showing top 5 rows



### Step 8. Fill the value of NaN with the number 10 in alcohol and 100 in magnesium

In [64]:
wine = wine.withColumn('magnesium', F.when(wine.magnesium.isNull(), 100).otherwise(wine.magnesium)).\
  withColumn('alcohol', F.when(wine.alcohol.isNull(), 10).otherwise(wine.alcohol))

In [65]:
wine.show(5)

+-------+----------+-----------------+---------+----------+---------------+----+---+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins| hue|idx|
+-------+----------+-----------------+---------+----------+---------------+----+---+
|   10.0|      1.71|             15.6|      127|      3.06|           2.29|1.04|  0|
|   10.0|      1.78|             11.2|      100|      2.76|           1.28|1.05|  1|
|   10.0|      2.36|             18.6|      100|      3.24|           2.81|1.03|  2|
|  14.37|      1.95|             16.8|      100|      3.49|           2.18|0.86|  3|
|  13.24|      2.59|             21.0|      118|      2.69|           1.82|1.04|  4|
+-------+----------+-----------------+---------+----------+---------------+----+---+
only showing top 5 rows



### Step 9. Count the number of missing values

In [66]:
wine.select([F.count(F.when(F.isnan(c) | col(c).isNull(), c)).alias(c) for c in wine.columns]).show()

+-------+----------+-----------------+---------+----------+---------------+---+---+
|alcohol|malic_acid|alcalinity_of_ash|magnesium|flavanoids|proanthocyanins|hue|idx|
+-------+----------+-----------------+---------+----------+---------------+---+---+
|      0|         0|                0|        0|         0|              0|  0|  0|
+-------+----------+-----------------+---------+----------+---------------+---+---+



### Step 10.  Create an array of 10 random numbers up until 10

### Step 11.  Use random numbers you generated as an index and assign NaN value to each of cell.

### Step 12.  How many missing values do we have?

### Step 13. Delete the rows that contain missing values

### Step 14. Print only the non-null values in alcohol

### Step 15.  Reset the index, so it starts with 0 again

### BONUS: Create your own question and answer it.