Для произвольно выбранного датасета провести обработку данных и построить предсказательную модель с использованием функционала pySpark.

(если есть проблемы с выбором - использовать winequality-red.csv).

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import RandomForestRegressor

Импорт всех инструментов

In [2]:
session = SparkSession.builder.appName("WineQuality").getOrCreate()

Создание Spark-сессии


In [6]:
data = session.read.format("csv").option("header", "true")\
.option("inferSchema", "true").option("delimiter", ";").load("winequality-red.csv")

Загрузка данных

In [7]:
data.show(10)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
|         11.2|            0.28|       0.56|           1.9|    0.075|               17.0|           

Просмотр данных

In [8]:
data.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)



Просмотр инфы о данных

In [9]:
assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")

In [10]:
new_data = assembler.transform(data)

In [11]:
new_data.show(5)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+--------------------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|            features|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+--------------------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|[7.4,0.7,0.0,1.9,...|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|[7.8,0.88,0.0,2.6...|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|    

Создание колонки с фичами для обучения

In [15]:
data = new_data.select("features", "quality") 

Получили данные нужные для обучения

In [16]:
data.show(5)

+--------------------+-------+
|            features|quality|
+--------------------+-------+
|[7.4,0.7,0.0,1.9,...|      5|
|[7.8,0.88,0.0,2.6...|      5|
|[7.8,0.76,0.04,2....|      5|
|[11.2,0.28,0.56,1...|      6|
|[7.4,0.7,0.0,1.9,...|      5|
+--------------------+-------+
only showing top 5 rows



In [17]:
train_data, test_data = data.randomSplit([0.75, 0.25], seed=42)

Разбиение данных на трейн и тест

In [18]:
model = RandomForestRegressor(featuresCol="features", labelCol="quality")
model = model.fit(train_data)

Обучили случайный лес регрессор

In [20]:
y_pred = model.transform(test_data)

Получили предсказания для тестовой выборки

In [25]:
evaluetor = RegressionEvaluator(labelCol="quality", metricName="mae")
mae = evaluetor.evaluate(y_pred)

Оценили модель и как можем видеть..


In [26]:
print(f"Средняя абсолютна ошибка составила - {mae}")

Средняя квадратичная ошибка составила - 0.4946027831357235
