# Regressão Linear Generalizada:

Tipos:
- 1. Gaussiano > Contínuo
- 2. Binomial > Binário
- 3. Poisson > Discreto
-4. Gamma - Dados Contínuos

# Hiper Parâmetros:

- link: define a função de link: identidade, log, inverse, logit, probit, cloglog e sqrt
- maxiter: número máximo de iterações (default=100)
- regParam: índice de regularização (default=0)

In [1]:
import findspark, pyspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.appName("generalized").getOrCreate()

In [2]:
carros = spark.read.csv("Carros.csv", header=True, inferSchema= True, sep=";")
carros.show(5)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 5 rows



In [3]:
from pyspark.ml.feature import RFormula
# Padrão de RFormula:
Rformula = RFormula(formula="HP ~Consumo + Cilindros + Cilindradas", featuresCol = "independente", labelCol="dependente")
carrosrf = Rformula.fit(carros).transform(carros)
carrosrf.select("independente","dependente").show(5,truncate=False)

+-----------------+----------+
|independente     |dependente|
+-----------------+----------+
|[21.0,6.0,160.0] |110.0     |
|[21.0,6.0,160.0] |110.0     |
|[228.0,4.0,108.0]|93.0      |
|[214.0,6.0,258.0]|110.0     |
|[187.0,8.0,360.0]|175.0     |
+-----------------+----------+
only showing top 5 rows



In [4]:
# Cria treino (80%) e teste (20%): retorna 2 datasets
CarrosTreino, CarrosTeste = carrosrf.randomSplit([0.7,0.3])
print("treino: "+ str(CarrosTreino.count()), "teste: " + str(CarrosTeste.count()))

treino: 22 teste: 10


In [5]:
from pyspark.ml.regression import GeneralizedLinearRegression
geral = GeneralizedLinearRegression(family="gaussian",featuresCol="independente", 
                                    labelCol="dependente", link="identity", maxIter = 1000, 
                                    regParam=0.08)
modelo = geral.fit(CarrosTreino)

In [6]:
previsao = modelo.transform(CarrosTeste)
previsao.select("independente","dependente","prediction").show()

+------------------+----------+------------------+
|      independente|dependente|        prediction|
+------------------+----------+------------------+
| [147.0,8.0,440.0]|     230.0|210.84287383486287|
|[173.0,8.0,2758.0]|     180.0|186.31865262537025|
| [181.0,6.0,225.0]|     105.0|145.93398773654377|
| [192.0,8.0,400.0]|     175.0|206.70332169830618|
| [214.0,4.0,121.0]|     109.0| 80.07611932427696|
| [214.0,6.0,258.0]|     110.0|142.30902707310548|
| [228.0,4.0,108.0]|      93.0| 78.79351079055644|
|  [273.0,4.0,79.0]|      66.0| 74.54996646047684|
| [304.0,4.0,951.0]|     113.0| 63.19402449608255|
| [339.0,4.0,711.0]|      65.0| 61.94917408505525|
+------------------+----------+------------------+



In [7]:
from pyspark.ml.evaluation import RegressionEvaluator
avaliar = RegressionEvaluator(labelCol="dependente", predictionCol="prediction", metricName="rmse") #col original, previsão e métrica
rmse = avaliar.evaluate(previsao)
print(rmse)

27.80817490087484
