<a href="https://colab.research.google.com/github/Geofgabriel/Spark-con-Python/blob/main/mlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Comenzando con mlib y pyspark. 

In [1]:
!pip install pyspark



In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('1st mlib project').getOrCreate()

In [3]:
ruta = "/content/drive/MyDrive/Colab Notebooks/FreeCodeCamp/test1.csv"
t = spark.read.csv(ruta,header=True,inferSchema=True)

In [4]:
t.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [5]:
t.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [6]:
t.columns

['Name', 'age', 'Experience', 'Salary']

In [7]:
#observar que se trabaja un poco distinto con spark. Ahora vamos a buscar
#agrupar los features que queremos usar para predecir el salario

# para esto usamos VectorAssembler

In [8]:
from pyspark.ml.feature import VectorAssembler
featuresassembler = VectorAssembler(inputCols=["age","Experience"],
                                    outputCol="independent features")

In [9]:
output = featuresassembler.transform(t)
output.show()

+---------+---+----------+------+--------------------+
|     Name|age|Experience|Salary|independent features|
+---------+---+----------+------+--------------------+
|    Krish| 31|        10| 30000|         [31.0,10.0]|
|Sudhanshu| 30|         8| 25000|          [30.0,8.0]|
|    Sunny| 29|         4| 20000|          [29.0,4.0]|
|     Paul| 24|         3| 20000|          [24.0,3.0]|
|   Harsha| 21|         1| 15000|          [21.0,1.0]|
|  Shubham| 23|         2| 18000|          [23.0,2.0]|
+---------+---+----------+------+--------------------+



In [10]:
finalized_data = output.select("independent features","Salary")

In [11]:
finalized_data.show()

+--------------------+------+
|independent features|Salary|
+--------------------+------+
|         [31.0,10.0]| 30000|
|          [30.0,8.0]| 25000|
|          [29.0,4.0]| 20000|
|          [24.0,3.0]| 20000|
|          [21.0,1.0]| 15000|
|          [23.0,2.0]| 18000|
+--------------------+------+



In [18]:
from pyspark.ml.regression import LinearRegression
# separamos en conjunto de entrenamiento y de prueva

train_d,test_d = finalized_data.randomSplit([0.75,0.25])#75% y 25%
train_d.show()
#test_d.show()

+--------------------+------+
|independent features|Salary|
+--------------------+------+
|          [21.0,1.0]| 15000|
|          [24.0,3.0]| 20000|
|          [29.0,4.0]| 20000|
+--------------------+------+



In [19]:
regressor = LinearRegression(featuresCol='independent features',labelCol='Salary')
regressor = regressor.fit(train_d)

In [20]:
# podemos ver los coeficientes:
regressor.coefficients

DenseVector([-714.2857, 3571.4286])

In [21]:
regressor.intercept

26428.57142857082

In [22]:
# y ahora probamos con el conjuto de prueba
pred_res = regressor.evaluate(test_d)

In [23]:
# y podemos ver la predicción
pred_res.predictions.show()

+--------------------+------+-----------------+
|independent features|Salary|       prediction|
+--------------------+------+-----------------+
|          [23.0,2.0]| 18000|17142.85714285713|
|          [30.0,8.0]| 25000| 33571.4285714283|
|         [31.0,10.0]| 30000| 39999.9999999996|
+--------------------+------+-----------------+



podemos ver que dá bastante mal... :(. Calculemos los errores:

In [24]:
pred_res.meanAbsoluteError, pred_res.meanSquaredError

(6476.190476190258, 58068027.21088016)

Just in case...esto es un ejemplo básico con mlib para una regresión lineal en un caso simple. 