## Pyspark With Python-Introduction To Pyspark MLlib

In [1]:
from pyspark.sql import SparkSession

In [2]:
ss = SparkSession.builder.appName("MLlib").getOrCreate()

In [6]:
df_spark = ss.read.csv('test2.csv',header=True,inferSchema=True)

Here we are going to create a model which will predict the based on age and experience. This is a simple problem statement based on regression.

In [7]:
df_spark.show()

+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|   Aman| 22|        10|100000|
|Anshita| 25|         8| 90000|
|   John| 27|         4| 80000|
|   zara| 23|         4| 70000|
|    Sam| 21|         3| 50000|
| Justin| 22|         6| 80000|
+-------+---+----------+------+



In [8]:
df_spark.columns

['Name', 'Age', 'Experience', 'Salary']

In [11]:
from pyspark.ml.feature import VectorAssembler # to group the column as independent variable we use VectorAssembler
featureAssembler = VectorAssembler(inputCols=['Age',"Experience"],outputCol='Independent feature')

In [12]:
output = featureAssembler.transform(df_spark)

In [13]:
output.show()

+-------+---+----------+------+-------------------+
|   Name|Age|Experience|Salary|Independent feature|
+-------+---+----------+------+-------------------+
|   Aman| 22|        10|100000|        [22.0,10.0]|
|Anshita| 25|         8| 90000|         [25.0,8.0]|
|   John| 27|         4| 80000|         [27.0,4.0]|
|   zara| 23|         4| 70000|         [23.0,4.0]|
|    Sam| 21|         3| 50000|         [21.0,3.0]|
| Justin| 22|         6| 80000|         [22.0,6.0]|
+-------+---+----------+------+-------------------+



In [14]:
## Independent feature will be treated as input feature and Salary column will be output feature
finalized_data = output.select("Independent feature","Salary")
finalized_data.show()

+-------------------+------+
|Independent feature|Salary|
+-------------------+------+
|        [22.0,10.0]|100000|
|         [25.0,8.0]| 90000|
|         [27.0,4.0]| 80000|
|         [23.0,4.0]| 70000|
|         [21.0,3.0]| 50000|
|         [22.0,6.0]| 80000|
+-------------------+------+



### Train test split

In [15]:
from pyspark.ml.regression import LinearRegression

train_data,test_data = finalized_data.randomSplit([0.75,0.25])
reg = LinearRegression(featuresCol='Independent feature',labelCol='Salary')
reg = reg.fit(train_data)

In [16]:
reg.coefficients

DenseVector([2953.2164, 6008.7719])

In [17]:
reg.intercept

-26359.649122808278

In [18]:
res = reg.evaluate(test_data)

In [21]:
res.predictions.show()

+-------------------+------+-----------------+
|Independent feature|Salary|       prediction|
+-------------------+------+-----------------+
|         [23.0,4.0]| 70000|65599.41520467834|
+-------------------+------+-----------------+



In [22]:
res.meanAbsoluteError

4400.584795321658