### Examples Of Pyspark ML

In [2]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=fa5e18c85e788ca49769692c1d76ab86c8829d64b905da9328f827df67772a1d
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [3]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Missing').getOrCreate()

In [4]:
# Read the dataset

training = spark.read.csv('test1.csv',header=True,inferSchema=True)

In [5]:
training.show()



+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [6]:
training.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [7]:
training.columns

['Name', 'age', 'Experience', 'Salary']

[Age,Experience]----> new feature--->independent feature

The VectorAssembler in PySpark is a feature transformer that combines multiple columns into a single vector column, which is often necessary for machine learning algorithms in Spark.

Breakdown:
- inputCols=["age", "Experience"]:

  This is a list of columns that you want to combine into a single feature vector. In your case, you're combining the age and Experience columns.
- outputCol="Independent Features":

  This specifies the name of the output column that will contain the feature vector. The combined features from age and Experience will be stored in this new column

In [8]:
from pyspark.ml.feature import VectorAssembler
featureassembler=VectorAssembler(inputCols=["age","Experience"],outputCol="Independent Features")

Explanation:
- featureassembler.transform(training):

  This line applies the VectorAssembler that was previously defined to the training DataFrame.
-The transform() method combines the specified inputCols (in this case, ["age", "Experience"]) into a new vector column ("Independent Features").
output:

The resulting DataFrame from the transformation is stored in the variable output.
This new DataFrame will be the same as the training DataFrame, but with an additional column ("Independent Features") that contains the feature vectors.

In [9]:
output=featureassembler.transform(training)

In [None]:
output.show()

+---------+---+----------+------+--------------------+
|     Name|age|Experience|Salary|Independent Features|
+---------+---+----------+------+--------------------+
|    Krish| 31|        10| 30000|         [31.0,10.0]|
|Sudhanshu| 30|         8| 25000|          [30.0,8.0]|
|    Sunny| 29|         4| 20000|          [29.0,4.0]|
|     Paul| 24|         3| 20000|          [24.0,3.0]|
|   Harsha| 21|         1| 15000|          [21.0,1.0]|
|  Shubham| 23|         2| 18000|          [23.0,2.0]|
+---------+---+----------+------+--------------------+



In [None]:
output.columns

['Name', 'age', 'Experience', 'Salary', 'Independent Features']

In [10]:
finalized_data=output.select("Independent Features","Salary")

In [11]:
finalized_data.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|         [31.0,10.0]| 30000|
|          [30.0,8.0]| 25000|
|          [29.0,4.0]| 20000|
|          [24.0,3.0]| 20000|
|          [21.0,1.0]| 15000|
|          [23.0,2.0]| 18000|
+--------------------+------+



- featuresCol='Independent Features': Specifies the column containing the feature vectors (created by VectorAssembler).
- labelCol='Salary': Specifies the column containing the target variable or label that the model will predict.
- fit(train_data): This method trains the linear regression model using the train_data DataFrame.
- After this step, regressor becomes a fitted model, ready to make predictions.

In [14]:
from pyspark.ml.regression import LinearRegression
##train test split
train_data,test_data=finalized_data.randomSplit([0.75,0.25])
regressor=LinearRegression(featuresCol='Independent Features', labelCol='Salary')
regressor=regressor.fit(train_data)

In [15]:
### Coefficients
regressor.coefficients

DenseVector([1500.0, -0.0])

In [16]:
### Intercepts
regressor.intercept

-16500.000000036445

In [17]:
### Prediction
pred_results=regressor.evaluate(test_data)

In [18]:
pred_results.predictions.show()

+--------------------+------+------------------+
|Independent Features|Salary|        prediction|
+--------------------+------+------------------+
|          [24.0,3.0]| 20000|19500.000000000764|
|          [29.0,4.0]| 20000|27000.000000007793|
|          [30.0,8.0]| 25000|28500.000000001943|
+--------------------+------+------------------+



In [19]:
pred_results.meanAbsoluteError,pred_results.meanSquaredError

(3666.666666669657, 20500000.000040647)