### Examples Of Pyspark ML

In [0]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Missing').getOrCreate()

In [0]:
## Read The dataset
training = spark.read.csv('/FileStore/shared_uploads/astartupcto@gmail.com/test1.csv',header=True,inferSchema=True)

In [0]:
training.show()



+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|   Sapan| 31|        10| 30000|
|Priyanka| 30|         8| 25000|
|Gurpreet| 29|         4| 20000|
|   Payal| 24|         3| 20000|
|   Priya| 21|         1| 15000|
|   Aayan| 23|         2| 18000|
+--------+---+----------+------+



In [0]:
training.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [0]:
training.columns

Out[18]: ['Name', 'Age', 'Experience', 'Salary']

In [0]:
#The purpose of using VectorAssembler is to transform multiple feature columns into a single feature vector column. 
# This is a common preprocessing step in machine learning pipelines. 
# By combining multiple features into a single vector column, it becomes easier to use them as input for machine learning models in PySpark's MLlib

from pyspark.ml.feature import VectorAssembler
featureassembler=VectorAssembler(inputCols=["Age","Experience"],outputCol="Independent Features")

In [0]:
#The purpose of this transformation is to prepare the training data for machine learning modeling. 
# By combining multiple feature columns into a single vector column, you can easily use this vector as input 
# for various machine learning algorithms in PySpark's MLlib.
output=featureassembler.transform(training)

In [0]:
output.show()

+--------+---+----------+------+--------------------+
|    Name|Age|Experience|Salary|Independent Features|
+--------+---+----------+------+--------------------+
|   Sapan| 31|        10| 30000|         [31.0,10.0]|
|Priyanka| 30|         8| 25000|          [30.0,8.0]|
|Gurpreet| 29|         4| 20000|          [29.0,4.0]|
|   Payal| 24|         3| 20000|          [24.0,3.0]|
|   Priya| 21|         1| 15000|          [21.0,1.0]|
|   Aayan| 23|         2| 18000|          [23.0,2.0]|
+--------+---+----------+------+--------------------+



In [0]:
output.columns

Out[22]: ['Name', 'Age', 'Experience', 'Salary', 'Independent Features']

In [0]:
finalized_data=output.select("Independent Features","Salary")

In [0]:
finalized_data.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|         [31.0,10.0]| 30000|
|          [30.0,8.0]| 25000|
|          [29.0,4.0]| 20000|
|          [24.0,3.0]| 20000|
|          [21.0,1.0]| 15000|
|          [23.0,2.0]| 18000|
+--------------------+------+



In [0]:
# This line imports the LinearRegression class from the pyspark.ml.regression module. 
# LinearRegression is a machine learning algorithm used for predicting a continuous target variable based on one or more input features.
from pyspark.ml.regression import LinearRegression
##train test split
#finalized_data: This is the DataFrame containing your data after transformations (e.g., with the Independent Features vector column).
#randomSplit([0.75, 0.25]): This method splits the data into training and test datasets. 75% of the data will be used for training the model (train_data), and the remaining #25% will be used for testing the model (test_data).
train_data,test_data=finalized_data.randomSplit([0.75,0.25])
#Creating and Training the Linear Regression Model
regressor=LinearRegression(featuresCol='Independent Features', labelCol='Salary')
regressor=regressor.fit(train_data)

In [0]:
### Coefficients
regressor.coefficients

Out[26]: DenseVector([109.3058, 1199.4092])

In [0]:
### Intercepts
regressor.intercept

Out[27]: 12187.59231905408

In [0]:
### Prediction
pred_results=regressor.evaluate(test_data)

In [0]:
pred_results.predictions.show()

+--------------------+------+------------------+
|Independent Features|Salary|        prediction|
+--------------------+------+------------------+
|          [24.0,3.0]| 20000|18409.158050221544|
|         [31.0,10.0]| 30000|27570.162481536125|
+--------------------+------+------------------+



In [0]:
pred_results.meanAbsoluteError,pred_results.meanSquaredError

Out[30]: (2010.3397341211657, 4217444.237654801)

In [0]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Step 1: Create a Spark Session
spark = SparkSession.builder.appName('LinearRegressionExample').getOrCreate()

# Step 2: Define the Schema (optional, for explicit schema definition)
# If schema is not defined, Spark will infer it automatically
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Experience", IntegerType(), True),
    StructField("Salary", IntegerType(), True)
])

# Step 3: Load Data from File Path
file_path = '/FileStore/shared_uploads/astartupcto@gmail.com/test1.csv'
df = spark.read.csv(file_path, schema=schema, header=True)

# Step 4: Clean and Transform Data
# Drop rows with null values
df = df.dropna()

# Define the VectorAssembler
featureassembler = VectorAssembler(inputCols=["Age", "Experience"], outputCol="Independent Features")

# Apply the VectorAssembler to the DataFrame
finalized_data = featureassembler.transform(df)

# Step 5: Train-Test Split
train_data, test_data = finalized_data.randomSplit([0.75, 0.25])

# Step 6: Create and Train the Linear Regression Model
regressor = LinearRegression(featuresCol='Independent Features', labelCol='Salary')
regressor = regressor.fit(train_data)

# Show the training summary
training_summary = regressor.summary
print("Coefficients: " + str(regressor.coefficients))
print("Intercept: " + str(regressor.intercept))
print("RMSE: %f" % training_summary.rootMeanSquaredError)
print("R2: %f" % training_summary.r2)

# Step 7: Make Predictions on the Test Data
predictions = regressor.transform(test_data)
predictions.select("Age", "Experience", "Salary", "prediction").show()


Coefficients: [-102.5299600532517,1688.6817576564458]
Intercept: 16470.039946737463
RMSE: 666.648912
R2: 0.982531
+---+----------+------+------------------+
|Age|Experience|Salary|        prediction|
+---+----------+------+------------------+
| 30|         8| 25000|26903.595206391477|
+---+----------+------+------------------+



Let's dive into training summary:

### Coefficients

```plaintext
Coefficients: [-102.5299600532517, 1688.6817576564458]
```

- **Coefficients** are the weights assigned to the features in your linear regression model. In this case, you have two features: `Age` and `Experience`.

  - The first coefficient `-102.5299600532517` is associated with the `Age` feature.
  - The second coefficient `1688.6817576564458` is associated with the `Experience` feature.

  These coefficients indicate how much the target variable (`Salary`) is expected to change with a one-unit change in the corresponding feature, holding all other features constant.

  - A coefficient of `-102.5299600532517` for `Age` means that, on average, for each additional year of age, the salary is expected to decrease by approximately 102.53 units, assuming `Experience` remains constant.
  - A coefficient of `1688.6817576564458` for `Experience` means that, on average, for each additional year of experience, the salary is expected to increase by approximately 1688.68 units, assuming `Age` remains constant.

### Intercept

```plaintext
Intercept: 16470.039946737463
```

- The **Intercept** is the expected value of the target variable (`Salary`) when all the features (`Age` and `Experience`) are zero. It represents the baseline level of the target variable without any influence from the features.

  In this case, an intercept of `16470.039946737463` means that if both `Age` and `Experience` were zero, the model would predict a salary of approximately 16,470.04 units.

### RMSE (Root Mean Squared Error)

```plaintext
RMSE: 666.648912
```

- **RMSE** is a measure of the differences between the predicted values and the actual values. It is the square root of the average of the squared differences between predicted and actual values. It gives an idea of how well the model's predictions match the actual data.

  An RMSE of `666.648912` means that, on average, the model's predictions are off by about 666.65 units from the actual salaries. Lower RMSE values indicate better model performance.

### R² (R-Squared)

```plaintext
R2: 0.982531
```

- **R² (R-Squared)** is a statistical measure that represents the proportion of the variance for the target variable that is explained by the features in the model. It ranges from 0 to 1.

  An R² of `0.982531` means that approximately 98.25% of the variance in the salary can be explained by the `Age` and `Experience` features in the model. Higher R² values indicate a better fit of the model to the data.

### Summary

To summarize:

- **Coefficients** tell us how much the target variable is expected to change with a one-unit change in the corresponding feature.
- **Intercept** represents the expected value of the target variable when all features are zero.
- **RMSE** indicates the average error of the model's predictions.
- **R²** shows how well the features explain the variance in the target variable.

These metrics help us understand the performance and characteristics of the linear regression model.