In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("IntroToPySpark").getOrCreate()

data = [("John", "Doe", 29), ("Jane", "Smith", 34), ("Sam", "Brown", 23)]
columns = ["first_name", "last_name", "age"]

df = spark.createDataFrame(data, columns)
df.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/12 08:34:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:>                                                          (0 + 1) / 1]

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 29|
|      Jane|    Smith| 34|
|       Sam|    Brown| 23|
+----------+---------+---+



                                                                                

In [2]:
# Selecting specific columns
df.select("first_name", "age").show()

# Filtering data
df.filter(df["age"] > 25).show()

# Grouping and Aggregating
df.groupBy("age").count().show()

+----------+---+
|first_name|age|
+----------+---+
|      John| 29|
|      Jane| 34|
|       Sam| 23|
+----------+---+

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 29|
|      Jane|    Smith| 34|
+----------+---------+---+

+---+-----+
|age|count|
+---+-----+
| 29|    1|
| 34|    1|
| 23|    1|
+---+-----+



In [3]:
df.createOrReplaceTempView("people")
result = spark.sql("SELECT * FROM people WHERE age > 25")
result.show()

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 29|
|      Jane|    Smith| 34|
+----------+---------+---+



Sure, let's break down what this code is doing in PySpark and explain the concept of an RDD.

1. **Understanding the Code**:
   - `rdd = spark.sparkContext.parallelize([1,2,3,4,5])`: This line creates an RDD (Resilient Distributed Dataset) named `rdd`. It does this by parallelizing a list `[1,2,3,4,5]`, which means distributing the list elements across multiple nodes in the cluster. Each element (1, 2, 3, 4, 5) becomes a separate item in the RDD.
   - `squared_rdd = rdd.map(lambda x: x*x)`: This line applies a transformation to each element of the `rdd`. The `map` function takes a lambda function that squares each element. So, `squared_rdd` becomes an RDD containing the squares of the original numbers (1, 4, 9, 16, 25).
   - `print(squared_rdd.collect())`: Finally, `collect()` is an action that retrieves all elements of the `squared_rdd` from the distributed cluster and brings them back to the local machine as a regular Python list. The `print` statement then outputs this list.

2. **What is an RDD in PySpark?**:
   - RDD stands for Resilient Distributed Dataset. It's a fundamental data structure of PySpark that represents an immutable, distributed collection of objects. Each dataset in an RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
   - **Resilient**: RDDs are fault-tolerant, meaning they can automatically recover from node failures.
   - **Distributed**: Data in RDDs is distributed across multiple nodes in a cluster, allowing for parallel processing.
   - **Dataset**: Refers to a collection of partitioned data with values.

RDDs are the backbone of PySpark, enabling it to handle big data processing efficiently by utilizing distributed computing. They allow users to perform transformations (like `map`, `filter`) and actions (like `collect`, `count`) on large datasets in a distributed environment.

In [4]:
rdd = spark.sparkContext.parallelize([1,2,3,4,5])
squared_rdd = rdd.map(lambda x: x*x)
print(squared_rdd.collect())

[1, 4, 9, 16, 25]


This PySpark code demonstrates the process of performing linear regression, a fundamental machine learning task, using the PySpark ML (Machine Learning) library. Here's a high-level overview of what each part of the code is doing:

1. **Import Libraries**: 
   - `LinearRegression`: The linear regression model from PySpark's machine learning library.
   - `Vectors`: A utility for handling local vectors that are not distributed across multiple machines.
   - `VectorAssembler`: A transformer that combines multiple columns into a single vector column, often used to prepare data for machine learning models.

2. **Prepare Sample Data**:
   - The `data` list contains tuples, each representing a data point. Each tuple has two elements: a feature vector (using `Vectors.dense`) and a label.
   - `Vectors.dense([0.0])`, `Vectors.dense([1.0])`, and `Vectors.dense([2.0])` are feature vectors. In this simple case, each vector contains only one feature.
   - The corresponding labels are `1.0`, `2.0`, and `3.0`.

3. **Create DataFrame**:
   - The data is converted into a DataFrame `df`, with columns named "features" and "label". This is a standard format for ML tasks in PySpark, where features are usually presented in vector form.

4. **Set Up Linear Regression Model**:
   - An instance of `LinearRegression` is created with specific parameters (`maxIter`, `regParam`, and `elasticNetParam`), which control aspects of the model training process like the number of iterations, regularization parameter, and the mix of L1 and L2 regularization.

5. **Train the Model**:
   - The model is trained (fitted) on the provided DataFrame `df`. The `fit` method applies the linear regression algorithm to learn the relationship between the features and the label.

6. **Output Model Parameters**:
   - After training, the model's coefficients (weights assigned to the features) and the intercept (the point where the estimated regression line crosses the y-axis) are printed out. These parameters define the fitted line in linear regression.

In summary, this code is a basic example of performing linear regression in PySpark, where it trains a model to understand the relationship between a single feature and a label, then outputs the parameters of the learned linear relationship.

---

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. Here are the key points:

1. **Basic Idea**: The main goal of linear regression is to find a linear relationship between the independent variable(s) (also known as predictors or features) and the dependent variable (also known as the response or outcome). 

2. **Linear Equation**: This relationship is represented as a linear equation, typically in the form `y = mx + c` for simple linear regression with one independent variable, where:
   - `y` is the dependent variable.
   - `x` is the independent variable.
   - `m` is the slope of the line (shows how much `y` changes for a unit change in `x`).
   - `c` is the y-intercept (value of `y` when `x` is 0).

3. **Multiple Variables**: In cases with more than one independent variable, the equation becomes `y = b0 + b1*x1 + b2*x2 + ... + bn*xn`, where `b0` is the intercept and `b1`, `b2`, ..., `bn` are coefficients for each independent variable `x1`, `x2`, ..., `xn`.

4. **Fitting the Model**: "Fitting" a linear regression model involves finding the values of the coefficients that result in the best approximation of the actual relationship between the variables. This is usually done by minimizing the difference between the observed values and the values predicted by the model (often using a method called least squares).

5. **Use Cases**: Linear regression is used in various fields like economics, biology, engineering, etc., for predicting a quantitative response, understanding relationships between variables, and for trend forecasting.

In simple terms, linear regression is like drawing a straight line through data points in a way that the line represents the best estimate of how those points relate to each other.

---

Note: for the Code below to work, we need to install `numpy` => `pip3 install numpy`.

---

In [6]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# Sample data
data = [(Vectors.dense([0.0]), 1.0),
        (Vectors.dense([1.0]), 2.0),
        (Vectors.dense([2.0]), 3.0)]

df = spark.createDataFrame(data, ["features", "label"])

# Linear Regression model
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(df)

# Print the coefficients
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

23/12/11 22:33:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

Coefficients: [0.657728271247261]
Intercept: 1.3422717287527388
