### **Chapter 2: DataFrame**
Welcome back! In Chapter 1: SparkSession, we learned that the SparkSession is our main entry point into the powerful world of Apache Spark. It's like opening the front door to a large library. Once inside, where do you find the books (your data)? How is the information organized?

This is where the DataFrame comes in.

#### **What Problem Does DataFrame Solve?**
You have vast amounts of data – perhaps millions or billions of rows – spread across many computers. This data often has a structure, like columns with names (e.g., "user_id", "product_name", "purchase_amount"). You need a way to:

Organize this structured data.
Perform common data manipulation tasks (like selecting specific columns, filtering rows, joining with other data) efficiently on this distributed data.
Do this in a way that Spark can optimize for performance across the cluster.
Simply storing data in basic lists or arrays in Python wouldn't work well for huge datasets and wouldn't leverage Spark's distributed processing power effectively.

### **DataFrame: Your Structured Data Table in Spark**
Think of a Spark DataFrame as a super-powered spreadsheet or a table in a database, but designed to handle immense amounts of data that are too big for one computer's memory and are processed across many machines.
It organizes data into named columns. This structure makes it intuitive to work with, much like you would interact with a table in SQL or a data frame in the pandas library (though Spark DataFrames are fundamentally built for distributed processing).

### **Basic Operations with DataFrames**
DataFrames provide many methods for common data manipulation tasks. Let's look at a couple of simple ones.

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col, mean, count, when, isnull, isnan
from pyspark.sql.types import IntegerType, FloatType


In [7]:
# Read in the dataset

spark = SparkSession.builder.appName("AutomotiveDataAnalysis").getOrCreate()

auto_df = spark.read.csv("/content/data.csv", header=True, sep=",", inferSchema=True)
auto_df.show()

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        3|             NULL|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88.6| 16

# Show the scheama
To understand the datatype


In [8]:
auto_df.printSchema()

root
 |-- symboling: integer (nullable = true)
 |-- normalized-losses: integer (nullable = true)
 |-- make: string (nullable = true)
 |-- fuel-type: string (nullable = true)
 |-- aspiration: string (nullable = true)
 |-- num-of-doors: string (nullable = true)
 |-- body-style: string (nullable = true)
 |-- drive-wheels: string (nullable = true)
 |-- engine-location: string (nullable = true)
 |-- wheel-base: double (nullable = true)
 |-- length: double (nullable = true)
 |-- width: double (nullable = true)
 |-- height: double (nullable = true)
 |-- curb-weight: integer (nullable = true)
 |-- engine-type: string (nullable = true)
 |-- num-of-cylinders: string (nullable = true)
 |-- engine-size: integer (nullable = true)
 |-- fuel-system: string (nullable = true)
 |-- bore: double (nullable = true)
 |-- stroke: double (nullable = true)
 |-- compression-ratio: double (nullable = true)
 |-- horsepower: integer (nullable = true)
 |-- peak-rpm: integer (nullable = true)
 |-- city-mpg: integer 

### Describe

# Check basic statistics for numeric columns


In [9]:
auto_df.describe().show()

+-------+------------------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+------------------+------------------+-----------------+------------------+------------------+-----------+----------------+------------------+-----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+
|summary|         symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|        wheel-base|            length|            width|            height|       curb-weight|engine-type|num-of-cylinders|       engine-size|fuel-system|              bore|            stroke| compression-ratio|        horsepower|         peak-rpm|         city-mpg|      highway-mpg|             price|
+-------+------------------+-----------------+-----------+---------+----------+------------+-----------+------------+---------

In [13]:
# Check for missing values
auto_df.select([count(when(isnull(c) | isnan(c), c)).alias(c) for c in auto_df.columns]).show()

+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|make|fuel-type|aspiration|num-of-doors|body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        0|               41|   0|        0|         0|           2|         0|           0|              0|         0|     0|    0|     0|          0|   