## Introduction to PySpark
- PySpark: The Python API for Apache Spark, an open-source distributed computing system.
- Apache Spark: A fast, general-purpose cluster-computing framework.

## Key Features
- Distributed Computing: Allows parallel data processing across a cluster of machines.
- Ease of Use: Offers high-level APIs in Java, Scala, Python, and R.
- Fast Processing: Utilizes in-memory computing to speed up data processing.
- Scalability: Capable of scaling up from a single server to thousands of machines.

## Components of PySpark
- Spark Core: The underlying execution engine that handles memory management and fault recovery.
- Spark SQL: Enables SQL queries to be run on Spark.
- Spark Streaming: Processes real-time data streams.
- MLlib: A library for machine learning that runs on Spark.
- GraphX: A library for graph processing.

### Command to install pyspark
    pip install pyspark

In [1]:
import pyspark
import pandas as pd

In [9]:
df = pd.read_csv('test1.csv')
df

Unnamed: 0,Name,Age
0,Aman,22
1,Anshita,25
2,John,27


#### Starting Pyspark session

In [3]:
from pyspark.sql import SparkSession

In [4]:
sp = SparkSession.builder.appName('Practice').getOrCreate()

In [6]:
sp

In [10]:
df_pyspark = sp.read.csv('test1.csv')

In [11]:
df_pyspark

DataFrame[_c0: string, _c1: string]

In [12]:
df_pyspark.show()

+-------+---+
|    _c0|_c1|
+-------+---+
|   Name|Age|
|   Aman| 22|
|Anshita| 25|
|   John| 27|
+-------+---+



In [16]:
df_pyspark=sp.read.option('header','true').csv('test1.csv') # making 1st row as header

In [17]:
df_pyspark.show()

+-------+---+
|   Name|Age|
+-------+---+
|   Aman| 22|
|Anshita| 25|
|   John| 27|
+-------+---+



In [19]:
df_pyspark.head(3)

[Row(Name='Aman', Age='22'),
 Row(Name='Anshita', Age='25'),
 Row(Name='John', Age='27')]

In [22]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

