# Pyspark

- PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.


In [None]:
import pyspark

# SparkSession:


- A SparkSession is the entry point to using Apache Spark's functionality in a Spark application. It was introduced in Spark 2.0 to replace the earlier SQLContext and HiveContext 

- Unified Entry Point: `A SparkSession is a single entry point to interact with Spark`, making it easier to work with different Spark components and APIs, such as Spark SQL, Structured Streaming, and Spark MLlib.

- Configuration: `You can configure various aspects of Spark through a SparkSession, including cluster settings, memory allocation, and application-specific settings`. This configuration is applied to all Spark functionality within your application.

- DataFrame and SQL Operations: `The SparkSession allows you to work with DataFrames and perform SQL operations on structured data`. You can create DataFrames from various data sources (e.g., CSV, Parquet, JSON) and perform complex transformations and queries using Spark SQL.

- Support for Multiple Languages: Spark is compatible with several programming languages, including Python, Scala, Java, and R. `The SparkSession provides a consistent interface across these languages.`

- Resource Management: The SparkSession manages resources efficiently, `including memory and CPU, and can distribute computation across a cluster of machines`

** **


# WARN

- In PySpark, a WARN is a log level that indicates a warning message. When you see a WARN message in your PySpark application's log output, it means that the application has encountered a non-fatal issue or condition that might require attention or further investigation, but it hasn't caused the application to fail.


In [2]:
# Intial step is to create a spark session 

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("pratice").getOrCreate()

23/10/06 17:58:20 WARN Utils: Your hostname, BTCCHL0016 resolves to a loopback address: 127.0.1.1; using 192.168.0.100 instead (on interface wlp44s0)
23/10/06 17:58:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/10/06 17:58:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

In [4]:
### Read a Dataset 

# df_spark = spark.read.csv("Data.csv",header=True)

### Read the file using option 
# Unless we give the infer schema we will get the deafault datatype string 

df_spark = spark.read.option('header','true').csv("/home/kurinchiban/Desktop/Pyspark/Pyspark/data_source/data.csv",inferSchema=True)

df_spark.show()

### Check the schema
df_spark.printSchema()  

+-----------+---+----------+
|       Name|Age|Experience|
+-----------+---+----------+
|Kurinchiban| 23|         2|
|    Kishore| 30|        10|
|      Kavin| 18|         0|
+-----------+---+----------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [5]:
# df_spark.head(2)

### Selecting the columns 

# df_spark.columns

# df_spark.select("Name").show()

df_spark.select(["Name","Age"]).show()


+-----------+---+
|       Name|Age|
+-----------+---+
|Kurinchiban| 23|
|    Kishore| 30|
|      Kavin| 18|
+-----------+---+



In [34]:
df_spark.describe().show()

+-------+-----------+------------------+-----------------+
|summary|       Name|               Age|       Experience|
+-------+-----------+------------------+-----------------+
|  count|          3|                 3|                3|
|   mean|       null|23.666666666666668|              4.0|
| stddev|       null| 6.027713773341707|5.291502622129181|
|    min|      Kavin|                18|                0|
|    max|Kurinchiban|                30|               10|
+-------+-----------+------------------+-----------------+



In [40]:
## Adding Colums in pyspark dataframe

df_spark = df_spark.withColumn("Experience after 3 year",df_spark['Experience']+3)

df_spark.show()

+-----------+---+----------+-----------------------+
|       Name|Age|Experience|Experience after 3 year|
+-----------+---+----------+-----------------------+
|Kurinchiban| 23|         2|                      5|
|    Kishore| 30|        10|                     13|
|      Kavin| 18|         0|                      3|
+-----------+---+----------+-----------------------+



In [41]:
### Dropt the columns 

df_spark = df_spark.drop("Experience after 3 year")

df_spark.show()


+-----------+---+----------+
|       Name|Age|Experience|
+-----------+---+----------+
|Kurinchiban| 23|         2|
|    Kishore| 30|        10|
|      Kavin| 18|         0|
+-----------+---+----------+



In [46]:
# Rename the columns 

df_spark = df_spark.withColumnRenamed('Name',"New_Name")


In [47]:
df_spark.show()

+-----------+---+----------+
|   New_Name|Age|Experience|
+-----------+---+----------+
|Kurinchiban| 23|         2|
|    Kishore| 30|        10|
|      Kavin| 18|         0|
+-----------+---+----------+

