# DataFrames

DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:

* Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
* Lazy Evaluations: Which means that a task is not executed until an action is performed.
* Distributed: RDD and DataFrame both are distributed in nature.


### Advantages of the DataFrame
* DataFrames are designed for processing large collection of structured or semi-structured data.
* Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
* DataFrame in Apache Spark has the ability to handle petabytes of data.
* DataFrame has a support for wide range of data format and sources.
* It has API support for different languages like Python, R, Scala, Java.


# Importing Libraries

In [1]:
import pyspark
from pyspark import SparkContext as sc

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Basics').getOrCreate()

In [4]:
df = spark.read.csv('sales_info.csv',inferSchema=True,header=True)

.show() the method used to show data in dataframe

In [5]:
df.show()

+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
|   GOOG|    Sam|200.0|
|   GOOG|Charlie|120.0|
|   GOOG|  Frank|340.0|
|   MSFT|   Tina|600.0|
|   MSFT|    Amy|124.0|
|   MSFT|Vanessa|243.0|
|     FB|   Carl|870.0|
|     FB|  Sarah|350.0|
|   APPL|   John|250.0|
|   APPL|  Linda|130.0|
|   APPL|   Mike|750.0|
|   APPL|  Chris|350.0|
+-------+-------+-----+



Use `printSchema()` to show he schema of the data. Note, how tightly it is integrated to the SQL-like framework. You can even see that the schema accepts null values because nullable property is set True

In [6]:
df.printSchema()

root
 |-- Company: string (nullable = true)
 |-- Person: string (nullable = true)
 |-- Sales: double (nullable = true)



Fortunately a simple `columns` method exists to get column names back as a Python list

In [7]:
df.columns

['Company', 'Person', 'Sales']

### Spark DataFrames have separate Column and Row types

In [8]:
type(df['Company'])

pyspark.sql.column.Column

In [9]:
type(df.head(2)[0])

pyspark.sql.types.Row

### The `select` method to select particular columns

We can use this method to actually select the DataFrame columns and see them. Note that we still have to use the `show` method to actually output the data.

In [10]:
df.select('Company').show()

+-------+
|Company|
+-------+
|   GOOG|
|   GOOG|
|   GOOG|
|   MSFT|
|   MSFT|
|   MSFT|
|     FB|
|     FB|
|   APPL|
|   APPL|
|   APPL|
|   APPL|
+-------+



In [11]:
df.select(['Company','Person']).show()

+-------+-------+
|Company| Person|
+-------+-------+
|   GOOG|    Sam|
|   GOOG|Charlie|
|   GOOG|  Frank|
|   MSFT|   Tina|
|   MSFT|    Amy|
|   MSFT|Vanessa|
|     FB|   Carl|
|     FB|  Sarah|
|   APPL|   John|
|   APPL|  Linda|
|   APPL|   Mike|
|   APPL|  Chris|
+-------+-------+



### The `limit` method to take first few rows (without any collection)

Applying `limit()` to the DataFrame will result in a new Dataframe. This is a transformation and does not perform collecting the data. Other similar methods like `take` and `head` result in an Array of Rows i.e. a Scala Collection Object like `scala.collection.immutable.Array` (which is transformed to Python list while using the PySpark API).

In [12]:
df.limit(5).show()

+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
|   GOOG|    Sam|200.0|
|   GOOG|Charlie|120.0|
|   GOOG|  Frank|340.0|
|   MSFT|   Tina|600.0|
|   MSFT|    Amy|124.0|
+-------+-------+-----+



### The `head()` and the `asDict()` methods

In [13]:
df.head(2)

[Row(Company='GOOG', Person='Sam', Sales=200.0),
 Row(Company='GOOG', Person='Charlie', Sales=120.0)]

In [14]:
dicti=df.head(2)[0].asDict()
dicti

{'Company': 'GOOG', 'Person': 'Sam', 'Sales': 200.0}


### The `count` method - number of rows

In [15]:
df.count()

12

### The `describe` and `summary` methods

Similar to Pandas, the `describe` method is used for the statistical summary. But unlike Pandas, calling only `describe()` returns a DataFrame! This is due to the lazy evaluation - the actual computation is delayed as much as possible.

In [16]:
df.describe().show()

+-------+-------+-------+------------------+
|summary|Company| Person|             Sales|
+-------+-------+-------+------------------+
|  count|     12|     12|                12|
|   mean|   null|   null| 360.5833333333333|
| stddev|   null|   null|250.08742410799007|
|    min|   APPL|  Chris|             120.0|
|    max|   MSFT|Vanessa|             870.0|
+-------+-------+-------+------------------+



In [17]:
df.summary().show()

+-------+-------+-------+------------------+
|summary|Company| Person|             Sales|
+-------+-------+-------+------------------+
|  count|     12|     12|                12|
|   mean|   null|   null| 360.5833333333333|
| stddev|   null|   null|250.08742410799007|
|    min|   APPL|  Chris|             120.0|
|    25%|   null|   null|             130.0|
|    50%|   null|   null|             250.0|
|    75%|   null|   null|             350.0|
|    max|   MSFT|Vanessa|             870.0|
+-------+-------+-------+------------------+



### Coutning rows with distinct values - `distinct` method

In [18]:
df.select('Company').distinct().show()

+-------+
|Company|
+-------+
|   APPL|
|   GOOG|
|     FB|
|   MSFT|
+-------+



In [19]:
spark.stop()