Dataframe is collection of object row. each row object is represent one record for all columns.
Three Scenarion where you will work with row object
1. Manually creating row and dataframe
2. Collecting dataframe rows to driver
3. work with an indivisual row to apply transformation

###### Lets simulate scenarion 1 and 2

In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

from lib.logger import Log4j

spark = SparkSession \
            .builder \
            .master("local[3]") \
            .appName("Row Demo") \
            .getOrCreate()

logger = Log4j(spark)
logger.info("Starting HelloSparkSQL")

In [2]:
from pyspark.sql.functions import *

def to_date_df(df, fmt, fld):
    return df.withColumn(fld, to_date(fld, fmt))

In [4]:
from pyspark.sql.types import *

In [5]:
my_schema = StructType([StructField("ID", StringType()),StructField("EventDate", StringType())])

In [6]:
from pyspark.sql.types import Row
my_rows = [Row("123", "04/05/2020"), Row("124", "4/5/2020"), Row("125", "04/5/2020"), Row("126", "4/05/2020")]

In [7]:
my_rdd = spark.sparkContext.parallelize(my_rows, 2)

In [8]:
my_df = spark.createDataFrame(my_rdd, my_schema)

In [9]:
my_df.printSchema()

root
 |-- ID: string (nullable = true)
 |-- EventDate: string (nullable = true)



In [10]:
my_df.show()

+---+----------+
| ID| EventDate|
+---+----------+
|123|04/05/2020|
|124|  4/5/2020|
|125| 04/5/2020|
|126| 4/05/2020|
+---+----------+



In [11]:
new_df = to_date_df(my_df, "M/d/y", "EventDate")

In [12]:
new_df.printSchema()

root
 |-- ID: string (nullable = true)
 |-- EventDate: date (nullable = true)



In [13]:
new_df.show()

+---+----------+
| ID| EventDate|
+---+----------+
|123|2020-04-05|
|124|2020-04-05|
|125|2020-04-05|
|126|2020-04-05|
+---+----------+



In [14]:
spark.stop()

###### Work With an Indivisual row
Spark Dataframe offers bunch of transformation functions 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

You will be using these methods if your data has schema.
What if you do not have proper schema?
we need to take an extra step to create columnar structure and then apply transformations. 
In this cases you need to start working with row only and then transform it to columnar structure.


In [15]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

from lib.logger import Log4j

spark = SparkSession \
            .builder \
            .master("local[3]") \
            .appName("Row Demo") \
            .getOrCreate()

logger = Log4j(spark)
logger.info("Starting HelloSparkSQL")

In [16]:
file_df = spark.read.text("data/apache_logs.txt")
file_df.printSchema()

root
 |-- value: string (nullable = true)



Now you see it has single row of string type no columns.

We need to find a way to extract welll defined field.
Can we use regular expresion here?
lets look at data... This data is apache . log comes with IP Client user datetime cmd request protocol status bytes referrer useragents like info. we can extract all these fields using regexp 

In [17]:
log_reg = r'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\S+) "(\S+)" "([^"]*)'

create dataframe with only four fields.

In [18]:
logs_df = file_df.select(regexp_extract('value', log_reg, 1).alias('ip'),
                             regexp_extract('value', log_reg, 4).alias('date'),
                             regexp_extract('value', log_reg, 6).alias('request'),
                             regexp_extract('value', log_reg, 10).alias('referrer'))

use select method with regexp_extract method which has three arguments. 1. source string or field name 2. regexp 3. filed no.

In [19]:
logs_df.printSchema()

root
 |-- ip: string (nullable = true)
 |-- date: string (nullable = true)
 |-- request: string (nullable = true)
 |-- referrer: string (nullable = true)



In [20]:
logs_df \
    .groupBy("referrer") \
    .count() \
    .show(100, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|referrer                                                                                                                                                                               

http://www.semicomplete.com/blog/tags/wifi

|http://www.semicomplete.com/projects/keynav/

|http://www.semicomplete.com/presentations/logstash-blah/images/office-space-printer-beat-down-gif.gif

look at these refereer from same site. 

lets take only websites home url

In [21]:
  logs_df \
        .where("trim(referrer) != '-' ") \
        .withColumn("referrer", substring_index("referrer", "/", 3)) \
        .groupBy("referrer") \
        .count() \
        .show(100, truncate=False)

+--------------------------------------+-----+
|referrer                              |count|
+--------------------------------------+-----+
|http://ijavascript.cn                 |1    |
|http://www.google.co.tz               |1    |
|http://www.google.ca                  |6    |
|https://www.google.hr                 |2    |
|https://www.google.ch                 |1    |
|http://www.google.ru                  |6    |
|http://www.raspberrypi-spanish.es     |1    |
|http://semicomplete.com               |2001 |
|http://manpages.ubuntu.com            |2    |
|http://kufli.blogspot.fr              |1    |
|http://www.bing.com                   |6    |
|http://rungie.com                     |1    |
|http://www.google.co.th               |2    |
|https://www.google.cz                 |5    |
|http://danceuniverse.ru               |3    |
|http://www.google.co.uk               |14   |
|http://www.google.rs                  |1    |
|http://kufli.blogspot.in              |1    |
|http://t.co 

how ever you might wondering about this column transformation. we havent learn this thing yet. let do column level transformation and come back.

In [22]:
spark.stop()