## Reading different types of Data, Partitions, Parellization
##### DATAFRAME API IS PREFERRED against RDD API, as it is much faster. Datasets API is not avaliable in python.

#### 1. Dataframes are immutable ; with every transformation new dataset is created

#### 2. Spark datasets are represented as a list of entries.
       This list is broken into partitions stored on a different machines. 
       Each partition holds a unique subset of the entries in the list. 
       Spark calls these datasets "Resilient Distributed Datasets" (RDDs).
#### 3. At low level, everything is implemented as RDDs

#### 4. DataFrames are ultimately represented as RDDs, with additional meta-data.

#### 5.When you create a DataFrame, this collection is going to be parallelized

#### 6.Spark DataFrames schemas are defined as a collection of typed columns. The entire schema is stored as a StructType and individual columns are stored as StructFields.

## Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
####  Hadoop format 
1. CSV Files
2. Text Files
3. JSON Records
4. Avro Files
5. Sequence Files
6. RC Files
7. ORC Files
8. Parquet Files
9. XML files

In [None]:
### There are 3 different ways to create dataframes in pyspark
    1. Read from data directly to CreateDataFrame
    2. Create RDD and pass it to CreateDataFrame
    3. Create pandas df and pass it to CreateDataFrame

Differences in 1, 2 & 3
Numofpartitions:In method 1, it is 1, In method 2, it is 2, In method 3, it is 8
Method1 : Raw Data => Spark DataFrame
Method2 : Raw Data => RDD => Spark DataFrame
Method3 : Raw Data => PandasDF => Spark DataFarme

In [2]:
from pyspark.sql import SparkSession
import pandas as pd

In [3]:
spark = SparkSession.Builder().appName("fileformats").getOrCreate()

### 1. CSV files

In [65]:
#Method 1
pandas_df = pd.read_csv("Data/Employee_Statistics.csv")

In [66]:
pandas_df.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                      int64
dtype: object

In [67]:
from pyspark.sql.types import *

    DataType
            ArrayType
            MapType
            NullType
            StructField
            StructType
    AtomicType(DataType)
        BinaryType
        BooleanType
        DateType
        StringType
        TimestampType
    FractionalType(NumericType)
        DecimalType
        DoubleType
        FloatType
    IntegralType(NumericType)
        ByteType
        IntegerType
        LongType
        ShortType

In [69]:
## We need to define Schema for pandas dataframe, 
## because Spark DataFrame can't infer spark dataframe schema from pandas dataframe
## It may throw an error.

## df= spark.createDataFrame(panda_df)

## One of the error example when,I run above code line without giving schema 

##TypeError: field company_size: 
##        Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>


In [8]:
schema = StructType([StructField("enrollee_id", IntegerType(), False)\
                    ,StructField("city", StringType(), True)\
                    ,StructField("city_development_index", FloatType(), True)\
                    ,StructField('gender', StringType(), True)\
                    ,StructField('relevent_experience', StringType(), True)\
                    ,StructField('enrolled_university', StringType(), True)\
                    ,StructField('education_level', StringType(), True)\
                    ,StructField('major_discipline', StringType(), True)\
                    ,StructField('experience', StringType(), True)\
                    ,StructField('company_size', StringType(), True)\
                    ,StructField('company_type', StringType(), True)\
                    ,StructField('last_new_job', StringType(), True)\
                    ,StructField('training_hours', IntegerType(), True),StructField('target', IntegerType(), True)])

In [9]:
df1= spark.createDataFrame(pandas_df, schema=schema)

In [10]:
#Method 2, Data is loaded as a spark dataframe( Not as RDD, not as Pandas Dataframe)
df2= (spark.read.format("csv").options(header="true").load("Data/Employee_Statistics.csv"))

In [11]:
## Let us compare method 2 & method 1
## In Method 2: whole data is loaded as string
## Method 1 give us more control over describing data types for columns

In [12]:
df2.printSchema()

root
 |-- enrollee_id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- city_development_index: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- relevent_experience: string (nullable = true)
 |-- enrolled_university: string (nullable = true)
 |-- education_level: string (nullable = true)
 |-- major_discipline: string (nullable = true)
 |-- experience: string (nullable = true)
 |-- company_size: string (nullable = true)
 |-- company_type: string (nullable = true)
 |-- last_new_job: string (nullable = true)
 |-- training_hours: string (nullable = true)
 |-- target: string (nullable = true)



In [13]:
df1.rdd.getNumPartitions()

8

In [14]:
df2.rdd.getNumPartitions()

1

In [15]:
## Method 1 give me default data partitions done while dataframe creation, but partitions are not created in method 2

In [18]:
df1.take(1)

[Row(enrollee_id=8949, city='city_103, city_103', city_development_index=0.9200000166893005, gender='Male', relevent_experience='Has relevent experience', enrolled_university='no_enrollment', education_level='Graduate', major_discipline='STEM', experience='>20', company_size='NaN', company_type='NaN', last_new_job='1', training_hours=36, target=1)]

In [101]:
### Method 3
RDD_csv = spark.sparkContext.textFile("Data/Employee_Statistics.csv")

In [104]:
df1= spark.createDataFrame(RDD_csv, schema=schema)

In [105]:
df1.rdd.getNumPartitions()

2

## 2.Text File
##### It will be similar to csv file. Let us see if we can spot any differences


In [None]:
## A create DataFRAME can take only three types of data, a list, pandas Dataframe or RDD
## We have three options while reading data from external source
## A pure text file(such as book page) can be read as pandas data frame or pure RDD or Spark DataFrame
## Use Case: for NLP problems

In [None]:
## an RDD of :class:`Row`/:class:`tuple`/:class:`list`/:class:`dict`,:class:`list`,

In [39]:
RDD_list = spark.sparkContext.textFile("Data/bookpage.txt")

In [43]:
## Let us check this RDD_text looks like
RDD_list.take(10)

['Fine for running, but does that idea hold for any pursuit?',
 'Kriegel continues: “The same is true elsewhere: Trying easy',
 'will help you in any area of your life. Conventional Wisdom',
 'tells us we have to give no less than 110 percent to keep',
 'ahead. Yet conversely, I have found that giving 90 percent is',
 'usually more effective.”',
 'For freewriting, too, Kriegel’s “easy” notion hits the nail',
 'on its relaxed head.',
 'Rather than approach your writing with your teeth gritted, demanding instant, virtuoso solutions from yourself,',
 'loosen up and ease into your best 90 percent effort. Here’s']

In [48]:
## Let us convert this RDD into dataframe

## df3 = spark.createDataFrame(RDD_list)

## when I run above line of code, it throws an error

## TypeError: Can not infer schema for type: <class 'str'>


In [78]:
df3 =spark.createDataFrame(RDD_list, StringType())

In [79]:
df3.take(1)

[Row(value='Fine for running, but does that idea hold for any pursuit?')]

In [80]:
df3.rdd.getNumPartitions()

2

In [None]:
## Let us do with Method2: Pandasdf

In [83]:
pandas_df = pd.read_table("Data/bookpage.txt", header=None, names=['PlainTextField'])

In [84]:
pandas_df.head(10)

Unnamed: 0,PlainTextField
0,"Fine for running, but does that idea hold for ..."
1,Kriegel continues: “The same is true elsewhere...
2,will help you in any area of your life. Conven...
3,tells us we have to give no less than 110 perc...
4,"ahead. Yet conversely, I have found that givin..."
5,usually more effective.”
6,"For freewriting, too, Kriegel’s “easy” notion ..."
7,on its relaxed head.
8,Rather than approach your writing with your te...
9,loosen up and ease into your best 90 percent e...


In [85]:
df4 = spark.createDataFrame(pandas_df)

In [86]:
df4.take(1)

[Row(PlainTextField='Fine for running, but does that idea hold for any pursuit?')]

In [87]:
df4.rdd.getNumPartitions()

8

In [96]:
#Method 3: Loading data into Spark Dataframe 
Spark_Df =spark.read.text("Data/bookpage.txt")

In [97]:
Spark_Df.rdd.getNumPartitions()

1

## 3. JSON records.

In [None]:
#Method 1: Making Spark Dataframe by reading directly from Json file

In [109]:
df5=spark.read.json("Data\sparkify_log_small.json")

In [110]:
df5.rdd.getNumPartitions()

2

In [116]:
df5.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [121]:
# Using Pandas Dataframe
pandas_df = pd.read_json("Data/sparkify_log_small.json", lines=True)
pandas_df.dtypes

ts                 int64
userId            object
sessionId          int64
page              object
auth              object
method            object
status             int64
level             object
itemInSession      int64
location          object
userAgent         object
lastName          object
firstName         object
registration     float64
gender            object
artist            object
song              object
length           float64
dtype: object

In [123]:
#when I run following line, I got type error, we have to describe schema to convert each pandas type to spark type.
#df6 = spark.createDataFrame(pandas_df)

#TypeError: field artist: 
        #Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

In [None]:
# To save effort writing schema, i will go with create dierctly spark dataframe by reading raw data
# But I have to make sure, I get parallelism as, spark offers for reading pandas dataframes.

In [None]:
df5=spark.read.json("Data\sparkify_log_small.json")

In [None]:
## A DataFrame is already optimized for parallel execution, we need not to give it- number of partitions##
## DataFrame is a distributed data structure. It is neither required nor possible to parallelize it. (source:stackoverflow)

In [None]:
## Need to understand partitions in more detail:
## Spark uses Hadoop InputFilFormat under the hood, it will be reading partitions by input block (source:stackoverflow)