In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master('local[1]') \
                    .appName('test') \
                    .getOrCreate()

23/12/16 05:52:28 WARN Utils: Your hostname, codespaces-d00206 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
23/12/16 05:52:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/16 05:52:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# The Structure of the Data Sources API

## Read API Structure

The core structure for reading data is as follows:

DataFrameReader.format(...).option("key", "value").schema(...).load()

We will use this format to read from all of our data sources.
- **format** is optional because by default Spark will use the Parquet format.
- **option** allows you to set key-value configurations to parameterize how you will read data.
- **schema** is optional if the data source provides a schema or if you intend to use schema inference.

Naturally, there are some required options for each format, which we will discuss when we look at each format.

## Basics of Reading Data

The foundation for reading data in Spark is the DataFrameReader. We access this through the SparkSession via the read attribute:

In [5]:
spark.read

<pyspark.sql.readwriter.DataFrameReader at 0x7efd1960f820>

After we have a DataFrame reader, we specify several values:
- The format
- The schema
- The read mode
- A series of options

The **format**, **options**, and **schema** each return a DataFrameReader that can undergo further transformations and **are all optional**, except for one option.

Each data source has a specific set of options that determine how the data is read into Spark (we cover these options shortly). At a minimum, **you must supply the DataFrameReader a path to from which to read**.

Here’s an example of the overall layout:


In [6]:
spark.read.format("csv")\
    .option("mode", "FAILFAST")\
    .option("inferSchema", "true")\
    .option("path", "path/to/file(s)")\
    .schema(someSchema)\
    .load()

NameError: name 'someSchema' is not defined

There are a variety of ways in which you can set options; for example, you can build a map and pass in your configurations. For now, we’ll stick to the simple and explicit way that you just saw

### Read modes

Reading data from an external source naturally entails encountering malformed data, especially when working with only semi-structured data sources. Read modes specify what will happen when Spark does come across malformed records

Read mode Description
- permissive - sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record
- dropMalformed - Drops the row that contains malformed records
- failFast - Fails immediately upon encountering malformed records

The default is permissive.

## Write API Structure

The core structure for writing data is as follows:
  DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(
    ...).save()

**format** is optional because by default, Spark will use the parquet format. **option**, again, allows us to configure how to write out our
given data. **PartitionBy**, **bucketBy**, and **sortBy** work only for file-based data sources; you can use them to control the specific layout of files at the destination.

## Basics of Writing Data

The foundation for writing data is quite similar to that of reading data. Instead of the DataFrameReader, we have the DataFrameWriter. Because we always need to write out some given data source, we access the DataFrameWriter on a per-DataFrame basis via the write attribute:
```Scala
dataFrame.write
```

After we have a DataFrameWriter, we specify three values: the format, a series of options, and the save mode. At a minimum, you must supply a path. We will cover the potential for options, which vary from data source to data source, shortly.

```Python
dataFrame.write.format('csv')
         .option('mode', 'OVERWRITE')
         .option('dateFormat', 'yyyy-MM-dd')
         .option('path', 'path/to/file(s)')
         .save()
```

### Save Modes

Save modes specify what will happen if Spark finds data at the specified location

Save mode        Description
- append -        Appends the output files to the list of files that already exist at that location
- overwrite   -   Will completely overwrite any data that already exists there
- errorIfExists - Throws an error and fails the write if data or files already exist at the specified location
- ignore    -     If data or files exist at the location, do nothing with the current DataFrame


The **default** is errorIfExists. This means that if Spark finds data at the location to which you’re writing, it will fail the write immediately.

# CSV Files

CSV files, while seeming well structured, are actually one of the trickiest file formats you will encounter because not many assumptions can be made in production scenarios about what they contain or how they are structured. For this reason, the CSV reader has a large number of options. These options give you the ability to work around issues like certain characters needing to be escaped—for example, commas inside of columns when the file is also comma-delimited or null values labeled in an unconventional way.


## CSV Options

| Read/write | key | Potential values | Default | Description |
|------------|-----|------------------|---------|-------------|
| Both | sep |Any single string character|,|The single character that is used as separator for each field and value.|
| Both | header | true, false | false | A Boolean flag that declares whether the first line in the file(s) are the names of the columns. |
| Read | escape | Any string character | \ | The character Spark should use to escape other characters in the file. |
| ... | ... | ... | ... | ... |
|  |  |  |  |  |
|  |  |  |  |  |

## Reading CSV Files

To read a CSV file, like any other format, we must first create a DataFrameReader for that specific format. Here, we specify the format to be CSV:

In [3]:
spark.read.format('csv')

<pyspark.sql.readwriter.DataFrameReader at 0x7fd14551a4d0>

After this, we have the option of specifying a schema as well as modes as options. Let’s set a couple of options, some that we saw from the beginning of the book and others that we haven’t seen yet

```Python
spark.read.format('csv') \
          .option('header', 'true') \
          .option('mode', 'FAILFAST') \
          .option('inferSchema', 'true') \
          .load('some/path/to/file.csv')
```

As mentioned, we can use the mode to specify how much tolerance we have for malformed data. For example, we can use these modes and the schema that we created in Chapter 5 to ensure that our file(s) conform to the data that we expected:

In [3]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
    StructField('DEST_COUNTRY_NAME', StringType(), True),
    StructField('ORIGIN_COUNTRY_NAME', StringType(), True),
    StructField('count', LongType(), False, metadata={'hello':'world'}),
])

spark.read.format('csv') \
    .option('header', 'true') \
    .option('mode', 'FAILFAST') \
    .schema(myManualSchema) \
    .load('../data/flight-data/csv/2010-summary.csv') \
    .show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



Things get tricky when we don’t expect our data to be in a certain format, but it comes in that way, anyhow. For example, let’s take our current schema and change all column types to LongType. This does not match the actual schema, but Spark has no problem with us doing this. The problem will only manifest itself when Spark actually reads the data. As soon as we start our Spark job, it will immediately fail (after we execute a job) due to the data not conforming to the specified schema:

In [None]:
myManualSchema = StructType([
    StructField('DEST_COUNTRY_NAME', LongType(), True),
    StructField('ORIGIN_COUNTRY_NAME', LongType(), True),
    StructField('count', LongType(), False, metadata={'hello':'world'}),
])

spark.read.format('csv') \
    .option('header', 'true') \
    .option('mode', 'FAILFAST') \
    .schema(myManualSchema) \
    .load('../data/flight-data/csv/2010-summary.csv') \
    .show(5)

In general, Spark will fail only at job execution time rather than DataFrame definition time— even if, for example, we point to a file that does not exist. This is due to lazy evaluation

## Writing CSV Files

Just as with reading data, there are a variety of options (listed in Table 9-3) for writing data when we write CSV files. This is a subset of the reading options because many do not apply when writing data (like maxColumns and inferSchema). Here’s an example:

In [5]:
csvFile = spark.read.format('csv') \
                .option('header', 'true') \
                .option('mode', 'FAILFAST') \
                .schema(myManualSchema) \
                .load('../data/flight-data/csv/2010-summary.csv')

For instance, we can take our CSV file and write it out as a TSV file quite easily:


In [None]:
csvFile.write.format('csv').mode('overwrite').option('sep', '\t').save('../tmp/my-tsv-file.tsv')

When you list the destination directory, you can see that my-tsv-file is actually a folder with numerous files within it. This actually reflects the number of partitions in our DataFrame at the time we write it out. If we were to repartition our data before then, we would end up with a different number of files. We discuss this trade-off at the end of this chapter.

# JSON Files

In Spark, when we refer to JSON files, we refer to **line-delimited** JSON files. This contrasts with files that have a large JSON object or array per file.

The line-delimited versus multiline trade-off is controlled by a single option: multiLine. When you set this option to true, you can read an entire file as one json object and Spark will go
through the work of parsing that into a DataFrame. Line-delimited JSON is actually a much more stable format because it allows you to append to a file with a new record (rather than having to read in an entire file and then write it out), which is what we recommend that you use.

## JSON Options

| Read/write | key | Potential values | Default |
|------------|-----|------------------|---------|
| Both | compression or codec |None, uncompressed, bzip2, deflate, gzip, lz4, or snappy|None|
| Both | dateFormat | Any string or character that conforms to Java’s SimpleDataFormat.| yyyy-MM-dd |
| ... | ... | ... | ... | ... |
|  |  |  |  |  |
|  |  |  |  |  |

## Reading JSON Files

In [None]:
spark.read.format('json') \
          .option('mode', 'FAILFAST') \
          .option('inferSchema', 'true') \
          .load('../data/flight-data/json/2010-summary.json').show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



## Writing JSON Files

Writing JSON files is just as simple as reading them, and, as you might expect, the data source does not matter. Therefore, we can reuse the CSV DataFrame that we created earlier to be the source for our JSON file. This, too, follows the rules that we specified before: one file per partition will be written out, and the entire DataFrame will be written out as a folder. It will also have one JSON object per line:

In [None]:
csvFile.write.format('json').mode('overwrite').save('../tmp/my-json-file.json')

# Parquet Files

Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression, which saves storage space and allows for reading individual columns instead of entire files. It is a file format that works exceptionally well with Apache Spark and is in fact the default file format. We recommend writing data out to Parquet for long-term storage because reading from a Parquet file will always be more efficient than JSON or CSV. Another advantage of Parquet is that it supports complex types. This means that if your column is an array (which would fail with a CSV file, for example), map, or struct, you’ll still be able to read and write that file without issue.

## Reading Parquet Files

Parquet has very few options because it enforces its own schema when storing data. Thus, all you need to set is the format and you are good to go. We can set the schema if we have strict requirements for what our DataFrame should look like. Oftentimes this is not necessary because we can use schema on read, which is similar to the inferSchema with CSV files. However, with Parquet files, this method is more powerful because the schema is built into the file itself (so no inference needed).

Here are some simple examples reading from parquet:

In [None]:
spark.read.format('parquet').load('../data/flight-data/parquet/2010-summary.parquet').show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



## Parquet options

Even though there are only two options, you can still encounter problems if you’re working with incompatible Parquet files. Be careful when you write out Parquet files with different versions of Spark (especially older ones) because this can cause significant headache.

| Read/write | key | Potential values | Default | Description |
|------------|-----|------------------|---------|-------------|
| Write | compression or codec |None, uncompressed, bzip2, deflate, gzip, lz4, or snappy|None|Declares what compression codec Spark should use to read or write the file.|
| Read | mergeSchema | true,false | Value of the configuration spark.sql.parquet.mergeSchema | You can incrementally add columns to newly written Parquet files in the same table/folder.Usethisoption to enable or disable this feature. |

## Writing Parquet Files

In [None]:
csvFile.write.format('parquet').mode('overwrite').save('../tmp/my-parquet-file.parquet')

# ORC Files

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because Spark understands the file format quite well. An often-asked question is: What is the difference between ORC and Parquet? For the most part, they’re quite similar; the fundamental difference is that Parquet is further optimized for use with Spark, whereas ORC is further optimized for Hive.

## Reading Orc Files

In [None]:
spark.read.format("orc").load("../data/flight-data/orc/2010-summary.orc").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



## Writing Orc Files

In [None]:
csvFile.write.format("orc").mode("overwrite").save("../tmp/my-orc-file.orc")

# SQL Databases

To read and write from these databases, you need to do two things: include the Java Database Connectivity (JDBC) driver for you particular database on the spark classpath, and provide the proper JAR for the driver itself. For example, to be able to read and write from PostgreSQL, you might run something like this:
```bash
./bin/spark-shell \
--driver-class-path postgresql-9.4.1207.jar \
--jars postgresql-9.4.1207.jar
```

To avoid the distraction of setting up a database for the purposes of this book, we provide a reference sample that runs on SQLite

## SQL Databases ooptions

Just as with our other sources, there are a number of options that are available when reading from and writing to SQL databases. Only some of these are relevant for our current example, but Table lists all of the options that you can set when working with JDBC databases.

|Property Name|Meaning|
|-|-|
|url|The JDBC URL to which to connect. The source-specific connection properties can be specified in the URL; for example, jdbc:postgresql://localhost/test? user=fred&password=secret.|
|dbtable|The JDBC table to read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.|
|...|...|

## Reading from SQL Databases

When it comes to reading a file, SQL databases are no different from the other data sources that we looked at earlier. As with those sources, we specify the format and options, and then load in the data:

In [None]:
driver = 'org.sqlite.JDBC'
path = '../data/flight-data/jdbc/my-sqlite.db'
url = 'jdbc:sqlite:' + '/data/flight-data/jdbc/my-sqlite.db'
tablename = 'flight_info'

In [None]:
dbDataFrame = spark.read.format("jdbc").option("url", url).option("dbtable", tablename).option("driver",  driver).load()

SQLite has rather simple configurations (no users, for example). Other databases, like PostgreSQL, require more configuration parameters. Let’s perform the same read that we just performed, except using PostgreSQL this time:

In [None]:
pgDF = spark.read.format("jdbc")\
            .option("driver", "org.postgresql.Driver")\
            .option("url", "jdbc:postgresql://database_server")\
            .option("dbtable", "schema.tablename")\
            .option("user", "username").option("password", "my-secret-password").load()

As we create this DataFrame, it is no different from any other: you can query it, transform it, and join it without issue. You’ll also notice that there is already a schema, as well. That’s because Spark gathers this information from the table itself and maps the types to Spark data types. Let’s get only the distinct locations to verify that we can query it as expected:

In [None]:
dbDataFrame.select("DEST_COUNTRY_NAME").distinct().show(5)

## Query Pushdown

First, Spark makes a best-effort attempt to filter data in the database itself before creating the DataFrame. For example, in the previous sample query, we can see from the query plan that it selects only the relevant column name from the table:

In [None]:
dbDataFrame.select("DEST_COUNTRY_NAME").distinct().explain

```bash
== Physical Plan ==
  *HashAggregate(keys=[DEST_COUNTRY_NAME#8108], functions=[])
  +- Exchange hashpartitioning(DEST_COUNTRY_NAME#8108, 200)
     +- *HashAggregate(keys=[DEST_COUNTRY_NAME#8108], functions=[])
        +- *Scan JDBCRelation(flight_info) [numPartitions=1] ...
```

Spark can actually do better than this on certain queries. For example, if we specify a filter on our DataFrame, Spark will push that filter down into the database. We can see this in the explain plan under PushedFilters.

In [None]:
dbDataFrame.filter("DEST_COUNTRY_NAME in ('Anguilla', 'Sweden')").explain()

```bash
== Physical Plan ==
  *Scan JDBCRel... PushedFilters: [*In(DEST_COUNTRY_NAME, [Anguilla,Sweden])],
  ...
```

Spark can’t translate all of its own functions into the functions available in the SQL database in which you’re working. Therefore, sometimes you’re going to want to pass an entire query into your SQL that will return the results as a DataFrame

In [None]:
pushdownQuery = """(SELECT DISTINCT(DEST_COUNTRY_NAME) FROM flight_info) AS flight_info"""
dbDataFrame = spark.read.format("jdbc")\
                        .option("url", url).option("dbtable", pushdownQuery).option("driver",  driver)\
                        .load()

Now when you query this table, you’ll actually be querying the results of that query. We can see this in the explain plan. Spark doesn’t even know about the actual schema of the table, just the one that results from our previous query:

In [None]:
dbDataFrame.explain()

```bash
== Physical Plan ==
  *Scan JDBCRelation(
  (SELECT DISTINCT(DEST_COUNTRY_NAME)
    FROM flight_info) as flight_info
  ) [numPartitions=1] [DEST_COUNTRY_NAME#788] ReadSchema: ...
```

### Reading from databases in parallel

Spark has an underlying algorithm that can read multiple files into one partition, or conversely, read multiple partitions out of one file, depending on the file size and the “splitability” of the file type and compression

In [None]:
dbDataFrame = spark.read.format("jdbc")\
    .option("url", url).option("dbtable", tablename).option("driver",  driver)\
    .option("numPartitions", 10).load()

In this case, this will still remain as one partition because there is not too much data. However, this configuration can help you ensure that you do not overwhelm the database when reading and writing data:

In [None]:

dbDataFrame.select("DEST_COUNTRY_NAME").distinct().show()

You can explicitly push predicates down into SQL databases through the connection itself. This optimization allows you to control the physical location of certain data in certain partitions by
specifying predicates. We do that by specifying a list of predicates when we create the data source:

In [None]:
props = {"driver":"org.sqlite.JDBC"}
predicates = [
    "DEST_COUNTRY_NAME = 'Sweden' OR ORIGIN_COUNTRY_NAME = 'Sweden'",
    "DEST_COUNTRY_NAME = 'Anguilla' OR ORIGIN_COUNTRY_NAME = 'Anguilla'"]
spark.read.jdbc(url, tablename, predicates=predicates, properties=props).show()
spark.read.jdbc(url,tablename,predicates=predicates,properties=props).rdd.getNumPartitions() # 2


If you specify predicates that are not disjoint, you can end up with lots of duplicate rows.

### Partitioning based on a sliding window

Let’s take a look to see how we can partition based on predicates. In this example, we’ll partition based on our numerical count column. Here, we specify a minimum and a maximum for both
the first partition and last partition. Anything outside of these bounds will be in the first partition or final partition. Then, we set the number of partitions we would like total (this is the level of parallelism). Spark then queries our database in parallel and returns numPartitions partitions. We simply modify the upper and lower bounds in order to place certain values in certain partitions. No filtering is taking place like we saw in the previous example:

In [None]:
colName = 'count'
lowerBound = 0
upperBound = 348113
numPartitions = 10

spark.read.jdbc(url, tablename, column=colName, properties=props, lowerBound=lowerBound, upperBound=upperBound, numPartitions=numPartitions).count() # 255

## Writing to SQL Databases


Writing out to SQL databases is just as easy as before. You simply specify the URI and write out the data according to the specified write mode that you want:

In [None]:
newPath = "jdbc:sqlite://tmp/my-sqlite.db"
csvFile.write.jdbc(newPath, tablename, mode="overwrite", properties=props)

spark.read.jdbc(newPath, tablename, properties=props).count() # 255

Of course, we can append to the table this new table just as easily:

In [None]:
csvFile.write.jdbc(newPath, tablename, mode="append", properties=props)

spark.read.jdbc(newPath, tablename, properties=props).count() # 765

# Text Files

Spark also allows you to read in plain-text files. Each line in the file becomes a record in the DataFrame. It is then up to you to transform it accordingly. As an example of how you would do this, suppose that you need to parse some Apache log files to some more structured format, or perhaps you want to parse some plain text for natural-language processing. Text files make a great argument for the Dataset API due to its ability to take advantage of the flexibility of native types.

## Reading Text Files

In [10]:
spark.read.text('../data/flight-data/csv/2010-summary.csv').selectExpr("split(value, ',') as rows").show()

+--------------------+
|                rows|
+--------------------+
|[DEST_COUNTRY_NAM...|
|[United States, R...|
|[United States, I...|
|[United States, I...|
|[Egypt, United St...|
|[Equatorial Guine...|
|[United States, S...|
|[United States, G...|
|[Costa Rica, Unit...|
|[Senegal, United ...|
|[United States, M...|
|[Guyana, United S...|
|[United States, S...|
|[Malta, United St...|
|[Bolivia, United ...|
|[Anguilla, United...|
|[Turks and Caicos...|
|[United States, A...|
|[Saint Vincent an...|
|[Italy, United St...|
+--------------------+
only showing top 20 rows



## Writing Text Files

When you write a text file, you need to be sure to have only one string column; otherwise, the write will fail:

In [15]:
csvFile.select('DEST_COUNTRY_NAME').write.text('../tmp/simple-text-file.txt')

If you perform some partitioning when performing your write, you can write more columns. However, those columns will manifest as directories in the folder to which you’re writing out to, instead of columns on every single file:

In [16]:
csvFile.limit(10).select("DEST_COUNTRY_NAME", "count").write.partitionBy('count').text('../tmp/five-csv-files2py.csv')

# Advanced I/O Concepts

We saw previously that we can control the parallelism of files that we write by controlling the partitions prior to writing. We can also control specific data layout by controlling two things: bucketing and partitioning (discussed momentarily).

## Splittable File Types and Compression

Certain file formats are fundamentally “splittable.” This can improve speed because it makes it possible for Spark to avoid reading an entire file, and access only the parts of the file necessary to satisfy your query. Additionally if you’re using something like Hadoop Distributed File System (HDFS), splitting a file can provide further optimization if that file spans multiple blocks. In conjunction with this is a need to manage compression. Not all compression schemes are splittable. How you store your data is of immense consequence when it comes to making your Spark jobs run smoothly. We recommend Parquet with gzip compression.

## Reading Data in Parallel

Multiple executors cannot read from the same file at the same time necessarily, but they can read
different files at the same time. In general, this means that when you read from a folder with multiple files in it, each one of those files will become a partition in your DataFrame and be read in by available executors in parallel (with the remaining queueing up behind the others).

## Writing Data in Parallel

The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data. This means that although we specify a “file,” it’s actually a number of files within a folder, with the name of the specified file, with one file per each partition that is written.

### Partitioning

Partitioning is a tool that allows you to control what data is stored (and where) as you write it. When you write a file to a partitioned directory (or table), you basically encode a column as a folder. What this allows you to do is skip lots of data when you go to read it in later, allowing you to read in only the data relevant to your problem instead of having to scan the complete dataset. These are supported for all file-based data sources:


In [7]:
csvFile.limit(10).write.mode('overwrite').partitionBy("DEST_COUNTRY_NAME").save("../tmp/partitioned-files.parquet")

This is probably the lowest-hanging optimization that you can use when you have a table that readers frequently filter by before manipulating. For instance, date is particularly common for a partition because, downstream, often we want to look at only the previous week’s data (instead of scanning the entire list of records). This can provide massive speedups for readers.

### Bucketing


Bucketing is another file organization approach with which you can control the data that is specifically written to each file. This can help avoid shuffles later when you go to read the data because data with the same bucket ID will all be grouped together into one physical partition. This means that the data is prepartitioned according to how you expect to use that data later on, meaning you can avoid expensive shuffles when joining or aggregating.
Rather than partitioning on a specific column (which might write out a ton of directories), it’s probably worthwhile to explore bucketing the data instead. This will create a certain number of files and organize our data into those “buckets”:

In [6]:
numberBuckets = 10
columnToBucketBy = 'count'

csvFile.write.format('parquet').mode('overwrite').bucketBy(numberBuckets, columnToBucketBy).saveAsTable("bucketedFiles")

                                                                                

## Writing Complex Types

Spark has a variety of different internal types. Although Spark can work with all of these types, not every single type works well with every data file format. For
instance, CSV files do not support complex types, whereas Parquet and ORC do.

## Managing File Size

When you’re writing lots of small files, there’s a significant metadata overhead that you incur managing all of those files. Spark especially does not do well with small files, although many file systems (like HDFS) don’t handle lots of small files well, either.

You can use the **maxRecordsPerFile** option and specify a number of your choosing. This allows you to better control file sizes by controlling the number of records that are written to each file. For example, if you set an option for a writer as df.write.option("maxRecordsPerFile", 5000), Spark will ensure that files will contain at most 5,000 records.