The good majority of the data you work with when starting out with PySpark is saved in `csv` format. Getting it all under your fingers, however, is a bit tricker than you might expect if you, like me, find yourself coming from `pandas`.

## Prelims

In [1]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

Dataset is recycled from [the Academy Award blogpost I did earlier this year](https://napsterinblue.github.io/blog/2018/01/12/best-picture-consolation-prizes/).

In [2]:
fpath = '../data/movieData.csv'

### Load the Data

Spoiler alert, figuring out the proper function call to load a csv is going to take some revisioning. Let's append arguments and values as we go

In [3]:
movieArgs = dict()

`movieArgs` unpacks to nothing at the moment. Let's see what a vanilla `read.csv` call gets us.

In [4]:
movies = spark.read.csv(fpath, **movieArgs)
movies.show(5)

+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
| _c0|        _c1|             _c2|     _c3|          _c4|     _c5|        _c6| _c7|       _c8|               _c9|_c10|  _c11|
+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
|Rank|WeeklyGross|PctChangeWkGross|Theaters|DeltaTheaters|  AvgRev|GrossToDate|Week|  Thursday|              name|year|Winner|
|17.0|     967378|            null|    14.0|         null| 69098.0|     967378|   1|1990-11-18|dances with wolves|1990|  True|
| 9.0|    3871641|           300.0|    14.0|         null|276546.0|    4839019|   2|1990-11-25|dances with wolves|1990|  True|
| 3.0|   12547813|           224.0|  1048.0|       1034.0| 11973.0|   17386832|   3|1990-12-02|dances with wolves|1990|  True|
| 4.0|    9246632|           -26.3|  1053.0|          5.0|  8781.0|   26633464|   4|1990-12-09|dances with wolv

Okay. So Spark expected *not* to see a header in the file. That's alright. We'll just tell it to look for one.

In [5]:
movieArgs['header'] = True

movies = spark.read.csv(fpath, **movieArgs)
movies.show(5)

+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
|Rank|WeeklyGross|PctChangeWkGross|Theaters|DeltaTheaters|  AvgRev|GrossToDate|Week|  Thursday|              name|year|Winner|
+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
|17.0|     967378|            null|    14.0|         null| 69098.0|     967378|   1|1990-11-18|dances with wolves|1990|  True|
| 9.0|    3871641|           300.0|    14.0|         null|276546.0|    4839019|   2|1990-11-25|dances with wolves|1990|  True|
| 3.0|   12547813|           224.0|  1048.0|       1034.0| 11973.0|   17386832|   3|1990-12-02|dances with wolves|1990|  True|
| 4.0|    9246632|           -26.3|  1053.0|          5.0|  8781.0|   26633464|   4|1990-12-09|dances with wolves|1990|  True|
| 4.0|    7272350|           -21.4|  1051.0|         -2.0|  6919.0|   33905814|   5|1990-12-16|dances with wolv

That looks better.

`pandas` might struggle with the `Thursday` column. Did PySpark?

In [6]:
movies.dtypes

[('Rank', 'string'),
 ('WeeklyGross', 'string'),
 ('PctChangeWkGross', 'string'),
 ('Theaters', 'string'),
 ('DeltaTheaters', 'string'),
 ('AvgRev', 'string'),
 ('GrossToDate', 'string'),
 ('Week', 'string'),
 ('Thursday', 'string'),
 ('name', 'string'),
 ('year', 'string'),
 ('Winner', 'string')]

Well, I suppose technically it didn't. You have to make an effort to struggle, yeah?

Let's tell it to take a crack at figuring out the schema.

In [7]:
movieArgs['inferSchema'] = True

movies = spark.read.csv(fpath, **movieArgs)
movies.dtypes

[('Rank', 'double'),
 ('WeeklyGross', 'int'),
 ('PctChangeWkGross', 'double'),
 ('Theaters', 'double'),
 ('DeltaTheaters', 'double'),
 ('AvgRev', 'double'),
 ('GrossToDate', 'int'),
 ('Week', 'int'),
 ('Thursday', 'timestamp'),
 ('name', 'string'),
 ('year', 'int'),
 ('Winner', 'boolean')]

Cool. This is really coming along. Let's pull our data down to do some local analysis in `pandas`.

In [8]:
df = movies.toPandas()

TypeError: Cannot convert tz-naive Timestamp, use tz_localize to localize

## Fear Not

That's certainly not an inviting error message. [I wrestled with it myself less than an hour ago](https://napsterinblue.github.io/notes/spark/sparksql/topandas_datetime_error/).

Here's a usable-enough workaround I found to finish this out.

In [9]:
movies = movies.withColumn('Thursday', movies['Thursday'].cast('string'))

In [10]:
import pandas as pd

df = movies.toPandas()
df['Thursday'] = pd.to_datetime(df['Thursday'])
df.head()

Unnamed: 0,Rank,WeeklyGross,PctChangeWkGross,Theaters,DeltaTheaters,AvgRev,GrossToDate,Week,Thursday,name,year,Winner
0,17.0,967378,,14.0,,69098.0,967378,1,1990-11-18,dances with wolves,1990,True
1,9.0,3871641,300.0,14.0,,276546.0,4839019,2,1990-11-25,dances with wolves,1990,True
2,3.0,12547813,224.0,1048.0,1034.0,11973.0,17386832,3,1990-12-02,dances with wolves,1990,True
3,4.0,9246632,-26.3,1053.0,5.0,8781.0,26633464,4,1990-12-09,dances with wolves,1990,True
4,4.0,7272350,-21.4,1051.0,-2.0,6919.0,33905814,5,1990-12-16,dances with wolves,1990,True
