# HW6 - Transforming vote tallies

This homework give you some insight into the transformations that are underlying all the voting dashboards you've undoubtedly seen all over the news.  Here we are going to use 2020 Presidential election data.

The dataset for this week contains voting outcomes for different batches of votes in the battleground states.  The data consist of the following:
* The state of the votes
* The time the voting results were reported
* The number of votes in the batch (new_votes)
* The number of those new votes that were for Joe Biden (votes_biden)

The key metric of interest is in how each candidate, Biden and Trump, are doing over time in terms of the percentage of votes they're getting with each batch.  This is easier to think of compared to raw numbers for a variety of reasons.  One is that the number of votes varies state-to-state and batch-to-batch, so just thinking about raw numbers doesn't inform very much.  The other is that people are often thinking about the percentage a candidate needs to take the lead or win the state.  If each batch is hitting at or above that percentage, then they have a good chance of winning the state.  

The goal of this homework is to generate the percentages of votes in each state that are going to Trump and Biden **on an hourly reporting basis**. Each line is a batch of votes that were reported at a given time.  Sometimes there are multiple in an hour.  We want to report a coarser resolution which is why we want to group by hours and days.  

Given you have only the variables listed above, you're going to need to do a few things to make this happen:
* Import your data and apply a timestamp
* Make a column for the number of votes received by Trump
* Aggregate your data to get the number of votes for each on a daily and hourly basis
* Create columns of the percentages of votes each have received in the aggregated data

**NOTE 1** - This is by every measure 'small' data, but it needs to be coded in pyspark.

**NOTE 2** - These are all votes that were reported after the initial waves of in-person and early reported mail-in votes.  So if you explore you might see that the total votes in this dataset don't seem to match up with what you would find elsewhere.  That's only because those early votes aren't added in.  These are just batch totals of later reported batches.



**Submission Instruction:**

* Make a copy and replace blank in the title with your name
* Run all the cells
* Download the notebook (.ipynb)
* Submit on Gradescope



In [1]:
!pip install pyspark



In [6]:
!apt-get install openjdk-11-jdk -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  ca-certificates-java fonts-dejavu-core fonts-dejavu-extra java-common
  libatk-wrapper-java libatk-wrapper-java-jni libpcsclite1 libxt-dev libxtst6
  libxxf86dga1 openjdk-11-jdk-headless openjdk-11-jre openjdk-11-jre-headless
  x11-utils
Suggested packages:
  default-jre pcscd libxt-doc openjdk-11-demo openjdk-11-source visualvm
  libnss-mdns fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  | fonts-wqy-zenhei fonts-indic mesa-utils
The following NEW packages will be installed:
  ca-certificates-java fonts-dejavu-core fonts-dejavu-extra java-common
  libatk-wrapper-java libatk-wrapper-java-jni libpcsclite1 libxt-dev libxtst6
  libxxf86dga1 openjdk-11-jdk openjdk-11-jdk-headless openjdk-11-jre
  openjdk-11-jre-headless x11-utils
0 upgraded, 15 newly installed, 0 to remove and 41 not upgraded.
Need to get 122 MB of archives.


## Importing your data - 3 point

The URL to import your data is

http://131.193.32.85:9000/mybucket/votes_2020_hw5.txt

You need to import the data with the appropriate options applied.  You will need to apply a timestampFormat option as well, but it's best to import the data first, look at the format, and then create your timestamp string.  Note, this is fractional seconds in the timestamp. <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html" target="_blank">Remember you can see a full list of datetime formatting here</a>

Also, although it's a text file you can import it as a CSV and specify a delimiter as an option.  

Call the imported dataframe 'votes'

In [8]:
# Make a filepath
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark import SparkFiles

spark = SparkSession \
    .builder \
    .appName("intro_pyspark") \
    .getOrCreate()

url = 'http://131.193.32.85:9000/mybucket/votes_2020_hw5.txt'
spark.sparkContext.addFile(url)

fp = 'file://'+SparkFiles.get('votes_2020_hw5.txt')

In [9]:
# import as votes
votes = (spark.read.option('header', True).option('delimiter', ',').csv(fp))

In [10]:

# Check data
votes.show()

+----------------+--------------------+---------+-----------+
|           state|           timestamp|new_votes|votes_biden|
+----------------+--------------------+---------+-----------+
|  Alaska (EV: 3)|2020-11-09 19:14:...|    18939|     7779.0|
|  Alaska (EV: 3)|2020-11-04 18:40:...|    39556|    11448.0|
|  Alaska (EV: 3)|2020-11-04 13:28:...|        0|        0.0|
|Arizona (EV: 11)|2020-11-10 02:18:...|     6397|     3127.0|
|Arizona (EV: 11)|2020-11-10 00:51:...|      600|      232.0|
|Arizona (EV: 11)|2020-11-10 00:02:...|      591|       84.0|
|Arizona (EV: 11)|2020-11-09 22:23:...|     3437|     1049.0|
|Arizona (EV: 11)|2020-11-09 21:15:...|      787|      189.0|
|Arizona (EV: 11)|2020-11-09 18:40:...|      424|      250.0|
|Arizona (EV: 11)|2020-11-09 18:30:...|      145|      114.0|
|Arizona (EV: 11)|2020-11-09 18:16:...|      124|       57.0|
|Arizona (EV: 11)|2020-11-09 00:49:...|      327|      180.0|
|Arizona (EV: 11)|2020-11-08 23:35:...|    16739|     6748.0|
|Arizona

## Create number of votes for trump - 3 points

Just like it sounds.  Create a column in votes called 'votes_trump' that has the number of votes he received in the batch.

You'll need to import all the functions from pyspark.sql.functions to do this and later steps.

In [11]:
# make votes_trump
votes = votes.withColumn('votes_trump', col('new_votes') - col('votes_biden'))

In [12]:
# Check
votes.show()

+----------------+--------------------+---------+-----------+-----------+
|           state|           timestamp|new_votes|votes_biden|votes_trump|
+----------------+--------------------+---------+-----------+-----------+
|  Alaska (EV: 3)|2020-11-09 19:14:...|    18939|     7779.0|    11160.0|
|  Alaska (EV: 3)|2020-11-04 18:40:...|    39556|    11448.0|    28108.0|
|  Alaska (EV: 3)|2020-11-04 13:28:...|        0|        0.0|        0.0|
|Arizona (EV: 11)|2020-11-10 02:18:...|     6397|     3127.0|     3270.0|
|Arizona (EV: 11)|2020-11-10 00:51:...|      600|      232.0|      368.0|
|Arizona (EV: 11)|2020-11-10 00:02:...|      591|       84.0|      507.0|
|Arizona (EV: 11)|2020-11-09 22:23:...|     3437|     1049.0|     2388.0|
|Arizona (EV: 11)|2020-11-09 21:15:...|      787|      189.0|      598.0|
|Arizona (EV: 11)|2020-11-09 18:40:...|      424|      250.0|      174.0|
|Arizona (EV: 11)|2020-11-09 18:30:...|      145|      114.0|       31.0|
|Arizona (EV: 11)|2020-11-09 18:16:...

## Aggregate data - 6 points

Time to group your data. You're going to need to group by day, hour, and then state to get the totals.  I showed you how to group by a single column that's not a datetime in the previous lesson.  Do do more than one you just add them in separated by commas.  The only issue is that you need to extract day and hour from the timestamp.  To do this for the day of the month your grouping variable would be `dayofmonth('timestamp')`.  I'll let you figure out how to do it for hour and day.  Grouping by state should be easy.

For the aggregation you need to calculate the sums that will allow you to get the percentage of votes that went to Biden vs Trump in a given hour.

Call the resulting dataframe votes_hourly

In [14]:
# make votes_hourly
votes_hourly = votes.groupBy(dayofmonth('timestamp').alias('day'), hour('timestamp').alias('hour'), 'state').agg(sum('new_votes').alias('sum_new_votes'), sum('votes_biden').alias('sum_votes_biden'), sum('votes_trump').alias('sum_votes_trump'))
votes_hourly.show()

+---+----+--------------------+-------------+---------------+---------------+
|day|hour|               state|sum_new_votes|sum_votes_biden|sum_votes_trump|
+---+----+--------------------+-------------+---------------+---------------+
|  6|   6|Pennsylvania (EV:...|       5878.0|         5142.0|          736.0|
|  4|  20|Pennsylvania (EV:...|     154565.0|       126886.0|        27679.0|
|  5|  15|    Georgia (EV: 16)|       1203.0|          824.0|          379.0|
|  6|   0|    Georgia (EV: 16)|      20187.0|        13123.0|         7064.0|
|  6|   1|    Arizona (EV: 11)|      32786.0|        15803.0|        16983.0|
|  5|  23|    Arizona (EV: 11)|      12227.0|         4729.0|         7498.0|
|  4|  14|    Georgia (EV: 16)|         78.0|           67.0|           11.0|
|  7|  23|North Carolina (E...|        199.0|          113.0|           86.0|
|  4|  16|Pennsylvania (EV:...|      94133.0|        68492.0|        25641.0|
|  8|  23|    Arizona (EV: 11)|      23601.0|        10562.0|   

## Calculating percentage of votes per hour - 6 points

Now go and make your columns of the percentage of votes received each hour by each candidate. Call these 'percent_biden' and 'percent_trump'

In [15]:
votes_hourly = votes_hourly.withColumn('percent_biden', col('sum_votes_biden') / col('sum_new_votes') * 100)
votes_hourly = votes_hourly.withColumn('percent_trump', col('sum_votes_trump') / col('sum_new_votes') * 100)
votes_hourly.show()

+---+----+--------------------+-------------+---------------+---------------+------------------+------------------+
|day|hour|               state|sum_new_votes|sum_votes_biden|sum_votes_trump|     percent_biden|     percent_trump|
+---+----+--------------------+-------------+---------------+---------------+------------------+------------------+
|  6|   6|Pennsylvania (EV:...|       5878.0|         5142.0|          736.0| 87.47873426335488|12.521265736645118|
|  4|  20|Pennsylvania (EV:...|     154565.0|       126886.0|        27679.0| 82.09232361789537| 17.90767638210462|
|  5|  15|    Georgia (EV: 16)|       1203.0|          824.0|          379.0|  68.4954280964256|31.504571903574398|
|  6|   0|    Georgia (EV: 16)|      20187.0|        13123.0|         7064.0| 65.00718284044187| 34.99281715955813|
|  6|   1|    Arizona (EV: 11)|      32786.0|        15803.0|        16983.0| 48.20045141218813| 51.79954858781187|
|  5|  23|    Arizona (EV: 11)|      12227.0|         4729.0|         74

## Rounding values - 3 point

Those percent_biden and percent_trump columns have too many digits.  You can import a rounding `round()` function from pyspark using `from pyspark.sql.functions import round`.  Go and import that function and then apply round to those columns so that they have 3 digits after the decimal.  It's fine to overwrite them with the rounded values.

In [16]:
from pyspark.sql.functions import round
votes_hourly = votes_hourly.withColumn('percent_biden', round(col('percent_biden'), 3))
votes_hourly = votes_hourly.withColumn('percent_trump', round(col('percent_trump'), 3))
votes_hourly.show()

+---+----+--------------------+-------------+---------------+---------------+-------------+-------------+
|day|hour|               state|sum_new_votes|sum_votes_biden|sum_votes_trump|percent_biden|percent_trump|
+---+----+--------------------+-------------+---------------+---------------+-------------+-------------+
|  6|   6|Pennsylvania (EV:...|       5878.0|         5142.0|          736.0|       87.479|       12.521|
|  4|  20|Pennsylvania (EV:...|     154565.0|       126886.0|        27679.0|       82.092|       17.908|
|  5|  15|    Georgia (EV: 16)|       1203.0|          824.0|          379.0|       68.495|       31.505|
|  6|   0|    Georgia (EV: 16)|      20187.0|        13123.0|         7064.0|       65.007|       34.993|
|  6|   1|    Arizona (EV: 11)|      32786.0|        15803.0|        16983.0|         48.2|         51.8|
|  5|  23|    Arizona (EV: 11)|      12227.0|         4729.0|         7498.0|       38.677|       61.323|
|  4|  14|    Georgia (EV: 16)|         78.0| 

## SQL query - 9 points

Now let's leverage those SQL powers to summarize our data a bit more within Georgia, which was a critical state for 2020 election.  I want you to write two queries.  
1. The first should be just getting all data from Georgia. You can use "Like '%Georgia%' " to match for Georigia state in your query (you may see negative values because the data has not been cleaned).
2. The second should get the daily percent of votes going to Trump for each state. Keep in mind that the percent votes will need to be recalculated (ie averaging the hourly percentages won't work).

Remember you need to register your dataframe as a table to access it!

In [17]:
spark.catalog.dropGlobalTempView('votes_hourly_table')
votes_hourly.createGlobalTempView('votes_hourly_table')


In [18]:
# query 1
qs = ''' SELECT * FROM global_temp.votes_hourly_table WHERE state = 'Arizona (EV: 11)' '''
spark.sql(qs).show()

+---+----+----------------+-------------+---------------+---------------+-------------+-------------+
|day|hour|           state|sum_new_votes|sum_votes_biden|sum_votes_trump|percent_biden|percent_trump|
+---+----+----------------+-------------+---------------+---------------+-------------+-------------+
|  6|   1|Arizona (EV: 11)|      32786.0|        15803.0|        16983.0|         48.2|         51.8|
|  5|  23|Arizona (EV: 11)|      12227.0|         4729.0|         7498.0|       38.677|       61.323|
|  8|  23|Arizona (EV: 11)|      23601.0|        10562.0|        13039.0|       44.752|       55.248|
|  6|  19|Arizona (EV: 11)|       5043.0|         1077.0|         3966.0|       21.356|       78.644|
|  6|  20|Arizona (EV: 11)|       6666.0|         2730.0|         3936.0|       40.954|       59.046|
|  7|   0|Arizona (EV: 11)|       2976.0|          647.0|         2329.0|       21.741|       78.259|
|  4|  13|Arizona (EV: 11)|          0.0|            0.0|            0.0|         

In [19]:
votes_daily = votes.groupBy(dayofmonth('timestamp').alias('day'), 'state').agg(sum('new_votes').alias('sum_new_votes'), sum('votes_trump').alias('sum_votes_trump'))
votes_daily = votes_daily.withColumn('percent_trump', col('sum_votes_trump') / col('sum_new_votes') * 100)
votes_daily = votes_daily.withColumn('percent_trump', round(col('percent_trump'), 3))
votes_daily.createGlobalTempView('votes_daily_table')


In [20]:
# query 2
qs = ''' SELECT state, AVG(percent_trump) AS avg_daily_votes_trump FROM global_temp.votes_daily_table GROUP BY state'''
spark.sql(qs).show()

+--------------------+---------------------+
|               state|avg_daily_votes_trump|
+--------------------+---------------------+
|    Arizona (EV: 11)|    59.22071428571429|
|      Nevada (EV: 6)|              39.8425|
|North Carolina (E...|    60.41266666666667|
|    Georgia (EV: 16)|   28.487285714285715|
|      Alaska (EV: 3)|              64.9925|
|Pennsylvania (EV:...|   28.707142857142856|
+--------------------+---------------------+

