## Election Results
The election is conducted in a city and everyone can vote for one or more candidates, or choose not to vote at all. Each person has 1 vote so if they vote for multiple candidates, their vote gets equally split across these candidates. For example, if a person votes for 2 candidates, these candidates receive an equivalent of 0.5 vote each. Some voters have chosen not to vote, which explains the blank entries in the dataset.


Find out who got the most votes and won the election. Output the name of the candidate or multiple names in case of a tie.
To avoid issues with a floating-point error you can round the number of votes received by a candidate to 3 decimal places.
<br><br>
Table: voting_results

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import pandas as pd

import os
import sys

In [2]:
os.environ['JAVA_HOME'] = "C:/Program Files/Java/jdk-11"

spark = SparkSession.builder.appName('Election Results').getOrCreate()

In [5]:
df = spark.read.format('csv') \
    .option('header', 'true') \
    .option('inferSchema', 'true') \
    .load('../Data/voting_results.csv')
    
df.show()

+--------+---------+
|   voter|candidate|
+--------+---------+
|   Kathy|     NULL|
| Charles|     Ryan|
| Charles|Christine|
| Charles|    Kathy|
|Benjamin|Christine|
| Anthony|     Paul|
| Anthony|  Anthony|
|  Edward|     Ryan|
|  Edward|     Paul|
|  Edward|    Kathy|
|   Terry|     NULL|
|   Nancy|     Ryan|
|   Nancy|   Nicole|
|   Nancy|     Paul|
|   Nancy|Christine|
|   Nancy|    Kathy|
|  Evelyn|  Anthony|
|  Evelyn|Christine|
|  Evelyn|     Paul|
|  Evelyn|   Nicole|
+--------+---------+
only showing top 20 rows



In [57]:
result = df.filter(
    F.col('candidate').isNotNull()
    )

result.show()

+--------+---------+
|   voter|candidate|
+--------+---------+
| Charles|     Ryan|
| Charles|Christine|
| Charles|    Kathy|
|Benjamin|Christine|
| Anthony|     Paul|
| Anthony|  Anthony|
|  Edward|     Ryan|
|  Edward|     Paul|
|  Edward|    Kathy|
|   Nancy|     Ryan|
|   Nancy|   Nicole|
|   Nancy|     Paul|
|   Nancy|Christine|
|   Nancy|    Kathy|
|  Evelyn|  Anthony|
|  Evelyn|Christine|
|  Evelyn|     Paul|
|  Evelyn|   Nicole|
| Shirley|     Ryan|
| Shirley|   Nicole|
+--------+---------+
only showing top 20 rows



In [58]:
result1 = result.groupBy(
    'voter'
    ).agg(
        F.round(F.lit(1)/F.count('candidate'), 2).alias('rate')
        )
    
result1.show()

+---------+----+
|    voter|rate|
+---------+----+
| Benjamin| 1.0|
|  Matthew| 0.5|
|    Helen|0.25|
|   Evelyn|0.25|
|   Nicole| 1.0|
|   Edward|0.33|
|   Martha| 0.5|
|  Charles|0.33|
|     Alan| 1.0|
|    Bobby| 0.5|
|   Andrew| 0.5|
|  Anthony| 0.2|
|    Kevin|0.33|
|    Kathy|0.25|
|    Nancy| 0.2|
|    Marie| 1.0|
|  Shirley|0.25|
|Christine| 0.5|
|     Ryan| 1.0|
+---------+----+



In [71]:
final_result = result.alias('t1').join(
    result1.alias('t2'),
    F.col('t1.voter') == F.col('t2.voter'),
    how = 'inner'
    ).groupBy(
        F.col('candidate')
        ).agg(
            F.round(F.sum(F.col('rate')), 2).alias('total_vote_rate')
            ).orderBy(
                F.col('total_vote_rate').desc()
                ).select(
                    'candidate'
                    ).limit(1)

final_result.show()

+---------+
|candidate|
+---------+
|Christine|
+---------+

