# Project in Data Intensive Computing
Authors: Alex Hermansson and Elin Samuelsson

## Blabla Political Parties

In [100]:
import sys
!{sys.executable} -m pip install pyspark

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## SparkSession

In this cell, we simply initiliaze the sparkSession and create some useful variables such as the paths to the json files containing votes (and other metadata).

In [101]:
import os
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("DataIntensive project").getOrCreate()

top_dir = "../data"
paths = [os.path.join(top_dir, path)
         for path in os.listdir(top_dir) 
         if path.endswith(".json")]

## Data Cleaning

In the cell below, we clean the json files, which are on some sort of weird format

In [None]:
def clean_json(path_to_file):
    f = open(path_to_file, "r")
    lines = f.readlines()
    length = len(lines)
    f.close()

    all_lines = []
    for i, line in enumerate(lines):
        if i not in [0, 1, 2, length-1, length-2, length-3]:
            all_lines.append(line)

    lines_str = ''.join(all_lines)
    lines = lines_str.split("},")
    lines = ''.join([line + "}" for line in lines[:-1]])
    lines = lines.replace('\n', '')
    lines = lines.replace('}', '}\n')
    lines = lines.replace('  ', '')
    lines = lines.replace(',', ', ')
    
    ## Move this into a map function?
    lines = lines.replace('Ja', '1')
    lines = lines.replace('Nej', '0')
    lines = lines.replace('Avstår', '-1')
    lines = lines.replace('Frånvarande', '-2')

    with open(path_to_file, "w") as f:
        for line in lines:
            f.write(line)
            
def clean_data():
    for path in paths:
        clean_json(path)

## Only run the following function if the data needs cleaning.. if its already "clean", it messes it up completely
# clean_data()

## Static Information

Here we store the information that is static with respect to different voting rounds. It includes names, political parties etc. Also, we map the Swedish names to English ones and we modify one column to store age instead of year of birth.

In [88]:
df_ = spark.read.json(os.path.join(top_dir, paths[0]))
df_info = df_.select(df_["namn"].alias("name"), 
                     df_["parti"].alias("party"),
                     df_["kon"].alias("sex"),
                     df_["valkrets"].alias("constituency"),
                     (2018 - df_["fodd"]).alias("age").cast("int")
                    )
df_info.show()

+--------------------+-----+------+--------------------+---+
|                name|party|   sex|        constituency|age|
+--------------------+-----+------+--------------------+---+
|      Andreas Norlén|    M|   man|   Östergötlands län| 45|
|     Ulrika Carlsson|    C|kvinna|Västra Götalands ...| 53|
| Margareta Cederfelt|    M|kvinna|   Stockholms kommun| 59|
|   Christina Östberg|   SD|kvinna|          Kalmar län| 50|
|   Cecilia Magnusson|    M|kvinna|    Göteborgs kommun| 56|
|     Penilla Gunther|   KD|kvinna|Västra Götalands ...| 54|
|      Jonas Eriksson|   MP|   man|          Örebro län| 51|
|          Per Åsling|    C|   man|       Jämtlands län| 61|
|      Peter Jeppsson|    S|   man|        Blekinge län| 50|
|         Lawen Redar|    S|kvinna|   Stockholms kommun| 29|
|      1n R Andersson|    M|   man|          Kalmar län| 48|
|    Robert Stenkvist|   SD|   man|Västra Götalands ...| 60|
|      Johan Forssell|    M|   man|   Stockholms kommun| 39|
|Annika Hirvonen Falk|  

## Creating Features

Below, we merge the votes for different rounds into our final DataFrame. Here we have all the "static" information for each parliment member, and also their votes.
We have chosen to map the votes as the following:
- "Yes" to 1, 
- "No" to 0, 
- "Refrain" to -1, 
- "Absent" to -2.

In [96]:
# vote_to_int = {"Ja": 1, "Nej": 0, "Avstår": -1, "Frånvarande": -2}
# .rdd.map(lambda vote: vote_to_int[vote])
df = df_info
for question_number, path in enumerate(paths[:10], 1):
    column_name = "q%s" % question_number
    df_i = spark.read.json(os.path.join(top_dir, path))
    df_vote = df_i.select(df_i["namn"].alias("name"), df_i["rost"].alias(column_name))
    df = df.join(df_vote, "name")

In [97]:
df.show()

+--------------------+-----+------+--------------------+---+---+---+---+---+---+---+---+---+---+---+
|                name|party|   sex|        constituency|age| q1| q2| q3| q4| q5| q6| q7| q8| q9|q10|
+--------------------+-----+------+--------------------+---+---+---+---+---+---+---+---+---+---+---+
|      Andreas Norlén|    M|   man|   Östergötlands län| 45| -1|  0|  1|  1|  0|  1|  0|  1| -2|  1|
|     Ulrika Carlsson|    C|kvinna|Västra Götalands ...| 53| -1|  0| -1|  1|  1|  1| -2|  0|  1|  1|
| Margareta Cederfelt|    M|kvinna|   Stockholms kommun| 59| -1|  0| -2|  1| -2|  1|  0|  1|  1| -2|
|   Christina Östberg|   SD|kvinna|          Kalmar län| 50|  0|  1|  0|  0|  1|  1|  1|  1|  1|  0|
|   Cecilia Magnusson|    M|kvinna|    Göteborgs kommun| 56| -1|  0|  1|  1|  0|  1|  0|  1|  1|  1|
|     Penilla Gunther|   KD|kvinna|Västra Götalands ...| 54| -1|  0| -2| -2|  1|  0|  0|  1| -2|  1|
|      Jonas Eriksson|   MP|   man|          Örebro län| 51|  1|  1|  1| -2| -2|  1|  1|  1