## Studenten opdracht

De dataset van het CBS bestaan uit inschrijvingen en diplomauitreikingen van studenten in de periode 2015-2020.
In de originele vraag A ging het over de periode 2008-2015. Omdat deze data niet meer beschikbaar is zal er moeten worden gekozen om de periode 2015-2020 aan te houden.

Opdracht:
- a) Bereken de kans dat een student tot tweemaal toe uitvalt bij een technische studie in de periode 2015-2020
- b) Bereken een belief distribution voor de kans dat een student zowel geslaagde als uitvaller is
- c) Bereken de kans dat een student van Business Studies daarna inschrijft voor een opleiding in het sociale domein, binnen 2 jaar na.



In [1]:
import pyspark
import findspark
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import Row
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

In [2]:
# Create findspark instance
findspark.init()

# Create spark session
spark = SparkSession.builder.master("local").appName("Linear Regression Model").config("spark.executor.memory", "1gb").getOrCreate()

In [3]:
# Load spark context entry
sc = spark.sparkContext

## Preprocessing
Tijdens deze stap proberen we data van de instroom te combineren met data van de geslaagden over dezelfde periode.
De bedoeling is dat er twee dataframes worden geladen en worden gecombineerd.
Het resultaat is dat we een dataset krijgen waarbij per provincie, gemeente, opleiding en jaar kunnen zien hoeveel instroom en uitstroom er is.

In [4]:
# Loading both text files
data_intake = sc.textFile("04-inschrijvingen-hbo-2020.csv")
data_outgoing = sc.textFile("05-gediplomeerden-hbo-2020.csv")

In [5]:
# Map the columns split by the delimiter=; into a new dataframe
# This changes the RDD from a text file to a 2D array
header_in = data_intake.first()
header_out = data_outgoing.first()

data_intake = data_intake.filter(lambda line: line != header_in).map(lambda line: line.lstrip(";").split(";"))
data_outgoing = data_outgoing.filter(lambda line: line != header_out).map(lambda line: line.lstrip(";").split(";"))

In [6]:
# For both arrays we are going to build a spark DataFrame, which will allow us to do further processing
df_intake = data_intake.map(lambda row: Row(municipality_nr=row[1],
                                            chrono_compartment=row[7],
                                            ed_code_actual=row[9],
                                            ed_name_actual=row[10],
                                            ed_form=row[11],
                                            gender=row[12],
                                            in_2016=row[13],
                                            in_2017=row[14],
                                            in_2018=row[15],
                                            in_2019=row[16],
                                            in_2020=row[17]
                                            )).toDF()

df_outgoing = data_outgoing.map(lambda row: Row(municipality_nr=row[1],
                                            chrono_compartment=row[6],
                                            ed_code_actual=row[8],
                                            ed_name_actual=row[9],
                                            ed_form=row[10],
                                            gender=row[12],
                                            out_2015=row[13],
                                            out_2016=row[14],
                                            out_2017=row[15],
                                            out_2018=row[16],
                                            out_2019=row[17]
                                            )).toDF()

In [7]:
df = df_intake.join(df_outgoing, ["municipality_nr", "chrono_compartment" ,"ed_code_actual", "ed_name_actual", "ed_form", "gender"], how="inner")

In [8]:
df = df.withColumn("in_2016", df["in_2016"].cast(IntegerType())) \
    .withColumn("in_2017", df["in_2017"].cast(IntegerType())) \
    .withColumn("in_2018", df["in_2018"].cast(IntegerType())) \
    .withColumn("in_2019", df["in_2019"].cast(IntegerType())) \
    .withColumn("in_2020", df["in_2020"].cast(IntegerType())) \
    .withColumn("out_2016", df["out_2016"].cast(IntegerType())) \
    .withColumn("out_2017", df["out_2017"].cast(IntegerType())) \
    .withColumn("out_2018", df["out_2018"].cast(IntegerType())) \
    .withColumn("out_2019", df["out_2019"].cast(IntegerType())) \

print(df.count())
df.show(50)

4035
+---------------+--------------------+--------------+--------------------+------------------+------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+
|municipality_nr|  chrono_compartment|ed_code_actual|      ed_name_actual|           ed_form|gender|in_2016|in_2017|in_2018|in_2019|in_2020|out_2015|out_2016|out_2017|out_2018|out_2019|
+---------------+--------------------+--------------+--------------------+------------------+------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+
|           0034|           onderwijs|         44103| M Educational Needs|deeltijd onderwijs|   man|      0|      4|      8|      7|      2|       0|       0|       0|       2|       5|
|           0034|            techniek|         30020|           B HBO-ICT| voltijd onderwijs|   man|    239|    281|    344|    427|    477|      23|      24|      25|      30|      40|
|           0080|  sectoroverstijgend|         49302|M Design Dri

In [12]:
# Now that we have all individual record combined from both sheets, we can eliminate the difference in municipality_nr and gender
# We will look at all cases where chrono_compartment, ed_code_actual, ed_name and ed_form are equal, and sum the total of the in_* and out_* fields
df_sum = df.groupBy([df.chrono_compartment,
                     df.ed_code_actual,
                     df.ed_name_actual]).sum().drop(df.ed_code_actual).drop(df.ed_form)
df_sum.show()

+--------------------+--------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
|  chrono_compartment|      ed_name_actual|sum(in_2016)|sum(in_2017)|sum(in_2018)|sum(in_2019)|sum(in_2020)|sum(out_2016)|sum(out_2017)|sum(out_2018)|sum(out_2019)|
+--------------------+--------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
|     gezondheidszorg|B Oefentherapie C...|         247|         206|         213|         172|         168|           39|           55|           38|           26|
|           onderwijs|B Opleiding tot l...|          57|          58|          59|          71|          74|           12|           11|           11|           16|
|     gezondheidszorg|M Sport- en Bewee...|           0|           0|          51|          54|          57|            0|            0|           10|            9|
|     taal

### a) Bereken de kans dat een student tot tweemaal toe uitvalt bij een technische studie in de periode 2015-2020

In [15]:
df_tech = df_sum.filter(df_sum.chrono_compartment == "techniek")
df_tech.show()

+------------------+--------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
|chrono_compartment|      ed_name_actual|sum(in_2016)|sum(in_2017)|sum(in_2018)|sum(in_2019)|sum(in_2020)|sum(out_2016)|sum(out_2017)|sum(out_2018)|sum(out_2019)|
+------------------+--------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+
|          techniek|  B Ocean Technology|         102|         105|          90|          93|          94|           14|           18|           13|           19|
|          techniek|B Maritieme Techniek|         357|         370|         369|         391|         413|           40|           44|           47|           41|
|          techniek|    M Serious Gaming|           0|           9|          24|          31|          41|            0|            0|            5|            7|
|          techniek|Ad

In [18]:
df_tech = df_tech.drop(df_tech.ed_name_actual)
df_tech = df_tech.groupBy(df_tech.chrono_compartment).sum()
df_tech.show()

+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+
|chrono_compartment|sum(sum(in_2016))|sum(sum(in_2017))|sum(sum(in_2018))|sum(sum(in_2019))|sum(sum(in_2020))|sum(sum(out_2016))|sum(sum(out_2017))|sum(sum(out_2018))|sum(sum(out_2019))|
+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+
|          techniek|            89648|            93574|            96228|            97312|           101427|             12573|             13365|             13959|             13963|
+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+



In [None]:
# We now see all the incoming and outgloing students for each year in the data set for technical studies

