# Dutch primary school analytics
 
## Overview

In the Netherland  children go to primary school on the next day after they turn 4 years old. And before that, when they are 2.5 or 3 year old, parents should choose a school for them. Ideally the best school for their baby. It's quite a big problem for non-native dutch parent, like me - I have no idea what is important,what should I think about, what should I look at in this process. For example, there are a lot of types of school, like 'Openbaar', 'Rooms-Katholiek', 'Protestants-Christelijk', etc. And for the person who never met this system before, it's not very clear which one should I choose and which one would fit my child. On the other hand, I want to make a desicion based on objective data like scores in final tests or how many pupils from the school were recommended to continue different 'level' of education in future, or for example 'quality' of students knowledge etc. 

So as a result of all this thoughts I realized that I have a problem: I don't understand, do I need to choose a type of school together with other characteristics, or it can be explained by this characteristics. Then I can just choose other characteristics that I want for my childs school to have and they will just define the best type of school. In other words, is it possible to make a classifier which would identify whether school is Openbaar or Rooms-Katholiek or has other denomination, based on other characterrictics we have? 
I found and pool in the dataset the most important information I want to know about my childs future school, so now I want to know do I need to choose a denomination of a school (and spend time to investigate what each school denomination mean in Dutch culture, their comparison and so on)? Or I can just choose other characteristic (which looks more clear to me) and they will define a denomination of a school. So I will build a few classifiers based on different ML alghorithms, will try to choose the best of them and see if the best classifier works good enough to predict a denomination of a primary school.

## Data preparation

Let's start with reading dataset which I constructed in the file 'Score.csv':

In [None]:
import pandas as pd 

data_score = pd.read_csv("Score.csv", error_bad_lines=False, sep=',', encoding = "ISO-8859-1") 
#data_score = data_score[data_score['DATUM'] == 2018]
print(data_score.shape)
print(data_score.columns)
data_score.head() 

There are a lot of variables here which define address of the school in different ways, so we need to choose one of them. I think we can leave 'GEMEENTENUMMER' column as the variable which explains location  and remove the rest of the the address columns and also 'BEVOEGD_GEZAG_NUMMER':

In [None]:
columns = ['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING', 'PLAATSNAAM', 'GEMEENTENAAM', 
           'PROVINCIE', 'BEVOEGD_GEZAG_NUMMER', 'ZITPERC']
print(data_score.shape)
data_score.drop(columns, 1, inplace = True)
print(data_score.shape)
print(data_score.columns)
data_score.head()

In the dataset we got there are 3 categorical columns. Before start modelling, we need to convert all the categorical variables to numerical ones. Let's use ***LabelEncoder*** this time: 

In [None]:
from sklearn.preprocessing import LabelEncoder

data_score['SOORT_PO'] = LabelEncoder().fit_transform(data_score['SOORT_PO'])

data_score['DENOMINATIE_VESTIGING'] = LabelEncoder().fit_transform(data_score['DENOMINATIE_VESTIGING'])

data_score['EXAMEN'] = LabelEncoder().fit_transform(data_score['EXAMEN'])
data_score.head()

## Building classification models

Ley's try to build classification models using a few different ML alghorithms: Random Forest, SVM, KNN, XGBoosting. 

First of all, we need to split data to test and train subsets:

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
#from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

pred = data_score['DENOMINATIE_VESTIGING']

data_score.drop('DENOMINATIE_VESTIGING', 1, inplace = True)

random_state = 4004

scaler = MinMaxScaler()
data = scaler.fit_transform(data_score)

X_train, X_test, y_train, y_test = train_test_split(data_score, pred, train_size=0.7, test_size=0.3, random_state=random_state)
X_train.head()

### Random forest

In [None]:
rfc = RandomForestClassifier(n_estimators=100, random_state = random_state)
rfc.fit(X_train, y_train)
pred_test = rfc.predict(X_test)

print("Random Forest's accuracy is: %3.2f" % (100 * rfc.score(X_test, y_test)))

### K Nearest Neighbors 

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
pred_test = knn.predict(X_test)

print("KNN's accuracy is: %3.2f" % (100 * knn.score(X_test, y_test)))

### Support Vector Machine

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
pred_test = svc.predict(X_test)
print("SVC's accuracy is: %3.2f" % (100 * svc.score(X_test, y_test)))

### Naive Bayes classifier 

In [None]:
from sklearn.preprocessing import MaxAbsScaler

scaler_gnb = MaxAbsScaler()
data = scaler_gnb.fit_transform(data_score)

X_train_gnb, X_test_gnb, y_train_gnb, y_test_gnb = train_test_split(data, pred, train_size=0.7, test_size=0.3, random_state=random_state)

In [None]:
gnb = GaussianNB()
gnb.fit(X_train_gnb, y_train_gnb)
pred_test = gnb.predict(X_test_gnb) 

In [None]:
accuracy = get_accuracy(y_test, pred_test)
print("Gaussian Naive Bayes accuracy is: %3.2f" % (accuracy))
print("Gaussian Naive Bayes accuracy is: %3.2f" % (100 * gnb.score(X_test, y_test)))

In [None]:
Now I would like to take a look at the values of different variables: how common they are? It can be that only one value accures most of the cases, so we can remove this columns also (because it doesn't give us any information): 

n_rows = len(data_score)
one_value_columns = []
threshold = 95
for column in data_score:
    common_value_count = data_score[column].value_counts().iloc[0]
    perc = (100 * common_value_count)/n_rows
    if(perc > threshold):
        one_value_columns.append(column)
        print("{}: {:.1f}%".format(column, perc))
        print(data_score[column].value_counts())
one_value_columns


print(data_score.shape)
data_score.drop(one_value_columns, 1, inplace = True)
print(data_score.shape)
print(data_score.columns)