In [108]:
import matplotlib.pyplot as plot
import numpy as np
import pandas as pd
import re as re
from scipy import stats

In [109]:
matrix = pd.read_csv("csv/data-original.csv", quotechar='"', skipinitialspace=True).as_matrix()
matrix = matrix[:,13]
print(matrix)

[nan '2012' '2013' 'September 2013' '2013' 'September 2013' '01-09-2012'
 '2012' '2012' '2013' '2012' 'Two thousend twelve' 'Tilburg' '2013' '2013'
 '01/09/2012' '2013' '2012' '2013' "Feb '13" '2013' '2013' '2012' '2013'
 '2013' '2012' '3 years ago' '2013' '2013' 'February 2013' '2012' '2013'
 '2012' '2012' '2012' 'September 2012' '2013' 'Long Time agoo' '2012'
 'feb 2012' 'Long Time agoo' '2012' '2013' 'Long Time agoo' '2012' '2013'
 '2012' 'none' '2012' '2013' '2013' '2 years ago' 'sdfasv' nan nan nan
 '1111' '2012' '2012' '1321' '2012' 'Trump' 'Fall, 2012' '2013' '2014'
 '2013' '2013' '2013' '6 years ago' 'September 2013' '2012-2013'
 'February of 2014 ' '2012' '4 years ago' '2013' '2013' 'Que duemiladodici'
 '20' '2012-2013' '2013' '2013' 'September 2013' '2012' '2013' '2011'
 '2014' '2014' '2013' '2012' '2014' 'In 2013' '2014' '2014'
 'September 2013' '2013' '2013' nan '2012' '2012' '2014' '2014' '2014'
 'September 2014' '2012' '2012' '2014' '2014' 'September 2014' '09/2014'
 '201

## Preparation

To prepare the data for filtering, I remove all `nan` types before running the regular expression on the data.

In [110]:
matrix = matrix[~pd.isnull(matrix)]
print(matrix)

['2012' '2013' 'September 2013' '2013' 'September 2013' '01-09-2012' '2012'
 '2012' '2013' '2012' 'Two thousend twelve' 'Tilburg' '2013' '2013'
 '01/09/2012' '2013' '2012' '2013' "Feb '13" '2013' '2013' '2012' '2013'
 '2013' '2012' '3 years ago' '2013' '2013' 'February 2013' '2012' '2013'
 '2012' '2012' '2012' 'September 2012' '2013' 'Long Time agoo' '2012'
 'feb 2012' 'Long Time agoo' '2012' '2013' 'Long Time agoo' '2012' '2013'
 '2012' 'none' '2012' '2013' '2013' '2 years ago' 'sdfasv' '1111' '2012'
 '2012' '1321' '2012' 'Trump' 'Fall, 2012' '2013' '2014' '2013' '2013'
 '2013' '6 years ago' 'September 2013' '2012-2013' 'February of 2014 '
 '2012' '4 years ago' '2013' '2013' 'Que duemiladodici' '20' '2012-2013'
 '2013' '2013' 'September 2013' '2012' '2013' '2011' '2014' '2014' '2013'
 '2012' '2014' 'In 2013' '2014' '2014' 'September 2013' '2013' '2013'
 '2012' '2012' '2014' '2014' '2014' 'September 2014' '2012' '2012' '2014'
 '2014' 'September 2014' '09/2014' '2013' 'August 2016' 'Sep

## Filtering

As to filtering, since there are so many people who filled out a normal year, I am going to only filter out numbers which start with 20. I do this, because I am sure that nobody started the education more than 17 years ago and still hasn't finished it. Furthermore, this should filter out all dates which are not relevant such as several hyperlinks or beer jokes.

In [111]:
year_regex = re.compile("^(20\\d{2})")
filtered = np.vectorize(lambda x: year_regex.match(x))(matrix)
filtered = filtered[filtered != np.array(None)]
filtered = np.vectorize(lambda x: x.group())(filtered)
print(filtered)

['2012' '2013' '2013' '2012' '2012' '2013' '2012' '2013' '2013' '2013'
 '2012' '2013' '2013' '2013' '2012' '2013' '2013' '2012' '2013' '2013'
 '2012' '2013' '2012' '2012' '2012' '2013' '2012' '2012' '2013' '2012'
 '2013' '2012' '2012' '2013' '2013' '2012' '2012' '2012' '2013' '2014'
 '2013' '2013' '2013' '2012-2013' '2012' '2013' '2013' '2012-2013' '2013'
 '2013' '2012' '2013' '2011' '2014' '2014' '2013' '2012' '2014' '2014'
 '2014' '2013' '2013' '2012' '2012' '2014' '2014' '2014' '2012' '2012'
 '2014' '2014' '2013' '2012' '2015' '2015' '2014' '2015' '2014' '2014'
 '2014' '2015' '2013' '2012' '2014' '2015' '2014' '2015' '2014' '2014'
 '2015' '2014' '2014' '2012' '2014' '2017' '2015' '2014' '2015' '2014'
 '2014' '2014' '2014 volgens mij' '2014' '2013' '2014' '2014' '2014' '2014'
 '2014' '2013' '2014' '2014' '2014' '2014' '2014' '2013']


The amount of records which are filtered out equate to:

In [112]:
print(matrix.size - filtered.size)

69


Something which is notable to mention is that it is not important to filter out the months, due to the educational minor and educational major only starting twice in a school year. A semester never ends in the same year as when the first one started.

I also want to mention that entries like "three years ago" are not that relevant to even bother with resolving, since they are a very small minority of records and they won't change much in the big picture. I also had to change the regular expression to capture only the first part of the string, since I don't want two entries for a school year like `2012-2013`.

## Cleaning

The cleaning of the data is pretty simple. I only need to convert the string of text to a number and I can start sorting the data and counting their occurences.

In [113]:
cleaned = np.vectorize(lambda x: int(x), otypes=[np.uint16])(filtered)
print(cleaned)

ValueError: invalid literal for int() with base 10: '2012-2013'

## Conclusions

This column was a lot easier to filter, because the data was already in a pretty clear format. I also learned that the big picture is much easier to oversee when you are not starting to worry over smaller edge cases. As seen in the last part of the assignment, there were more edge cases where people would add a unit at the end of their numbers, which could very well be answers of great importance.

### Possible biases

* Data which is left out is text which describes how many years ago a person started their education.

In [None]:
x, y = np.unique(cleaned, return_counts=True)

plot.bar(x, y)
plot.show()

The graph should give you an indication as to why I did not count in the edge cases. If the edge cases were included, it would not have done that much to change the conclusion that most of the students started their educational program in the years between 2012 and 2015.