In [1]:
# PREPARE THE WORKSPACE

#%matplotlib notebook
import numpy as np
import pandas as pd

#import matplotlib.pyplot as plts
#from sklearn.model_selection import train_test_split
# I think I do only need np and pd right here at this point. We will see.

In [15]:
# READ THE DATA
# the data is synthetic. I generated it in order to have data to work with
# it contains artworks created by the "Beautify Berlin" community inlcuding data about the type of artwork, the district it can be found in and further
# "Column1" contained the index in the *.txt file and was dropped since it would have been redundant

artworks = pd.read_table("beautified_boxes.txt").drop("Column1", axis=1)
print(artworks)

Treptow-Kopenick              79
Neukolln                      72
Friedrichshain-Kreuzberg      65
Steglitz-Zehlendorf           64
Spandau                       64
Charlottenburg-Wilmersdorf    63
Mitte                         61
Lichtenberg                   60
Reinickendorf                 60
Marzahn-Hellersdorf           59
Pankow                        54
Tempelhof-Schoneberg          49
Name: district, dtype: int64

In [17]:
# PROCESS THE DATA

# some artworks have no user rating, thus the column "userRating" contains NaN in these cases
# during labeling, a missing user rating was a disadvantage in the process of approval
# thus, fill the missing values with the numerical value 0  

artworks_processed = artworks.fillna(0)

# currently, the categories are not numerical and not ordered
# transform the categories to continuous numerical by replacing values
# least likely to be approved gets 1
# most likely to be approved gets highest value (=number of options)
# remaining ones are ranked accordingly

# look at distinct values
print(artworks["district"].value_counts())
print("\n")

# for column "type"
artworks_processed["type"] = artworks_processed["type"].map({
    "painting": 5,
    "graffiti": 4,
    "poster": 3,
    "stencil": 2,
    "text": 1
})

# implement for remaining columns here

# have a look at current state of data frame
print(artworks_processed)

Treptow-Kopenick              79
Neukolln                      72
Friedrichshain-Kreuzberg      65
Steglitz-Zehlendorf           64
Spandau                       64
Charlottenburg-Wilmersdorf    63
Mitte                         61
Lichtenberg                   60
Reinickendorf                 60
Marzahn-Hellersdorf           59
Pankow                        54
Tempelhof-Schoneberg          49
Name: district, dtype: int64


     Artwork-Id  type                  district  environment  countArtists  \
0      32048698     5          Treptow-Kopenick  main street             2   
1      39800694     5          Treptow-Kopenick         park             3   
2      80318972     4          Treptow-Kopenick  public spot             3   
3      74478002     4          Treptow-Kopenick  main street             4   
4      71449602     5       Steglitz-Zehlendorf  side street             3   
..          ...   ...                       ...          ...           ...   
745    68311402     5      

In [None]:
# cleaning and preparing the data
"""
Things still to do:

-fill NaN in column "userRating" with 0
-replace all of the categories with numerical values. 
    least likely to be approved gets 1
    most likely to be approved gets highest value (=number of options)
    rank remaining categories accordingly
-make sure all of the columns are numerical as opposed to character

"""

In [None]:
# split the data into training and test sets
# remember that 650/1000 are labeled yet. Either label remaining ones or shorten the set
# training data  is for training the algorithm 
# test set is for evaluating the algorithm
# both must be strictly separated

# features are type, district, environment, countArtists, experience, replaced, content, user rating
# label is "approved"
# select features and label e.g. like this 
# features = data[["type", "district", "..."]]
# label = data["approved"]
# or as in the example: features is called X and label is called y
# They did it like that:
# X = fruits[["mass", "width", "height"]]
# y = fruits["fruit_label"]
# They then split the data like this: 
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 
# note that random state is for setting seed and this is necessary because they work for a large audience that wants to reproduce the result. 
# I may not need this parameter, depending on if I want to get a new result each time or not.
# Alright, to reliably assess, you even NEED to have a look at multiple splits!

In [None]:
# visualisations
# to see range of values, outliers and so on (or actually less in my example, because the values are categorical lol)
# try to see if a potential algorithm is likely to be able to classify the data -> a well defined cluster should be visible

# good plots to visualise the data: 
# 1) feature-pair plot
# their code:
from matplotlib import cm
cmap = cm.get_cmap("gnuplot")
scatter = pd.scatter_matrix(X_train, c = y_train, marker = "o", s = 40, hist_kwds = {"bins":15}, figsize = (12, 12), cmap = cmap)
# furthermore: 
# just look at for example how many of the artworks are paintings and the distribution over the districts.
# the data is synthetic, so this is basically meaningless.
# because of that, e.g., compare it to the probabilities that I set.

In [None]:
# MACHINE LEARNING ALGORITHM LEARNING, OPTIONS, DECISION MAKING AND BRAINSTORM
# THIS BRANCH AND FILE IS FOR DATA PREPARATION, WRITE THE ACTUAL ML PART IN IT'S OWN BRANCH AND FILE
# THIS CELL IS JUST TO TAKE NOTES AND SO ON 
"""
ML algorithm options
-k-nearest neighbors: instance or memory based learning
Would likely work for my data set
Likely pretty easy: k=1 (one nearest neighbor)
It needs: a distance metric, k (how many nearest neighbors), weighting function on the neighbor points (not all neigh. have same influence), method for aggregating classes of neighbor points (how to combine influence and decide)
Distance: scikit uses euclidian distance per default
Neighbors: e.g. 5 (odd -> no tie -> no weighting necessary)
Weighting: not necessary, when k/2=0.5
Aggregation: Majority vote
Use-case/code in the video: about knn
-decision tree would also be very nice
BUT scikit learn can only handle numeric features and beyond that, it will interpret them as continuous numeric variables.
The problem I have here, namely that some of my features are discrete categorical strings, will thus not be solved by using decision trees in scikit learn or just substituting them by numbers, because they have no order
-> use knn, it will be more practical, since we had several hours of teaching about that and only 19 min about decision trees.

Assess the performance of the algorithm later on:
e.g. compute accuracy of the classifier -> score method
Also (if I will actually use knn,) try different values for k and then assess accuracy, because overfitting IS an issue!
Also assess using different splits of the data set into training and test set. 
-> Cross validation -> There is a function for this (cross_val_score()) and I do not need to do this via hardcoding. 
Cross validation takes time, but my data set is pretty small, so just do it, since it has benefits. 
Code for assessing can be found in the second notebook of the ML part.
A good test score (with good = close to 1) will be more important than a good train score for this.
"""