In [21]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# k-NN analyis of wine data
Our first goal is to analyze wine datasets to classify wine quality based on different quality indicators.

Input variables (based on physicochemical tests):
<br><br>
1 - fixed acidity
<br>
2 - volatile acidity
<br>
3 - citric acid
<br>
4 - residual sugar
<br>
5 - chlorides
<br>
6 - free sulfur dioxide
<br>
7 - total sulfur dioxide
<br>
8 - density
<br>
9 - pH
<br>
10 - sulphates
<br>
11 - alcohol
<br><br>
Output variable (based on sensory data):
<br><br>
12 - quality (score between 0 and 10)

More datasets can be found at: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=&sort=instDown&view=table

In [2]:
data = pd.read_csv("winequality-white.csv",delimiter=";")

In [3]:
X = data.drop("quality", axis=1).to_numpy()
y = data["quality"].to_numpy()

In [30]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

KNeighborsClassifier(n_neighbors=3)

In [31]:
print(neigh.predict([[8,0.7,0.,2,0.076,11.,48.,0.9978,3.51,0.56,12]]))

[6]


# k-NN analyis of BTC data

Features:
<br><br>
address: String. Bitcoin address.
<br>
year: Integer. Year.
<br>
day: Integer. Day of the year. 1 is the first day, 365 is the last day.
<br>
length: Integer.
<br>
weight: Float.
<br>
count: Integer.
<br>
looped: Integer.
<br>
neighbors: Integer.
<br>
income: Integer. Satoshi amount (1 bitcoin = 100 million satoshis).
<br><br>
Output variable: 
<br><br>
label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware). 

In [2]:
data = pd.read_csv("BTC_data/BitcoinHeistData.csv",delimiter=",")

Let's hash all address strings to integers.

In [3]:
data["address_hash"] = data["address"].apply(hash)

In [4]:
X = data.drop(["label","address"], axis=1).to_numpy()
y = data["label"].to_numpy()

We should also rescale the dataset, because all features have different numerical scales.

In [5]:
scaler = StandardScaler()
scaler.fit(X)

StandardScaler()

In [6]:
X = scaler.transform(X)

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

This takes a very long time...

Exercise: Can you come up with an efficient k-NN implementation in PySpark?