# Machine learning project
Author: Manuele Nolli, student BSc Computer Science SUPSI 

Date: 05.05.2023

Mail: manuele.nolli@student.supsi.ch

# Introduction
This document is an analysis of the dataset "Crypto Coven" from __[Kaggle.com](https://www.kaggle.com/datasets/harrywang/crypto-coven)__. 

## Goal
The goal of this project is to analyze the dataset and to create a model that can predict the price of a token based on the features of the token. 

In addition, the project will divide the tokens into different categories based on the price of the token and create a model that can predict the category of a token based on the features of the token.

## Dataset description
The dataset contains 9761 rows and 45 columns. The columns are:
- **id**: unique identifier of the transaction
- **num_sales**: number of sales of the token
- **name**: name of the token
- **description**: description of the token
- **external_link**: link to the external website
- **permalink**: link to the token on OpenSea
- **token_metadata**: link to the metadata of the token
- **token_id**: unique identifier of the token
- **owner.user.username**: username of the owner
- **owner.address**: address of the owner
- **last_sale.total_price**: price of the last sale
- **last_sale.payment_token.usd_price**: price of the last sale in USD
- **last_sale.transaction.timestamp**: timestamp of the last sale
- **Wonder**: Wonder of the token
- **Skin Tone**: Skin Tone of the token
- **Rising Sign**: Rising Sign of the token
- **Eyebrows**: Eyebrows of the token
- **Wisdom**: Wisdom of the token
- **Body Shape**: Body Shape of the token
- **Moon Sign**: Moon Sign of the token
- **Will**: Will of the token
- **Hair Color**: Hair Color of the token
- **Wit**: Wit of the token
- **Wiles**: Wiles of the token
- **Necklace**: Necklace of the token
- **Sun Sign**: Sun Sign of the token
- **Eye Style**: Eye Style of the token
- **Eye Color**: Eye Color of the token
- **Mouth**: Mouth of the token
- **Hat**: Hat of the token
- **Archetype of Power**: Archetype of Power of the token
- **Woe**: Woe of the token
- **Hair (Front)**: Hair (Front) of the token
- **Top**: Top of the token
- **Hair (Back)**: Hair (Back) of the token
- **Background**: Background of the token
- **Face Markings**: Face Markings of the token
- **Facewear**: Facewear of the token
- **Hair Topper**: Hair Topper of the token
- **Back Item**: Back Item of the token
- **Earrings**: Earrings of the token
- **Forehead Jewelry**: Forehead Jewelry of the token
- **Hair (Middle)**: Hair (Middle) of the token
- **Mask**: Mask of the token
- **Outerwear**: Outerwear of the token


In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
# read data
df = pd.read_csv('data/witches.csv')

ModuleNotFoundError: No module named 'seaborn'

# 1. Preprocessing

## 1.1 Initial state

Show information about the features present in the dataset.

In [None]:
# print dataset info function
def printDatasetInfo(df):
    print(f"Total columns: {len(df.columns)}")
    print("Columns names:", end=" ")
    for col in df:
            print(col, end=", ")
    print()
    
    #columns types
    print(f"Columns type:")
    #creating temp array
    columnData = []
    dfIndexType = []
    
    for col in df.columns:
        temp = []
        dfIndexType.append(col)
        temp.append(df[col].apply(type).unique())
        temp.append(df[col].isnull().sum())
        temp.append(round((df[col].isnull().sum() / len(df[col])) * 100, 2))
        temp.append(df[col].nunique())
        columnData.append(temp)
    
    #create new Dataframe
    dfColumnsType = pd.DataFrame(columnData, columns=['Types','NaN Count', 'NaN %','Unique Values'])
    dfColumnsType.index = dfIndexType
    #print columns type
    display(dfColumnsType)

# print dataset info
printDatasetInfo(df)

#df size
print(f"Dataframe rows: {len(df)}")

#df sample
print("Dataset samples:")
df.sample(1)

## 1.2 Converting features in Present / Not present

From the table above, it is notable that some witch's caracteristics are present only for a small number of tokens. For example, the "Mask" featurs has a 95% of null values. For simplicity, the features with more than 50% of null values will be considered as "not present" or "present" and not as a specific value.

It is possible to see that the dataset has approximately 50% of the tokens with a price of 0. This means that the tokens have never been sold. 
The dataset will be divided into two datasets: one with the tokens that have never been sold and one with the tokens that have been sold at least once.

The data preprocessing will continue with dropping the columns that are not useful for the analysis and the creation of other simpler columns.

In [None]:
#Remove not useful columns: id, external link, token metadata, owner username and address, last_sale.transaction.timestamp
df.drop(['id', 'external_link', 'token_metadata', 'owner.user.username', 'owner.address', 'last_sale.transaction.timestamp', 'Rising Sign', 'Moon Sign', 'Sun Sign'], axis=1, inplace=True)

# create featureList with all features that have more than 50% of null values and another list with all features that have less than 0.5% of null values but not 0
featureList50 = []
featureList01 = []
for col in df.columns:
    if df[col].isnull().sum() > len(df[col]) * 0.5:
        featureList50.append(col)
    elif df[col].isnull().sum() < len(df[col]) * 0.005 and df[col].isnull().sum() != 0:
        featureList01.append(col)

# create has_'Feature' column based on null values in 'feature' column
for feature in featureList50:
    df['has_' + feature] = df[feature].notnull()

# drop features present in featureList
df.drop(featureList50, axis=1, inplace=True)

# drop null values in featureList01
df.dropna(subset=featureList01, inplace=True)

In [None]:
# Separate the data into two groups: one with missing values of the variable 'last_sale.total_price' 
df_missingPrice = df[df['last_sale.total_price'].isnull()]
df = df[df['last_sale.total_price'].notnull()]

# Convert the 'last_sale.total_price' GWEI to ETH
df['last_sale.total_price'] = df['last_sale.total_price'].apply(lambda x: x / 10 ** 18)
# Create a new column with the USD price (In last_sale.total_price column there are ETH prices of the sales times)
df['price_USD'] = df['last_sale.total_price'] * df['last_sale.payment_token.usd_price']

#drop not useful columns
df.drop(['last_sale.payment_token.usd_price', 'last_sale.total_price'], axis=1, inplace=True)

# print shape of the two groups
print(f"Shape of the group with missing price values: {df_missingPrice.shape}")
print(f"Shape of the group without missing price values: {df.shape}")

## 1.3 Adapt numeric features

In the dataset are present some numeric features with a score form 0 to 9. For simplicity, these features will be renamed to score_'Feature'

In [None]:
# Rename score features: Wiles, Will, Wisdom, Wit, Woe, Wonder

scoreFeature = ['Wiles', 'Will', 'Wisdom', 'Wit', 'Woe', 'Wonder']

for i in range(len(scoreFeature)):
    #Rename
    df.rename(columns={f'{scoreFeature[i]}': f'score_{scoreFeature[i]}'}, inplace=True)

# Create a copy of the dataframe for the hashing section
df_hashing = df.copy()

# Copy dataset for Classification
df_class = df.copy()

## 1.4 Reduce cardinality of categorical features

Perfect! The dataset is almost ready for the analysis. The last step is to reduce the cardinality of the categorical features. The goal is to reduce the cardinality to a maximum of 3 values. The best solution will be to reduce the cardinality to 2 values: "Rare" and "Not Rare". In addition, the features will be renamed to rarity_'Feature'.

In [None]:
# Reduce cardinality: Skin Tone, Rising Sign, Eyebrows, Moon Sign, Hair Color, Eye Style, Eye Color, Mouth, Archetype of Power, Hair (Front), Top, Hair (Back), Background

categoricalFeature = ['Skin Tone', 'Eyebrows', 'Hair Color', 'Eye Style', 'Eye Color', 'Mouth', 'Archetype of Power', 'Hair (Front)', 'Top', 'Hair (Back)', 'Background']

# Reduce methodology: If a value count is in more than the third quartile of the total count, it will be replaced by 'Not Rare'. The others will be replaced by 'Rare'
# In addition, if a value is null it will be replaced by 'Not Rare'
for i in range(len(categoricalFeature)):
    # Replace null values by 'Not Rare'
    df[categoricalFeature[i]].fillna('Not Rare', inplace=True)

    # Calculate value
    sumCount = df[categoricalFeature[i]].value_counts().sum()
    numUnique = df[categoricalFeature[i]].nunique()
    thirdQuartile = sumCount/numUnique * 0.75

    # Replace values by 'Rare' or 'Not Rare'
    df[categoricalFeature[i]] = df[categoricalFeature[i]].apply(lambda x: 'Rare' if df[categoricalFeature[i]].value_counts()[x] < thirdQuartile else 'Not Rare')
    

    # Rename the column as 'rarity_' + column name
    df.rename(columns={f'{categoricalFeature[i]}': f'rarity_{categoricalFeature[i]}'}, inplace=True)

# If there are features that have only one value, they will be removed
for col in df.columns:
    if df[col].nunique() == 1:
        df.drop(col, axis=1, inplace=True)

## 1.5 Final state

In [None]:
# print dataset info
print("Final dataset info:")
printDatasetInfo(df)

As it is possible to see from the table above, the dataset is now cleaned. We are ready to start the analysis.

# 2. Data visualization
In this section, it will be shown some graphs that can help to understand the dataset. 

A recap of the previous section:
* the features with more than 50% of null values will be considered as "not present" or "present" and not as a specific value. Those features are called has_'Feature'
* there are some features that present a score from 0 to 9. Those features are called score_'Feature'
* the remaining features (with discrete values) are called rarity_'Feature' and their values are "Rare" or "Not Rare" depending of the third quartile

## 2.1 Price distribution
The price distribution is shown in the graph below. The price is in USD. The graph shows that the majority of the tokens have a low price. In addition, the graph shows that the price is not normally distributed. The price is right skewed.

In [None]:
# Price distribution
fig = sns.displot(df['price_USD'], kde=True).set(title='Price distribution')

## 2.2 Percentage distribution of the features prensence

In [None]:
# Percentage of has_'Feature' columns plot 
# create temp array
temp = []
for col in df.columns:
    if col.startswith('has_'):
        temp.append([col,df[col].value_counts()[0]/len(df), df[col].value_counts()[1]/len(df)])

# create new dataframe
dfHasFeature = pd.DataFrame(temp, columns=['Name','False [%]', 'True [%]'])
# Order item 
dfHasFeature.sort_values(by=['True [%]'], inplace=True)

# Plot
plt = dfHasFeature.plot(kind='barh', stacked=True, figsize=(20,10), title='Percentage of has_\'Feature\' columns', x='Name', width=0.8)
plt.set_xlabel('Percentage')
plt.set_ylabel('Feature')

# Add more x ticks
plt.set_xticks(np.arange(0, 1.05, 0.05))


## 2.3 Percentage distribution of the features rarity
It is possible to see that the previous calculation of the rarity of the features is correct. The majority of the features have a small percentage of "Rare" values.

In [None]:
# Percentage of rarity_'Feature' columns plot 
# create temp array
temp = []
for col in df.columns:
    if col.startswith('rarity_'):
        if(len(df[col].value_counts()) == 1):
            if(df[col].value_counts().index[0] == 'Rare'):
                temp.append([col,df[col].value_counts()[0]/len(df), 0])
            else:
                temp.append([col,0, df[col].value_counts()[0]/len(df)])
        else:
            temp.append([col,df[col].value_counts()[1]/len(df), df[col].value_counts()[0]/len(df)])

# create new dataframe
dfHasFeature = pd.DataFrame(temp, columns=['Name','Rare [%]', 'Not Rare [%]'])
# Order item 
dfHasFeature.sort_values(by=['Rare [%]'], inplace=True)

# Plot
plt = dfHasFeature.plot(kind='barh', stacked=True, figsize=(20,10), title='Percentage of rarity_\'Feature\' columns', x='Name', width=0.8)
plt.set_xlabel('Percentage')
plt.set_ylabel('Feature')

# Add more x ticks
plt.set_xticks(np.arange(0, 1.05, 0.05))


## 2.3 Score distribution

In [None]:
import matplotlib.pyplot as plt
# Distribution of 'score_' features

# create temp array
temp = []
for col in df.columns:
    if col.startswith('score_'):
        temp.append([col,df[col].mean(), df[col].median(), df[col].mode()[0], df[col].std()])

# Create subplots
height = int(len(temp)/2)
width = 2
fig, axs = plt.subplots(height, width, figsize=(10*width, 10*height))

for i in range(len(temp)):
    # Histogram
    max_n = df[temp[i][0]].max()
    min_n = df[temp[i][0]].min()
    bins = 10
    step = (max_n - min_n) / bins
    arr_div = np.arange(min_n + step / 2, max_n + step / 2, step=step)
    arr_div_r = np.round(arr_div, 0).astype(int)
    fig = sns.histplot(data = df, x=temp[i][0], kde=True, bins=bins, ax=axs[i%height,i%width]).set(title=f'{temp[i][0]} distribution')
    axs[i % height, i % width].set_xticks(arr_div)
    axs[i % height, i % width].set_xticklabels(arr_div_r)

plt.show()

## 2.4 Correlation matrix
As it is possible to see from the graph above, there is not a correlation between the features. This means that the features are independent.

In [None]:
# Correlation matrix
corrMatrix = df.corr(numeric_only=True)
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(corrMatrix, annot=True, ax=ax)
plt.show()

# 3. Price prediction model
The goal of this section is to create a model that can predict the price of a token based on the features of the token.
It will be done following these steps:
1. Split the dataset into train, validation and test set (60%, 20%, 20%) using K-Fold cross validation
2. LASSO regression with all the features. Some features will be removed from the model if they have a p-value > 0.05
3. RIDGE regression with the features selected in the previous step

**Note**: In this dataset is not needed to scale the features because the features are already scaled.

## 3.1 Convertion of the features in numeric values and dummy variables

In [None]:
# Conver binary columns to 0 and 1

# Create a list of columns to be converted
feature = []

for col in df.columns:
    if col.startswith('has_') or col.startswith('rarity_'):
        feature.append(col)

# if present, replace 'True' with 1 and 'False' with 0. If rare, replace 'Rare' with 1 and 'Not Rare' with 0
for col in feature:
    if col.startswith('has_'):
        df[col] = df[col].apply(lambda x: 1 if x == True else 0)
    else:
        df[col] = df[col].apply(lambda x: 1 if x == 'Rare' else 0)

# It remains to convert the Body Shape feature with dummy variables
df = pd.get_dummies(df, columns=['Body Shape'], prefix = ['dummies_BodyShape'])

# Add to the list of features the dummy variables and the score features
for col in df.columns:
    if col.startswith('dummies_') or col.startswith('score_'):
        feature.append(col)

feature.append("num_sales")
# Print dataset info
printDatasetInfo(df)
df.sample(1)

## 3.2 Split the dataset
Split the dataset into train, validation and test set (60%, 20%, 20%)

In [None]:
from sklearn.model_selection import train_test_split
# Split dataset in train, validation and test set

y = df['price_USD']
X = df.drop(['price_USD'], axis=1)
# Drop all columns not in the feature list
X = X[feature]

# Split dataset in train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# *Note*: The validation set is split in the following code section with the GridSearchCV
# Split train set in train and validation set
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2


## 3.3 LASSO regression

In [None]:
# Ignore warnings due to the GridSearchCV
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

In [None]:
# Test the model
from sklearn.metrics import  r2_score

def printResults(model, X_train, y_train):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    print(f'R-squared: {r2}')

In [None]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

# Lasso regression
# Create the parameter grid based on the results of random search
param_grid = {
    'alpha': 10**np.linspace(10,-2,100)*0.5 ,
    'max_iter': [1000, 10000, 100000],
    'tol': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# Create a based model
lasso = Lasso()

# Instantiate the grid search model
crossValidation = 5
lasso_regressor = GridSearchCV(estimator = lasso, param_grid = param_grid, cv = crossValidation, scoring='neg_mean_squared_error') 

# Fit the grid search to the data
lasso_regressor.fit(X_train, y_train)

# Print best parameters
print(f'Best params {lasso_regressor.best_params_}')
print(f'Best score {-lasso_regressor.best_score_}')

# Create a list of features with a coefficient different from 0
feature_lasso = []
for i in range(len(lasso_regressor.best_estimator_.coef_)):
    if lasso_regressor.best_estimator_.coef_[i] != 0:
        feature_lasso.append(feature[i])

# Print the number of features
print(f'Number of features: {len(feature_lasso)} that are important for the prediction')

# Print the features
print(f'Best features: {feature_lasso}')

# Print the score on the test set
printResults(lasso_regressor.best_estimator_, X_train, y_train)

## 3.4 RIDGE regression

In [None]:
%%time
from sklearn.linear_model import Ridge

# Ridge regression
# Create the parameter grid based on the results of random search
param_grid = {
    'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'max_iter': [1000, 10000, 100000],
    'tol': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# Create new train and test set with only the important features
X_train = X_train[feature_lasso]
X_test = X_test[feature_lasso]

# Create a based model
ridge = Ridge()

# Instantiate the grid search model
crossValidation = 5
ridge_regressor = GridSearchCV(estimator = ridge, param_grid = param_grid, cv = crossValidation, scoring='neg_mean_squared_error')

# Fit the grid search to the data
ridge_regressor.fit(X_train, y_train)

# Print best parameters
print(f'Best params {ridge_regressor.best_params_}')
print(f'Best score {-ridge_regressor.best_score_}')

# Print the score on the test set
printResults(ridge_regressor.best_estimator_, X_train, y_train)

## 3.5 Random Forest regression

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Random forest regression
# Create the parameter grid based on the results of random search
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [None, 10, 50, 100],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create new train and test set with only the important features
X_train = X_train[feature_lasso]
X_test = X_test[feature_lasso]

# Create a based model
rf = RandomForestRegressor()

# Instantiate the grid search model
crossValidation = 5
rf_regressor = GridSearchCV(estimator = rf, param_grid = param_grid, cv = crossValidation, scoring='neg_mean_squared_error')

# Fit the grid search to the data
rf_regressor.fit(X_train, y_train)

# Print best parameters
print(f'Best params {rf_regressor.best_params_}')
print(f'Best score {-rf_regressor.best_score_}')

# Print the score on the test set
printResults(rf_regressor.best_estimator_, X_train, y_train)

## 3.6 Hashing
As is possible to see from the above models, the scorse is very bad, near 0. This is due to the fact that the features are categorical and not numerical. Maybe in the dataprocessing the features were extremely reduced and the model is not able to predict the price of the token.

 In order to improve the score, it is possible to use the hashing trick. The hashing trick is a method that converts categorical features into numerical features. 

In [None]:
# Delete rows with missing values for Eyebrows and Hair Color (A max of 176 rows are deleted)
df_hashing = df_hashing.dropna(subset=['Eyebrows', 'Hair Color'])

# Convert Hair Color NaN values to 'Unknown'
df_hashing['Hair Color'] = df_hashing['Hair Color'].fillna('Unknown')
df_hashing['Hair (Back)'] = df_hashing['Hair (Back)'].fillna('Unknown')

# Drop unnecessary columns: num_sales, name, description, permalink, token_id
df_hashing = df_hashing.drop([ 'name', 'description', 'permalink', 'token_id'], axis=1)

for col in feature:
    if col.startswith('has_'):
        df[col] = df[col].apply(lambda x: 1 if x == True else 0)
    else:
        df[col] = df[col].apply(lambda x: 1 if x == 'Rare' else 0)

In [None]:
from sklearn.model_selection import train_test_split

# Hashing
from category_encoders import HashingEncoder

hashingFeatures = []

 # All features that have not "has_", "score_" in the name
for feature in df_hashing.columns:
    if 'has_' not in feature and 'score_' not in feature and 'price_USD' not in feature:
        hashingFeatures.append(feature)

y = df_hashing['price_USD']
X = df_hashing.drop(['price_USD'], axis=1)

In [None]:
# We create a helper function to get the scores for each encoding method:
from sklearn.metrics import roc_auc_score


def get_score(model, X, y, X_val, y_val):
    model.fit(X, y)
    # Calculate the r2 score
    r2 = model.score(X_val, y_val)
    return r2

In [None]:
%%time
# Split dataset into train and validation subsets:
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size=0.2, random_state=1)

# Lasso regression

# Iterate over different n_components:
for n_components in [10, 100, 500, 1000, 5000, 10000]:
    
    hashing_enc = HashingEncoder(cols=hashingFeatures, n_components=n_components).fit(X_train, y_train)
    
    X_train_hashing = hashing_enc.transform(X_train.reset_index(drop=True))
    X_val_hashing = hashing_enc.transform(X_val.reset_index(drop=True))

    # # Create a based model
    lasso = Lasso()

    # print the scores for each encoding method
    score = get_score(lasso, X_train_hashing, y_train, X_val_hashing, y_val)
    print("Lasso regression score with n_components = {}: {}".format(n_components, score))

## 3.7 Conclusion
Unfortunately, all the models have a very low score. Although the hashing trick, the score is not improved. 
A possible reason could be that NFT are art and the price is not related to the features of the token. The price is related to the artist and the popularity of the artist. 

# 4. Classification model
In this section the price will be divided into 3 classes:
* Low price: price < 1000
* Medium price: 1000 <= price < 10000
* High price: price >= 10000

The goal of this section is to create a model that can predict the price class of a token based on the features of the token.
It will be done using UMAP for show a possible clustering of the tokens and then using different classification models.

In [None]:
# ROC Curve function that will be used to plot the ROC curve for model
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

def plot_roc_curve(y_test, y_pred):
    # Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    n_classes = 3

    y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
    y_pred_bin = label_binarize(y_pred, classes=[0, 1, 2])

    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_pred_bin[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # Plot of a ROC curve for a specific class
    for i in range(n_classes):
        plt.figure()
        plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
        plt.plot([0, 1], [0, 1], 'k--')
        plt.xlim([-0.05, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'ROC Curve for class {i}')
        plt.legend(loc="lower right")
        plt.show()
        

In [None]:
# function to calculate and plot confusion matrix, it will be used for each model
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrix(y_test, y_pred, classes, title=None):
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
    disp = disp.plot(include_values=True, cmap=plt.cm.Blues, ax=None, xticks_rotation='horizontal')
    if title is not None:
        disp.ax_.set_title(title)
    # Calculate TP, FP, FN, and TN for each class
    for i in range(len(cm)):
        tp = cm[i,i]
        fp = np.sum(cm[:,i]) - tp
        fn = np.sum(cm[i,:]) - tp
        tn = np.sum(cm) - tp - fp - fn
        sensitivity = tp / (tp + fn)
        specificity = tn / (tn + fp)
        
        print(f"Class {i} -- TP: {tp}, FP: {fp}, FN: {fn}, TN: {tn}, Sensitivity: {sensitivity}, Specificity: {specificity}")


In [None]:
# Data processing for classification

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

# Drop unnecessary columns: num_sales, name, description, permalink, token_id
df_class = df_class.drop(['name', 'description', 'permalink'], axis=1)

# Create a new column with price (LOW, MEDIUM, HIGH)
df_class['price_USD'] = df_class['price_USD'].astype(float)
df_class['price_USD'] = df_class['price_USD'].apply(lambda x: 'LOW' if x < 1000 else 'MEDIUM' if x < 5000 else 'HIGH')

# print the unique values of the column and their counts
print("Counts for price_USD:")
print(df_class['price_USD'].value_counts())

# Convert Hair Color NaN with SimpleImputer
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
df_class['Hair (Back)'] = imputer.fit_transform(df_class[['Hair (Back)']])
df_class['Hair Color'] = imputer.fit_transform(df_class[['Hair Color']])
df_class['Eyebrows'] = imputer.fit_transform(df_class[['Eyebrows']])

# Create a label encoder object
label_encoder = preprocessing.LabelEncoder()

#Convert all df_class columns to numeric
for col in df_class.columns:
    df_class[col] = label_encoder.fit_transform(df_class[col])

# separate target from predictors
y = df_class['price_USD']
X = df_class.drop(['price_USD'], axis=1)
X = X.drop(['token_id'], axis=1)

# Split dataset in train and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)


## 4.1 UMAP
UMAP is a dimensionality reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

In [None]:
# UMAP 
import umap 

reducer = umap.UMAP()

embedding = reducer.fit_transform(X_train)
embedding.shape

In [None]:
# Image embedding for show in the plot
from io import BytesIO
from PIL import Image
import base64

def embeddable_image(data): # data is the path to the image
    img = Image.open(data)
    img.thumbnail((64,64), Image.LANCZOS)
    buffer = BytesIO()
    img.save(buffer, format='png')
    for_encoding = buffer.getvalue()
    return 'data:image/png;base64,' + base64.b64encode(for_encoding).decode()

In [None]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper

output_notebook()

In [None]:
# Show images in UMAP from ./data/images/$(token_id).png
df_class['image'] = df_class["token_id"].apply(lambda x: f"./data/images/{x}.png")

# drop token_id = 0
df_class = df_class.drop(df_class[df_class['token_id'] == 0].index)

def draw_umap(n_neighbors=15, min_dist=0.1, metric='euclidean', title='', data=df_class):

    fit = umap.UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        n_components=2,
        metric=metric,
        random_state=42
    )

    data2 = data.drop(['image'], axis=1)
    u = fit.fit_transform(data2)
    source = ColumnDataSource(data=dict(
        x=u[:,0],
        y=u[:,1],
        image=[embeddable_image(d) for d in data.image],
        token_id=df["token_id"],
        name=df["name"],
        description=df["description"],
        permalink=df["permalink"],
        price=df["price_USD"]
    ))

    plot_figure = figure(
        title=title,
        width=600,
        height=600,
        tools=('pan, wheel_zoom, reset')
    )

    plot_figure.add_tools(HoverTool(tooltips="""
    <div>
        <div>
            <img src='@image' style='float: left; margin: 5px 5px 5px 5px'/>
        </div>
    </div>
    """))

    plot_figure.circle(
        'x',
        'y',
        source=source,
        line_alpha=0.6,
        fill_alpha=0.6,
        size=4
    )

    show(plot_figure)

In [None]:
for n in (2, 5, 7, 10): # n_neighbors # 2, 5, 10, 20, 50, 100, 200
    for d in ( 0.01, 0.1, 0.5): # min_dist 0.1, 0.25, 0.5, 0.75, 0.99
        for m in ('euclidean', 'cosine', 'manhattan', 'correlation'):
            draw_umap(n_neighbors=n, min_dist=d, metric=m, title='n_neighbors = {}, min_dist = {}, metric = {}'.format(n, d, m))

As it is possible to see from the graph above, the tokens sometimes are clustered in groups, but not following the price. This means that there is a correlation between the features, but not with the price.

## 4.2 KKN
The first model that will be used is the KKN. The KKN is a classification model that classifies a new data point based on the k nearest data points. The k is a parameter that can be tuned.
The model will be trained with grid search cross validation. 
KKN will be fit with 40 different values of k (from 1 to 40) and the best value of k will be selected.
Unfortunately, the score is very low. This means that the KKN is not able to classify the tokens based on the features.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
error = []

# Split dataset in train and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
plt.show()

In [None]:
# Plot confusion matrix for KNN with n = 3
# The best K value is 3
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

plot_confusion_matrix(y_test, y_pred, knn.classes_, title='Confusion matrix KNN n = 5')

# print accuracy
print('Accuracy of KNN classifier on training set: {:.2f}' .format(knn.score(X_train, y_train)))

# print roc curve
plot_roc_curve(y_test, y_pred)

## 4.4 Random Forest

Another model that will be used is the Random Forest. The Random Forest is an ensemble model that uses multiple decision trees to classify the data. The model will be trained with grid search cross validation.

In [None]:
# try again with Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Grid Search
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [100, 200, 300, 1000], 'max_features': [ 'sqrt', 'log2']}
grid = GridSearchCV(RandomForestClassifier(), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)

grid_predictions = grid.predict(X_test)

#Print best parameters
print("Best params:")
print(grid.best_params_)
print("Best estimator:")
print(grid.best_estimator_)

# plot confusion matrix
plot_confusion_matrix(y_test, grid_predictions, grid.classes_, title='Confusion matrix Random Forest')

# print roc curve
plot_roc_curve(y_test, grid_predictions)

## 4.5 Decision Tree

The final model that will be used is the Decision Tree. The Decision Tree is a model that uses a tree structure to classify the data. The model will be trained with grid search cross validation.

In [None]:
# try again with Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Grid Search
from sklearn.model_selection import GridSearchCV

param_grid = {'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random']}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)

grid_predictions = grid.predict(X_test)

#Print best parameters
print("Best params:")
print(grid.best_params_)
print("Best estimator:")
print(grid.best_estimator_)

# plot confusion matrix
plot_confusion_matrix(y_test, grid_predictions, grid.classes_, title='Confusion matrix Random Forest')

# print roc curve
plot_roc_curve(y_test, grid_predictions)

# 5. Conclusion
Unfortunately, all the models have a very low score. Although the hashing trick, the score is not improved.

A possible reason could be that NFT are art and the price is not related to the features of the token. The price is related to the artist and the popularity of the artist.

# 6. Future work

In the future, it will be interesting to analyze the price of the NFTs based on the artist and the popularity of the artist. In addition, it will be interesting to analyze the price of the NFTs based on the category of the NFTs.