### 0 - MODULES AND CONSTANTS
All the modules, constants, import and libraries used in this file

In [169]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from pydot import graph_from_dot_data
from IPython.display import Image


DATA_INPUT_FILE = 'wine.data'
CLASSES = ['class']
FEATURE_NAMES = ['Alcohol', 'Malicacid', 'Ash', 'Alcalinity_of_ash', 'Magnesium', 'Total_phenols', 'Flavanoids', 'Nonflavanoid_phenols',
'Proanthocyanins', 'Color_intensity', 'Hue', '0D280_0D315_of_diluted_wines', 'Proline']
DATA_PERCENTAGE = {'TRAIN':0.8, 'TEST':0.2}

### 1 - DATASET LOADING

Load the dataset from sklearn, as described in Subsec. Then, based on your X and y, answer
the following questions:
- How many records are available?
- Are there missing values?
- How many elements does each class contain?

In [170]:
def loadDataBase(file:str, cols:list[str])->pd.DataFrame:
    return pd.read_csv(filepath_or_buffer=file, delimiter=',', header=None, names=cols)

def AnsQuestionPart1(df:pd.DataFrame)->None:
    print('How many records are available ? \t', df.shape[0])
    print('Are there missing values ? \n', df[df.isna()].count())
    print('How many elements does each class contain? \t', df.loc[:, 'class'].value_counts().sort_index())

### 2 - CLASSIFIER BUILDING AND TRAINING
Create a DecisionTreeClassifier object with the default configuration (i.e. without passing any
parameters to the constructor). <br>Train the classifier using your X and y.

In [171]:
def createAndTrainTree(df:pd.DataFrame, features:list[str], classes:list[str])->DecisionTreeClassifier:
    tree = DecisionTreeClassifier()
    tree.fit(X=df.loc[:, features], y=df.loc[:, classes])
    
    return tree

### 3 - TREE PLOTTING
Now that you have created a tree, you can visualize it. Sklearn offers two functions to visualize decision trees. 
<ul>
<li>The first one, plot_tree(), plots the tree in a matplotlib-based, interactive window.</li>
<li>An alternative way is using export_graphviz(). This function exports the tree as a DOT file. DOT
is a language for describing graph (and, as a consequence, trees). From the DOT code, you can
generate the resulting visual representation either using specific Python libraries, or by using any
online tools (such as Webgraphviz). </li>
</ul>
We recommend using the latter approach, where you paste the string returned by export_graphviz (which is the DOT file) directly into Webgraphviz.<br> If, instead, you would rather run it locally, you can install pydot (Python package) and graphviz (a graph
visualization software). <br>
After you successfully plotted a tree, you can take a closer look at the result and draw some conclusions. 
<br> In particular, what information is contained in each node? Take a closer look at the leaf
nodes. <br>Based on what you know about over fitting, what can you learn from these nodes?



In [172]:
def printTree(tree:DecisionTreeClassifier, features:list[str], printGood:bool)->None:
    if printGood:
        dot_code = export_graphviz(tree, feature_names=features)
        Image(graph_from_dot_data(dot_code)[0].create_png())
    else:
        plot_tree(tree)


### 4 - PREDICTION FOR OVER FITTING 

Given the dataset X, you can get the predictions of the classifier (one for each entry in X) by calling
the predict() of DecisionTreeClassifier. <br>
Then, use the accuracy_score() function (which you can import from sklearn.metrics) to compute the accuracy between two lists of values (y_true,
the list of “correct” labels, and y_pred, the list of predictions made by the classifier). 
<br> Since you already have both these lists (y for the ground truth, and the result of the predict() method for the
prediction), you can already compute the accuracy of your classifier. 
<br> What result do you get? 
<br> Does this result seem particularly high/low? 
<br> Why do you think that is?


In [173]:
def testOverFittingValues(tree:DecisionTreeClassifier, df:pd.DataFrame, classes:list[str], features:list[str], method: object) -> float:
    return method(df.loc[:, classes], tree.predict(df.loc[:, features])) *100

### 5 - PARTITIONED DATA SET TESTING AND ACCURACY

Now, we can split our dataset into a training set and a test set. <br>
We will use the training set to train a model, and to assess its performance with the test set. <br>
Sklearn offers the train_test_split() function to split any number of arrays (all having the same length on the first dimension) into two
sets. <br> You can use an 80/20 train/test split. If used correctly, you will get 4 arrays: X_train, X_test, y_train, y_test.

### MAIN FUNCTION
this is the main function of our program that will coordinate code execution, it does:
<ol>
<li>Loads the database into a data structure</li>
<li>Creates and trains the classifier</li>
<li>Prints the tree</li>
<li>Accuracy evaluation for over fitting</li>
<li>Model testing with partitioned data set</li>
<li></li>
<li></li>
<li></li>
<li></li>
</ol>

In [174]:
def main()->None:
    df = loadDataBase(DATA_INPUT_FILE, CLASSES+FEATURE_NAMES) # 1
    # AnsQuestionPart1(df) # 1
    
    tree = createAndTrainTree(df, FEATURE_NAMES, classes=['class']) # 2
    
    # printTree(tree, FEATURE_NAMES, False) # 3
    
    print("'DUMB' Accuracy of the model ", testOverFittingValues(tree, df, CLASSES, FEATURE_NAMES, accuracy_score))
    
    
    
main()

'DUMB' Accuracy of the model  100.0
