# Final Assignment Machine Learning

## Introduction

Below 2 cases are shortly presented. 

The first case has a small amount of data and is fairly easy, and the second case is of intermediate level.

For this final assignment you should work out both cases. Every case can be considered as a typical classification problem. The data of both cases is available on the UCI website. Both cases have labels.

For each case the following should be done:
+ Formulate the question are you trying the answer?
+ Clearly describe the problem that you want to solve.
+ What are the features and labels to start with, motivate your choices (e.g. based on literature).
+ Make a description of the dataset.
+ Find out which are the most important features, should you add and remove features?
+ Show how far can you go with K-means clustering?
+ Apply different classification algorithms, vary the values of the most important parameters, play with the number of features and keep records of algo scores. 
+ Motivate your choices, and of course, support your research journey with appealing and informative graphs and diagrams.

## Case 1 - Wine Quality

**Data Set Information**

The data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. 

See: [UCI Wine](http://archive.ics.uci.edu/ml/datasets/Wine)

# Prepairing the dataset

In [5]:
import pandas as pd

data= pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', delimiter=',', encoding="utf-8", header=None, names=['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline'])
data_cleaned=data.interpolate()

data_cleaned.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [30]:
from sklearn.cluster import KMeans
from sklearn import datasets, cluster
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

est = cluster.KMeans(3)
est.fit(data_cleaned)

y_est = est.predict(data_cleaned)

labels = est.labels_

plt.imshow(confusion_matrix(y_est, labels), cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');
confusion_matrix(y_est, labels)

array([[62,  0,  0],
       [ 0, 47,  0],
       [ 0,  0, 69]])

In [18]:
from sklearn.preprocessing import StandardScaler

# Standarize the features
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[:5,:]

array([[-1.4398785 , -0.78560713],
       [ 0.62763934, -0.49047417],
       [ 0.43073288, -1.24566733],
       [ 0.88607907,  2.84279219],
       [ 0.77531919,  2.23516551]])

## Case 2 - Heart Disease

**Data Set Information**

A number of attributes are listed that possibly influence heart diseases. The presence of heart disease in the patient is an integer valued from 0 (no presence) to 4. 

The names and social security numbers of the patients were recently removed from the database, and replaced with dummy values. 

One file has been "processed", i.e. the Cleveland database (use this one!). 

See: [UCI Heart Disease](http://archive.ics.uci.edu/ml/datasets/Heart+Disease)

### Goodluck