# Classification

In this notebook we will be testing to see if the decision tree or SVM does a better job at classifying whether the type of wine is red or white using the wines features. 

## Importing DataSet

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
wine_data = pd.read_csv("wineData.csv")
wine_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
type                    6497 non-null object
fixed acidity           6487 non-null float64
volatile acidity        6489 non-null float64
citric acid             6494 non-null float64
residual sugar          6495 non-null float64
chlorides               6495 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6488 non-null float64
sulphates               6493 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 659.9+ KB


## Cleaning the Dataset

In [2]:
wineDf = wine_data.copy()

fixed_acidity_mean = wineDf["fixed acidity"].mean()
volatile_acidity_mean = wineDf["volatile acidity"].mean()
citric_acid_mean = wineDf["citric acid"].mean()
chlorides_mean = wineDf["chlorides"].mean()
residual_sugar_mean = wineDf["residual sugar"].mean()
pH_mean = wineDf["pH"].mean()
sulphates_mean = wineDf["sulphates"].mean()

wineDf["fixed acidity"].fillna( value=fixed_acidity_mean, inplace=True)
wineDf["volatile acidity"].fillna( value=volatile_acidity_mean, inplace=True)
wineDf["citric acid"].fillna( value=citric_acid_mean, inplace=True)
wineDf["residual sugar"].fillna( value=residual_sugar_mean, inplace=True)
wineDf["chlorides"].fillna( value=chlorides_mean, inplace=True)
wineDf["pH"].fillna( value=pH_mean, inplace=True)
wineDf["sulphates"].fillna( value=sulphates_mean, inplace=True)

wineDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
type                    6497 non-null object
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 659.9+ KB


### Graphs of Features

In [38]:
white_alcohol = np.ma.masked_where(wineDf["type"] != "red", wineDf["alcohol"])
red_alcohol = np.ma.masked_where(wineDf["type"] != "white", wineDf["alcohol"])

plt.scatter(white_alcohol, wineDf['volatile acidity'], color="blue")
plt.scatter(red_alcohol, wineDf["volatile acidity"], color="red")
plt.xlabel("Alcohol")
plt.ylabel("Volatile Acidity")
plt.legend(["White", "Red"])
plt.show()

In [39]:
white_sugar = np.ma.masked_where(wineDf["type"] != "red", wineDf["residual sugar"])
red_sugar = np.ma.masked_where(wineDf["type"] != "white", wineDf["residual sugar"])

plt.scatter(white_sugar, wineDf['volatile acidity'], color="blue")
plt.scatter(red_sugar, wineDf["volatile acidity"], color="red")
plt.xlabel("Residual Sugar")
plt.ylabel("Volatile Acidity")
plt.legend(["White", "Red"])
plt.show()

## Splitting the Dataset

In [3]:
from sklearn.model_selection import train_test_split

wine_train_set, wine_test_set = train_test_split(wineDf, test_size=0.2, random_state=123)
print(len(wine_train_set), len(wine_test_set))
print(wine_train_set.head())
print(wine_test_set.head())

5197 1300
       type  fixed acidity   ...     alcohol  quality
6452    red            6.6   ...        11.0        6
5110    red           11.6   ...        10.2        6
2792  white            6.8   ...         9.4        5
1879  white            7.2   ...         9.2        6
2742  white            8.0   ...         9.5        6

[5 rows x 13 columns]
       type  fixed acidity   ...     alcohol  quality
1321  white            7.3   ...        13.2        6
2767  white            7.9   ...         9.5        6
5069    red            8.0   ...         9.2        6
5780    red            8.4   ...        12.0        6
547   white            7.7   ...        11.8        6

[5 rows x 13 columns]


## Decision Tree

### Notes: 
The target feature is going to be the type of wine to see if the type of wine can be classified by its chemical properties. The inital X feature is going to be the quality and alcohol because those seem to be important.

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

X = wine_train_set[["alcohol", "quality"]]
y = wine_train_set['type']

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [5]:
y_predicted = tree_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[  93 1205]
 [  58 3841]]
Accuracy is  0.7569751779873004
Precision is  0.7249061994745626
Sensitivity is  0.7569751779873004
F1 is  0.6763696358184811


### Notes: 
It's getting around 75% accuracy correct with the inital X features and an F1 score of 67%. I am going to different X feautures to try and get better results.

In [6]:
X = wine_train_set[["alcohol", "residual sugar", "volatile acidity"]]
y = wine_train_set['type']

tree_classifier2 = DecisionTreeClassifier()
tree_classifier2.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [7]:
y_predicted = tree_classifier2.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[1298    0]
 [  13 3886]]
Accuracy is  0.9974985568597268
Precision is  0.9975233614065029
Sensitivity is  0.9974985568597268
F1 is  0.99750270034275


### Notes: 
After extensive testing the best that could be scored on the accuracy and F1 score is a 99% using alcohol, residual sugar, and volatile acidity. Which makes sense since those are the main contributing factors to the wines. 

## Testing Decision Tree

In [8]:
X = wine_test_set[["alcohol", "residual sugar", "volatile acidity"]]
y = wine_test_set['type']

In [9]:
y_predicted = tree_classifier2.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[246  55]
 [ 65 934]]
Accuracy is  0.9076923076923077
Precision is  0.9088722422031914
Sensitivity is  0.9076923076923077
F1 is  0.9082142933012857


### Notes: 
The accuracy and F1 score a 99%! That's pretty good. It does seem to be overfitting slightly.

## SVM

In [10]:
from sklearn.svm import SVC

X = wine_train_set[["alcohol", 'residual sugar', "volatile acidity"]]
y = wine_train_set['type']

svm_classifier = SVC(kernel="rbf", gamma = "auto")
svm_classifier.fit(X,y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
y_predicted = svm_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[ 974  324]
 [ 101 3798]]
Accuracy is  0.9182220511833751
Precision is  0.9175633550841831
Sensitivity is  0.9182220511833751
F1 is  0.9155163519780091


### Notes: 
As you can see the decision tree did better than the SVM at classifying the wine. The decision tree performed at 99% and the SVM performed at 91% accuracy. Below further testing will be done to see if a better score can be found.

In [12]:
X = wine_train_set[["alcohol", 'residual sugar', "chlorides", "volatile acidity"]]
y = wine_train_set['type']

svm_classifier2 = SVC(kernel="rbf", gamma = "auto")
svm_classifier2.fit(X,y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
y_predicted = svm_classifier2.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[ 994  304]
 [  86 3813]]
Accuracy is  0.924956705791803
Precision is  0.9247138539283412
Sensitivity is  0.924956705791803
F1 is  0.922537382531821


### Notes: 
The best that could be found was a 92% using alcohol, residual sugar, volatile acidity, and chlorides. 

## Testing SVM

In [14]:
X = wine_test_set[["alcohol", 'residual sugar', "chlorides", "volatile acidity"]]
y = wine_test_set['type']

In [15]:
y_predicted = svm_classifier2.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)
print ("Accuracy is ", accuracy_score(y, y_predicted))
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

[[234  67]
 [ 18 981]]
Accuracy is  0.9346153846153846
Precision is  0.9343327950675278
Sensitivity is  0.9346153846153846
F1 is  0.9325011689750455


### Notes:
The accuracy and F1 score increased by 1% so it does not seem to be overfitting

## Overall Notes

The decision tree did a better job at classifying the type of wine with an accuracy and a F1 score of 99%. It used the features alcohol, residual sugar, and volatile acidity which makes sense because in the correlation heat maps those had the largest affects on the wine. 

The SVM did a little bit worse but still good with an accuracy and F1 score of 92% using the features alcohol, chlorides, volatile acidity, and residual sugar. There is a chance it is slighlty overfitting though because in the testing of the SVM it did slightly worse than the training set. 