<a href="https://colab.research.google.com/github/JSJeong-me/Machine_Learning/blob/main/ML/6_wine_0818_mutate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook is an exercise in the [Intro to Deep Learning](https://www.kaggle.com/learn/intro-to-deep-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/a-single-neuron).**

---


In [None]:
!hostname

# Introduction #

In the tutorial we learned about the building blocks of neural networks: *linear units*. We saw that a model of just one linear unit will fit a linear function to a dataset (equivalent to linear regression). In this exercise, you'll build a linear model and get some practice working with models in Keras.

Before you get started, run the code cell below to set everything up.

The *Red Wine Quality* dataset consists of physiochemical measurements from about 1600 Portuguese red wines.  Also included is a quality rating for each wine from blind taste-tests. 

First, run the next cell to display the first few rows of this dataset.

In [None]:
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('winequality-red.csv')

In [None]:
red_wine['total_acidity'] = red_wine['fixed acidity'] + red_wine['volatile acidity']
red_wine['total_acidity_citric'] = red_wine['total_acidity'] + red_wine['citric acid']
red_wine['bound_sulphur_dioxide'] = red_wine['total sulfur dioxide'] - red_wine['free sulfur dioxide']
red_wine['org_minus_fail_SG'] = (red_wine['alcohol'] * 7.36)/1000

In [82]:
# len(red_wine.columns)

16

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = red_wine.corr(method='pearson')
sns.heatmap(corr, xticklabels = corr.columns.values,
           yticklabels=corr.columns.values)

In [None]:
red_wine['quality'].hist()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
from tensorflow.keras.utils import to_categorical

In [86]:
# X.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'total_acidity', 'total_acidity_citric',
       'bound_sulphur_dioxide', 'org_minus_fail_SG'],
      dtype='object')

In [87]:
from sklearn.model_selection import train_test_split

# X = red_wine.iloc[:,0:11]
X = red_wine.drop(['quality'], axis=1)

y = red_wine['quality']

In [88]:
X.shape

(1599, 15)

In [None]:
y.shape

In [None]:
type(y)

In [None]:
y.min()

In [89]:
le = LabelEncoder()

In [90]:
y = le.fit_transform(y)

In [None]:
type(y)

In [None]:
y.min()

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                  y,
                                                  test_size=0.2,
                                                  random_state = 42)

In [None]:
y_test.shape

In [92]:
y_train_cat = to_categorical(y_train,6)
y_test_cat = to_categorical(y_test, 6)

In [None]:
y_test_cat.shape

In [None]:
# Scale to [0, 1]
# max_ = df_train.max(axis=0)
# min_ = df_train.min(axis=0)
# df_train = (df_train - min_) / (max_ - min_)
# df_valid = (df_valid - min_) / (max_ - min_)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [93]:
scaler = MinMaxScaler(feature_range=(0, 1))

In [94]:
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train)

In [95]:
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test)

In [None]:
type(X_test)

In [None]:
red_wine.describe()

In [97]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=[15]),
    layers.Dropout(rate=0.5),
    layers.BatchNormalization(),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.5),
    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.5),
    layers.Dense(6, activation='softmax'),
])

In [None]:
# early_stopping = callbacks.EarlyStopping(
#     min_delta=0.001, # minimium amount of change to count as an improvement
#     patience=20, # how many epochs to wait before stopping
#     restore_best_weights=True,
# )

In [None]:
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy', metrics=['accuracy']
)

In [None]:
history = model.fit(
    X_train, y_train_cat,
    validation_data = (X_test,y_test_cat),
    batch_size=128,
    # callbacks=[early_stopping],
    epochs=150,
)

# 2) Define a linear model

Now define a linear model appropriate for this task. Pay attention to how many inputs and outputs the model should have.

In [None]:
model.summary()

In [None]:
import pandas as pd

# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df.loc[0:100,['loss','val_loss']].plot()

In [None]:
plt.plot(history.history['accuracy'], label='Accuracy training data')
plt.plot(history.history['val_accuracy'], label='Accuracy validation data')
plt.legend()
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.show()

In [None]:
# mse_value, mae_value = model.evaluate(X_test, y_test, verbose=0)

# print(mse_value)

In [None]:
y_pred = model.predict(X_test)

In [None]:
type(y_pred)

In [None]:
y_pred[-50,2]

In [None]:
y_test[-50]

https://www.tensorflow.org/datasets/catalog/wine_quality

Creating new features¶

Creating new features from existing features is also an important step to check if it improves the accuracy or even correlation with our dependent variable. After little research, I came up with the following new features:

total_acidity - Sum of fixed and volatile acidity. [fixed acidity + volatile acidity]
total_acidity_citric - I learnt that citric acid is a type of titratable or total acid, so i thought maybe i should add that too in total acidity?. Or else let's just create a new feature for that! [fixed acidity + volatile acidity + citric acid]
bound_sulphur_dioxide - total sulphur dioxide is actually a sum of bound(fixed) SO2 and free sulphur dioxide. [total sulphur dioxide - free sulphur dioxide]
org_minus_final_SG -
%alcohol(byvolume)=OriginalSpecificGravity−FinalSpecificGravity7.36×1000
 
Fun fact: Coke has about the same level of sugar, at 108 g/L, as some of the sweetest dessert wines!

In [None]:
red_wine.columns

In [None]:
# dataset_additional = data.copy()

# dataset_additional['total_acidity'] = dataset_additional['fixed acidity'] + dataset_additional['volatile acidity']
# dataset_additional['total_acidity_citric'] = dataset_additional['total_acidity'] + dataset_additional['citric acid']
# dataset_additional['bound_sulphur_dioxide'] = dataset_additional['total sulfur dioxide'] - dataset_additional['free sulfur dioxide']
# dataset_additional['org_minus_fail_SG'] = (dataset_additional['alcohol'] * 7.36)/1000