In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Wine-quality-classifier 🍷🍾

Just follow this notebook till the end and you could predict quality of wine is good or bad.

**In this notebook we'll classify the quality of wine into Good or Bad.
Alltough in the data the target variable have numbers between 0-10 but we can classify it by classifying all obsevations as:**
        
* quality >= 7  as  good
* quality < 7 as bad

Let's firstly import data and analyse it 

In [None]:
import pandas as pd

df= pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

Let's check if there are any null values

In [None]:
df.isna().sum()

Woow our data is not having any null value. So there's no need to clean the data. 😃😃

Now let's check the information about the features we have.

In [None]:
df.info()

So we have total 1599 rows each having 12 features.
Let's explore the data 

In [None]:
df['quality'].value_counts()

So we can see there are only 217 (199+18) records having quality >=7 i.e of good quality

Now let's change our target variable(quality) into good and bad.

In [None]:
df['quality'] = ['Good' if x>=7 else 'bad' for x in df['quality']]
df.head()

See our quality variable have values good and bad.
Now our data needs just one thing more to be ready for modelling.

We have to change data values all into one range by using **Standardization** so that our model can easily predict.

For this we have to do following steps.
* Split data into x (input features) & y(target variable)
* Then use Standard Scaler to to change data values of 'x'  into one range.

In [None]:
# Split data
x = df.drop('quality', axis=1)
y = df['quality']

x.head()

In [None]:
y.head()

In [None]:
# Import standard scaler
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

# apply scaler
x = ss.fit_transform(x)

x

Yehh! our data is ready for modelling and then predicting wheather a wine is of good or bad quality. 😀

### Modelling

We'll use following models and then evaluate them to find which model works well:

* KNN
* SVM
* Random Forest

**KNN**

In [None]:
#import modules
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

# define and configure the model
model = KNeighborsClassifier()

# evaluate the model 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report model performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


**SVM model**

In [None]:
#import modules
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.svm import SVC

# define and configure the model
model = SVC()

# evaluate the model 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report model performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


**Random forest model**

In [None]:
#import modules
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# define and configure the model
model = RandomForestClassifier()

# evaluate the model 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report model performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


**So we can see that accuracy of Random forest is the highest**

Hence we can easily classify that weather a wine is good or bad in quality using Random forest.

**Accuracy is 90%**

**PLEASE UPVOTE MY NOTEBOOK IF IT HELPED YOU** 😊😊