<h2>🍷Wine Dataset🍇</h2>


<h3>0. Objective 🎯</h3>
<ul>
  <li>Create a model to predict whether the wine is red --> 0 or white --> 1</li>
</ul> 

<h3>1. Importing Libraries 📚</h3>

In [128]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

<h3>2. Reading Dataset 👀</h3>

In [129]:
raw_data = pd.read_csv('wine_dataset.csv')
raw_data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,style
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


<h3>3. Checking Missing Values 🚫</h3>

In [130]:
# Data Cleaning
raw_data.isnull().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
style                   0
dtype: int64

In [131]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  style                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


<h3>4. Data Exploration 📊</h3>

In [132]:
raw_data.describe(include='all')

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,style
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497
unique,,,,,,,,,,,,,2
top,,,,,,,,,,,,,white
freq,,,,,,,,,,,,,4898
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378,
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255,
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0,
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0,
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0,
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0,


<h3>5. Data Preprocessing 🛠</h3>

In [133]:
raw_data['style'] = raw_data['style'].map({'red':0, 'white':1}) # red = 0, white = 1
raw_data.sample(5)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,style
1936,5.8,0.27,0.27,12.3,0.045,55.0,170.0,0.9972,3.28,0.42,9.3,6,1
672,9.8,1.24,0.34,2.0,0.079,32.0,151.0,0.998,3.15,0.53,9.5,5,0
1910,5.0,0.55,0.14,8.3,0.032,35.0,164.0,0.9918,3.53,0.51,12.5,8,1
2955,7.3,0.22,0.41,15.4,0.05,55.0,191.0,1.0,3.32,0.59,8.9,6,1
6072,7.1,0.36,0.2,1.6,0.271,24.0,140.0,0.99356,3.11,0.63,9.8,5,1


<h3>6. ExtraTreesClassifier</h3>

In [134]:
# Separating the independent and dependent variables
y = raw_data['style'] # Dependent (target variable)
X = raw_data.drop(['style'], axis=1)  # Independent (predictor variable)

In [135]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [136]:
# Create model
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X_train,y_train)

In [137]:
# Predict
y_pred = model.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,y_pred)
print(f'Accuracy Score: {score}')

Accuracy Score: 0.9946153846153846


In [138]:
y_test[400:410]

4586    1
5557    1
2908    1
5872    1
91      0
318     0
1477    0
1010    0
3353    1
4182    1
Name: style, dtype: int64

In [139]:
pred = model.predict(X_test[400:410])
pred

array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)