# Red Wine Quality Prediction
Problem Statement:
The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

This dataset can be viewed as classification task. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Attribute Information

Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

What might be an interesting thing to do, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.
This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.

You need to build a classification model. 

Inspiration

Use machine learning to determine which physiochemical properties make a wine 'good'!



In [None]:
import pandas as pd


In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sb


In [None]:
import warnings

In [None]:
%matplotlib inline

In [None]:
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv(r'C:\\Users\\win 7\\Desktop\\Datascience\\Wine.csv')

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
#fill the missing values
for col, value in df.items():
    if col !='type':
        df[col]=df.[col].fillna(df[col].mean())

In [None]:
df.isnull().sum()

In [None]:
# create box plots
fig, ax=plt.subplots(ncol=6, nrows=2, figsize=(20,10))
index=0
ax=ax.flatten()

for col,value in df.items():
    if col !='type':
         sns.boxplot(y=col, data=df, ax=ax[index])
         index=+1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)            

In [None]:
#log transformation
df['free sulfur dioxide']=np.log(1+ df['free sulfur dioxide'])

In [None]:
sns.distplot(df['free sulfur dioxide'])

In [None]:
sns.countplot(df['type'])

In [None]:
sns.countplot(df['quality'])

# Coorelation Matrix

In [None]:
corr=df.corr()
plt.figure(figsize(20,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Input Split

In [None]:
X=df.drop(columns=['type','quality'])
Y=df['quality']


# Class Imbalancement

In [None]:
Y.value_counts()

In [None]:
from imbeamlearn.over_sampling import SMOTE
oversample=SMOTE(k_neighbors=4)
#transform the dataset
X,Y=oversample.fit_resample(X,Y)

In [None]:
Y.value_counts()

In [None]:
#Classify function
from sklearn.model_selection import cross_val_score, train_test_split
def classify (model, X,Y):
    X_train, X_test, Y_train, Y_test+train_test_split(X,Y, test_size=0.25, random_state=42)
    #train the model
    model.fit(X_train, Y_train)
    print("Accuracy:",model.score(X_test, Y_test)*100)
    
    #cross-valdiation 
    score=cross_val_score(model,X,Y,cv=5)
    print("CV Score:", np.mean(score)*100)

In [None]:
from sklearn.linear_model import LogisticRegression
model=logisticRegression()
classify (model,X,Y)

In [None]:
from sklearn.tree import DecissionTreeClassifier
model=DecissionTreeClassifier()
classify (model,X,Y)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier 
model=ExtraTreesClassifier()
classify (model,X,Y)

In [None]:
from sklearn.ensemble import RandonforestClassifier
model=RandomforestClassifier()
classify (model,X,Y)

In [None]:
import lightgbm
model=lightgbm.LGBMClassifier()
Classify(model,X,Y)

# Abalone Case Study
Problem Statement:
The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

Attribute Information

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict. 

Name / Data Type / Measurement Unit / Description
-----------------------------
Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years. 

You have to predict the rings of each abalone which will lead us to the age of that abalone. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# reading the data

data=pd.read_csv(r'C:\\Users\\win 7\\Desktop\\Datascience\\Abalone.csv')


In [None]:
# getting the shape
data.shape


In [None]:
# looking at the head of the data
data.head()

In [None]:
# describe the data
data.describe()

In [None]:
# information of the data

data.info()

In [None]:
# checking if there is any NULL data
data.isnull().sum()

In [None]:
# pairplot
sns.pairplot(data)

In [None]:
# checking the columns of the data

data.columns

In [None]:
# heatmap
sns.heatmap(data[[ 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings']])

In [None]:
# checkig the values of sex
data['Sex'].value_counts()

In [None]:
# plotting a hue plot
plt.rcParams['figure.figsize'] = (18, 8)
sns.boxplot(data['Rings'], data['Length'], hue = data['Sex'], palette = 'pastel')
plt.title('Rings vs length and sex', fontsize = 20)

In [None]:
# rings vs diameter and sex

plt.rcParams['figure.figsize'] = (20, 8)
sns.violinplot(data['Rings'], data['Diameter'], hue = data['Sex'], palette = 'Set1')
plt.title('Rings vs diameter and sex', fontsize = 20)

In [None]:
# rings vs height and sex

plt.rcParams['figure.figsize'] = (18, 8)
sns.boxenplot(data['Rings'], data['Height'], hue = data['Sex'], palette = 'Set2')
plt.title('Rings vs height and sex', fontsize = 20)

In [None]:
# ring vs weight
plt.rcParams['figure.figsize'] = (18, 10)
sns.swarmplot(data['Rings'], data['Whole weight'])
plt.title('Rings vs weight')

In [None]:
# ring vs shucked weight
plt.rcParams['figure.figsize'] = (18, 10)
sns.swarmplot(data['Rings'], data['Shucked weight'], palette = 'dark')
plt.title('Rings vs shucked weight')

In [None]:
# ring vs viscera weight
plt.rcParams['figure.figsize'] = (18, 10)
sns.stripplot(data['Rings'], data['Viscera weight'])
plt.title('Rings vs Viscera Weight')

In [None]:
# ring vs shell weight
plt.rcParams['figure.figsize'] = (18, 10)
sns.regplot(data['Rings'], data['Shell weight'])
plt.title('Rings vs Shell weight')

In [None]:
from math import pi

# Set data
df = pd.DataFrame({
'group': [i for i in range(0, 4177)],
'Sex': data['Sex'],
'Length': data['Length'],
'Diameter': data['Diameter'],
'Whole weight':  data['Whole weight'],
'Viscera weight': data['Viscera weight'],
'Shell weight': data['Shell weight']
})
 
# number of variable
categories=list(df)[1:]
N = len(categories)
 
# We are going to plot the first line of the data frame.
# But we need to repeat the first value to close the circular graph:
values = df.loc[0].drop('group').values.flatten().tolist()
values += values[:1]
values
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
 
# Initialise the spider plot
ax = plt.subplot(111, polar=True)
 
# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories, color='grey', size=8)
 
# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([10,20,30], ["10","20","30"], color="grey", size=7)
plt.ylim(0,40)

# Plot data
ax.plot(angles, values, linewidth=1, linestyle='solid')
plt.title('Radar Chart for determing Importances of Features', fontsize = 20) 
# Fill area
ax.fill(angles, values, 'red', alpha=0.1)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex'])

data['Sex'].value_counts()
'''

data = pd.get_dummies(data)

In [None]:
data.head()

In [None]:
# splitting the dependent and independent variables

y = data['Rings']
data = data.drop(['Rings'], axis = 1)
x = data

# getting the shapes
print("Shape of x:", x.shape)
print("Shape of y:", y.shape)

In [None]:
# train test split

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

# getting the shapes
print("Shape of x_train :", x_train.shape)
print("Shape of x_test :", x_test.shape)
print("Shape of y_train :", y_train.shape)
print("Shape of y_test :", y_test.shape)

In [None]:
# MODELLING
# RANDOM FOREST REGRESSOR

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# evaluation
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("RMSE :", rmse)

# r2 score
r2 = r2_score(y_test, y_pred)
print("R2 Score :", r2)

In [None]:
!pip install eli5


In [None]:
# let's check the importance of each attributes


#for purmutation importance
import eli5 
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, random_state = 0).fit(x_test, y_test)
eli5.show_weights(perm, feature_names = x_test.columns.tolist())