# Prediction of the quality of the Portuguese "Vinho Verde" wine

### Can we be able to determine the quality of a win thanks to his chimical composition ? 

In the above reference, two datasets were created, using red sample.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. 
The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

## Import of the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

# setting plot style for all the plots
plt.style.use('fivethirtyeight')

## Import of the datafile

This datafile is from UCI web site: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

In [2]:
df=pd.read_csv('MyAuto_ge_Cars_Data.csv', sep=',')

In [78]:
df.head(1)

Unnamed: 0,ID,Price,Manufacturer,Year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Typegearbox,Drive wheels,Doors,Wheel,Color,Airbags,Turbo
0,45568273,11447,HONDA,2014,Hatchback,0,Petrol,1.5,80000,4,Manual,Front,4/5,Left wheel,Grey,4,0


In [24]:
df=df.convert_dtypes()

In [4]:
df=df.rename(columns={'Price ($)':'Price','Gear box type':'Typegearbox','Prod. year':'Year'})

In [5]:
df=df.drop(df[df['Price']=='Price negotiable'].index)

In [21]:
df.Price=df.Price.str.replace(',','')

In [76]:
df1=pd.DataFrame(df['Manufacturer'].value_counts())
Top20=list(df1.head(20).index)

In [None]:
Supprimer les lignes avec les manufacturers hors top20

In [26]:
df=df.drop('Levy ($)',axis=1)

In [34]:
df=df.drop('Model',axis=1)

In [19]:
df.Mileage=df.Mileage.str.replace(' km','')

In [22]:
df.Mileage=df.Mileage.str.replace(',','')

In [35]:
df['Leather interior']=np.where(df['Leather interior']=='No',0,1)

In [47]:
df['Engine volume'].fillna('0',inplace=True)

In [48]:
Turbo=df['Engine volume']
df['Turbo']=np.where(Turbo.str.contains('Turbo'),1,0)

In [58]:
df['Cylinders'].fillna(0,inplace=True)

In [10]:
df=df.drop('Interior color',axis=1)

In [11]:
df=df.drop('VIN',axis=1)

In [None]:
df[df['Drive wheels'].].head(30)

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46043 entries, 0 to 80576
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ID                46043 non-null  Int64 
 1   Price             46043 non-null  string
 2   Manufacturer      46043 non-null  string
 3   Year              46043 non-null  Int64 
 4   Category          46043 non-null  string
 5   Leather interior  46043 non-null  int32 
 6   Fuel type         46011 non-null  string
 7   Engine volume     46043 non-null  string
 8   Mileage           46043 non-null  string
 9   Cylinders         45925 non-null  Int64 
 10  Typegearbox       46043 non-null  string
 11  Drive wheels      45856 non-null  string
 12  Doors             45167 non-null  string
 13  Wheel             46043 non-null  string
 14  Color             45529 non-null  string
 15  Airbags           46043 non-null  Int64 
 16  Turbo             46043 non-null  int32 
dtypes: Int64(4),

In [None]:
df.describe().round(decimals=3)

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='quality', data=df)
plt.title('Number of wines present in the dataset of a given quality')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(y='total sulfur dioxide', x='quality', data=df)
plt.title('Number of wines present in the dataset of a given total sulfur dioxide')
plt.show()

In [None]:
df.loc[df['total sulfur dioxide']>200]

In [None]:
df.loc[df['free sulfur dioxide']>60]

We decided to drop the two outilers present in Totalsulfur Dioxide Column

In [None]:
df=df.drop([1079,1081])

## Modeling

In [None]:
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant

In [None]:
df.corr().quality.sort_values()

In [None]:
y=df.quality
X=df[['volatile acidity','total sulfur dioxide','density','chlorides','free sulfur dioxide','pH','residual sugar','fixed acidity','citric acid','sulphates','alcohol']]
model=sm.OLS(y,add_constant(X)).fit()
model.summary()

In [None]:
X=df[['volatile acidity','total sulfur dioxide','chlorides','free sulfur dioxide','pH','sulphates','alcohol']]
model=sm.OLS(y,add_constant(X)).fit()
model.summary()

## Assumptions

### Multicollinearity

In [None]:
import seaborn as sns
sns.heatmap(X.corr(),cmap='coolwarm')

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF

In [None]:
vifs=pd.Series([VIF(X.values,i) for i in range(X.shape[1])],index=X.columns)

In [None]:
vifs

In [None]:
X=X.drop('pH',axis=1)

In [None]:
vifs=pd.Series([VIF(X.values,i) for i in range(X.shape[1])],index=X.columns)
vifs

In [None]:
X=X.drop('alcohol',axis=1)
vifs=pd.Series([VIF(X.values,i) for i in range(X.shape[1])],index=X.columns)
vifs

In [None]:
sm.OLS(y,add_constant(X)).fit().summary()

In [None]:
X1=X.drop("free sulfur dioxide", axis=1)
sm.OLS(y,add_constant(X1)).fit().summary()

## Autocorrelation - when rows are correlated

Durbin_Watson=1,7353
No Autocorrelation

Assumption is satisfied. No autocorrelation found

## Linearity