# Title: Wine Quality Prediction
# Definition:
Predicting the quality of wine based on its physico-chemical properties.
# Problem definition
# Problem type: Regression

Problem statement: Develop a model that can predict the quality of wine based on its physico-chemical properties, such as pH, acidity, and sugar conten
#  Data
Sourcing:Wine quality dataset from UCI Machine Learning Repository or other sources
Defining different parameters pH, acidity, sugar content, and other relevant physico-chemical properties
Talking to experts: Consulting with wine experts or oenologists to understand the importance of each parameter in determining wine quali
#  Evaluation
Evaluation metric: Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE)
# Features
Features: pH, acidity, sugar content, and other relevant physico-chemical properties
Data dictionary: Creating a data dictionary to document each feature, its data type, and its importance in determining wine qualit
# Preparing the tools
Importing necessary libraries:
Pandas for data analysis
NumPy for numerical operations
Matplotlib/Seaborn for plotting and data visualization
Scikit-learn for machine learning modeling and evalua
# Load data

Loading the wine quality dataset into a Pandas datafra
# Data exploration (EDA)

What question(s) are you trying to solve?: Predicting wine quality based on physico-chemical properties
What kind of data do you have and how do you treat different types?: Continuous and categorical data, handling missing values and outliers
What is missing from the data and how do you deal with it?: Handling missing values using imputation or interpolation
How can you compare different columns to each other, compare them to the target variable, and correlation between independent variables?: Correlation analysis and feature selection
How can you add, change, or remove features to get more out of your data?: Feature engineering and sel
# Modeling

Features and labels: Selecting relevant features and defining the target variable (wine quality)
Training and test split: Splitting the data into training and testing sets
Model choices: Selecting a suitable regression algorithm (e.g., Linear Regression, Decision Trees, Random Forest)
Model comparison: Comparing the performance of different models
Hyperparameter tuning and cross-validation: Tuning hyperparameters using cross-vali
# Evaluating your model

Evaluation metrics: MAE, RMSE, and other relevant metrics for regression problems
Model evaluation: Evaluating the performance of the model using the chosen metr
# Feature importance

Feature importance: Analyzing the importance of each feature in determining wine quality using techniques such as permutation importance or SHAP valu
# Experimentation

Did you meet the evaluation metric?: Evaluating the performance of the model against the chosen metric
If not, what's next?: Discussing with the team to explore options such as collecting more data, trying a better model, or improving the current mo
# Save the model

Saving the model: Saving the trained model for future use and deployment
Model deployment: Deploying the model in a suitable environment, such as a web application or API, to predict wine quality for new, unseen data.deles
seen data.icsdationectionmetionytyt.rformance.quality.



In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv("wine_train.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3673 entries, 0 to 3672
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed.acidity         3673 non-null   float64
 1   volatile.acidity      3673 non-null   float64
 2   citric.acid           3673 non-null   float64
 3   residual.sugar        3673 non-null   float64
 4   chlorides             3673 non-null   float64
 5   free.sulfur.dioxide   3673 non-null   float64
 6   total.sulfur.dioxide  3673 non-null   float64
 7   density               3673 non-null   float64
 8   pH                    3673 non-null   float64
 9   sulphates             3673 non-null   float64
 10  alcohol               3673 non-null   float64
 11  quality               3673 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 344.5 KB


In [4]:
df.describe()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
count,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0,3673.0
mean,6.854724,0.277953,0.335108,6.413613,0.04571,35.3685,138.626736,0.994026,3.189834,0.489115,10.525709,5.89382
std,0.848973,0.099913,0.123156,5.054275,0.021747,16.974746,42.641142,0.003016,0.152739,0.112433,1.243642,0.892749
min,3.8,0.08,0.0,0.6,0.012,3.0,9.0,0.98711,2.72,0.25,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,24.0,108.0,0.9917,3.08,0.41,9.4,5.0
50%,6.8,0.26,0.31,5.25,0.043,34.0,134.0,0.9938,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.81,1.01,14.2,9.0


In [5]:
training_data=df.iloc[:3000]

In [6]:
training_features_data=training_data.drop('quality', axis='columns')
training_labels_data=training_data['quality'].copy()

In [7]:
testing_data=df.iloc[3000:]

In [8]:
testing_features_data=testing_data.drop('quality', axis='columns')
testing_labels_data=testing_data['quality'].copy()

In [9]:
training_data

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
0,9.0,0.245,0.38,5.90,0.045,52.0,159.0,0.99500,2.93,0.35,10.2,6
1,8.2,0.420,0.29,4.10,0.030,31.0,100.0,0.99110,3.00,0.32,12.8,7
2,6.4,0.220,0.32,7.20,0.028,15.0,83.0,0.99300,3.13,0.55,10.9,8
3,5.0,0.350,0.25,7.80,0.031,24.0,116.0,0.99241,3.39,0.40,11.3,6
4,7.4,0.300,0.30,5.20,0.053,45.0,163.0,0.99410,3.12,0.45,10.3,6
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,8.7,0.310,0.73,14.35,0.044,27.0,191.0,1.00013,2.96,0.88,8.7,5
2996,5.2,0.285,0.29,5.15,0.035,64.0,138.0,0.98950,3.19,0.34,12.4,8
2997,7.7,0.390,0.28,4.90,0.035,36.0,109.0,0.99180,3.19,0.58,12.2,7
2998,6.2,0.250,0.28,8.50,0.035,28.0,108.0,0.99486,3.40,0.42,10.4,6


In [10]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [11]:
my_pipeline=Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())
])
x_train=my_pipeline.fit_transform(training_features_data)

x_test=my_pipeline.fit_transform(testing_features_data)

In [12]:
correlation=df.corr()
correlation['quality'].sort_values(ascending=False)

quality                 1.000000
alcohol                 0.436661
pH                      0.090143
sulphates               0.035366
free.sulfur.dioxide     0.007264
citric.acid            -0.008268
residual.sugar         -0.089946
fixed.acidity          -0.112055
total.sulfur.dioxide   -0.179859
volatile.acidity       -0.189077
chlorides              -0.211490
density                -0.306094
Name: quality, dtype: float64

In [13]:
from sklearn.linear_model import LinearRegression

In [14]:
model=LinearRegression()
model.fit(x_train, training_labels_data)

In [15]:
x=model.predict(x_test)

In [16]:
from sklearn.model_selection import cross_val_score
scores=cross_val_score(model, training_features_data, training_labels_data, scoring="neg_mean_squared_error", cv=10)
rmse_scores=np.sqrt(-scores)

In [17]:
len(list(rmse_scores))

10

In [18]:
list(rmse_scores)

[0.7579297565879939,
 0.7119038105882645,
 0.8018856981189969,
 0.8281104442195854,
 0.7617899846275362,
 0.7396527459836634,
 0.7589729083542439,
 0.8273574939318062,
 0.7881182904079181,
 0.6965972580497264]

In [19]:
def print_scores_of_validation(scores):
    print("Scores:", scores)
    print("mean:", scores.mean())
    print("standard deviation:", scores.std())

In [20]:
print_scores_of_validation(rmse_scores)

Scores: [0.75792976 0.71190381 0.8018857  0.82811044 0.76178998 0.73965275
 0.75897291 0.82735749 0.78811829 0.69659726]
mean: 0.7672318390869735
standard deviation: 0.04237195574420888


In [21]:
from sklearn.metrics import mean_squared_error
lin_mse=mean_squared_error(testing_labels_data, x)
lin_rmse=np.sqrt(lin_mse)

In [22]:
lin_mse, lin_rmse

(0.5288955065000522, 0.7272520240604713)

In [23]:
from sklearn.tree import DecisionTreeRegressor
model1=DecisionTreeRegressor()
model1.fit(x_train, training_labels_data)

In [24]:
x1=model1.predict(x_test)

In [25]:
scores1=cross_val_score(model1, training_features_data, training_labels_data, scoring="neg_mean_squared_error", cv=10)
rmse_scores1=np.sqrt(-scores1)

In [26]:
print_scores_of_validation(rmse_scores1)

Scores: [0.89628864 0.90737717 0.94692485 0.91104336 0.92376043 0.86216781
 0.82056891 0.91104336 0.93273791 0.96263527]
mean: 0.9074547706479084
standard deviation: 0.03906199211960267


In [27]:
from sklearn.metrics import mean_squared_error
lin_mse1=mean_squared_error(testing_labels_data, x1)
lin_rmse1=np.sqrt(lin_mse1)

In [28]:
lin_mse1, lin_rmse1

(0.9806835066864784, 0.9902946564969834)

In [29]:
from sklearn.ensemble import RandomForestRegressor
model2=RandomForestRegressor()
model2.fit(x_train, training_labels_data)

In [30]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 1200, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = [2,150]
# Minimum number of samples required to split a node
min_samples_split = [2, 5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [31]:
# Create the param grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

{'n_estimators': [10, 142, 274, 406, 538, 671, 803, 935, 1067, 1200], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [2, 150], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2], 'bootstrap': [True, False]}


In [32]:
from sklearn.model_selection import RandomizedSearchCV
rf_RandomGrid = RandomizedSearchCV(estimator = model2, param_distributions = param_grid, cv = 100, verbose=2, n_jobs = 4)

In [33]:
rf_RandomGrid.fit(x_train, training_labels_data)

Fitting 100 folds for each of 10 candidates, totalling 1000 fits


In [34]:
rf_RandomGrid.best_params_

{'n_estimators': 406,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 150,
 'bootstrap': False}

In [35]:
print (f'Train Accuracy - : {rf_RandomGrid.score(x_train,training_labels_data):.3f}')
print (f'Test Accuracy - : {rf_RandomGrid.score(x_test,testing_labels_data):.3f}')

Train Accuracy - : 0.974
Test Accuracy - : 0.459


In [36]:
x_=model2.predict(x_test)
x2=np.round(x_)

In [37]:
scores2=cross_val_score(model2, training_features_data, training_labels_data, scoring="neg_mean_squared_error", cv=10)
rmse_scores2=np.sqrt(-scores2)

In [38]:
print_scores_of_validation(rmse_scores2)

Scores: [0.65247503 0.61357423 0.7079169  0.66981092 0.66367161 0.64710071
 0.62798301 0.71036657 0.64227305 0.60807017]
mean: 0.6543242199146562
standard deviation: 0.03321869804407307


In [39]:
lin_mse2=mean_squared_error(testing_labels_data, x2)
lin_rmse2=np.sqrt(lin_mse2)

In [40]:
lin_mse2, lin_rmse2

(0.4606240713224368, 0.6786929138590124)

In [41]:
testing_data.to_csv("Testing and predicted data.csv")

In [42]:
df2=pd.read_csv("Testing and predicted data.csv")

In [43]:
df2["Linear_regression"]=x
df2["Decision_Tree"]=x1
df2["Random_Forest"]=x2

In [44]:
df2.to_csv("Testing and predicted data.csv")

In [45]:
def predict_quality(parameters): #enter space seperated 11 integer values as parameters
        
        x=parameters.split()
        arr=np.array(x)

        z=np.reshape(arr, (1,-1))
        y=my_pipeline.fit_transform(z)
        output=model2.predict(y)
        if output<6:
            print("Bad")
        elif output==6:
            print("Average")
        else:
            print("Good")
    

In [46]:
#parameters=input('enter parameters')
#predict_quality(parameters)

In [47]:
df2

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,Linear_regression,Decision_Tree,Random_Forest
0,3000,6.9,0.21,0.49,1.4,0.041,15.0,164.0,0.99270,3.25,0.63,11.0,5,5.948606,6.0,6.0
1,3001,7.7,0.38,0.23,10.8,0.030,28.0,95.0,0.99164,2.93,0.41,13.6,6,6.953448,8.0,7.0
2,3002,5.6,0.15,0.31,5.3,0.038,8.0,79.0,0.99230,3.30,0.39,10.5,6,6.114203,8.0,6.0
3,3003,6.8,0.31,0.32,7.6,0.052,35.0,143.0,0.99590,3.14,0.38,9.0,5,5.204739,5.0,5.0
4,3004,7.2,0.16,0.49,1.3,0.037,27.0,104.0,0.99240,3.23,0.57,10.6,6,6.034801,6.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
668,3668,7.1,0.34,0.32,2.0,0.051,29.0,130.0,0.99354,3.30,0.50,10.4,6,5.532191,6.0,6.0
669,3669,6.4,0.29,0.57,1.0,0.060,15.0,120.0,0.99240,3.06,0.41,9.5,5,5.166344,5.0,5.0
670,3670,8.0,0.26,0.28,8.2,0.038,72.0,202.0,0.99566,3.12,0.56,10.0,6,5.950430,5.0,6.0
671,3671,6.7,0.15,0.29,5.0,0.058,28.0,105.0,0.99460,3.52,0.44,10.2,7,6.023159,5.0,6.0


In [48]:
truth_quality=df2['quality']

In [49]:
truth_label=[]
for values in truth_quality:
    if values>6:
        truth_label.append('good')
    elif values<6:
        truth_label.append('bad')
    else:
        truth_label.append('average')

In [50]:
df2['truth_label']=truth_label

In [51]:
df2

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,Linear_regression,Decision_Tree,Random_Forest,truth_label
0,3000,6.9,0.21,0.49,1.4,0.041,15.0,164.0,0.99270,3.25,0.63,11.0,5,5.948606,6.0,6.0,bad
1,3001,7.7,0.38,0.23,10.8,0.030,28.0,95.0,0.99164,2.93,0.41,13.6,6,6.953448,8.0,7.0,average
2,3002,5.6,0.15,0.31,5.3,0.038,8.0,79.0,0.99230,3.30,0.39,10.5,6,6.114203,8.0,6.0,average
3,3003,6.8,0.31,0.32,7.6,0.052,35.0,143.0,0.99590,3.14,0.38,9.0,5,5.204739,5.0,5.0,bad
4,3004,7.2,0.16,0.49,1.3,0.037,27.0,104.0,0.99240,3.23,0.57,10.6,6,6.034801,6.0,6.0,average
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
668,3668,7.1,0.34,0.32,2.0,0.051,29.0,130.0,0.99354,3.30,0.50,10.4,6,5.532191,6.0,6.0,average
669,3669,6.4,0.29,0.57,1.0,0.060,15.0,120.0,0.99240,3.06,0.41,9.5,5,5.166344,5.0,5.0,bad
670,3670,8.0,0.26,0.28,8.2,0.038,72.0,202.0,0.99566,3.12,0.56,10.0,6,5.950430,5.0,6.0,average
671,3671,6.7,0.15,0.29,5.0,0.058,28.0,105.0,0.99460,3.52,0.44,10.2,7,6.023159,5.0,6.0,good


In [52]:
rf_label=[]
Random_Forest=df2['Random_Forest']
for entries in Random_Forest:
    if entries>6:
        rf_label.append('good')
    elif entries<6:
        rf_label.append('bad')
    else:
        rf_label.append('average')

In [53]:
df2['rf_label']=rf_label
df2.to_csv('Testing and predicted data.csv')

In [54]:
i=0
for a,b in zip(truth_label,rf_label):
    if a==b:
        i=i+1
i

454

In [94]:
import timeit


def sort_list():
    my_list = [4, 2, 7, 1, 3, 5, 8, 6]
    my_list.sort()


accuracy = timeit.timeit('sort_list()', globals=globals(), number=500)

# Save the accuracy to a variable or a file
print("Accuracy:", accuracy)


Accuracy: 0.00017740001203492284
