# Executing Summary:

This notebook will classify wine type and predict the wine rating using the Machine Learning algorithms. This Notebook is pretty basic and simple :) 

In [1]:
## Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
wine_df = pd.read_csv('/home/docode/project/EDA on Collected Data/new_wine.csv', index_col = 0) 

# One Hot Encoding through pandas get dummies
categorical_features_to_encode = ['wine region', 'wine country', 'grape information'] 

wine_df = pd.get_dummies(wine_df, columns = categorical_features_to_encode, prefix = categorical_features_to_encode)

# Some wines have wine year as 'N.V.', need to convert them to NaN
wine_df['wine year'] = wine_df['wine year'].apply(lambda x: np.nan if x == 'N.V.' else x)


In [3]:
# Defining the X and Y variables
X = wine_df.iloc[:, 2:].drop(columns = ['wine type', 'wine description'])
X = X.fillna(0.0)
Y = wine_df['wine type']
Y = Y.fillna(0.0)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.25, random_state=0)

## Logistic Regression Classifier

In [4]:
lr_model=LogisticRegression(random_state=0)
lr_model.fit(X_train,y_train)
lr_pred=lr_model.predict(X_test)
lr_cm=confusion_matrix(y_test,lr_pred)
lr_ac=accuracy_score(y_test, lr_pred)
print('LogisticRegression_accuracy {} %:'.format(lr_ac * 100))

LogisticRegression_accuracy 49.207505920932775 %:


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Decision Tree Classified

In [5]:
dtree_model=DecisionTreeClassifier(criterion='entropy',random_state=0)
dtree_model.fit(X_train,y_train)
dtree_pred=dtree_model.predict(X_test)
dtree_cm=confusion_matrix(y_test,dtree_pred)
dtree_ac=accuracy_score(dtree_pred,y_test)
print('Decision Tree Classifier accuracy: {} %'.format(dtree_ac * 100))

Decision Tree Classifier accuracy: 93.35033703771178 %


## Random Forest Model

In [6]:
rdf_model=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
rdf_model.fit(X_train,y_train)
rdf_pred=rdf_model.predict(X_test)
rdf_cm=confusion_matrix(y_test,rdf_pred)
rdf_ac=accuracy_score(rdf_pred,y_test)
print('Random Forest accuracy: {} %'.format(rdf_ac * 100))

Random Forest accuracy: 94.69848788486063 %


## Evaluating Results using Cross Validation


In [7]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits = 5, shuffle = True) # Setting shuffle as True considerably increases Cross Validation Score

print('Cross - validation scores:', cross_val_score(rdf_model, X, Y, cv = kfold))
print('Mean score of all folds: {} %'.format (cross_val_score(rdf_model, X, Y, cv = kfold).mean() * 100))

Cross - validation scores: [0.95104736 0.94579822 0.95103621 0.94124345 0.95445229]
Mean score of all folds: 94.94442494770101 %


We Can See that the Random Forest mean accuracy is very similar to cross validation. 

Therefore, We can conclude that our model is very accurate in classifying the type of wine 

# Wine Rating Prediction
Now we will try to predict wine rating using the ML models

In [8]:
# Required libraries
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

from sklearn.metrics import mean_squared_error



In [9]:
wine_df = pd.read_csv('/home/docode/project/EDA on Collected Data/new_wine.csv', index_col = 0) 
wine_df

# One Hot Encoding through pandas get dummies
categorical_features_to_encode = ['wine type','wine region', 'wine country', 'grape information'] 

wine_df = pd.get_dummies(wine_df, columns = categorical_features_to_encode, prefix = categorical_features_to_encode)



In [10]:
# Defining X and Y variables
X = wine_df.iloc[:, 2:].drop(columns = ['wine rating', 'wine description'])
X = X.fillna(0.0)
Y = wine_df['wine rating']
Y = Y.fillna(0.0)

# Doing Train, test and split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.25, random_state=0)

## Decision Tree Regressor

In [11]:
dtree_model=DecisionTreeRegressor()
dtree_model.fit(X_train,y_train)
dtree_pred=dtree_model.predict(X_test)

print('R2 score:', r2_score( y_test, dtree_pred))

print('RMSE:', mean_squared_error(y_test, dtree_pred, squared=False))

print('MAE:', mean_absolute_error(y_test, dtree_pred))

R2 score: 0.5400526398714944
RMSE: 0.23928295142676836
MAE: 0.17691747130624885


## Raindom Forest Regressor

In [12]:
rfg_model = RandomForestRegressor(n_estimators=200, random_state = 42) # n_estimators=10,criterion='entropy',random_state=0
rfg_model.fit(X_train, y_train)
rfg_pred = rfg_model.predict(X_test)

print('R2 score:', r2_score( y_test, rfg_pred))

print('RMSE:', mean_squared_error(y_test, rfg_pred, squared=False))

print('MAE:', mean_absolute_error(y_test, rfg_pred))


R2 score: 0.7353305448367043
RMSE: 0.1815137809365554
MAE: 0.13688018049085174


## KNeighbors Regressor

In [13]:
knn_model = KNeighborsRegressor(n_neighbors = 10)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

print('R2 score:', r2_score(y_test, knn_pred))

print('RMSE:', mean_squared_error(y_test, knn_pred, squared=False))

print('MAE:', mean_absolute_error(y_test, knn_pred))


R2 score: 0.6896991507081625
RMSE: 0.1965391953545685
MAE: 0.1512078702860266


## Conclusion:
The Best Wine Rating (Quality) prediction and wine classification predictions come from Random Forest Algorithm. The model had the best score both in classification and regression problems. To get better results One Hot Encoding were used. Without Encoding the dataframe, the accuracy of the model were smaller for 6-8%.  

# End of the Notebook and Project:
This marks the end of the Notebook and Project!

Overall I have split the project into 6 different notebooks:
1) Wine Data Collection from 
2) Exploratory Data Analysis on Collected Data
3) Constructing User Rating Dataset for Recommendation System
4) Building a Recommendation System
5) Description Based Recommendation System
6) Wine Classification and Wine Quality Prediction

