<a href="https://colab.research.google.com/github/DariaEng2704/Final-Project/blob/main/Wine_Testing_Final_Project_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Quality of Wine - She Codes Final Project***



by Daria Engel


The current project will provide supervised predictive models to identify the quality of wines, based on a variety of parameters, as country of origin, price, and variety of the wine. 
The provided models will help us to predict the quality of a wine, and therefore, help us to understand is it worth buying. 


# **Import the modules**

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import tensorflow.keras
import seaborn as sb
import matplotlib.pyplot as plt
from matplotlib import style
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn import metrics
from sklearn.metrics import r2_score
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from termcolor import colored as cl
from sklearn.model_selection import cross_val_score 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import plot_confusion_matrix 
import matplotlib.pyplot as plt
import itertools
from sklearn.metrics import classification_report

try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False



# **Import the data**


We will load in a dataset from the **[Kaggle website](https://www.kaggle.com/)**.
Specifically, we are going to use the **[Wine Reviews Dataset](https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data-130k-v2.csv)**.

In [None]:
from google.colab import files
uploaded = files.upload()

data = pd.read_csv('winemag-data-130k-v2.csv')
df = pd.DataFrame(data)




# Preview of the first 5 rows of the data:

df.head()

In [None]:
# We will remove the first column, which contains the numbering of the rows.

df = df.drop(df.columns[0], axis=1) 
df.head()

The columns included in the data are : 

+ **country** - the country the wine comes from.
+ **description** - the description of the wine by the taster.
+ **designation** - the vineyard within the winery provided the grapes for the wine.
+ **points** - the number of points the taster rated the wine on a scale of 1-100 (the current dataset includes only a range of 80-100 as the website which provided the datadset posts those scores only. 
+ **price** - the cost of a bottle of wine.
+ **province** - the province or state the wine comes from.
+ **region_1** - the wine-growing area in a province or state.
+ **region_2** - specific regions specified within a wine-growing area.
+ **taster_name** - the name of the taster of the wine. 
+ **taster_twitter_handle** - the Twitter username of the taster.
+ **title** - the title of the wine.
+ **variety** - the variety of the wine.
+ **winery** - the winery the wine was made by. 


# **Identifying Missing Data**

In the current section, we will focus on identifying the missing data in the dataset. 
First, we will check the type of data in each column :

In [None]:
df.dtypes

After checking that all the suggested data types are matching the real data types, we will drop the duplicate samples -  


In [None]:
print ("Data size prior duplicates removal: " + str(len(df.axes[0])) + " rows and " + str(len(df.axes[1])) + " columns.")
df = df.drop_duplicates()
print ("Data size after duplicates removal: " + str(len(df.axes[0])) + " rows and " + str(len(df.axes[1])) + " columns.")

Now, we will make a summary table with  
 + Features with missing data
 + The percent of the missing data for each feature
 + How many unique values each feature have

In [None]:
nulls = df.isnull().sum()
percentage_nulls = 100 * nulls / len(df)
data_types = df.dtypes
unique_values = df.nunique()
missing_values_table = pd.concat([nulls, percentage_nulls,unique_values, data_types], axis=1)
missing_values_table = missing_values_table.rename(columns = {0 : 'Missing Values', 1 : 'Percentage', 2: 'Unique Values', 3 : 'Data Types'})
missing_values_table

# **Missing Data Imputation**

#### First, we will start with **observations** dropping : 
  + The features **"country"**, **"province"** and **"variety"** have a negligible amount of missing values. Therefore, we will remove the samples with those values : 

In [None]:
df_no_small_missing = df[(df['country'].notnull())  & (df['variety'].notnull()) & (df['province'].notnull())]
df_no_small_missing.isnull().sum()

#### Secondly, we will drop some **features** : 

+ The **"designation"** variable will not be used since 30% of the data is missing and the rest 70% includes many unique values (every second value is unique), which makes it not very useful to use. 
+ The **"region_2"** variable has a very high percentage of missing data (61%), so we will drop it. This variable only contains regions in the USA and irrelevant to all other countries (see below).

In [None]:
df_no_small_missing['region_2'].value_counts().plot(kind='barh')

+ For now, we will also drop the **"description"** variable, since it requires an NLP model. 
+ The variables **"title"** and **"taster_twitter_handle"** will be dropped. 
  + "title" data is very specific and unique, which is not informative for the models. 
  + "taster name" is covered by the "taster_name" variable. 

In [None]:
# Drop of all mentioned above columns 

df_relevant_features = df_no_small_missing.drop(['description','designation','region_2','title','taster_twitter_handle'], axis=1) 
df_relevant_features.head()

In [None]:
nulls = df_relevant_features.isnull().sum()
percentage_nulls = 100 * nulls / len(df_relevant_features)
data_types = df_relevant_features.dtypes
unique_values = df_relevant_features.nunique()
missing_values_table = pd.concat([nulls, percentage_nulls,unique_values, data_types], axis=1)
missing_values_table = missing_values_table.rename(columns = {0 : 'Missing Values', 1 : 'Percentage', 2: 'Unique Values', 3 : 'Data Types'})
missing_values_table

## **Missing Data Imputaion - "price" variable** 

We will start by imputing the missing data in the "price" variable (integer). The missing values will be replaced with a median of a price in each province, with a belief that these groups are homogeneous enough to obtain relatively accurate results.


In [None]:
df_relevant_features['price'] = df_relevant_features.groupby('province')['price'].transform(lambda x: x.fillna(x.median()))
print(df_relevant_features.isnull().sum())

We left with 3 missing values for the "price" variable. As possible to see below, all three have only one value per province and this value is missing. 


In [None]:
print (df_relevant_features.loc[pd.isna(df_relevant_features['price'])]['province'].value_counts())


In [None]:
df_relevant_features.loc[pd.isna(df_relevant_features['price'])]

Therefore, we will fill in those 3 values with the general median of all the data. 

In [None]:
df_relevant_features['price'].fillna(df_relevant_features['price'].median(), inplace=True)
print(df_relevant_features.isnull().sum())


## **Missing Data Imputaion - "taster_name" variable**

Now we will impute the missing data in the "taster_name" variable (character). The missing values will be replaced with the value "unknown", since we don't know whether the names are missing due to lack of documintation or deu to the fact that only the major tasters are mentioned in the list. 

In [None]:
df_relevant_features['taster_name'].fillna('other', inplace=True)
print(df_relevant_features.isnull().sum())

## **Missing Data Imputaion - "region_1" variable**


We only left with the "region_1" variable missing data. 
First, we will check what are countries covered in the "region_1" variable - 

In [None]:
df_relevant_features.loc[pd.isna(df_relevant_features['region_1']) == False]['country'].value_counts()

As we can see, the "region_1" feature includes data of only 7 countries out of all 43 countries which appear in the dataset. This means that it is impossible to fill in the missing data of all the other 36 countries. 
Therefore, the missing values will be replaced with "null". 

In [None]:
df_relevant_features['region_1'].fillna('other', inplace=True)
print(df_relevant_features.isnull().sum())

As a summary, after handling the missing data and dropping irrelevant columns, these are the columns we will use for the models. 

In [None]:
nulls_final = df_relevant_features.isnull().sum()
percentage_nulls_final = 100 * nulls_final / len(df_relevant_features)
data_types_final = df_relevant_features.dtypes
unique_values_final = df_relevant_features.nunique()
missing_values_table_final = pd.concat([nulls_final, percentage_nulls_final,unique_values_final, data_types_final], axis=1)
missing_values_table_final = missing_values_table_final.rename(columns = {0 : 'Missing Values', 1 : 'Percentage', 2: 'Unique Values', 3 : 'Data Types'})
missing_values_table_final

# **Data Distribution**

Before starting working on the models for wine quality prediction, we will check the data distribution of both features and labels. If needed, we will transform the data in order to get a better quality model.  

### **"country" Data Distribution**

In [None]:
df_relevant_features['country'].value_counts()

As we see, the big volume of the data is covered by the USA, France, and Italy. 

### **"points" Data Distribution**

In [None]:
df_relevant_features['points'].hist()
plt.suptitle('Histogram of points', fontsize='xx-large')
plt.xlabel('Points', fontsize='large')
plt.ylabel('Frequency', fontsize='large')
print ()

In [None]:
print ("The mean of the points is : {}".format(df_relevant_features['points'].mean()))
print ("The standard deviation of the points is : {}".format(df_relevant_features['points'].std()))

The scores are normally distributed with mean = 88.5 and standard deviation = 3. 

### **"price" Data Distribution**

In [None]:
# An histogram of the "price" variable. 

df_relevant_features['price'].hist()
plt.suptitle('Histogram of price', fontsize='xx-large')
plt.xlabel('Price is USD', fontsize='large')
plt.ylabel('Frequency', fontsize='large')
print ()

It seems that the outliers in the data prevent are from seeing the distribution of the major amount of data, with much lower prices than the range presented in this histogram. 

To get a better view, we will first create a boxplot, which will show us the possible outliers in a better way. 

In [None]:
import seaborn as sns
sns.boxplot(y = df_relevant_features['price'])
plt.suptitle('Boxplot of price', fontsize='xx-large')
print ()

In the boxplot, we see that most of the prices are lower than ~100. Therefore, we will create a histogram that will provide us a better view of the price distribution. In order to do that, we will summarize all values higher than 100 to one bin.

In [None]:
def plot_histogram():
    bins = np.arange(0,110,5)
    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([np.clip(df_relevant_features['price'], bins[0], bins[-2])], density=False, bins=bins)
    xlabels = bins[0:].astype(str)
    xlabels[-2] += '+'
    N_labels = len(xlabels)-1
    plt.xlim([0, 105])
    plt.xticks(5 * np.arange(N_labels))
    ax.set_xticklabels(xlabels)
    plt.suptitle('Histogram of price', fontsize='xx-large')
    plt.xlabel('Price in USD', fontsize='large')
    plt.ylabel('Frequency', fontsize='large')
    print ()
plot_histogram()

As we can see, most of the wines cost between 10-30 USD.

Since our data has a right-skewed distribution, we would make a **log transformation**. We expect that the transformation would provide asymptotically normally distributed data.

Since the lowest price is 4 and most of the values are below 100, we will present a histogram with a log of 4 to 100 (~1.25 to ~4.75 in log values). All the values higher than 4.75 are summarized and presented in the last bin. 

In [None]:
def plot_histogram_1():
    bins = np.arange(1.25,5.25,0.25)
    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([np.clip(df_relevant_features['price'].apply(np.log), bins[0], bins[-2])], density=False, bins=bins)
    xlabels = bins[0:].astype(str)
    xlabels[-2] += '+'
    N_labels = len(xlabels)-1
    plt.xlim([1.25,5])
    plt.xticks(0.25 * np.arange(N_labels) + 1.25)
    ax.set_xticklabels(xlabels)
    plt.suptitle('Histogram of log(price)', fontsize='xx-large')
    plt.xlabel('log(price)', fontsize='large')
    plt.ylabel('Frequency', fontsize='large')
    print () 
plot_histogram_1()

The transformed "price" variable has indeed a distribution that is much closer to be normal than the original distribution. Hence, we will use log(price) as a feature. 

In [None]:
df_relevant_features['price_log'] = np.log(df_relevant_features['price'])

### **"taster_name" Data Distribution**

In [None]:
df_relevant_features['taster_name'].value_counts().plot(kind='barh')
plt.suptitle('Bar chart of tasters', fontsize='xx-large')
plt.xlabel('Frequency', fontsize='large')
plt.ylabel('Tasters', fontsize='large')
print ()

We will skip other features since they have a big number of categories each and the summary will not be as informative as the summaries presented above. 

In [None]:

data_types_final = df_relevant_features.dtypes
unique_values_final = df_relevant_features.nunique()
missing_values_table_final = pd.concat([unique_values_final, data_types_final], axis=1)
missing_values_table_final = missing_values_table_final.rename(columns = {0 : 'Unique Values', 1 : 'Data Types'})
missing_values_table_final

In our future models, we will use an One-Hot Encoding for categorical variables. Since this encode will increase the amount of features to be a very high, will slow the running time and use a big amount of memory. 
In order to reduce all that, categories with a low amount of values (<100 in our case) will be replaced by the value "other".  

In [None]:
cols = ['province','region_1','variety', 'winery']
for col in cols:
    val = df_relevant_features[col].value_counts()
    y = val[val < 100].index
    df_relevant_features[col] = df_relevant_features[col].replace({x:'other' for x in y})

In [None]:
data_types_final = df_relevant_features.dtypes
unique_values_final = df_relevant_features.nunique()
missing_values_table_final = pd.concat([unique_values_final, data_types_final], axis=1)
missing_values_table_final = missing_values_table_final.rename(columns = {0 : 'Unique Values', 1 : 'Data Types'})
missing_values_table_final

After replacing the values to "other", we will encode the categorical features to One-Hot features. 

In [None]:
df_relevant_features_new = pd.get_dummies(df_relevant_features, columns = ['country', 'province', 'region_1','taster_name','variety','winery'])

Since we transrformed the 'price' variable to be log, we will drop the 'price' column. 

In [None]:
df_relevant_features_new.drop('price', axis=1, inplace=True)
df_relevant_features_new.head()


# **Regression Models**

### **Linear Regression Model**

As a starting point, we will create a basic Linear Regression Model, which will include only two numeric variables that are included in the dataset - price, and points. We will check the proportion of the variance for the points that is explained by the price. 

In [None]:
x1 = df_relevant_features_new['price_log'] # independent variable
y = df_relevant_features_new['points'] # dependent variable
plt.scatter(x1, y)
plt.xlabel('log (price)', fontsize = 10)
plt.ylabel('points', fontsize = 10)
plt.show()

x = sm.add_constant(x1)

lr_model = sm.OLS(y, x).fit() # Ordinary Least Squares 
lr_model.summary()

As we see in the graph and in the summary table, there is some linear relationship between the log of the prices and the scores. 
The R^2 value is 0.352, means the proportion of the variance for the points that is explained by the price is equal to 0.352. 
We also see that both the log(price) and the intercept are significant, means the log(price) is a significant variable when predicting the points. 

In [None]:
plt.scatter(x1, y)
yhat = 2.8480*x1 + 79.008
fig = plt.plot(x1,yhat, lw=4, c='orange', label = 'regression line')
plt.xlabel('log (price)', fontsize = 10)
plt.ylabel('points', fontsize = 10)
plt.show()

After creating a basic Linear Regression Model, we will move further to machine learning regression models, and will check how adding categorical features (presented as one-hot features) and using more advanced models will improve the predictions. 

Later, we will create categorical models, and will try to predict the quality of the wine, when the label will be used as a categorical label with two categories - "Poor quality" and "Good quality". 

## **Keras Regression Model**

We wil start with Neural Network Keras Regression Model. 

In [None]:
x_columns_reg = df_relevant_features_new.columns.drop('points')
x_reg = df_relevant_features_new[x_columns_reg].values
y_reg = df_relevant_features_new['points'].values

# Create train/test
x_train_reg, x_test_reg, y_train_reg, y_test_reg = train_test_split(    
    x_reg, y_reg, test_size=0.2, random_state=42)


In [None]:
# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x_reg.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
                        patience=5, verbose=1, mode='auto', 
                        restore_best_weights=True)
model.fit(x_train_reg,y_train_reg,validation_data=(x_test_reg,y_test_reg),
          callbacks=[monitor],verbose=2,epochs=1000)

In [None]:
# Predict
y_pred_keras = model.predict(x_test_reg)

# Measure MSE and RMSE error.  

score1 = metrics.mean_squared_error(y_pred_keras,y_test_reg)
print("Final score (MSE): {}".format(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_pred_keras,y_test_reg))
print("Final score (RMSE): {}".format(score2))

In [None]:
# Calculating R2 value

y_pred_train_keras = model.predict(x_train_reg)

score_train_keras = r2_score(y_train_reg, y_pred_train_keras)
score_test_keras = r2_score(y_test_reg, y_pred_keras)
print("R^2 for train data: {}".format(score_train_keras))
print("R^2 for test data: {}".format(score_test_keras))


In [None]:
# Regression chart.
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.plot(t['y'].tolist(), label='expected')
    plt.ylabel('output')
    plt.legend()
    plt.show()
chart_regression(y_pred_keras.flatten(), y_test_reg)

As we see, the RMSE value is 2.28 and the R^2 value for the test data is 0.45. 

We will build aditional Regression model and compare the scores.  

## **XGBoost Regression Model**

In [None]:
regressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=2,
    gamma=0,
    max_depth=3
)

In [None]:
regressor.fit(x_train_reg, y_train_reg)


In [None]:
y_pred_XG = regressor.predict(x_test_reg)

# Measure MSE and RMSE errors  

score1_XG = metrics.mean_squared_error(y_pred_XG,y_test_reg)
print("Final score (MSE): {}".format(score1_XG))
score2_XG = np.sqrt(metrics.mean_squared_error(y_pred_XG,y_test_reg))
print("Final score (RMSE): {}".format(score2_XG))

In [None]:
# Calculating R2 value

y_pred_train_XG = regressor.predict(x_train_reg)
score_train_XG = r2_score(y_train_reg,y_pred_train_XG)
score_test_XG = r2_score(y_test_reg,y_pred_XG)
print("R^2 for train data: {}".format(score_train_XG))
print("R^2 for test data: {}".format(score_test_XG))

In [None]:
# Regression chart.
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.plot(t['y'].tolist(), label='expected')
    plt.ylabel('output')
    plt.legend()
    plt.show()
chart_regression(y_pred_XG.flatten(), y_test_reg)

XDBoost Regression model provides a bit lower scores than the Keras Regression model. Both scores are not really good and do not provide a very good prediction for the quality of the wine. 
In addition, in the Regression chart we see that the predcition in XGBoost model have less variance than the predictions in Keras Model and almost don`t show any trend.

# **Binary Classification Models**

After predicting the scores of the wines using models with  a numeric label, we will encode our label to be a two-categories label (poor/good quality wine). Following that, the models shown below will be binary classification models.  

In [None]:
df_relevant_features_new['points_cat'] = np.where(df_relevant_features_new["points"]>=89, 1, 2) 
df_relevant_features_new['points_cat'].value_counts()


In [None]:
df_relevant_features_new.drop('points', axis=1, inplace=True)


In [None]:
x_columns_class = df_relevant_features_new.columns.drop('points_cat')
x_class = df_relevant_features_new[x_columns_class].values
y_class = df_relevant_features_new['points_cat'].values

# Create train/test
x_train_class, x_test_class, y_train_class, y_test_class = train_test_split(    
    x_class, y_class, test_size=0.2, random_state=42)

## **Naive Bayes Model**

In [None]:
#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(x_train_class, y_train_class)

#Predict the response for test dataset
y_pred_NB = gnb.predict(x_test_class)

In [None]:
# Confusion matrix 

def pretty_print_conf_matrix(y_true, y_pred, 
                             classes,
                             normalize=False,
                             title='Confusion matrix',
                             cmap=plt.cm.Pastel2):
    
    cm = confusion_matrix(y_true, y_pred)

    # Configure Confusion Matrix Plot Aesthetics (no text yet) 
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=14)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=80)
    plt.yticks(tick_marks, classes)
    plt.ylabel('True label', fontsize=12)
    plt.xlabel('Predicted label', fontsize=12)
    # Calculate normalized values (so all cells sum to 1) if desired
    if normalize:
        cm = np.round(cm.astype('float') / cm.sum(),2) #(axis=1)[:, np.newaxis]

    # Place Numbers as Text on Confusion Matrix Plot
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black",
                 fontsize=12)


    # Add Precision, Recall, F-1 Score as Captions Below Plot
    rpt = classification_report(y_true, y_pred)
    rpt = rpt.replace('avg / total', '      avg')
    rpt = rpt.replace('support', 'N Obs')

    plt.annotate(rpt, 
                 xy = (0,0), 
                 xytext = (-20, -200), 
                 xycoords='axes fraction', textcoords='offset points',
                 fontsize=12, ha='left')    

    # Plot
    plt.tight_layout()


# Plot Confusion Matrix
plt.style.use('classic')
plt.figure(figsize=(2,2))
pretty_print_conf_matrix(y_test_class, y_pred_NB, 
                         classes= ['Poor=1', 'Good=2'],
                         normalize=True, 
                         title='Confusion Matrix')

The confusion matrix showed us that the accuracy value is 0.64, which means the models classified correctly only 64% of the data. 
In order to try and improve the model, we will run the same model, but this time we will change to price feature to be a categorical feature.

In [None]:
bins = [0, 2.75, 3, 3.5, 4, np.inf]
names = ['1', '2', '3', '4', '5']

df_relevant_features_new['price_log_cat'] = pd.cut(df_relevant_features_new['price_log'], bins, labels=names)
df_relevant_features_new['price_log_cat'].value_counts()

In [None]:
x_columns_class_all_cat = df_relevant_features_new.columns.drop('points_cat').drop('price_log')
x_class_all_cat = df_relevant_features_new[x_columns_class_all_cat].values
y_class_all_cat = df_relevant_features_new['points_cat'].values

# Create train/test

x_train_class_all_cat, x_test_class_all_cat, y_train_class_all_cat, y_test_class_all_cat = train_test_split(    
    x_class_all_cat, y_class_all_cat, test_size=0.2, random_state=42)


In [None]:
#Create a Gaussian Classifier
gnb_cat = GaussianNB()

#Train the model using the training sets
gnb_cat.fit(x_train_class_all_cat, y_train_class_all_cat)

#Predict the response for test dataset
y_pred_NB_all_cat = gnb_cat.predict(x_test_class_all_cat)

In [None]:
# Confusion matrix 

plt.style.use('classic')
plt.figure(figsize=(2,2))
pretty_print_conf_matrix(y_test_class, y_pred_NB_all_cat, 
                         classes= ['Poor=1', 'Good=2'],
                         normalize=True, 
                         title='Confusion Matrix')

As we see, once we changed the price feature to be categorical, the accuracy hasn`t chanched.  

## **XGBoost Classification Model**


After seeing a XGBoost Regression Model, we will build a XGBoost Classification Model. 

In [None]:
xgbc = xgb.XGBClassifier(n_estimators=100)
xgbc.fit(x_train_class, y_train_class)


In [None]:
#Predict the response for test dataset
y_pred_XGBclass = xgbc.predict(x_test_class)

In [None]:
# Confusion matrix 

plt.style.use('classic')
plt.figure(figsize=(2,2))
pretty_print_conf_matrix(y_test_class, y_pred_XGBclass, 
                         classes= ['Poor=1', 'Good=2'],
                         normalize=True, 
                         title='Confusion Matrix')
