# **Table of Contents**
## 1. Loading the Dataset

## 2. Pre-processing the Dataset

## 3. Feature Engineering and Data cleaning

## 4. Model Building

## 5. Model Performance Evaluation

## 6. Custom prediction

## 7. Prediction using test.csv


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Importing Libraries**

In [2]:
import pandas as pd
import numpy as np
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

**Loading dataset from Drive**

In [3]:
data=pd.read_csv("/content/drive/MyDrive/train.csv")

**Data Analysis**

In [4]:
# Display the number of rows and columns
print("Shape of the dataset:", data.shape)

Shape of the dataset: (82657, 12)


In [5]:
data.head(3)

Unnamed: 0,user_name,country,review_title,review_description,designation,points,price,province,region_1,region_2,winery,variety
0,,Australia,Andrew Peace 2007 Peace Family Vineyard Chardo...,"Classic Chardonnay aromas of apple, pear and h...",Peace Family Vineyard,83,10.0,Australia Other,South Eastern Australia,,Andrew Peace,Chardonnay
1,@wawinereport,US,North by Northwest 2014 Red (Columbia Valley (...,This wine is near equal parts Syrah and Merlot...,,89,15.0,Washington,Columbia Valley (WA),Columbia Valley,North by Northwest,Red Blend
2,,Italy,Renato Ratti 2007 Conca (Barolo),Barolo Conca opens with inky dark concentratio...,Conca,94,80.0,Piedmont,Barolo,,Renato Ratti,Nebbiolo


In [6]:
# Columns
data.columns

Index(['user_name', 'country', 'review_title', 'review_description',
       'designation', 'points', 'price', 'province', 'region_1', 'region_2',
       'winery', 'variety'],
      dtype='object')

In [7]:
# View data types of each column
print("\nData types:")
print(data.dtypes)


Data types:
user_name              object
country                object
review_title           object
review_description     object
designation            object
points                  int64
price                 float64
province               object
region_1               object
region_2               object
winery                 object
variety                object
dtype: object


In [8]:
# View Null values
data.isnull().sum()

user_name             19393
country                  35
review_title              0
review_description        0
designation           23647
points                    0
price                  5569
province                 35
region_1              12754
region_2              46708
winery                    0
variety                   0
dtype: int64

**Handling missing values:**

* It was observed that certain columns, namely user_name, region_1, region_2, and
designation, contain a substantial number of missing values. Moreover, these columns are not deemed highly relevant for classification.So we will drop them.

* In 'country' and 'province', there are less no. of missing values so we can drop them without loosing effective information.

* We will fill the missing values in 'price' column using mean.

In [9]:
# Drop columns with a large number of missing values
data.drop(columns=["user_name", "region_1", "region_2","country", "province"], inplace=True)

# Drop rows with missing values in the "variety" column
data.dropna(subset=["variety"], inplace=True)
data.dropna(subset=["designation"], inplace=True)

# Fill missing values in "price" column with the mean value
data["price"].fillna(data["price"].mean(), inplace=True)


In [10]:
data.isnull().sum()

review_title          0
review_description    0
designation           0
points                0
price                 0
winery                0
variety               0
dtype: int64

In [11]:
# Check value count for each class
data['variety'].value_counts()

Pinot Noir                    7924
Chardonnay                    6271
Red Blend                     6000
Cabernet Sauvignon            4635
Riesling                      3511
Bordeaux-style Red Blend      2913
Syrah                         2454
Sauvignon Blanc               2312
Rosé                          1981
Portuguese Red                1836
Nebbiolo                      1791
Sparkling Blend               1705
Zinfandel                     1667
White Blend                   1663
Malbec                        1607
Merlot                        1290
Sangiovese                    1235
Tempranillo                   1171
Champagne Blend               1131
Rhône-style Red Blend          964
Grüner Veltliner               935
Portuguese White               835
Cabernet Franc                 683
Pinot Gris                     644
Gamay                          610
Gewürztraminer                 586
Bordeaux-style White Blend     336
Pinot Grigio                   320
Name: variety, dtype

**Insight:**

**Class Imbalance**: The class distribution shows a considerable disparity in the number of samples for each wine variety. Some varieties like "Pinot Noir," "Chardonnay," and "Red Blend" have a large number of occurrences, while others like "Bordeaux-style White Blend," "Pinot Grigio," and "Gewürztraminer" have significantly fewer occurrences.

**Impact on Classification:**The model might become biased towards predicting the majority classes, leading to lower accuracy, precision, recall, and F1 scores for the minority classes.

To address this issue and improve the classification model's performance, it is essential to balance the dataset using appropriate balancing techniques. Some common techniques include:

**Oversampling:** Generating synthetic samples for minority classes to increase their representation in the dataset. This can be done using methods like SMOTE (Synthetic Minority Over-sampling Technique).

**Undersampling:** Randomly removing samples from the majority classes to reduce their dominance in the dataset. This can help balance the class distribution.

**Class Weighting:**Assigning higher weights to the minority classes during model training. This allows the model to give more importance to the minority classes while learning.

**Ensemble Methods:** Using ensemble techniques like Random Forest or boosting algorithms like AdaBoost can help handle imbalanced data effectively.

Among these techniques I used undersampling and Ensemble method (Random Forest) . Let's first try using undersampling method.

In [12]:
df1 = data.groupby('variety').apply(lambda x: x.sample(500, replace=True) if len(x) > 500 else x)
#df1=data

**Undersampling :**
 Here we groups the data by the "variety" column and then applies a lambda function to each group. The lambda function samples 500 rows from each group if the group size is greater than 500, and if the group size is less than or equal to 500, it keeps all the rows in that group.  This makes our data balanced.

 **Note**
 If you want to perform Enseble method for balancing, then comment in the line
  df1 = data.groupby('variety').apply(lambda x: x.sample(500, replace=True) if len(x) > 500 else x)

  and comment out df1=data, so you don't need to do lots of changes in further code.

In [13]:
df1['variety'].value_counts()

Bordeaux-style Red Blend      500
Portuguese Red                500
White Blend                   500
Tempranillo                   500
Syrah                         500
Sparkling Blend               500
Sauvignon Blanc               500
Sangiovese                    500
Rosé                          500
Riesling                      500
Rhône-style Red Blend         500
Red Blend                     500
Portuguese White              500
Pinot Noir                    500
Pinot Gris                    500
Nebbiolo                      500
Merlot                        500
Malbec                        500
Grüner Veltliner              500
Gewürztraminer                500
Gamay                         500
Chardonnay                    500
Champagne Blend               500
Cabernet Sauvignon            500
Cabernet Franc                500
Zinfandel                     500
Bordeaux-style White Blend    336
Pinot Grigio                  320
Name: variety, dtype: int64

In [14]:
df1.shape,data.shape

((13656, 7), (59010, 7))

**Creating new dataframe:**

For review classification, we need text data. In our dataset we have two columns 'review_title' and 'review_description' which contains most of the information. To avoid loosing information by selecting just one of them, we will concate them(in df2) and create a new dataframe (final_df).

In [15]:
df2=df1['review_title']+ df1['review_description']

In [16]:
final_df=pd.DataFrame({'text':df2,'variety':df1['variety']})

In [17]:
final_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,text,variety
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bordeaux-style Red Blend,11190,Château Peyrabon 2010 Barrel sample (Haut-Méd...,Bordeaux-style Red Blend
Bordeaux-style Red Blend,77250,Cooper-Garrod 2012 Test Pilot F6F Hellcat Red ...,Bordeaux-style Red Blend
Bordeaux-style Red Blend,26870,Cadence 2011 Tapteil Vineyard Red (Red Mountai...,Bordeaux-style Red Blend
Bordeaux-style Red Blend,31698,Château Bouscaut 2009 Barrel sample (Pessac-L...,Bordeaux-style Red Blend
Bordeaux-style Red Blend,4295,Babich 2011 The Patriarch Premium Red (Hawke's...,Bordeaux-style Red Blend


**Data Cleaning and Preprocessing:**

In [18]:
# Convert text to lowercase
final_df['text'] = final_df['text'].str.lower()

In [19]:
# Remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

final_df['text'] =final_df['text'].apply(remove_punctuation)

In [20]:
# Remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

final_df['text'] = final_df['text'].apply(remove_numbers)

In [21]:
# Tokenize the text (split into words)
final_df['text'] = final_df['text'].apply(lambda x: x.split())

In [22]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
final_df['text'] = final_df['text'].apply(lambda x: [word for word in x if word not in stop_words])

In [24]:
# Lemmatization
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
final_df['text'] = final_df['text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [25]:
# Convert the tokenized text back to sentences
final_df['text'] = final_df['text'].apply(lambda x: ' '.join(x))

In [26]:
final_df['text'][0]

'château peyrabon barrel sample hautmédoc big extracted wine hard tannin firm tough'

**Convert the target column 'variety' to numeric using LabelEncoder:**

In [27]:
label_encoder = LabelEncoder()
final_df['variety_numeric'] = label_encoder.fit_transform(final_df['variety'])


In [28]:
final_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,text,variety,variety_numeric
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bordeaux-style Red Blend,11190,château peyrabon barrel sample hautmédoc big e...,Bordeaux-style Red Blend,0
Bordeaux-style Red Blend,77250,coopergarrod test pilot ff hellcat red santa c...,Bordeaux-style Red Blend,0
Bordeaux-style Red Blend,26870,cadence tapteil vineyard red red mountainthis ...,Bordeaux-style Red Blend,0
Bordeaux-style Red Blend,31698,château bouscaut barrel sample pessacléognan b...,Bordeaux-style Red Blend,0
Bordeaux-style Red Blend,4295,babich patriarch premium red hawkes bayin mark...,Bordeaux-style Red Blend,0


In [29]:
final_df['variety_numeric'].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])

**Create the feature matrix using CountVectorizer:**

In [31]:
count_vectorizer = CountVectorizer(max_features=5000)

X = count_vectorizer.fit_transform(final_df['text'])

Y=final_df['variety_numeric']


In [32]:
X.shape

(13656, 5000)

**Train test split**

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, random_state=18)


**Model Training and Evaluation:**

**1. Naive Bayes Classifier**

In [34]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Make predictions on the test set
nb_predictions = nb_model.predict(X_test)

# Evaluate the model
nb_accuracy = accuracy_score(y_test, nb_predictions)
print("Naive Bayes Accuracy:", nb_accuracy)

# Classification report
print(" Naive Bayes Classifier Classification Report:")
nb_classification_report=classification_report(y_test, nb_predictions, target_names=label_encoder.classes_)
print(nb_classification_report)


Naive Bayes Accuracy: 0.8543191800878477
 Naive Bayes Classifier Classification Report:
                            precision    recall  f1-score   support

  Bordeaux-style Red Blend       0.70      0.70      0.70       100
Bordeaux-style White Blend       0.79      0.80      0.79        60
            Cabernet Franc       0.93      0.72      0.81       107
        Cabernet Sauvignon       0.80      0.87      0.84        93
           Champagne Blend       0.85      0.93      0.88        95
                Chardonnay       0.91      0.76      0.83       106
                     Gamay       0.66      0.99      0.79       105
            Gewürztraminer       0.97      0.92      0.94        96
          Grüner Veltliner       0.95      0.98      0.96        94
                    Malbec       0.85      0.78      0.81       108
                    Merlot       0.83      0.73      0.78        94
                  Nebbiolo       0.90      0.96      0.93        99
              Pinot Grigio 

**2. Support Vector Machine (SVM) Classifier:**

In [35]:
from sklearn.svm import SVC

# Train the SVM model
svm_model = SVC()
svm_model.fit(X_train, y_train)

# Make predictions on the test set
svm_predictions = svm_model.predict(X_test)

# Evaluate the model
svm_accuracy = accuracy_score(y_test, svm_predictions)
print("SVM Accuracy:", svm_accuracy)

# Classification report
print("SVM Classification Report:")
svm_classification_report=classification_report(y_test, svm_predictions, target_names=label_encoder.classes_)
print(svm_classification_report)


SVM Accuracy: 0.9348462664714495
SVM Classification Report:
                            precision    recall  f1-score   support

  Bordeaux-style Red Blend       0.83      0.84      0.84       100
Bordeaux-style White Blend       0.83      0.73      0.78        60
            Cabernet Franc       0.98      0.90      0.94       107
        Cabernet Sauvignon       0.97      0.98      0.97        93
           Champagne Blend       0.98      0.94      0.96        95
                Chardonnay       0.92      0.92      0.92       106
                     Gamay       0.82      0.98      0.90       105
            Gewürztraminer       1.00      1.00      1.00        96
          Grüner Veltliner       1.00      1.00      1.00        94
                    Malbec       1.00      0.97      0.99       108
                    Merlot       1.00      0.93      0.96        94
                  Nebbiolo       0.95      0.96      0.95        99
              Pinot Grigio       1.00      0.99      0.

**3. Random Forest Classifier (with balanced dataset)**

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report

# Train the Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
from sklearn.metrics import accuracy_score,classification_report
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy:", rf_accuracy)

# Classification report
print("Random Forest Classifier Classification Report:")
rf_classification_report=classification_report(y_test, rf_predictions, target_names=label_encoder.classes_)
print(rf_classification_report)

Random Forest Accuracy: 0.9187408491947291
Random Forest Classifier Classification Report:
                            precision    recall  f1-score   support

  Bordeaux-style Red Blend       0.84      0.76      0.80       100
Bordeaux-style White Blend       0.79      0.77      0.78        60
            Cabernet Franc       0.93      0.93      0.93       107
        Cabernet Sauvignon       0.95      0.99      0.97        93
           Champagne Blend       0.95      0.94      0.94        95
                Chardonnay       0.88      0.88      0.88       106
                     Gamay       0.78      0.99      0.87       105
            Gewürztraminer       1.00      1.00      1.00        96
          Grüner Veltliner       1.00      1.00      1.00        94
                    Malbec       0.98      1.00      0.99       108
                    Merlot       0.99      0.89      0.94        94
                  Nebbiolo       0.90      0.95      0.92        99
              Pinot Grig

**Random Forest Classifier with imbalanced data:**

As we know that Using techniques like Random Forest or boosting algorithms like AdaBoost can help handle imbalanced data effectively. I tried Random forest on imbalanced dataset. Below are the results I got.

In [34]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score,classification_report

# # Train the Random Forest model
# rf_model = RandomForestClassifier()
# rf_model.fit(X_train, y_train)

# # Make predictions on the test set
# rf_predictions = rf_model.predict(X_test)

# # Evaluate the model
# from sklearn.metrics import accuracy_score,classification_report
# rf_accuracy = accuracy_score(y_test, rf_predictions)
# print("Random Forest Accuracy:", rf_accuracy)

# # Classification report
# print("Random Forest Classifier Classification Report:")
# rf_classification_report=classification_report(y_test, rf_predictions, target_names=label_encoder.classes_)
# print(rf_classification_report)

Random Forest Accuracy: 0.905693950177936
Random Forest Classifier Classification Report:
                            precision    recall  f1-score   support

  Bordeaux-style Red Blend       0.79      0.71      0.75       555
Bordeaux-style White Blend       0.94      0.25      0.40        59
            Cabernet Franc       0.96      0.56      0.70       126
        Cabernet Sauvignon       0.90      0.98      0.94       940
           Champagne Blend       0.95      0.83      0.89       202
                Chardonnay       0.88      0.99      0.93      1270
                     Gamay       0.98      0.60      0.75       131
            Gewürztraminer       1.00      0.90      0.94       115
          Grüner Veltliner       1.00      1.00      1.00       180
                    Malbec       0.97      0.97      0.97       302
                    Merlot       0.98      0.78      0.87       260
                  Nebbiolo       0.92      0.90      0.91       377
              Pinot Grigi

**Performance of Random Forest with imbalanced data:**

* Accuracy: 091
* Precision(avg): 0.94
* Recall(avg): 0.91
* F1 score(avg): 0.90

The model's overall performance seems promising, with good accuracy, precision, recall, and F1 score.

The high precision and recall values suggest that the model is performing well in correctly classifying instances of different wine varieties. It is effectively capturing both positive and negative instances.

**Model Comparison:**

In [37]:
# Model names and accuracies
models = ['Naive Bayes', 'SVM', 'Random Forest']
accuracies = [0.85, 0.93,0.91 ]

# Extracted average metrics from the classification reports
nb_avg_metrics = {
    'Precision (avg)': 0.86,
    'Recall (avg)': 0.85,
    'F1-score (avg)': 0.85
}

svm_avg_metrics = {
    'Precision (avg)': 0.94,
    'Recall (avg)': 0.93,
    'F1-score (avg)': 0.93
}

rf_avg_metrics = {
    'Precision (avg)': 0.92,
    'Recall (avg)': 0.92,
    'F1-score (avg)': 0.92
}

# Create a DataFrame for model comparison
data = {
    'Model': models,
    'Accuracy': accuracies,
    'Precision (avg)': [nb_avg_metrics['Precision (avg)'], svm_avg_metrics['Precision (avg)'], rf_avg_metrics['Precision (avg)']],
    'Recall (avg)': [nb_avg_metrics['Recall (avg)'], svm_avg_metrics['Recall (avg)'], rf_avg_metrics['Recall (avg)']],
    'F1-score (avg)': [nb_avg_metrics['F1-score (avg)'], svm_avg_metrics['F1-score (avg)'], rf_avg_metrics['F1-score (avg)']]
}

model_comparison_df = pd.DataFrame(data)
print(model_comparison_df)


           Model  Accuracy  Precision (avg)  Recall (avg)  F1-score (avg)
0    Naive Bayes      0.85             0.86          0.85            0.85
1            SVM      0.93             0.94          0.93            0.93
2  Random Forest      0.91             0.92          0.92            0.92


**custom prediction**

In [38]:
new_sample_text = "This wine has a rich and fruity taste with hints of cherry and vanilla. Cabernet Franc is becoming my favourite variety."

# Convert the new sample text to a feature vector using the same CountVectorizer used in training
# Make sure to preprocess the text in the same way as you did during training (lowercasing, removing stopwords, etc.)
new_sample_features = count_vectorizer.transform([new_sample_text])

# Make predictions using the trained Random Forest model
prediction =nb_model.predict(new_sample_features)

# Get the corresponding class label (if needed)
class_label = label_encoder.inverse_transform(prediction)

# Print the prediction
print("Predicted Class:", class_label[0])

Predicted Class: Cabernet Franc


**Predictions using test.csv**

In [46]:
test_data=pd.read_csv("/content/drive/MyDrive/test.csv")

In [47]:
test_data.shape

(20665, 11)

In [48]:
test_data.isnull().sum()

user_name              4738
country                   4
review_title              0
review_description        0
designation            5989
points                    0
price                  1394
province                  4
region_1               3314
region_2              11751
winery                    0
dtype: int64

**Test data preprocessing and cleaning**

In [49]:
# Drop columns with a large number of missing values
test_data.drop(columns=["user_name", "region_1", "region_2","country", "province","designation"], inplace=True)

# Fill missing values in "price" column with the mean value
test_data["price"].fillna(test_data["price"].mean(), inplace=True)

In [50]:
test_data.isnull().sum()

review_title          0
review_description    0
points                0
price                 0
winery                0
dtype: int64

In [51]:
test_df1=test_data['review_title']+test_data['review_description']

In [55]:
final_test=pd.DataFrame({'text':test_df1})

In [56]:
final_test.head()

Unnamed: 0,text
0,Boedecker Cellars 2011 Athena Pinot Noir (Will...
1,Mendoza Vineyards 2012 Gran Reserva by Richard...
2,Prime 2013 Chardonnay (Coombsville)Slightly so...
3,Bodega Cuarto Dominio 2012 Chento Vineyard Sel...
4,SassodiSole 2012 Brunello di MontalcinoEarthy...


In [57]:
# Convert text to lowercase
final_test['text'] = final_test['text'].str.lower()

In [58]:
# Remove punctuatio
final_test['text'] =final_test['text'].apply(remove_punctuation)

In [59]:
final_test['text'] = final_test['text'].apply(remove_numbers)

In [60]:
# Tokenize the text (split into words)
final_test['text'] = final_test['text'].apply(lambda x: x.split())

In [61]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
final_test['text'] = final_test['text'].apply(lambda x: [word for word in x if word not in stop_words])

In [62]:
final_test['text'] = final_test['text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [63]:
# Convert the tokenized text back to sentences
final_test['text'] = final_test['text'].apply(lambda x: ' '.join(x))

In [64]:
X_test = count_vectorizer.transform(final_test['text'])

In [68]:
# Make predictions using the trained Random forest classifier
predictions =rf_model.predict(X_test)


In [69]:
predictions

array([14,  9,  5, ...,  3, 25,  3])

In [71]:
predictions.shape

(20665,)

In [70]:
# Add the predictions to the test_data DataFrame as a new column 'predicted_variety'
test_data['predicted_variety'] = predictions

# Save the predictions to a new CSV file
test_data.to_csv('predictions.csv', index=False)

In [72]:
from google.colab import files

files.download('predictions.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Thank you!**