# The Wine Land

## The Data Description is as follows:

user_name - user_name of the reviewer


country -The country that the wine is from.

review_title - The title of the wine review, which often contains the vintage.

review_description - A verbose review of the wine.

designation - The vineyard within the winery where the grapes that made the wine are from.

points - ratings given by the user. The ratings are between 0 -100.

price - The cost for a bottle of the wine

province - The province or state that the wine is from.

region_1 - The wine-growing area in a province or state (ie Napa).

region_2 - Sometimes there are more specific regions specified within a wine-growing area (ie
Rutherford inside the Napa Valley), but this value can sometimes be blank.

winery - The winery that made the wine

variety - The type of grapes used to make the wine. Dependent variable for task 2 of the assignment


___

The notebook is divided into 3 steps:

1. Data Cleaning

2. Model Training

3. Model Testing

In [4]:
#importing the important libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

In [5]:
df= pd.read_csv(r"C:\Users\Irish Mehta\Desktop\Knight FinTech\Knight ML Assignment\Data\train.csv")


In [6]:
df = df[['review_description','designation','country','review_title','variety','price','winery','points']]
df.head()

Unnamed: 0,review_description,designation,country,review_title,variety,price,winery,points
0,"Classic Chardonnay aromas of apple, pear and h...",Peace Family Vineyard,Australia,Andrew Peace 2007 Peace Family Vineyard Chardo...,Chardonnay,10.0,Andrew Peace,83
1,This wine is near equal parts Syrah and Merlot...,,US,North by Northwest 2014 Red (Columbia Valley (...,Red Blend,15.0,North by Northwest,89
2,Barolo Conca opens with inky dark concentratio...,Conca,Italy,Renato Ratti 2007 Conca (Barolo),Nebbiolo,80.0,Renato Ratti,94
3,It's impressive what a small addition of Sauvi...,L'Abbaye,France,Domaine l'Ancienne Cure 2010 L'Abbaye White (B...,Bordeaux-style White Blend,22.0,Domaine l'Ancienne Cure,87
4,"This ripe, sweet wine is rich and full of drie...",Le Cèdre Vintage,France,Château du Cèdre 2012 Le Cèdre Vintage Malbec ...,Malbec,33.0,Château du Cèdre,88


In [7]:
#Since none of the reviews and their title have a duplicate as well as NaN, I wont remove them

In [8]:
#Creating the vectorizer float of the strings

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from scipy.sparse import hstack

# create the transform
vectorizer = HashingVectorizer(n_features=500, stop_words='english')
# tokenize and build vocab
review= vectorizer.fit_transform(df['review_description'])
review2= vectorizer.fit_transform(df['review_title'])

print(review.shape,review2.shape)


(82657, 500) (82657, 500)


In [9]:
from sklearn.model_selection import train_test_split

fin= hstack((review, review2))
y=df['variety']

X_train, X_test, y_train, y_test= train_test_split(fin, y, train_size=0.6,test_size=0.4, random_state=3)

In [10]:
import timeit

start = timeit.default_timer()

# training the linear SVM classifier 
from sklearn.svm import SVC 
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train) 
svm_predictions = svm_model_linear.predict(X_test) 
  
# model accuracy for X_test   
accuracy = svm_model_linear.score(X_test, y_test) 
print(accuracy)

stop = timeit.default_timer()

print('Time: ', stop - start)  

0.9622841242476484
Time:  612.894408327


In [11]:
accuracy

0.9622841242476484

In [12]:
from sklearn.metrics import classification_report 

print(classification_report(y_test, svm_predictions))

                            precision    recall  f1-score   support

  Bordeaux-style Red Blend       0.88      0.90      0.89      2182
Bordeaux-style White Blend       0.89      0.86      0.88       338
            Cabernet Franc       0.98      0.98      0.98       438
        Cabernet Sauvignon       0.99      1.00      1.00      2979
           Champagne Blend       0.97      0.97      0.97       477
                Chardonnay       0.99      0.99      0.99      3740
                     Gamay       0.96      0.93      0.95       318
            Gewürztraminer       0.99      0.99      0.99       346
          Grüner Veltliner       1.00      1.00      1.00       450
                    Malbec       0.98      0.99      0.99       831
                    Merlot       0.98      0.97      0.97      1003
                  Nebbiolo       0.99      0.99      0.99       917
              Pinot Grigio       1.00      0.98      0.99       332
                Pinot Gris       0.99      0.99

In [13]:
df2= pd.read_csv(r"C:\Users\Irish Mehta\Desktop\Knight FinTech\Knight ML Assignment\Data\test.csv")
df2.head()

Unnamed: 0,user_name,country,review_title,review_description,designation,points,price,province,region_1,region_2,winery
0,@paulgwine,US,Boedecker Cellars 2011 Athena Pinot Noir (Will...,Nicely differentiated from the companion Stewa...,Athena,88,35.0,Oregon,Willamette Valley,Willamette Valley,Boedecker Cellars
1,@wineschach,Argentina,Mendoza Vineyards 2012 Gran Reserva by Richard...,"Charred, smoky, herbal aromas of blackberry tr...",Gran Reserva by Richard Bonvin,90,60.0,Mendoza Province,Mendoza,,Mendoza Vineyards
2,@vboone,US,Prime 2013 Chardonnay (Coombsville),"Slightly sour and funky in earth, this is a re...",,87,38.0,California,Coombsville,Napa,Prime
3,@wineschach,Argentina,Bodega Cuarto Dominio 2012 Chento Vineyard Sel...,"This concentrated, midnight-black Malbec deliv...",Chento Vineyard Selection,91,20.0,Mendoza Province,Mendoza,,Bodega Cuarto Dominio
4,@kerinokeefe,Italy,SassodiSole 2012 Brunello di Montalcino,"Earthy aromas suggesting grilled porcini, leat...",,90,49.0,Tuscany,Brunello di Montalcino,,SassodiSole


In [14]:

# create the transform
vectorizer = HashingVectorizer(n_features=500, stop_words='english')
# tokenize and build vocab
review= vectorizer.fit_transform(df2['review_description'])
review2= vectorizer.fit_transform(df2['review_title'])

fin= hstack((review, review2))

In [15]:
svm_predictions = svm_model_linear.predict(fin)
df2['variety']=pd.Series(svm_predictions)

In [16]:
df2.groupby('variety').count()

Unnamed: 0_level_0,user_name,country,review_title,review_description,designation,points,price,province,region_1,region_2,winery
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Bordeaux-style Red Blend,1288,1415,1415,1415,737,1415,1100,1415,1383,347,1415
Bordeaux-style White Blend,223,225,225,225,83,225,147,225,224,1,225
Cabernet Franc,175,265,265,265,161,265,256,265,246,166,265
Cabernet Sauvignon,1236,1923,1923,1923,1171,1923,1904,1923,1696,1446,1923
Champagne Blend,229,251,251,251,250,251,226,251,249,13,251
Chardonnay,1702,2348,2348,2348,1570,2348,2221,2348,2142,1315,2348
Gamay,206,206,206,206,145,206,167,206,204,0,206
Gewürztraminer,129,173,173,173,124,173,166,173,163,56,173
Grüner Veltliner,281,290,290,290,250,290,249,290,23,23,290
Malbec,531,550,550,550,425,550,533,550,514,90,550


In [17]:
df2.to_csv('submissions.csv')