# Data Cleaning

As a necessary step of any data science project, data cleaning is a major one. Since we have got an access to well prepared dataset from Kaggle, we don't need to put any significant effort to the "Data Gathering" step. All we need to do is to download it from https://www.kaggle.com/zynicide/wine-reviews and store it at our project folder.

That has already been done.

However the data cleaning step involves more than well organized table. When dealing with numerical data we need to take care about null values, duplicates, outliers and so forth. With text data we need to pre-process the text to find it useful with further steps of the project. 

##### Common data cleaning steps on all text:

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words
* More data cleaning steps after tokenization:

##### Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos


## Dropping unneeded columns

Since, the subject of our project states that we want to predict the wine grape variety based on a review description (and only on that) there's no need to explore the dataset looking for other nice and useful features. That could be the subject for another project. We want to focus on the reviews' text and process it to find patterns which we can model and then predict the final category.

**First, let's check what kind of dataset we are dealing with and then drop unnecessary columns.**

In [2]:
# Let's prepare the working environment and load data
import pandas as pd

FILE_PATH = './data/winemag-data_first150k.csv'
raw_data = pd.read_csv(FILE_PATH, index_col=0)

raw_data.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


Well, a lot of useful data provided. Yet, not needed this time.


I'm going to use only 2 columns to perform our task. Our predictor variable will be "description" and target variable will be "variety".
We want to simulate the situation where there are only text field and target value field available.

**Let's drop the columns except "description" and "variety" and have a quick dataset check**.

In [3]:
data = raw_data[['description', 'variety']]
data.head()

Unnamed: 0,description,variety
0,This tremendous 100% varietal wine hails from ...,Cabernet Sauvignon
1,"Ripe aromas of fig, blackberry and cassis are ...",Tinta de Toro
2,Mac Watson honors the memory of a wine once ma...,Sauvignon Blanc
3,"This spent 20 months in 30% new French oak, an...",Pinot Noir
4,"This is the top wine from La Bégude, named aft...",Provence red blend


In [4]:
# let's check the data shape
data.shape

(150930, 2)

In [6]:
# checking out for duplicated descriptions
desc_duplicates = data[data.duplicated(subset=['description'], keep=False)]

# just to check what kind of duplicated valeus we have
desc_duplicates = desc_duplicates.sort_values(by='description')

desc_duplicates.head(30)

Unnamed: 0,description,variety
147725,$11. Opens with a highly perfumed bouquet of l...,Chardonnay
62345,$11. Opens with a highly perfumed bouquet of l...,Chardonnay
74993,). Very good wine from a winery increasingly k...,Cabernet Sauvignon
18803,). Very good wine from a winery increasingly k...,Cabernet Sauvignon
26530,". Christoph Neumeister's top wine, this is a c...",Sauvignon Blanc
84730,". Christoph Neumeister's top wine, this is a c...",Sauvignon Blanc
53110,". Christoph Neumeister's top wine, this is a c...",Sauvignon Blanc
107351,. From a small south-facing parcel next to the...,Chenin Blanc
65231,. From a small south-facing parcel next to the...,Chenin Blanc
43074,. Lemon zest and exotic spices enliven the nos...,Riesling


In [7]:
# duplicates table shape
desc_duplicates.shape

(92393, 2)

**There's a significant number of duplicated values in the description field. Let's drop duplicates and leave only distinct values.**

In [8]:
# dropping duplicates
data_no_duplicates = data.drop_duplicates(subset=['description'], keep='last')
data_no_duplicates

Unnamed: 0,description,variety
0,This tremendous 100% varietal wine hails from ...,Cabernet Sauvignon
1,"Ripe aromas of fig, blackberry and cassis are ...",Tinta de Toro
2,Mac Watson honors the memory of a wine once ma...,Sauvignon Blanc
3,"This spent 20 months in 30% new French oak, an...",Pinot Noir
4,"This is the top wine from La Bégude, named aft...",Provence red blend
...,...,...
150925,Many people feel Fiano represents southern Ita...,White Blend
150926,"Offers an intriguing nose with ginger, lime an...",Champagne Blend
150927,This classic example comes from a cru vineyard...,White Blend
150928,"A perfect salmon shade, with scents of peaches...",Champagne Blend


In [19]:
# Let's have a quick look at the numbers for duplicates

print(f'Original dataset number of values: {data.shape[0]}')
print(f'NO. of DROPPED DUPLICATED DESC. VALUES: {data.shape[0] - data_no_duplicates.shape[0]}')
print('-------------------------------------')
print(f'Distinct no. of review descriptions: {data_no_duplicates.shape[0]}')

Original dataset number of values: 150930
NO. of DROPPED DUPLICATED DESC. VALUES: 53109
-------------------------------------
Distinct no. of review descriptions: 97821


In [17]:
# Let's check out if there are any missing values in "description" or "variety" columns

print('shape', data_no_duplicates.shape)
print('----------------\n')
print('null values:')
print(data_no_duplicates.isna().sum())

shape (97821, 2)
----------------

null values:
description    0
variety        0
dtype: int64


**Looking good. There's no missing values either in "description" nor "variety" column.**

**Now, let's find the top 15 grape varieties for our target categories we want to focus on**

In [27]:
# finding top 15 varieties
top_fifteen_varieties = data_no_duplicates['variety'].value_counts().head(15).to_frame()
top_fifteen_varieties

Unnamed: 0,variety
Pinot Noir,9283
Chardonnay,9159
Cabernet Sauvignon,8267
Red Blend,6484
Bordeaux-style Red Blend,5170
Sauvignon Blanc,4034
Syrah,3662
Riesling,3582
Merlot,3176
Zinfandel,2408


In [28]:
# Let's leave only top 15 varieties in the data frame
top_fifteen_varieties = top_fifteen_varieties.index.to_list()

data_top_15 = data_no_duplicates.loc[data_no_duplicates['variety'].isin(top_fifteen_varieties)]
data_top_15 = data_top_15.reset_index().drop(columns='index')
data_top_15

Unnamed: 0,description,variety
0,This tremendous 100% varietal wine hails from ...,Cabernet Sauvignon
1,Mac Watson honors the memory of a wine once ma...,Sauvignon Blanc
2,"This spent 20 months in 30% new French oak, an...",Pinot Noir
3,This re-named vineyard was formerly bottled as...,Pinot Noir
4,The producer sources from two blocks of the vi...,Pinot Noir
...,...,...
64728,"This needs a good bit of breathing time, then ...",Pinot Noir
64729,The nose is dominated by the attractive scents...,Pinot Noir
64730,"Decades ago, Beringer’s then-winemaker Myron N...",White Blend
64731,Many people feel Fiano represents southern Ita...,White Blend
