# Data Cleaning

As a necessary step of any data science project, data cleaning is a major one. Since we have got an access to well prepared dataset from Kaggle, we don't need to put any significant effort to the "Data Gathering" step. All we need to do is to download it from https://www.kaggle.com/zynicide/wine-reviews and store it at our project folder.

That has already been done.

However the data cleaning step involves more than well organized table. When dealing with numerical data we need to take care about null values, duplicates, outliers and so forth. With text data we need to pre-process the text to find it useful with further steps of the project. 

##### Common data cleaning steps on all text:

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words
* More data cleaning steps after tokenization:

##### Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos


## Dropping unneeded columns

Since, the subject of our project states that we want to predict the wine grape based on a review description (and only on that) there's no need to explore the dataset looking for other nice and useful features. That could be the subject for another project. We want to focus on the reviews' text and process it to find patterns which we can model and then predict the final category.

**First, let's check what kind of dataset we are dealing with and then drop unnecessary columns.**

In [2]:
# Let's prepare the working environment and load data
import pandas as pd

FILE_PATH = './data/winemag-data_first150k.csv'
raw_data = pd.read_csv(FILE_PATH, index_col=0)

raw_data.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


Well, a lot of useful data provided. Yet, not needed this time.


I'm going to use only 2 columns to perform our task. Our predictor variable will be "description" and target variable will be "variety".
We want to simulate the situation where there are only text field and target value field available.

**Let's drop the columns except "description" and "variety"**.

In [3]:
data = raw_data[['description', 'variety']]
data.head()

Unnamed: 0,description,variety
0,This tremendous 100% varietal wine hails from ...,Cabernet Sauvignon
1,"Ripe aromas of fig, blackberry and cassis are ...",Tinta de Toro
2,Mac Watson honors the memory of a wine once ma...,Sauvignon Blanc
3,"This spent 20 months in 30% new French oak, an...",Pinot Noir
4,"This is the top wine from La Bégude, named aft...",Provence red blend
