# Wine Quality Survey
<hr>

### About the Dataset 
<hr>
Wine Quality is public dataset available for research in the University of California, Irvine Machine Learning repository created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) in 2009. This repository consists of two dataset, red and white wine samples which consists of data points which includes objective tests (e.g. PH values) and the output is presented by the wine experts based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Source : <a href="https://archive.ics.uci.edu/ml/datasets/wine+quality"></a>

#### Attribute Information

Input variables (based on physicochemical tests):

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

Output variable (based on sensory data)   

12. quality (score between 0 and 10)


In [1]:
# import necesary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Cleaning

In [2]:
# import data sets // Importing both red and white wine data set

data_red = pd.read_csv("winequality-red.csv", sep=";")
data_white = pd.read_csv("winequality-white.csv", sep=";")

In [3]:
data_red.head(5)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur-dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
data_white.head(5)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


* Since there are two dataset with same structure, we could join both the dataset to ease our job. We can do this by adding 
    a extra column color which specifies which kind of whine is it : red / white.
    
* Also the column total_sulphur_dioxide name differs in both the data sets. Therefor we will rename red wine's column name to __total_suphur_dioxide__ from __total_sulphur-dioxide__.

In [5]:
# rename column 

data_red.rename(columns={'total_sulfur-dioxide':'total_sulfur_dioxide'}, inplace=True)

In [7]:
# create a new column "color" with wine type : red / white

data_red['color'] = np.repeat('red',data_red.shape[0])
data_white['color'] = np.repeat('white',data_white.shape[0])

In [8]:
# combine both data sets

data_wine = data_red.append(data_white)

In [11]:
# shape of new data
data_wine.shape

(6497, 13)

* After combining the data sets we have our new data set. The data contains 6497 rows and 13 columns.