# 와인 품질 예측을 위한 머신러닝 모델 만들기-1

# 와인 품질 데이터 탐색

### 와인 품질 데이터 개요

- 관측값: 6497건(레드와인: 1599건, 화이트 와인: 4898건)

- 입력 변수: 12개(고정산, 휘발산, 구연산, 잔여당, 염화물, 무수아황산, 총이산화황, 밀도, 산성도, 황산염, 알콜도수와 같은 와인의 물리화학적 특성들과 red, white의 와인 타입)

- 출력 변수: 1개(와인품질평가점수, 가장 낮은 품질 1점 ~ 가장 높은 품질 10점)

- http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality

In [2]:
import pandas as pd

In [16]:
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/"

In [17]:
redwine = pd.read_csv(URL + "winequality-red.csv", sep=";", header=0)

In [18]:
redwine["type"] = "red"
redwine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [19]:
whitewine = pd.read_csv(URL + "winequality-white.csv", sep=";", header=0)

In [20]:
whitewine["type"] = "white"
whitewine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


In [21]:
# red wine과 white wine 데이터 프레임 합치기
wine = redwine.append(whitewine)
wine.shape

(6497, 13)

In [23]:
# space를 언더바(_)로 변경
wine.columns = wine.columns.str.replace(" ", "_")
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [24]:
# 연속형 데이터에 대한 요약 통계량(기술 통계량) 보기
wine.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


- DataFrame 객체의 describe 함수를 이용한 수치형 변수들의 요약 통계 출력 

- 통계량은 표본에 대한 변량의 측정값


In [27]:
wine.quality.describe()

count    6497.000000
mean        5.818378
std         0.873255
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         9.000000
Name: quality, dtype: float64

- Series 객체의 describe함수를 이용한 수치형 변수의 요약 통계 출력

- 개별적 통계량 확인 가능

In [28]:
sorted(wine.quality.unique())

[3, 4, 5, 6, 7, 8, 9]

- Series의 unique 함수 사용

In [29]:
wine.quality.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

- quality 변수의 유일값을 오름차순으로 확인 및 빈도수 계산

- 전체적인 점수별 분포 빈도 확인 가능

<br/>

### Reference

https://kbig.kr/portal//kbig/datacube/onl_edu_class/python?bltnNo=11583395976711