# 기술통계의 종류
* 2024/05/09
* 2024/05/16

## 빈도분석
* 범주형, 단일 자료

## 기술분석
* 연속형, 단일 자료

## 교차분석
* 범주형, 다중 자료
* 분할표

## 그룹분석
* 자료를 집단별로 나눠서 분석

# 탐색적 자료분석
* 자료를 탐색하는 과정 (EDA)
* 그래픽 도구들이 EDA를 위한 도구로 활용됨

# 코드 - 범주형

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

In [3]:
# cp949로 되어 있는 데이터 : 인코딩 변환 필요
car = pd.read_csv("../data/usedcars.csv", encoding='cp949')
car.head()

Unnamed: 0.1,Unnamed: 0,title,year,fuel,km,price,maker
0,1,현대 제네시스 BH330 럭셔리 프라임팩,08/09(09년형),가솔린,260000.0,690,현대
1,2,제네시스 더 올 뉴 G80 3.5 T-GDi AWD,20/06(21년형),가솔린,10000.0,렌트,제네시스
2,3,기아 K7 프리미어 3.0 GDI 시그니처,19/07(20년형),가솔린,20000.0,3350,기아
3,4,기아 더 뉴 K7 3.0 GDI 프레스티지,15/01,가솔린,90000.0,1990,기아
4,5,현대 갤로퍼2 숏바디 이노베이션 밴 인터쿨러 엑시드,02/10,디젤,160000.0,550,현대


## 자료 확인

In [4]:
car.tail()

Unnamed: 0.1,Unnamed: 0,title,year,fuel,km,price,maker
65,66,기아 레이 1.0 터보 프레스티지,14/08(15년형),가솔린,60000.0,1050,기아
66,67,르노삼성 SM7 뉴 아트 프레스티지,09/10(10년형),가솔린,150000.0,590,르노삼성
67,68,현대 뉴 카운티 장축 SUP 29인승,18/03,디젤,50000.0,5580,현대
68,69,기아 뉴 오피러스 GH270 고급형,07/03,가솔린,100000.0,320,기아
69,70,기아 더 프레스티지 K7 3.0 LPi 프레스티지,12/06,LPG,160000.0,870,기아


In [5]:
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  70 non-null     int64  
 1   title       70 non-null     object 
 2   year        70 non-null     object 
 3   fuel        70 non-null     object 
 4   km          70 non-null     float64
 5   price       64 non-null     object 
 6   maker       70 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 4.0+ KB


In [6]:
car.shape

(70, 7)

## 결측값 확인

In [7]:
car.isnull()

Unnamed: 0.1,Unnamed: 0,title,year,fuel,km,price,maker
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
65,False,False,False,False,False,False,False
66,False,False,False,False,False,False,False
67,False,False,False,False,False,False,False
68,False,False,False,False,False,False,False


In [8]:
# 변수별 결측치 확인
car.isnull().sum()

Unnamed: 0    0
title         0
year          0
fuel          0
km            0
price         6
maker         0
dtype: int64

## 결측치 처리

|함수|설명|  
|:--|:---:|
|`df.isnull().sum()`|결측치 갯수 확인| 
|`df.dropna()`|결측치가 있는 행 모두 삭제| 
|`df.fillna(method='ffill')`|결측치를 바로 위의 행의 값으로 대체|
|`df.fillna(method='bfill')`|결측치를 바로 아래 행의 값으로 대체|
|`df.fillna(df.mean())`|결측치를 각 열의 평균으로 대체|
|`df.dropna(subset=['b', 'c'])`|B열과 C열의 결측치만 삭제|

## 범주 확인
1. `df.unique()` : 범주형 변수의 레벨을 확인
* 확인을 하고 범주의 레벨을 줄이는 것도 생각해봐야!
2. 빈도표
* `pd.crosstab' : 도수분포표, crosstab(교차표)

In [9]:
car['maker'].unique()

array(['현대', '제네시스', '기아', '쉐보레', '오딧', '아리아워크스루밴', 'GM대우', '쎄미시스코',
       '르노삼성', '케이씨'], dtype=object)

In [10]:
car_tab = car['maker'].value_counts()
car_tab

현대          27
기아          21
제네시스         9
GM대우         4
쉐보레          3
르노삼성         2
오딧           1
아리아워크스루밴     1
쎄미시스코        1
케이씨          1
Name: maker, dtype: int64

In [11]:
car_tab = pd.crosstab(index = car['maker'], columns = "count")
print(car_tab)

col_0     count
maker          
GM대우          4
기아           21
르노삼성          2
쉐보레           3
쎄미시스코         1
아리아워크스루밴      1
오딧            1
제네시스          9
케이씨           1
현대           27


In [12]:
count_table = pd.crosstab(car['maker'], car['fuel'])
count_table

fuel,LPG,가솔린,디젤,전기
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GM대우,0,1,3,0
기아,2,11,8,0
르노삼성,0,1,1,0
쉐보레,0,2,1,0
쎄미시스코,0,0,0,1
아리아워크스루밴,0,0,1,0
오딧,0,0,1,0
제네시스,0,9,0,0
케이씨,0,0,1,0
현대,0,18,9,0


In [13]:
perc_table = pd.crosstab(car['maker'], car['fuel'], margins=True, margins_name='All')
perc_table

fuel,LPG,가솔린,디젤,전기,All
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GM대우,0,1,3,0,4
기아,2,11,8,0,21
르노삼성,0,1,1,0,2
쉐보레,0,2,1,0,3
쎄미시스코,0,0,0,1,1
아리아워크스루밴,0,0,1,0,1
오딧,0,0,1,0,1
제네시스,0,9,0,0,9
케이씨,0,0,1,0,1
현대,0,18,9,0,27


In [14]:
# 표준화 작업
perc_table = pd.crosstab(car['maker'], car['fuel'], normalize='all')
perc_table

fuel,LPG,가솔린,디젤,전기
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GM대우,0.0,0.014286,0.042857,0.0
기아,0.028571,0.157143,0.114286,0.0
르노삼성,0.0,0.014286,0.014286,0.0
쉐보레,0.0,0.028571,0.014286,0.0
쎄미시스코,0.0,0.0,0.0,0.014286
아리아워크스루밴,0.0,0.0,0.014286,0.0
오딧,0.0,0.0,0.014286,0.0
제네시스,0.0,0.128571,0.0,0.0
케이씨,0.0,0.0,0.014286,0.0
현대,0.0,0.257143,0.128571,0.0


In [15]:
# 행 별로 total이 1
perc_table = pd.crosstab(car['maker'], car['fuel'], normalize='index')
perc_table

fuel,LPG,가솔린,디젤,전기
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GM대우,0.0,0.25,0.75,0.0
기아,0.095238,0.52381,0.380952,0.0
르노삼성,0.0,0.5,0.5,0.0
쉐보레,0.0,0.666667,0.333333,0.0
쎄미시스코,0.0,0.0,0.0,1.0
아리아워크스루밴,0.0,0.0,1.0,0.0
오딧,0.0,0.0,1.0,0.0
제네시스,0.0,1.0,0.0,0.0
케이씨,0.0,0.0,1.0,0.0
현대,0.0,0.666667,0.333333,0.0


In [16]:
# 열 별로 total이 1
perc_table = pd.crosstab(car['maker'], car['fuel'], normalize='columns')
perc_table

fuel,LPG,가솔린,디젤,전기
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GM대우,0.0,0.02381,0.12,0.0
기아,1.0,0.261905,0.32,0.0
르노삼성,0.0,0.02381,0.04,0.0
쉐보레,0.0,0.047619,0.04,0.0
쎄미시스코,0.0,0.0,0.0,1.0
아리아워크스루밴,0.0,0.0,0.04,0.0
오딧,0.0,0.0,0.04,0.0
제네시스,0.0,0.214286,0.0,0.0
케이씨,0.0,0.0,0.04,0.0
현대,0.0,0.428571,0.36,0.0


In [17]:
perc_table = pd.crosstab(car['maker'], car['fuel'], normalize='columns', margins=True)
perc_table

fuel,LPG,가솔린,디젤,전기,All
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GM대우,0.0,0.02381,0.12,0.0,0.057143
기아,1.0,0.261905,0.32,0.0,0.3
르노삼성,0.0,0.02381,0.04,0.0,0.028571
쉐보레,0.0,0.047619,0.04,0.0,0.042857
쎄미시스코,0.0,0.0,0.0,1.0,0.014286
아리아워크스루밴,0.0,0.0,0.04,0.0,0.014286
오딧,0.0,0.0,0.04,0.0,0.014286
제네시스,0.0,0.214286,0.0,0.0,0.128571
케이씨,0.0,0.0,0.04,0.0,0.014286
현대,0.0,0.428571,0.36,0.0,0.385714


In [18]:
# 백분위로 환산
perc_table = pd.crosstab(car['maker'], car['fuel'], normalize='index') * 100
perc_table

fuel,LPG,가솔린,디젤,전기
maker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GM대우,0.0,25.0,75.0,0.0
기아,9.52381,52.380952,38.095238,0.0
르노삼성,0.0,50.0,50.0,0.0
쉐보레,0.0,66.666667,33.333333,0.0
쎄미시스코,0.0,0.0,0.0,100.0
아리아워크스루밴,0.0,0.0,100.0,0.0
오딧,0.0,0.0,100.0,0.0
제네시스,0.0,100.0,0.0,0.0
케이씨,0.0,0.0,100.0,0.0
현대,0.0,66.666667,33.333333,0.0


In [19]:
perc_table['LPG']

maker
GM대우        0.00000
기아          9.52381
르노삼성        0.00000
쉐보레         0.00000
쎄미시스코       0.00000
아리아워크스루밴    0.00000
오딧          0.00000
제네시스        0.00000
케이씨         0.00000
현대          0.00000
Name: LPG, dtype: float64

# 코드 - 수치형

* 대표값 (1) : 평균, 중위수, 최빈수
    * '평균'의 단점 : 이상치에 약함.
    * 이럴 때는 중위수를 쓴다.
    * 중위수 : 분위수를 이용해서 구한다.
    * 최빈수 : 질적 자료에서 많이 쓴다.
* 대표값 (2) : 분산, 범위, 변동계수
    * 분산 & 표준편차 : 단위의 스케일과 연관
    * 범위 => 사분위수범위 : IQR = Q3 - Q1
    * 변동계수 => 단위가 없어짐.

In [27]:
from scipy import stats

## 기술분석 (1)

In [20]:
np.average(car['km'])

86267.71428571429

In [21]:
np.median(car['km'])

80000.0

In [23]:
np.max(car['km'])

290000.0

In [24]:
np.min(car['km'])

13.0

In [28]:
stats.mode(car['km'])

ModeResult(mode=array([10000.]), count=array([6]))

In [29]:
stats.trim_mean(car['km'], 0.1)

79642.85714285714

In [30]:
range = np.max(car['km']) - np.min(car['km'])
range

289987.0

In [34]:
np.var(car['km'])

4708315353.032656

In [33]:
np.std(car['km'])

68617.16514861755

In [35]:
q1 = car['km'].quantile(0.25)
q3 = car['km'].quantile(0.75)
IQR = q3 - q1
print(f"IQR = {IQR}")

IQR = 97500.0


## 기술분석 (2)
* 한번에 뽑기 : pandas.describe()
* 집단별로 묶어서 기술하기 : groupby() + describe()로 구할 수 있다.
* scipy.stats 쓰기 : stats.describe()

In [37]:
car['km'].describe()

count        70.000000
mean      86267.714286
std       69112.602378
min          13.000000
25%       30000.000000
50%       80000.000000
75%      127500.000000
max      290000.000000
Name: km, dtype: float64

In [39]:
car.groupby('fuel')['km'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
fuel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
LPG,2.0,135000.0,35355.339059,110000.0,122500.0,135000.0,147500.0,160000.0
가솔린,42.0,85669.690476,69049.21983,22.0,22500.0,80000.0,120000.0,280000.0
디젤,25.0,86824.0,70899.043717,600.0,30000.0,70000.0,150000.0,290000.0
전기,1.0,13.0,,13.0,13.0,13.0,13.0,13.0


In [52]:
stats.describe(car['km'])

DescribeResult(nobs=70, minmax=(13.0, 290000.0), mean=86267.71428571429, variance=4776551807.424431, skewness=0.8357837246716393, kurtosis=0.4606352248440335)