### [통계 관련 메서드들]
- 컬럼별로 데이터의 상태를 분석하기 위해서 제공
    * 평균 / 중앙값 / 표준편차 / 최소값 / 최대값 / 최빈값
    * 컬럼과 컬럼의 관계 : 상관계수

[1] 모듈 로딩 및 데이터 준비 <hr>

In [1]:
## [1] 모듈 로딩
import pandas as pd

In [2]:
## [2] 데이터 준비
DATA_FILE = '../Data/auto_mpg.csv'

mpgDF = pd.read_csv(DATA_FILE)

In [3]:
## [3] 로딩 데이터 확인
## 요약 정보 확인
mpgDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [4]:
## 실제 데이터와 요약 정보에 타입 일치 여보 체크

mpgDF.head(3)

## -> horsepower object ==>   int64  형변환     ==> category 형변환 [고민]
## -> cylinders  int64  ==> category 형변환
## -> origin     int64  ==> category 형변환

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite


In [5]:
## 컬럼별 데이터 분포 및 기본 통계치 확인 [기] 수치데이터 컬럼만
mpgDF.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


In [6]:
## 범주형/카테고리형
## - 예) 혈액형(A, B, O, AB), 성별(남, 여), 종교(기독교, 천주교, 불교, 사이비교, 기타), 별점
##            (1, 2, 3, 4)       (1, 2)
## - 숫자로 되어 있더라도 , 의미상 숫자가 가진 의미가 아님!
## 데이터 가진 의미와 다르게 연산/계산되면 안됨 ==> 의미를 기반 타입 지정

In [7]:
## 컬럼별 데이터 분포 및 기본 통계치 확인 수치 + 텍스트 데이터 컬럼까지 모두 [매개변수 include='all']
## -> 수치형 데이터: 평균, 분산, 최소값, 최대값, 4분위수값
## -> 텍스트 데이터: 고유값, 최빈값, 빈도수
mpgDF.describe(include='all')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398
unique,,,,94.0,,,,,305
top,,,,150.0,,,,,ford pinto
freq,,,,22.0,,,,,6
mean,23.514573,5.454774,193.425879,,2970.424623,15.56809,76.01005,1.572864,
std,7.815984,1.701004,104.269838,,846.841774,2.757689,3.697627,0.802055,
min,9.0,3.0,68.0,,1613.0,8.0,70.0,1.0,
25%,17.5,4.0,104.25,,2223.75,13.825,73.0,1.0,
50%,23.0,4.0,148.5,,2803.5,15.5,76.0,1.0,
75%,29.0,8.0,262.0,,3608.0,17.175,79.0,2.0,


[3] 통계 메서드 사용 <hr>

In [22]:
## -------------------------------------------------
## => 컬럼별 통계값 계산 및 출력
## -------------------------------------------------
## -> 컬럼별 계산
for col in mpgDF.columns:
    print(f'[{col:12}] {mpgDF[col].dtype} --- {mpgDF[col].dtype != object}')

    if mpgDF[col].dtype != object:
        print(f'mean  : {mpgDF[col].mean(numeric_only=True)}')
        print(f'min   : {mpgDF[col].min(numeric_only=True)}')
        print(f'max   : {mpgDF[col].max(numeric_only=True)}')
        print(f'median: {mpgDF[col].median(numeric_only=True)}')
        print(f'mode  : {mpgDF[col].mode()[0]}')      ##<= 최빈값 결과 Series
        print(f'sum: {mpgDF[col].sum(numeric_only=True)}\n')
        

[mpg         ] float64 --- True
mean  : 23.514572864321607
min   : 9.0
max   : 46.6
median: 23.0
mode  : 13.0
sum: 9358.8

[cylinders   ] int64 --- True
mean  : 5.454773869346734
min   : 3
max   : 8
median: 4.0
mode  : 4
sum: 2171

[displacement] float64 --- True
mean  : 193.42587939698493
min   : 68.0
max   : 455.0
median: 148.5
mode  : 97.0
sum: 76983.5

[horsepower  ] object --- False
[weight      ] int64 --- True
mean  : 2970.424623115578
min   : 1613
max   : 5140
median: 2803.5
mode  : 1985
sum: 1182229

[acceleration] float64 --- True
mean  : 15.568090452261307
min   : 8.0
max   : 24.8
median: 15.5
mode  : 14.5
sum: 6196.1

[model year  ] int64 --- True
mean  : 76.01005025125629
min   : 70
max   : 82
median: 76.0
mode  : 73
sum: 30252

[origin      ] int64 --- True
mean  : 1.5728643216080402
min   : 1
max   : 3
median: 1.0
mode  : 1
sum: 626

[car name    ] object --- False


In [25]:
## -------------------------------------------------
## => 데이터프레임 전체 통계값 계산 및 출력
## -------------------------------------------------
for func in [mpgDF.mean, mpgDF.std, mpgDF.median, mpgDF.sum]:
    print(f'\n----------')
    display(mpgDF.mean(numeric_only=True))


----------


mpg               23.514573
cylinders          5.454774
displacement     193.425879
weight          2970.424623
acceleration      15.568090
model year        76.010050
origin             1.572864
dtype: float64


----------


mpg               23.514573
cylinders          5.454774
displacement     193.425879
weight          2970.424623
acceleration      15.568090
model year        76.010050
origin             1.572864
dtype: float64


----------


mpg               23.514573
cylinders          5.454774
displacement     193.425879
weight          2970.424623
acceleration      15.568090
model year        76.010050
origin             1.572864
dtype: float64


----------


mpg               23.514573
cylinders          5.454774
displacement     193.425879
weight          2970.424623
acceleration      15.568090
model year        76.010050
origin             1.572864
dtype: float64

[4] 컬럼과 컬럼의 관계성 <hr>
- 상관계수: corr() 메서드
    * 값의 범위: -1 ~ 1
    * 종류: 음의 상관관계, 양의 상관관계

In [28]:
## 전체
mpgDF.corr(numeric_only=True)

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
mpg,1.0,-0.775396,-0.804203,-0.831741,0.420289,0.579267,0.56345
cylinders,-0.775396,1.0,0.950721,0.896017,-0.505419,-0.348746,-0.562543
displacement,-0.804203,0.950721,1.0,0.932824,-0.543684,-0.370164,-0.609409
weight,-0.831741,0.896017,0.932824,1.0,-0.417457,-0.306564,-0.581024
acceleration,0.420289,-0.505419,-0.543684,-0.417457,1.0,0.288137,0.205873
model year,0.579267,-0.348746,-0.370164,-0.306564,0.288137,1.0,0.180662
origin,0.56345,-0.562543,-0.609409,-0.581024,0.205873,0.180662,1.0


In [31]:
## 연비mpg 컬럼과 다른 컬럼들의 상관계수
# mpgDF.corr(numeric_only=True)['mpg']

mpg_corrSR = mpgDF.corr(numeric_only=True).loc['cylinders':, 'mpg']

mpg_corrSR.sort_values()

weight         -0.831741
displacement   -0.804203
cylinders      -0.775396
acceleration    0.420289
origin          0.563450
model year      0.579267
Name: mpg, dtype: float64

In [None]:
## ==> mpg 컬럼과 관련성이 높은 컬럼은 weight, displacement, cylinders [음의 상관관계]
##                                  model year, origin [양의 상관관계]
##    종합해서 봤을때 weight, displacement, cylinders이 연비와 높은 관계성이 있음!