# Introduction

-주제: 농수산물 별 물가 상승 정도 분석 및 경매 가격 분석 시각화
-목표: 데이터 분석 프로젝트 진행, 시각화로 포트폴리오 만들기
-기간: 22/07/25 ~ 22/08/05
-팀원: 이진규
-데이터: 서울열린데이터광장 : 농수산물 경매 정보 (http://data.seoul.go.kr/dataList/OA-2662/S/1/datasetView.do#)

## column info

값 | 의미
---|:---:|
`prd` | 품목명, 카테고리별로 전처리 |
`scale` | 판매 단위 (kg) |
`price` | 판매 가격 (₩) |
`reg_date` | 등록일 |
`new_class` | 상품 등급, 1이 가장 높음 |
`price_kg` | kg 당 가격 |
`state` | 도단위, 해외는 수입 |
`city` | 시단위, 해외는 국가명 |

## Initialize

In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
pd.options.display.float_format = '{:.5f}'.format

In [82]:
df2101 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202101_proceed.csv")
df2102 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202102_proceed.csv")
df2103 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202103_proceed.csv")
df2104 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202104_proceed.csv")
df2105 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202105_proceed.csv")
df2106 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202106_proceed.csv")
df2107 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202107_proceed.csv")
df2108 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202108_proceed.csv")
df2109 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202109_proceed.csv")
df2110 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202110_proceed.csv")
df2111 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202111_proceed.csv")
df2112 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202112_proceed.csv")
df2201 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202201_proceed.csv")
df2202 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202202_proceed.csv")
df2203 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202203_proceed.csv")
df2204 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202204_proceed.csv")
df2205 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202205_proceed.csv")
df2206 = pd.read_csv("/Users/luci031/Downloads/auction/data_proceed/202206_proceed.csv")
df_lst = [df2101,df2102,df2103,df2104,df2105,df2106,df2107,df2108,df2109,df2110,df2111,df2112,df2201,df2202,df2203,df2204,df2205,df2206]

## Pre-processing

In [83]:
# 데이터 전처리
for df in df_lst:
    df.reset_index(inplace=True,drop=True)
    df.drop(columns=['Unnamed: 0'],inplace=True)

In [84]:
# 모든 데이터 통합
df = df2101
for dfs in df_lst[1:]:
    df = pd.concat([df,dfs])
df.reset_index(inplace=True,drop=True)

In [85]:
# 기존 데이터의 전처리가 완료된 상태기 때문에 null값 없음
df.isnull().sum()

prd          0
scale        0
price        0
eco          0
reg_date     0
new_class    0
price_kg     0
state        0
city         0
dtype: int64

In [86]:
# scale 값이 0인 이상치 발견하여 삭제
df.drop(df[df['scale']==0].index,inplace=True)

In [87]:
# reg_date datetime format으로 변경
df['reg_date'] = pd.to_datetime(df['reg_date'])

# EDA

## Basic Info

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11141586 entries, 0 to 11141588
Data columns (total 9 columns):
 #   Column     Dtype         
---  ------     -----         
 0   prd        object        
 1   scale      float64       
 2   price      int64         
 3   eco        object        
 4   reg_date   datetime64[ns]
 5   new_class  int64         
 6   price_kg   float64       
 7   state      object        
 8   city       object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 850.0+ MB


In [89]:
df.describe()

Unnamed: 0,scale,price,new_class,price_kg
count,11141586.0,11141586.0,11141586.0,11141586.0
mean,8.45508,23027.53212,1.56489,4375.88803
std,120.899,108435.25502,1.67264,25900.33197
min,0.02,500.0,1.0,0.4
25%,3.0,8000.0,1.0,1600.0
50%,5.0,15000.0,1.0,2857.14
75%,10.0,27000.0,1.0,5222.22
max,18000.0,99999999.0,9.0,49999999.5


In [92]:
# 상관관계 탐색
# scale과 price가 상관관계가 꽤 있지만 생각보다 크진 않다
df.corr()

Unnamed: 0,scale,price,new_class,price_kg
scale,1.0,0.46682,-0.00314,-0.00446
price,0.46682,1.0,-0.01486,0.6727
new_class,-0.00314,-0.01486,1.0,-0.01181
price_kg,-0.00446,0.6727,-0.01181,1.0


In [91]:
# 평균적인 거래량은 약 24600건
df.groupby(['reg_date']).count()['prd'].mean()

24649.526548672566

## Column 별 탐색

### prd

In [98]:
print('product infor')
print(f"품목 수 : {len(df['prd'].unique())}")


품목 수 : 207
