<a href="https://colab.research.google.com/github/GeulHae/GeulHae/blob/dev_dataAnalysis/Go_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Go_EDA 탐색적 데이터 분석

## 기술통계 - 수집한 데이터를 요약 묘사 설명하는 통계 기법으로 데이터의 대푯값, 분포 등을 이용

### “ '탐색적 데이터 분석(EDA)’은 우리가 존재한다고 믿는 것들은 물론이고 존재하지 않는다고 믿는 것들을 발견하려는 태도, 유연성, 그리고 자발성이다. “ - 존 튜키 (도서 Doing Data Science 중)

데이터의 분포와 값을 다양한 각도에서 관찰하며 데이터가 표현하는 현상을 더 잘 이해할 수 있도록 도와주고 데이터를 다양한 기준에서 살펴보는 과정을 통해 문제 정의 단계에서 미처 발견하지 못한 다양한 패턴을 발견하고 이를 바탕으로 기존의 가설을 수정하거나 새로운 가설을 추가할 수 있도록 한다.   

데이터에 대한 관찰과 지식이 이후에 통계적 추론이나 예측 모델 구축 시에도 사용되므로 데이터 분석 단계 중 중요한 단계라고 할 수 있다.   
EDA의 목표는 관측된 현상의 원인에 대한 가설을 제시하고, 적절한 통계 도구 및 기법의 선택을 위한 가이드가 되며, 통계 분석의 기초가 될 가정을 평가하고 추가 자료수집을 위한 기반을 제공한다.

(1) 데이터에 대한 질문 & 문제 만들기  
(2) 데이터를 시각화하고, 변환하고, 모델링하여 그 질문 & 문제에 대한 답을 찾아보기  
(3) 찾는 과정에서 배운 것들을 토대로 다시 질문을 다듬고 또 다른 질문 & 문제 만들기

<br>

###1. 전체적인 데이터 살펴보기

데이터 항목의 개수, 속성 목록, NAN 값, 각 속성이 가지는 데이터형 등을 확인하고, 데이터 가공 과정에서 데이터의 오류나 누락이 없는지 데이터의 head와 tail을 확인. 
데이터를 구성하는 각 속성값이 예측한 범위와 분포를 갖는지 확인

### 2. 이상치(Outlier) 분석

개별 데이터를 관찰하여 전체적인 추세와 특이사항을 관찰   
데이터가 많다고 특정 부분만 보게 되면 이상치가 다른 부분에서 나타날 수도 있으므로 앞, 뒤, 무작위로 표본을 추출해서 관찰  
이상치들은 작은 크기의 표본에서는 나타나지 않을 수도 있다.   

두 번째 적절한 요약 통계 지표 사용  
데이터의 중심을 알기 위해서는 평균, 중앙값, 최빈값을 사용하고, 데이터의 분산도를 알기 위해서는 범위, 분산 등을 이용  
통계 지표를 이용할 때에는 평균과 중앙값의 차이처럼 데이터의 특성에 주의해서 이용해야 한다.  

세 번째로는 시각화를 활용  
시각화를 통해 데이터의 개별 속성에 어떤 통계 지표가 적절한지를 결정  
시각화 방법에는 Histogram, Scatterplot, Boxplot, 시계열 차트 등이 있다.  
이외에도 기계학습의 K-means 기법, Static based detection, Deviation based method, Distance based Detection 기법을 이용하여 이상치를 발견 할 수 있다.


### 3. 속성 간의 관계 분석

속성 간의 관계 분석을 통해 서로 의미 있는 상관관계를 갖는 속성의 조합을 찾아낸다.  
분석에 대상이 되는 속성의 종류에 따라서 분석 방법도 달라져야 한다. (변수 속성의 종류 image 참조)  


<img src = "https://drive.google.com/uc?id=1KjOycyiHVPDAFHfsViCD4G4F1fQXOKgg">

<img src = "https://drive.google.com/uc?id=1U3CqtqJzWRwUdqB-2ROCSZli5lJf_2il">

- 이산형 변수 : 이산형 변수의 경우 상관계수를 통해 두 속성 간의 연관성을 나타낸다.   
Heatmap이나 Scatterplot을 이용하여 시각화
- 이산형 변수 - 범주형 변수 : 카테고리별 통계치를 범주형으로 나누어서 관찰할 수 있고, Box plot, PCA plot 등으로 시각화 
- 범주형 변수 : 범주형 변수의 경우에는 각 속성값의 쌍에 해당하는 값의 개수, 분포를 관찰할 수 있고 Piechart, Mosaicplot 등을 이용하여 시각화





In [2]:
import os
import os.path
import cv2
import shutil 
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
# warnings.filterwarnings(action='default')


## Data Set : 생활 폐기물 이미지
### Training [T원천]
- 페트병_페트병_페트병.zip 20GB
- 페트병_일회용음료수잔_일회용음료수잔.zip 2GB v
- 플라스틱류_밀폐용기_밀폐용기.zip 13GB 
- 플라스틱류_장난감_장난감.zip 2GB
- 플라스틱류_욕실용품_욕실용품.zip 2GB
- 플라스틱류_대용량플라스틱통_대용량플라스틱통.zip 1GB
- 플라스틱류_바구니_바구니.zip 928MB
- 플라스틱류_기타_기타.zip 673MB
- 페트병_기타_기타.zip 31MB
- Training_라벨링데이터.zip 624MB


### Validation [V원천]
- 페트병_페트병_페트병.zip 3GB
- 페트병_일회용음료수잔_일회용음료수잔.zip 239MB v
- 플라스틱류_밀폐용기_밀폐용기.zip 2GB 
- 플라스틱류_장난감_장난감.zip 260MB
- 플라스틱류_욕실용품_욕실용품.zip 186M[링크 텍스트](https://)B
- 플라스틱류_대용량플라스틱통_대용량플라스틱통.zip 146MB
- 플라스틱류_바구니_바구니.zip 115MB
- 플라스틱류_기타_기타.zip MB
- 페트병_기타_기타.zip 8MB
- Validation_라벨링데이터.zip 78MB


In [14]:
# Google Drive mount

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# #현재 디렉토리 확인
print(os.getcwd())

# #현재 디렉토리에 dataset_plastics 폴더 생성
# os.mkdir(dataset_plastics)

/content


## DataSet .zip 해제

In [None]:
# # %cd 압축을 풀 경로
# # !unzip -qq "압축파일 Path"
# # /content/DataSet_plastics

# %cd /content/drive/MyDrive/Colab Notebooks/DataSet/DataSet_plastics

# !unzip -qq "/content/drive/MyDrive/Colab Notebooks/DataSet/DataSet_plastics.zip"

In [15]:
# Google Drive에 데이터셋 압축 해제 dataset_plastics > training > 폴더명


# /content/drive/MyDrive/DataSet/Training/[T원천]페트병_일회용음료수잔_일회용음료수잔.zip

!unzip /content/drive/MyDrive/DataSet/Training/[T원천]페트병_일회용음료수잔_일회용음료수잔.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/


Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]페트병_일회용음료수잔_일회용음료수잔.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C012_1110/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C012_1110/23_X001_C012_1110_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C012_1110/23_X001_C012_1110_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C012_1110/23_X001_C012_1110_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C012_1110/23_X001_C012_1110_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_일회용음료수잔_일회요

In [16]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_일회용음료수잔_일회용음료수잔.zip

!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_일회용음료수잔_일회용음료수잔.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/

Archive:  /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_일회용음료수잔_일회용음료수잔.zip
   creating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C015_1026/
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C015_1026/23_X001_C015_1026_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C015_1026/23_X001_C015_1026_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C015_1026/23_X001_C015_1026_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음료수잔_일회용음료수잔/23_X001_C015_1026/23_X001_C015_1026_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_일회용음ᄅ

In [17]:
# /content/drive/MyDrive/DataSet/Training/[T원천]페트병_기타_기타.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]페트병_기타_기타.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]페트병_기타_기타/

Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]페트병_기타_기타.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/23_X004_C509_0511_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/23_X004_C509_0511_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/23_X004_C509_0511_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/23_X004_C509_0511_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X004_C509_0511/23_X004_C509_0511_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]페트병_기타_기타/23_X011_C509_0511/
  inflating: /content/drive/My 

In [18]:
# /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_기타_기타.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_기타_기타.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]플라스틱류_기타_기타/

Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_기타_기타.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/24_X001_C016_1008_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/24_X001_C016_1008_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/24_X001_C016_1008_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/24_X001_C016_1008_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C016_1008/24_X001_C016_1008_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_기타_기타/24_X001_C0

In [19]:
# /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/

Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/24_X001_C014_1207/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/24_X001_C014_1207/24_X001_C014_1207_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/24_X001_C014_1207/24_X001_C014_1207_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/24_X001_C014_1207/24_X001_C014_1207_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/24_X001_C014_1207/24_X001_C014_1207_3.jpg  
  inflating: /content/drive/My Drive/dat

In [20]:
# /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_바구니_바구니.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_바구니_바구니.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/

Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_바구니_바구니.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/24_X001_C015_0929_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/24_X001_C015_0929_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/24_X001_C015_0929_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/24_X001_C015_0929_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_바구니_바구니/24_X001_C015_0929/24_X001_C015_0929_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라

In [21]:
# /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_욕실용품_욕실용품.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_욕실용품_욕실용품.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/

Archive:  /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_욕실용품_욕실용품.zip
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/24_X001_C014_1112_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/24_X001_C014_1112_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/24_X001_C014_1112_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/24_X001_C014_1112_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_욕실용품_욕실용품/24_X001_C014_1112/24_X001_C014_1112_4

In [22]:
# /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_장남감_장남감.zip
!unzip /content/drive/MyDrive/DataSet/Training/[T원천]플라스틱류_장남감_장남감.zip -d /content/drive/"My Drive"/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C705_1210/24_X050_C705_1210_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C705_1210/24_X050_C705_1210_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C999_0328/
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C999_0328/24_X050_C999_0328_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C999_0328/24_X050_C999_0328_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스틱류_장남감_장남감/24_X050_C999_0328/24_X050_C999_0328_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/training/[T원천]플라스티

In [None]:
# /content/drive/MyDrive/DataSet/Training/Training_라벨링데이터.zip
!unzip /content/drive/MyDrive/DataSet/Training/Training_라벨링데이터.zip -d /content/drive/"My Drive"/dataset_plastics/training/Training_라벨링데이터/

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C015_1104/22_X004_C015_1104_1.Json  
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C015_1104/22_X004_C015_1104_2.Json  
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C015_1104/22_X004_C015_1104_3.Json  
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C015_1104/22_X004_C015_1104_4.Json  
   creating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C016_1027/
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C016_1027/22_X004_C016_1027_0.Json  
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C016_1027/22_X004_C016_1027_1.Json  
  inflating: /content/drive/My Drive/dataset_plastics/training/캔류/음료수캔/22_X004_C016_1027/22_X004_C016_1027_2.Json  
  inflating: /content/drive/My Drive/dataset_pla

## dataset_plastics > Validation

In [23]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_기타_기타.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_기타_기타.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]페트병_기타_기타/

Archive:  /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_기타_기타.zip
   creating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/23_X108_C509_0326_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/23_X108_C509_0326_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/23_X108_C509_0326_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/23_X108_C509_0326_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X108_C509_0326/23_X108_C509_0326_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/validation/[V원천]페트병_기타_기타/23_X116_C499_0326/
  inflating: /c

In [24]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_기타_기타.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_기타_기타.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/

Archive:  /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_기타_기타.zip
   creating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/24_X001_C509_0106_0.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/24_X001_C509_0106_1.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/24_X001_C509_0106_2.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/24_X001_C509_0106_3.jpg  
  inflating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X001_C509_0106/24_X001_C509_0106_4.jpg  
   creating: /content/drive/My Drive/dataset_plastics/validation/[V원천]플라스틱류_기타_기타/24_X002_C803_0219/
  inflating: /content/drive/My 

In [None]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_대용량플라스틱통_대용량플라스틱통.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_대용량플라스틱통_대용량플라스틱통.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_대용량플라스틱통_대용량플라스틱통/

In [None]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_바구니_바구니.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_바구니_바구니.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_바구니_바구니/

In [None]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_욕실용품_욕실용품.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_욕실용품_욕실용품.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_욕실용품_욕실용품/

In [None]:
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_장남감_장남감.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_장남감_장남감.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_장남감_장남감/

In [None]:
# /content/drive/MyDrive/DataSet/Validation/[V라벨링]라벨링데이터.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V라벨링]라벨링데이터.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V라벨링]라벨링데이터/

In [None]:
# training set에 미포함
# /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_밀폐용기_밀폐용기.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]플라스틱류_밀폐용기_밀폐용기.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]플라스틱류_밀폐용기_밀폐용기/

In [None]:
# training set에 미포함
# /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_페트병_페트병.zip
!unzip /content/drive/MyDrive/DataSet/Validation/[V원천]페트병_페트병_페트병.zip -d /content/drive/"My Drive"/dataset_plastics/validation/[V원천]페트병_페트병_페트병/

## 이미지 파일 형식

In [None]:
# # upload 잘 되었는지 파일 수 확인

# filepaths = list(glob('content/image/*.jpg'))

# len(filepaths)

In [None]:
# 폴더 안에 파일 목록 .listdir / len 파일 수 확인

# print(len(os.listdir('/content/trashs_filtered/train/plastics')))
# print(len(os.listdir('/content/trashs_filtered/train/papers')))

# print(len(os.listdir('/content/trashs_filtered/validation/plastics')))
# print(len(os.listdir('/content/trashs_filtered/validation/papers')))

In [None]:
# #현재 디렉토리 확인
# print(os.getcwd())

# # 새로운 폴더에 파일 옮기기 / 이 경로를 합쳐줘~
# root_dir = os.path.join(os.getcwd(), 'board_papers')
# print(root_dir)
# #현재 디렉토리에 board_papers 폴더 생성
# os.mkdir(root_dir)

# #현재 디렉토리 변경
# os.chdir(root_dir)
# print(os.getcwd())

## 1. 명시적 정보 분류 및 분석
- 이미지 이름(img)
- 레이블(label)
- 파일 확장자(ftype)
- color map(cmap)
- channel 차원 (channel)
- 이미지 크기 용량(fsize)
- 이미지 마지막 수정 날짜(ftime)
- 이미지 너비(width)
- 이미지 높이(height)

In [None]:
data = []

# make list which contains information of each image data
for label in data_list.keys():
    for img_dir in data_list[label]:
        img_path = base_dir + "/" + label + "/" + img_dir
        # load image data
        img = Image.open(img_path)
        # PIL.Image to np.array
        imgarr = np.array(img)
        if len(imgarr.shape) == 2: 
            w, h= imgarr.shape
            c = 1
        else:
            w, h, c = imgarr.shape # get shape
        fsize = (os.path.getsize(img_path) / 1024.0)
        fmtime = time.ctime(os.path.getmtime(img_path))
        data.append([img_dir[:-4], label, img_dir[-3:], img.mode, fsize, fmtime, w, h, c])
        print(f"{cnt}/{numdata}")

df = pd.DataFrame(data, columns=['img','label','ftype', 'cmap', 'fsize',\
																									'fmtime', 'width', 'height', 'channel'])
df = df.astype({'img':'string', 'label':'string', 'ftype':'string', 'cmap':'string',\
								'fsize':'float', 'fmtime':'datetime64[ns]','width':'int', 'height':'int',\
																																				 'channel':'int'})
df.head()

In [None]:
df.info()

## 각 Class별 데이터의 갯수 시각화

## 이미지 파일 형식, 색상맵 및 채널 수 분포 시각화

In [None]:
df.groupby('ftype')['cmap'].value_counts()

plt.xlim(0,1200)
sns.countplot(y='ftype',hue='cmap',data=df)
plt.show()

## 이미지 규격 시각화 - 너비(width) 와 높이(height) 값 

In [None]:
sns.scatterplot(x='width',y='height',data=df)
plt.show()