# 10개 클래스 테스트 (Google Cloud Storage, GCS)

## 정현님이 올려주신 이미지와 같이 데이터 개수 Top 10의 이미지만 먼저 분류해 보자!
    1. issue : 1위 데이터의 개수와 10위 데이터 개수 차이도 너무 크다. 이것이 문제가 되는지도 알아야한다.
      MNIST로 테스트
    2. Issue가 문제가 된다면 해결방법은? 이미지. 개수 밸런싱을 한다.
    3. 간단한 모델로 일단 transfer learning을 해본다.(VGG, Inception, etc)

### 분석결과

#### 1. issue : 1위 데이터의 개수와 10위 데이터 개수 차이도 너무 크다. 이것이 문제가 되는지도 알아야한다.

* 사용한 dataset : MNIST
* train size : 60,000
* test size : 10,000
* label size : 10

|샘플링 유무|클래스별 샘플링 확률|테스트셋 정확도|
|--------|---------------|--------|
|무||9830/10000 (98%)|
|유|각 클래스별로 동일한 확률로 샘플링되도록 셋팅|9839/10000 (98%)|
|유|[0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]|9354/10000 (94%)|
|유|[0.99, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001]|6528/10000 (65%)|
|유|[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]|980/10000 (10%)|
|유|[0.24707459, 0.2461469 , 0.1149304 , 0.09066322, 0.0651395, 0.05471404, 0.04666915, 0.04526535, 0.0452359 , 0.04416096]|9767/10000 (98%)|

* 레이블 당 데이터의 비율이 다를때 문제가 되는것 확인
* 마지막 결과는 Kaggle문제 Top10 레이블의 데이터 비율을 그대로 반영하였음
* 결과는 언뜻좋아보이나, 비율이 작은 레이블의 정확도가 떨어질것임으로 간단하게 처리하려면 밸런싱 필요

#### 2. Issue가 문제가 된다면 해결방법은? 이미지. 개수 밸런싱을 한다.
* 문제가 되는것으로 확인 -> 개수를 밸런싱해야함 또는 Imbalanced dataset을 처리하는 기법 분석 필요

#### 3. 간단한 모델로 일단 transfer learning을 해본다.(VGG, Inception, etc)
* Transferlearning 테스트 필요
* 장기적으로는 의령님이 제시해주신것 처럼 다른 기법 분석이 필요


## 1.issue : 1위 데이터의 개수와 10위 데이터 개수 차이도 너무 크다. 이것이 문제가 되는지도 알아야한다.

### 모델 셋팅

In [49]:
!pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl



In [50]:
!pip3 install torchvision

Collecting pillow>=4.1.1 (from torchvision)
  Downloading Pillow-5.0.0-cp36-cp36m-manylinux1_x86_64.whl (5.9MB)
[K    100% |████████████████████████████████| 5.9MB 230kB/s 
Installing collected packages: pillow
  Found existing installation: Pillow 4.0.0
    Uninstalling Pillow-4.0.0:
      Successfully uninstalled Pillow-4.0.0
Successfully installed pillow-5.0.0


In [51]:
!pip install Pillow==4.0.0
!pip install PIL
!pip install image

Collecting Pillow==4.0.0
  Downloading Pillow-4.0.0-cp36-cp36m-manylinux1_x86_64.whl (5.6MB)
[K    100% |████████████████████████████████| 5.6MB 236kB/s 
Installing collected packages: Pillow
  Found existing installation: Pillow 5.0.0
    Uninstalling Pillow-5.0.0:
      Successfully uninstalled Pillow-5.0.0
Successfully installed Pillow-4.0.0
Collecting PIL
[31m  Could not find a version that satisfies the requirement PIL (from versions: )[0m
[31mNo matching distribution found for PIL[0m


In [0]:
# load library
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable

class argument():
  def __init__(self):
    pass

args = argument()
args.batch_size = 64
args.test_batch_size = 1000
args.epochs = 10
args.lr = 0.01
args.momentum = 0.5
args.no_cuda = False
args.seed = 1
args.log_interval = 10

# Training settings
args.cuda = True

torch.manual_seed(args.seed)
if args.cuda:
    torch.cuda.manual_seed(args.seed)

# load dataset
kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=args.test_batch_size, shuffle=True, **kwargs)

In [0]:
# MNIST CNN model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

model = Net()
if args.cuda:
    model.cuda()

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.data[0]))

def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        test_loss += F.nll_loss(output, target, size_average=False).data[0] # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).long().cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

### MNIST 테스트 (기본셋팅)

In [17]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!

Test set: Average loss: 0.1974, Accuracy: 9426/10000 (94%)




Test set: Average loss: 0.1262, Accuracy: 9612/10000 (96%)




Test set: Average loss: 0.0950, Accuracy: 9695/10000 (97%)




Test set: Average loss: 0.0861, Accuracy: 9729/10000 (97%)




Test set: Average loss: 0.0716, Accuracy: 9779/10000 (98%)




Test set: Average loss: 0.0657, Accuracy: 9791/10000 (98%)




Test set: Average loss: 0.0612, Accuracy: 9807/10000 (98%)




Test set: Average loss: 0.0581, Accuracy: 9811/10000 (98%)




Test set: Average loss: 0.0552, Accuracy: 9840/10000 (98%)




Test set: Average loss: 0.0567, Accuracy: 9830/10000 (98%)



### MNIST 테스트 (balancing train dataset)

In [0]:
def make_weights_for_balanced_classes(images, nclasses):
    count = [0] * nclasses
    for item in images:
        count[item[1]] += 1
    weight_per_class = [0.] * nclasses
    N = float(sum(count))
    for i in range(nclasses):
        weight_per_class[i] = N / float(count[i])
    weight = [0] * len(images)
    for idx, val in enumerate(images):
        weight[idx] = weight_per_class[val[1]]
    return weight

# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(train_loader.dataset, 10)
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

train_loader = torch.utils.data.DataLoader(train_loader.dataset, batch_size=args.batch_size, shuffle = False,
                                                             sampler = sampler, **kwargs)

In [60]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()


Test set: Average loss: 0.2053, Accuracy: 9396/10000 (94%)




Test set: Average loss: 0.1255, Accuracy: 9589/10000 (96%)




Test set: Average loss: 0.1019, Accuracy: 9685/10000 (97%)




Test set: Average loss: 0.0886, Accuracy: 9717/10000 (97%)




Test set: Average loss: 0.0768, Accuracy: 9752/10000 (98%)




Test set: Average loss: 0.0680, Accuracy: 9797/10000 (98%)




Test set: Average loss: 0.0635, Accuracy: 9813/10000 (98%)




Test set: Average loss: 0.0588, Accuracy: 9819/10000 (98%)




Test set: Average loss: 0.0543, Accuracy: 9829/10000 (98%)




Test set: Average loss: 0.0543, Accuracy: 9839/10000 (98%)



### MNIST 테스트 (imbalancing train dataset) 샘플링 확률 : class별 [0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]

In [0]:
def make_weights_for_balanced_classes(images, nclasses):
    weight_per_class = [0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
    weight = [0] * len(images)
    for idx, val in enumerate(images):
        weight[idx] = weight_per_class[val[1]]
    return weight

# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(train_loader.dataset, 10)
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

train_loader = torch.utils.data.DataLoader(train_loader.dataset, batch_size=args.batch_size, shuffle = False,
                                                             sampler = sampler, **kwargs)

In [100]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()


Test set: Average loss: 1.3606, Accuracy: 6676/10000 (67%)




Test set: Average loss: 0.7211, Accuracy: 8115/10000 (81%)






Test set: Average loss: 0.5832, Accuracy: 8476/10000 (85%)




Test set: Average loss: 0.4492, Accuracy: 8792/10000 (88%)




Test set: Average loss: 0.4016, Accuracy: 8889/10000 (89%)




Test set: Average loss: 0.3461, Accuracy: 9092/10000 (91%)




Test set: Average loss: 0.3003, Accuracy: 9248/10000 (92%)


Test set: Average loss: 0.2711, Accuracy: 9287/10000 (93%)




Test set: Average loss: 0.2579, Accuracy: 9323/10000 (93%)




Test set: Average loss: 0.2524, Accuracy: 9354/10000 (94%)



### MNIST 테스트 (imbalancing train dataset) 샘플링 확률 : class별 [0.99, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001]

In [0]:
def make_weights_for_balanced_classes(images, nclasses):
    weight_per_class = [0.99, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001]
    weight = [0] * len(images)
    for idx, val in enumerate(images):
        weight[idx] = weight_per_class[val[1]]
    return weight

# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(train_loader.dataset, 10)
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

train_loader = torch.utils.data.DataLoader(train_loader.dataset, batch_size=args.batch_size, shuffle = False,
                                                             sampler = sampler, **kwargs)

In [103]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()


Test set: Average loss: 3.4205, Accuracy: 1115/10000 (11%)






Test set: Average loss: 2.8094, Accuracy: 2326/10000 (23%)




Test set: Average loss: 2.5433, Accuracy: 2868/10000 (29%)




Test set: Average loss: 2.3452, Accuracy: 3618/10000 (36%)




Test set: Average loss: 2.1482, Accuracy: 4358/10000 (44%)




Test set: Average loss: 1.8333, Accuracy: 4662/10000 (47%)






Test set: Average loss: 1.8095, Accuracy: 5313/10000 (53%)




Test set: Average loss: 1.6408, Accuracy: 5689/10000 (57%)




Test set: Average loss: 1.4963, Accuracy: 6200/10000 (62%)




Test set: Average loss: 1.3802, Accuracy: 6528/10000 (65%)



### MNIST 테스트 (imbalancing train dataset) 샘플링 확률 : class별 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [0]:
def make_weights_for_balanced_classes(images, nclasses):
    weight_per_class = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    weight = [0] * len(images)
    for idx, val in enumerate(images):
        weight[idx] = weight_per_class[val[1]]
    return weight

# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(train_loader.dataset, 10)
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

train_loader = torch.utils.data.DataLoader(train_loader.dataset, batch_size=args.batch_size, shuffle = False,
                                                             sampler = sampler, **kwargs)

In [89]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()


Test set: Average loss: 41.0281, Accuracy: 980/10000 (10%)






Test set: Average loss: 42.3182, Accuracy: 980/10000 (10%)




Test set: Average loss: 43.5711, Accuracy: 980/10000 (10%)




Test set: Average loss: 44.3967, Accuracy: 980/10000 (10%)




Test set: Average loss: 45.2906, Accuracy: 980/10000 (10%)






Test set: Average loss: 46.2153, Accuracy: 980/10000 (10%)




Test set: Average loss: 46.9251, Accuracy: 980/10000 (10%)




Test set: Average loss: 47.5256, Accuracy: 980/10000 (10%)




Test set: Average loss: 48.2159, Accuracy: 980/10000 (10%)




Test set: Average loss: 48.9310, Accuracy: 980/10000 (10%)



### MNIST 테스트 (imbalancing train dataset) 샘플링 확률 : class별 (Kaggle Top10 Label과 동일) 
[0.24707459, 0.2461469 , 0.1149304 , 0.09066322, 0.0651395, 0.05471404, 0.04666915, 0.04526535, 0.0452359 , 0.04416096]

In [119]:
a = [50337,50148,23415,18471,13271,11147,9508,9222,9216,8997]
na = np.array(a)
na/na.sum()

array([0.24707459, 0.2461469 , 0.1149304 , 0.09066322, 0.0651395 ,
       0.05471404, 0.04666915, 0.04526535, 0.0452359 , 0.04416096])

In [0]:
def make_weights_for_balanced_classes(images, nclasses):
    weight_per_class = [0.24707459, 0.2461469 , 0.1149304 , 0.09066322, 0.0651395, 0.05471404, 0.04666915, 0.04526535, 0.0452359 , 0.04416096]
    weight = [0] * len(images)
    for idx, val in enumerate(images):
        weight[idx] = weight_per_class[val[1]]
    return weight

# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(train_loader.dataset, 10)
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

train_loader = torch.utils.data.DataLoader(train_loader.dataset, batch_size=args.batch_size, shuffle = False,
                                                             sampler = sampler, **kwargs)

In [122]:
for epoch in range(1, args.epochs + 1):
    train(epoch)
    test()


Test set: Average loss: 0.3184, Accuracy: 9092/10000 (91%)




Test set: Average loss: 0.1730, Accuracy: 9517/10000 (95%)






Test set: Average loss: 0.1413, Accuracy: 9587/10000 (96%)




Test set: Average loss: 0.1192, Accuracy: 9647/10000 (96%)




Test set: Average loss: 0.1069, Accuracy: 9689/10000 (97%)




Test set: Average loss: 0.0945, Accuracy: 9717/10000 (97%)






Test set: Average loss: 0.0949, Accuracy: 9695/10000 (97%)




Test set: Average loss: 0.0898, Accuracy: 9728/10000 (97%)




Test set: Average loss: 0.0867, Accuracy: 9731/10000 (97%)




Test set: Average loss: 0.0822, Accuracy: 9767/10000 (98%)

