# Basic Logistic Regression

This notebook implements a basic logistic regresion on our RESISC45 image dataset. To simplify implementation,
we used the ScikitLearn `LogisticRegression` class.

<a href="https://colab.research.google.com/github/cs449s23/project-6-layers-deep/blob/main/basicLogistic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

To start, we import everything needed, and download our dataset.

In [3]:
%%capture
import os
from pathlib import Path

import torch
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder

import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

import gdown

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import numpy as np

RESISC45_DIR = Path(".") / "NWPU-RESISC45"
if not RESISC45_DIR.exists():
    gdown.download(
        id="1nd0R9iljzkWd7Hhfyp2tH55KxAsKrzYj",
        output="NWPU-RESISC45.rar",
        quiet=False,
    )
    !unrar x NWPU-RESISC45.rar

ModuleNotFoundError: No module named 'torch'

Now, we create helper functions for training.

Currently, the only one needed is a data splitter to get our train and test data.

In [1]:
# uses sklearn train_test_split to randomly return the desired train test split
def split_data(data, train_size=0.8):
    """Split the data into test and train sets.

    Uses sklearn's `train_test_split`.

    Returns (train_indices, test_indicies).
    """
    target_array = data.targets

    train_indices, test_indices = train_test_split(
        range(len(target_array)),
        train_size=train_size,
        random_state=69,
        stratify=target_array,
    )

    return train_indices, test_indices

## Training

Now, we move on to training our model.

### Hyperparameters and Data Wrangling

The first step is setting up our hyperparameters and reading our data in.
As part of our data pipeline, we resize all images, convert them to tensors,
and normalize them. The normalization coefficients were taken from ImageNet.
We then split the data into train and test sets, and set up `DataLoader`s to
automatically read in, batch, and sample from our data.

We then make sure the data is in the proper format for sklearn, which involves converting it to a Numpy array and
reshaping it to a 2D array (from the $256 \times 256 \times 3$ image we read in).

In [2]:
image_size = (256, 256)
batch_size = 32

data_transforms = transforms.Compose(
    [
        transforms.Resize(image_size),
        transforms.ToTensor(),
        transforms.Normalize(
            [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
        ),  # normalization
    ]
)

data = ImageFolder(RESISC45_DIR, transform=data_transforms)

# get indices of split
train_indices, test_indices = split_data(data)

# Create a sampler out of the indices
train_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_indices)
test_sampler = torch.utils.data.sampler.SubsetRandomSampler(test_indices)

# give sampler to dataloader
# batch size is the whole train/test set so we can transform it to numpy
train_loader = DataLoader(data, batch_size=len(train_indices), sampler=train_sampler)
test_loader = DataLoader(data, batch_size=len(test_indices), sampler=test_sampler)

NameError: name 'transforms' is not defined

In [None]:
# convert tensors to numpy
Xtrain = next(iter(train_loader))[0].numpy()
ytrain = next(iter(train_loader))[1].numpy()

In [None]:
# convert tensors to numpy
Xtest = next(iter(test_loader))[0].numpy()
ytest = next(iter(test_loader))[1].numpy()

In [None]:
# reshape the data to be 2d array as sklearn logistic does not play well with 3d
reshape_Xtrain = Xtrain.reshape(Xtrain.shape[0], -1)
reshape_Xtest = Xtest.reshape(Xtest.shape[0], -1)

### Training

Thankfully, sklearn handles a lot of the implementation of the LogisticRegression for us. We
simply train and then view the results.

In [None]:
# fit model
log_model = LogisticRegression().fit(reshape_Xtrain, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# 80-20 split results
print(f"Train Accuracy: {accuracy_score(ytrain, log_model.predict(reshape_Xtrain))}")
print(f"Test Accuracy: {accuracy_score(ytest, log_model.predict(reshape_Xtest))}")

Train Accuracy: 0.8502777777777778
Test Accuracy: 0.017142857142857144


## Expirementation

The below is the same as above, but we use a 20/80 test/train split (rather than
the conventional 80/20) to view overfitting.

In [7]:
# everything below this cell is same as above with the change of using 20% train
data = ImageFolder(RESISC45_DIR, transform=data_transforms)

# get indices of split
train_indices, test_indices = split_data(data, train_size=0.2)

# Create a sampler out of the indices
train_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_indices)
test_sampler = torch.utils.data.sampler.SubsetRandomSampler(test_indices)

# give sampler to dataloader
train_loader = DataLoader(data, batch_size=len(train_indices), sampler=train_sampler)
test_loader = DataLoader(data, batch_size=len(test_indices), sampler=test_sampler)

In [8]:
# convert tensors to numpy
Xtrain = next(iter(train_loader))[0].numpy()
ytrain = next(iter(train_loader))[1].numpy()
Xtest = next(iter(test_loader))[0].numpy()
ytest = next(iter(test_loader))[1].numpy()

# 2d array
reshape_Xtrain = Xtrain.reshape(Xtrain.shape[0], -1)
reshape_Xtest = Xtest.reshape(Xtest.shape[0], -1)

log_model = LogisticRegression().fit(reshape_Xtrain, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [9]:
# 20-80 split
print(f"Train Accuracy: {accuracy_score(ytrain, log_model.predict(reshape_Xtrain))}")
print(f"Test Accuracy: {accuracy_score(ytest, log_model.predict(reshape_Xtest))}")

Train Accuracy: 0.9998412698412699
Test Accuracy: 0.02158730158730159
