In [1]:
from sklearn.datasets import make_classification
import torch

# Parameters of make_classification:

*  **n_samples=10**: Creates a dataset containing 10 data points (samples).
*  **n_features=2**: Each data point will be described by 2 features (attributes or characteristics).
*  **n_informative=2**:  Specifies that 2 out of the 2 features are relevant for classification, meaning all features contribute useful information to distinguish between classes.
*  **n_redundant=0**:  Indicates there are no redundant features; all features provide unique information.
*  **n_classes=2**: Sets up a binary classification problem, aiming to categorize data into 2 distinct classes.
*  **random_state=42**: Ensures consistent dataset generation for reproducibility. Using a specific number (like 42) guarantees the same data is produced each time the code is executed.

In [2]:
X, y = make_classification(
    n_samples=10,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_classes=2,
    random_state=42
)

In [3]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [4]:
X.shape

(10, 2)

In [5]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [6]:
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

In [7]:
from torch.utils.data import Dataset, DataLoader

### **`__init__` Method: Initialization**

The `__init__` method is a special method in Python classes that is called when an object of the class is created. It is used to initialize the object's attributes.

In this case, it takes the data (`X`) and the corresponding labels or targets (`y`) as input and stores them as attributes of the `CustomDataset` object.

*   `self.X = X`: This line stores the input data `X` in an attribute named `X` within the `CustomDataset` object.
*   `self.y = y`: This line stores the labels `y` in an attribute named `y` within the `CustomDataset` object.

### **`__len__` Method: Getting the Dataset Size**

The `__len__` method is another special method that allows you to get the length or size of the dataset using the built-in `len()` function.

It simply returns the length of the data (`X`) stored in the dataset, which represents the total number of data points.

### **`__getitem__` Method: Accessing Data by Index**

The `__getitem__` method is used to retrieve a specific data point from the dataset using its index (`idx`).

It takes an index `idx` as input and returns the corresponding data point (`self.X[idx]`) and its label (`self.y[idx]`).

This method allows you to access data from the dataset using indexing, like `dataset[0]` to get the first data point.

In [8]:
class CustomDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return len(self.features)

  def __getitem__(self, idx):
    # can apply different kinds of transformation here
    return self.features[idx], self.labels[idx]

In [9]:
dataset = CustomDataset(X, y)

In [10]:
len(dataset)

10

In [11]:
dataset[8]

(tensor([1.8997, 0.8344]), tensor(1))

num_workers in PyTorch DataLoader

The `num_workers` argument controls how many subprocesses are used for data loading. 0 means the main process handles it. Higher values create worker processes for parallel loading, potentially speeding up training, especially with large datasets or complex transformations.

Choosing the Right Value:

- Fast data loading (e.g., images in RAM): `num_workers = 0` might be enough.
- Slow data loading: Experiment with increasing `num_workers`, starting with the number of CPU cores and adjusting based on performance.
- Colab environments: Start with a smaller value and increase if needed due to resource limitations.

Example:

dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

This uses 4 worker processes for parallel data loading.

Considerations:

- Data Transformations: Multiple workers can greatly benefit datasets with complex transformations in their `__getitem__` method.
- Memory: Increasing `num_workers` increases memory usage. Ensure sufficient RAM.
- I/O Bottleneck: If data is on a slow disk, increasing `num_workers` might not help much if the disk becomes the bottleneck.

In [12]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

**What is collate_fn?**

- It's a function you give to the DataLoader in PyTorch.
- It helps the DataLoader organize your data into batches for your model.

**Why use it?**

- **Different Lengths:** If your data items have varying lengths (like text sentences), collate_fn can be used to pad them to the same length before batching.
- **Preprocessing:** You can use collate_fn to apply any special preprocessing steps to your data before creating batches.
- **Different Data Types:** If you have data of different types (like images and text captions), collate_fn helps you structure the batches correctly.

**How to use it:**

1. **Define:** Create a function that takes a list of data samples and combines them into a batch.
2. **Pass to DataLoader:** When you create your DataLoader, provide your function as the `collate_fn` argument.

**Example:**

python def my_collate_fn(batch):

Process the data items in 'batch' to create a batch

return batch

dataloader = DataLoader(dataset, batch_size=32, collate_fn=my_collate_fn)
**Important Points:**

- If you don't specify a collate_fn, the DataLoader uses a default one.
- You'll usually need a custom collate_fn for more complex data handling scenarios.

In [13]:
for batch_features, batch_labels in dataloader:
  print(batch_features)
  print(batch_labels)
  print('-'*30)

tensor([[-2.8954,  1.9769],
        [ 1.8997,  0.8344]])
tensor([0, 1])
------------------------------
tensor([[-0.7206, -0.9606],
        [-0.9382, -0.5430]])
tensor([0, 1])
------------------------------
tensor([[-1.9629, -0.9923],
        [-1.1402, -0.8388]])
tensor([0, 0])
------------------------------
tensor([[ 1.0683, -0.9701],
        [ 1.7774,  1.5116]])
tensor([1, 1])
------------------------------
tensor([[-0.5872, -1.9717],
        [ 1.7273, -1.1858]])
tensor([0, 1])
------------------------------


In [14]:
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [15]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [16]:
df.shape

(569, 33)

In [17]:
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

In [18]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [19]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('diagnosis', axis=1), df['diagnosis'], test_size=0.2, random_state=42)

In [20]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [21]:
X_train

array([[-1.44075296, -0.43531947, -1.36208497, ...,  0.9320124 ,
         2.09724217,  1.88645014],
       [ 1.97409619,  1.73302577,  2.09167167, ...,  2.6989469 ,
         1.89116053,  2.49783848],
       [-1.39998202, -1.24962228, -1.34520926, ..., -0.97023893,
         0.59760192,  0.0578942 ],
       ...,
       [ 0.04880192, -0.55500086, -0.06512547, ..., -1.23903365,
        -0.70863864, -1.27145475],
       [-0.03896885,  0.10207345, -0.03137406, ...,  1.05001236,
         0.43432185,  1.21336207],
       [-0.54860557,  0.31327591, -0.60350155, ..., -0.61102866,
        -0.3345212 , -0.84628745]])

In [22]:
y_train

Unnamed: 0,diagnosis
68,B
181,M
63,B
248,B
60,B
...,...
71,B
106,B
270,B
435,M


In [23]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [24]:
y_train

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,

In [25]:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

In [26]:
X_train_tensor.shape

torch.Size([455, 30])

In [27]:
X_train_tensor.dtype

torch.float32

In [28]:
from torch.utils.data import TensorDataset, DataLoader

class CustomDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return len(self.features)

  def __getitem__(self, idx):
    return self.features[idx], self.labels[idx]

In [29]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [30]:
train_dataset[10]

(tensor([-0.4976,  0.6137, -0.4981, -0.5310, -0.5769, -0.1749, -0.3622, -0.2849,
          0.4335,  0.1782, -0.3684,  0.5531, -0.3167, -0.4052,  0.0403, -0.0380,
         -0.1804,  0.1648, -0.1217,  0.2308, -0.5004,  0.8194, -0.4692, -0.5331,
         -0.0491, -0.0416, -0.1491,  0.0968,  0.1062,  0.4904]),
 tensor(0.))

In [31]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

In [32]:
import torch.nn as nn

class MySimpleNN(nn.Module):
  def __init__(self, num_features):
    super(MySimpleNN, self).__init__()

    self.layer1 = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):
    out = self.layer1(features)
    out = self.sigmoid(out)
    return out

In [33]:
learning_rate = 0.1
epochs = 25

In [34]:
loss_function = nn.BCELoss()

In [36]:
model = MySimpleNN(X_train_tensor.shape[1])

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(epochs):
  for batch_features, batch_labels in train_loader:
    y_pred = model(X_train_tensor)

    loss = loss_function(y_pred, y_train_tensor.view(-1, 1))

    optimizer.zero_grad()

    loss.backward()

    optimizer.step()

  print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Epoch 1, Loss: 0.21448461711406708
Epoch 2, Loss: 0.1609165072441101
Epoch 3, Loss: 0.13801798224449158
Epoch 4, Loss: 0.12474001944065094
Epoch 5, Loss: 0.11583109945058823
Epoch 6, Loss: 0.10932152718305588
Epoch 7, Loss: 0.10429397225379944
Epoch 8, Loss: 0.10025743395090103
Epoch 9, Loss: 0.09692250192165375
Epoch 10, Loss: 0.09410586208105087
Epoch 11, Loss: 0.09168500453233719
Epoch 12, Loss: 0.08957449346780777
Epoch 13, Loss: 0.08771263062953949
Epoch 14, Loss: 0.08605359494686127
Epoch 15, Loss: 0.08456258475780487
Epoch 16, Loss: 0.08321255445480347
Epoch 17, Loss: 0.08198220282793045
Epoch 18, Loss: 0.08085446804761887
Epoch 19, Loss: 0.07981551438570023
Epoch 20, Loss: 0.07885392010211945
Epoch 21, Loss: 0.0779602900147438
Epoch 22, Loss: 0.07712671160697937
Epoch 23, Loss: 0.07634653896093369
Epoch 24, Loss: 0.0756140723824501
Epoch 25, Loss: 0.07492446154356003


In [37]:
model.eval()
accuracy_list = []

with torch.no_grad():
  for batch_features, batch_labels in test_loader:
    y_pred = model(batch_features)
    y_pred = (y_pred > 0.8).float()

    accuracy = (y_pred == batch_labels.view(-1, 1)).float().mean()
    accuracy_list.append(accuracy.item())

overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Overall Accuracy: {overall_accuracy * 100:.2f}%')

Overall Accuracy: 95.05%
