## 1 

Import the librarys we need.

- We know our file is a .CSV so lets import `pandas`
- I will also import a few more libraries I expect to use. If they are not clear, don't import them, and you can go back and add them when it becomes clear that we will need them


In [None]:
import pandas as pd
import numpy as np
import torch 

Import data and look at it.

In [None]:
df = pd.read_csv('breast_cancer_dataset_raw_manipulated.csv')

In [None]:
df.head()

In [None]:
df.tail()

 We can't see all the columns so lets get the title of each one.

In [None]:
df.columns

## 2

Lets note some observations.

- `Unnamed` seems to be a duplicate of the index column
- For the `README.txt` we know the `id` column is not going to help us in classification

Lets keep looking around and check the `diagnosis` column

In [None]:
df['diagnosis'].unique()

Lets get the count of each "class"

In [None]:
counts = []
for diagnosis_type in df['diagnosis'].unique():
    count = (df['diagnosis'] == diagnosis_type).sum()
    counts.append(count.item())
    print(f"{diagnosis_type} has {count} occurrence")

print(f"\nTotal counts is: {sum(counts)}")
print(f"df is of size: {len(df)}")

Interesting let's look into the `nan` and figure out why we are not counting it

In [None]:
for diagnosis_type in df['diagnosis'].unique():
    print(f"{diagnosis_type} is of type: {type(diagnosis_type)}")

In [None]:
type(np.nan)

Lets see if the base case checks out.

In [None]:
np.nan == np.nan


This makes sense now.

We don't catch `np.nan` in the for loop above because `np.nan != np.nan`, so `df['diagnosis'] == np.nan` is always `False`.

`np.nan != np.nan` because NaN (Not a Number) is defined to be unequal to everything, including itself.


In [None]:
print(f"We are missing: {len(df) - sum(counts)} values. ")

In [None]:
df['diagnosis'].isna().sum()


When me make out classification datset we can't have rows that don't have a class so we must drop these rows.

In [None]:
df = df[ ~df['diagnosis'].isna()]
len(df), df['diagnosis'].isna().sum()

# 3

Since `Unnamed` is treated like another feature insted of the index that it is lets go ahead and drop it.

In [None]:
df.columns

In [None]:
df = df.drop(columns=['Unnamed: 0'])
df

## 4

Lets check for duplicate rows.

In [None]:
df.duplicated().sum()

This is where you ask yourself the question that is specific for your dataset.
- *"Does having duplicates make sense in my context"*

For us duplicates don't make sense so we will drop them.

In [None]:
df = df.dropna(axis=0)

In [None]:
len(df)

## 5

Lets check for missing data:
- Check across rows.

- Check across columns.


**5.1** Check across rows.

In [None]:
print(f"column id : # missing rows\n")
for col in df.columns:
    print(f"{col} : {df[col].isnull().sum()}")
    # print("-"* 50)

**5.2** Check across column.

In [None]:
(df.isnull().sum(axis=1) > 0).sum()

Looks to be 54 rows that contain a NaN

There are many ways to deal with missing data:
-  Fill with a Mean/Median
-  Forward Fill (`ffill`): Replacing NaN values with the previous non-NaN value in the same column
-  Backward Fill (`bfill`): Replacing NaN values with the next non-NaN value in the same column
-  Droping the rows


We will opt for dropping the rows.

In [None]:
df = df.dropna(axis=0)
df

In [None]:
len(df)

## 6

Lets check the data types.

In [None]:
df.dtypes

Thoses look good

## 7
Great now that out data is clean lets make the `X`, `y`

1. We dont need `id` in `X`
2. We need `y` to be 0/1 not "benign"/"malignant"

**7.1**

Make the `X` matrix:

- Drop `[id, diagnosis]`

- Turn into numpy array

In [None]:
X = df.drop(columns=['id', 'diagnosis']).to_numpy()
X.shape

**7.2** 

Make the `y` vector:
- Use just column `diagnosis`

- Convert to be 0 or 1

- Turn into numpy array

In [None]:
y = df['diagnosis'].apply(lambda x: 0 if x == 'benign' else 1).to_numpy()
# y = np.where(df['diagnosis'] == 'benign', 0, 1).to_numpy()
y.shape

In [None]:
print(f"Shape of X: {X.shape}  and shape of y: {y.shape}")

**7.3** 

Split them up 80/20  for Train/Test.

In [None]:
#shuffle the data first
np.random.seed(42) # for reproducibility
perm_idxs = np.random.permutation(len(X))
X = X[perm_idxs]
y = y[perm_idxs]

In [None]:
#split them 
train_size = int(len(X) * 0.8 )
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

In [None]:
print(f"Shape X train: {X_train.shape}  Shape y train: {y_train.shape}")
print(f"Shape X test: {X_test.shape}  Shape y test: {y_test.shape}")


**7.4**

We now save our processed data to be used.

In [None]:
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)

np.save('X_test.npy', X_test)
np.save('y_test.npy', y_test)

or we save them as `.pt`

In [None]:
# torch.save(X_train, 'X_train.pt')
# torch.save(y_train, 'y_train.pt')

# torch.save(X_test, 'X_test.pt')
# torch.save(y_test, 'y_test.pt')