# 1. Install sklearn

- See this website https://scikit-learn.org/stable/install.html. Usually `pip install scikit-learn`

- All sub-packages and modules are listed on this website on the left hand side under "Section Navigation": https://scikit-learn.org/stable/api/index.html

# 2. Load in relevant sklearn modules

You can either do this by:

- a) loading in the whole package of sklearn using `import sklearn`. Be careful as sklearn is a large package

- b) or just import sub-packages such as `from sklearn import metrics`

- c) or individual modules such as `from sklearn.metrics import confusion_matrix`

More information on pros and cons of each can be found here https://discuss.python.org/t/what-is-the-purpose-of-importing-a-package-alone/18433/3

For this exploratory purpose of playing with sklearn, we will load the whole package so option a).

In [2]:
import sklearn

# 3. Load data and clean to be in expected format for algorithm

For the purposes of this example, we will use the breast cancer data. This can be downloaded from a sklearn sub-package called `sklearn.datasets` where there are multiple datasets to choose from. Ours is called `load_breast_cancer`.

Usually for a binary classification, you need your data in the structure of X or x and y. For example, for LinearSVC, it requires you to have the data in the format of:

>X: {array-like, sparse matrix} of shape (n_samples, n_features)
>    Training vector, where n_samples is the number of samples and n_features is the number of features.
>
>
>y: array-like of shape (n_samples,)
>    Target vector relative to X.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC.fit

For example, pandas DataFrame/Series or numpy array is usually fine.

In [4]:
data = sklearn.datasets.load_breast_cancer(return_X_y=True, as_frame=True)

#the first item, which has index 0 is the image data
x = data[0]

#the second item, which has index 1 is the cancer classification
y = data[1]

# normally for binary classification: the positive case, also known as 1, is that something exists, e.g. has cancer.
# then the negative case, also known as 0, is that something does not exist, e.g. does not have cancer.
# however, in this dataset, it is reversed.
# this makes this unintuitive and confusion to work with, so we will reverse it.
# benign cases aka non-cancerous (represented as 1) and malignant aka cancerous cases (represented as 0)
y.replace({0:1, 1:0}, inplace=True)

# 4. Train/test split

In [None]:
chosen_random_state = 4627

# data split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split( # this function splits x and y data into 4 parts. The first two are where the features, or x, are split into train and test. The second two are where the targets, or y, are split into train and test.
                                                    x, # this is the features data 
                                                    y, # this is the target data
                                                    test_size=0.2, # this is the percentage of data that will be sectioned off in the test set so the train set will have 80% of the data and the test set will have 20%
                                                    shuffle = True, # this shuffles the data before splitting it
                                                    stratify = None, # this is used to ensure that the data is split in a way that preserves the percentage of samples for each class
                                                    random_state=chosen_random_state # this is the seed state or random state
                                                    )

# 5. Choose your algorithm


# train model
linear_svc_model = LinearSVC(random_state=2541).fit(x_train, y_train)

# evaluate model using test data

# this is a chart of the confusion matrix
print(f"""
        test_size: {test_size}
        shuffle: {shuffle}
        stratify: {stratify if type(stratify)==type(None) else "stratified"}
        random_state: {random_state}
        """)
ConfusionMatrixDisplay.from_estimator(linear_svc_model, x_test, y_test)
plt.show()

return x_train, x_test, y_train, y_test, linear_svc_model