# Basic ML in Python

This is a toy script to give you an impression of what running a simple machine learning model looks like in Python.

In [8]:
# !pip install pandas
import pandas as pd

# !pip install scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [6]:
df = pd.read_csv("./data/cancer.csv")

In [7]:
df

Unnamed: 0.1,Unnamed: 0,mean_perimeter,mean_radius,mean_smoothness,Class,smoothness_to_perimeter,Predictions
0,455,86.34,13.380,0.09245,1,0.001071,1
1,456,74.87,11.630,0.09357,1,0.001250,1
2,457,84.10,13.210,0.08791,1,0.001045,1
3,458,82.61,13.000,0.08369,1,0.001013,1
4,459,61.68,9.755,0.07984,1,0.001294,1
...,...,...,...,...,...,...,...
109,564,142.00,21.560,0.11100,2,0.000782,2
110,565,131.20,20.130,0.09780,2,0.000745,2
111,566,108.30,16.600,0.08455,2,0.000781,2
112,567,140.10,20.600,0.11780,2,0.000841,2


We subset our data for our ML task

In [17]:
feature_columns = ["mean_perimeter", "mean_radius", "mean_smoothness", "smoothness_to_perimeter"]
x = df[feature_columns]
y = df["Class"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

...and build our model

In [33]:
# create a new, but still untrained model
model = LogisticRegression()

# fit the model
model.fit(x_train, y_train)

array([1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1])

We can predict labels

In [39]:
model.predict(x_test)

array([1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1])

...or the probabilities

In [40]:
model.predict_proba(x_test) # each column corresponds to the probability for each label

array([[9.86029491e-01, 1.39705090e-02],
       [9.43074000e-01, 5.69259999e-02],
       [8.46651533e-01, 1.53348467e-01],
       [1.60209374e-02, 9.83979063e-01],
       [4.86787492e-01, 5.13212508e-01],
       [9.97404077e-01, 2.59592346e-03],
       [9.88330891e-01, 1.16691088e-02],
       [7.29503910e-01, 2.70496090e-01],
       [9.90863689e-01, 9.13631093e-03],
       [8.52815334e-02, 9.14718467e-01],
       [9.95015413e-01, 4.98458663e-03],
       [3.48847266e-01, 6.51152734e-01],
       [9.49644346e-01, 5.03556545e-02],
       [9.98375552e-01, 1.62444780e-03],
       [9.99043965e-01, 9.56034825e-04],
       [9.90644158e-01, 9.35584160e-03],
       [9.99059265e-01, 9.40734717e-04],
       [9.86075953e-01, 1.39240466e-02],
       [9.80988870e-01, 1.90111303e-02],
       [9.50612954e-01, 4.93870464e-02],
       [9.01245151e-01, 9.87548487e-02],
       [9.93673219e-01, 6.32678088e-03],
       [9.84604418e-01, 1.53955819e-02]])

We can also run an automatic evaluation of the model

In [36]:
model.score(x_test, y_test)

1.0