# Numpy and Pandas: Essential data science packages

## Why NumPy and Pandas over regular Python arrays?
In python, a vector can be represented in many ways, the simplest being a regular python list of numbers. Since Machine Learning requires lots of scientific calculations, it is much better to use NumPy’s ndarray, which provides a lot of convenient and optimized implementations of essential mathematical operations on vectors.

## Numpy
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.


[Source](https://cloudxlab.com/blog/numpy-pandas-introduction/#:~:text=Similar%20to%20NumPy%2C%20Pandas%20is,2d%20table%20object%20called%20Dataframe.)

Remember when we talked about multiplying python lists with a scalar?


In [None]:
!pip install numpy

In [1]:
import numpy as np

With Python:

In [2]:
a_list = [1, 3, 4, 10, -42.3]
a_list * 5

[1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3]

With Numpy:

In [3]:
a_np_array = np.array([1, 3, 4, 10, -42.3])
a_np_array = a_np_array * 5
a_np_array

array([   5. ,   15. ,   20. ,   50. , -211.5])

In [4]:
a_np_array = a_np_array + 1
a_np_array

array([   6. ,   16. ,   21. ,   51. , -210.5])

This super important for algorithms like those of machine learning, as these heavily rely on vectors and matrices

### Speed: Simple Python vs. Numpy

In [5]:
a_list

[1, 3, 4, 10, -42.3]

In [6]:
%%timeit
for i in range(len(a_list)):
    a_list[i] = a_list[i] * 5

KeyboardInterrupt: 

In [None]:
%%timeit
a_np_array * 5

#### => Numpy is ~150 times faster!

### Matrices

In [7]:
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
print(A)
print() # for a blank line
print(B)

[[ 2  4]
 [ 5 -6]]

[[ 9 -3]
 [ 3  6]]


In [8]:
C = A + B      # element wise addition
print(C)

[[11  1]
 [ 8  0]]


#### Hadamard product
If we just multiply two matrices in numpy using "*" then we just multiply element by element. This is also called the hadamard product. While you might not have come across the term in your studies, the concept will be very intuituve.

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/4eb9bb54b2820fb3583901ec05bc4b474b6d90bc">

In [9]:
A*B

array([[ 18, -12],
       [ 15, -36]])

#### Dot Product
If we want to use the dot product (also called the inner product) then we can just use the numpy function for that.

In [None]:
np.dot(A,B)

## Pandas
Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

The most important things you will do with it:
* Load .csv data into Python and save data to a local file
* Get single or multiple value ranges from the tables
* Manipulate multiple rows or columns
* Split, Join and merge tables apart or together

## Analysing the Wisconsin Breast Cancer dataset
Lets do some smallscale datascience and play around with a small dataset.

### General Dataset information
Features are computed from a digitized image of the breast cancer.

The dataset (and more information) can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

**Attribute Information:**

* Class: 1 = Benign; 2 = Malign

**Ten real-valued features are computed for each cell nucleus:**

1. radius (mean of distances from center to points on the perimeter)
1. texture (standard deviation of gray-scale values)
1. perimeter
1. area
1. smoothness (local variation in radius lengths)
1. compactness (perimeter^2 / area - 1.0)
1. concavity (severity of concave portions of the contour)
1. concave points (number of concave portions of the contour)
1. symmetry
1. fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

All feature values are recoded with four significant digits.

**Class distribution:** 357 benign, 212 malignant

In [11]:
 import pandas as pd

### Reading in a dataset from a local data file

In [16]:
?pd.read_csv

In [12]:
df = pd.read_csv("./data/wdbc_csv_dirty.csv") 
df

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,Class
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,2
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,2
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,2
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,2
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,2
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,2
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,2
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,2


Depending on your computer configuration, you might have issues loading the data like this. "International" csv files use commas (,) as separators, but as Germans like using commans to denote decimals, 'German' CSVs use semicolons (;) and separators. Pandas guesses based on your system preferences, but you can always specify which one to use by adding the `sep` parameter as follows.

You could for example find this, more detail and more parameters that you can specify in the documentation of pandas.

In [None]:
df = pd.read_csv("./data/wdbc_csv_dirty.csv", sep=",") 
df

Thats a lot of data, we can use `.head()` and `.tail()` to see only the first or last rows. This is nicer if we just want a peak at the data and don't want to clutter our screen.

In [17]:
df.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,Class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,2
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,2
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,2
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,2


In [18]:
df.tail()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,Class
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,2
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,2
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,2
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,2
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1


Let's get some general information about the contents of the dataframe. How many rows and columns are there? We can use the `.shape()`for this. It returns a tuple with the number of rows and columns. This is normally one of my first commands why I open a new dataset.

In [21]:
df.shape # dim(df)

(569, 31)

In [22]:
num_rows, num_cols = df.shape
print(f"There are {num_rows} rows and {num_cols} columns")

There are 569 rows and 31 columns


Another command that I like a lot is `.info()`. It gives you some general information on what the data type is in each row and how many missing values there are. We now that we have 569 rows, so if there are e.g. 565 non-null values, then we already know that there are 4 missing values in that column.

In [None]:
df.info()

## Averages and sums
With pandas we can also quickly get statistics on the columns of the dataframe through built-in functions

In [23]:
df.sum()

mean_radius                  8038.429000
mean_texture                10917.120000
mean_perimeter              52226.780000
mean_area                  372111.900000
mean_smoothness                54.701700
mean_compactness               59.370020
mean_concavity                 50.108161
mean_concave_points            27.749564
mean_symmetry                 102.907200
mean_fractal_dimension         35.592410
radius_error                  229.405900
texture_error                 687.703200
perimeter_error              1626.202700
area_error                  22839.468000
smoothness_error                4.000546
compactness_error              14.393671
concavity_error                18.099125
concave_points_error            6.683229
symmetry_error                 11.631338
fractal_dimension_error         2.151207
worst_radius                 9225.354000
worst_texture               14589.020000
worst_perimeter             60932.630000
worst_area                 500353.000000
worst_smoothness

In [None]:
df.mean()

## Selecting values
Subscripts work for individual or multiple columns

In [24]:
df["mean_perimeter"]

0      122.80
1      132.90
2      130.00
3       77.58
4      135.10
        ...  
564    142.00
565    131.20
566    108.30
567    140.10
568     47.92
Name: mean_perimeter, Length: 569, dtype: float64

In [25]:
columns = ["mean_perimeter","mean_radius","Class"]
df[columns]

Unnamed: 0,mean_perimeter,mean_radius,Class
0,122.80,17.99,2
1,132.90,20.57,2
2,130.00,19.69,2
3,77.58,11.42,2
4,135.10,20.29,2
...,...,...,...
564,142.00,21.56,2
565,131.20,20.13,2
566,108.30,16.60,2
567,140.10,20.60,2


But this doesn't work as well when we want to specify rows *and* columns. Instead we can use the `iloc` and `loc` functions.

With `iloc` select the cell(s) by using the index (remember that these also start at 0!).

With `loc` we can use the named indeces of a pandas dataframe (e.g. the columns)

In [26]:
df.iloc[1,3]

1326.0

In [27]:
df.loc[1,"mean_area"]

1326.0

We can also just select rows or columns by entering a colon (":") for either of the index values. This basically means "choose all"

In [None]:
df.loc[:,"mean_area"]

We can also select ranges of values

In [None]:
df.loc[1:10,"mean_perimeter"]

In [None]:
df.loc[1:10,["mean_perimeter","mean_area", "Class"]]

Depending on whether we specify lists in the `iloc` and `loc` get back different values

* No list -> Single value
* Range, list or a colon -> pandas series (like a one dimensional dataframe)
* Two ranges, lists, colons or combinations of these -> another dataframe

In [None]:
# returns a float
print(df.loc[1,"mean_area"])
print(type(df.loc[1,"mean_area"]))

In [None]:
# returns a series
print(df.loc[:,"mean_area"])
print(type(df.loc[:,"mean_area"]))

In [None]:
# returns a series
print(df.loc[1:3,"mean_area"])
print(type(df.loc[:,"mean_area"]))

In [None]:
# returns a new dataframe
print(df.loc[1:3,["mean_area"]])
print(type(df.loc[:,["mean_area"]]))

In [None]:
# returns a new dataframe
print(df.loc[:,["mean_area","Class"]])
print(type(df.loc[:,["mean_area","Class"]]))

## Creating subsets
We can use this for example to create a smaller set of our original dataset. E.g. because we don't need all columns or because we just want to experiment with some of the rows first.

Lets take a much smaller subset from the data to show off some other things we can do with pandas 

In [28]:
columns_we_want = ["mean_perimeter","mean_radius","mean_smoothness","Class"]
df_toy = df.loc[:20, columns_we_want]
df_toy

Unnamed: 0,mean_perimeter,mean_radius,mean_smoothness,Class
0,122.8,17.99,0.1184,2
1,132.9,20.57,0.08474,2
2,130.0,19.69,0.1096,2
3,77.58,11.42,0.1425,2
4,135.1,20.29,0.1003,2
5,82.57,12.45,0.1278,2
6,119.6,18.25,0.09463,2
7,90.2,13.71,0.1189,2
8,87.5,13.0,,2
9,83.97,12.46,0.1186,2


## Manipulating values
Just like we can select values we can also manipulate these values or use logical conditions to 'mask' the dataframe.

Lets start with manipulation.

Instead of just selecting the data, we can also overwrite it:

In [None]:
df_toy.iloc[:10,0] = 99.99 # sets the first 10 rows of the first column to 99.99
df_toy

In [None]:
df_toy.iloc[1:3,:] = "placeholder text" # sets the second and third row to a text value

We can also check for logical conditions and thereby search for cells where certain conditions are fulfilled. Lets re-generate our subset again to undo our manipulations first.

In [None]:
df_toy = df.loc[:20, columns_we_want]

Logical checks can be done just like you might expect using the logical operators "==", "!=", "<=" etc. 

When we check for a condition, then we get a dataframe (or series) full of boolean values. Lets check which values of the column `mean_perimeter` are greater than 120

In [None]:
df_toy["mean_perimeter"] >= 120

We can also do this for rows

In [None]:
df_toy.iloc[0,:] < 3

And even for the entire dataframe

In [None]:
df_toy >= 120

We can also apply these boolean values to select only certain rows or columns of the dataset where the values fulfill a condition. 

Say for example, we want to better understand the characteristics and data values of all benign tumors (class label = 1). We could generate a subset by checking for this condition

In [None]:
mask = df_toy["Class"] == 1
mask

In [None]:
df_toy[mask]

We can even skip a step here and insert the logic check in the selection command

In [None]:
df_toy[df_toy["Class"] == 1]

## Minitask 1
We can also use the masking to manipulate these rows specifically. Say for example that we know that there has been a measurement error in the laboratory, because the scales weren't calibrated correctly and measured a radius that was 2 units too high. This happened at the time when we were working on the all cancer samples that were previously categorised as malign.

* Step 1: Create a mask for all rows where the cancer class is "malign"
* Step 2: print out this subset 
* Step 3: Combine the mask with a selection of the radius column (just print it out for now)
* Step 4: Now adjust these values by reducing them by 2 units and print out the entire dataset again

In [None]:
# your code

## Minitask 2
One of the doctors in the hospital you work at approached you with following questions: 
    1. What is the average value for the mean_perimeter?
    1. Is the mean perimeter larger for the malign or benign cancer types?
    1. How many of the benign cancers have a mean_smoothness that is smaller than the average value of the *malign* cancer types?
    
Use the entire dataset, not just the toy dataset for this.

In [None]:
# your code

## Creating new columns 
By "selecting" columns that don't exist yet we can also *create* new columns. This is especially useful if we want to gnerate new features out of existing data in the dataset.

Predicting the cost of appartments the feature "appartment size" and "average room size" might be powerful predictors, but by dividing the total size by the average size we could create a new parameter called "# of rooms" which might have a strong predictive power for estimating the apprtmetn price.

Often you can also use this to add comments, timestamps or class labels.

In [None]:
from datetime import datetime as dt

In [None]:
df_toy["Timestamp"] = dt.now()
df_toy

# Creating a small machine learning model
## Create a new column for predictions
We hypothesise that the mean perimeter, radius and smoothness are the most important features of the dataset, so we are going to choose those columns as a subset.

We also believe that the ratio of smoothness to perimeter could be a very important predictor, so we are going to create that metric as a new feature in a new column.

In [29]:
df_subset = df.loc[:,["mean_perimeter", "mean_radius", "mean_smoothness", "Class"]]

df_subset["smoothness_to_perimeter"] = df["mean_smoothness"] / df["mean_perimeter"]

df_subset.head()

Unnamed: 0,mean_perimeter,mean_radius,mean_smoothness,Class,smoothness_to_perimeter
0,122.8,17.99,0.1184,2,0.000964
1,132.9,20.57,0.08474,2,0.000638
2,130.0,19.69,0.1096,2,0.000843
3,77.58,11.42,0.1425,2,0.001837
4,135.1,20.29,0.1003,2,0.000742


We save the names of the columns that we want to save as features and the column that is the label we want to predict

In [30]:
features = ["mean_perimeter", "mean_radius", "mean_smoothness","smoothness_to_perimeter"]
label = ["Class"]

## Splitting into Training and testing set
Lets start by creating a training and a test set.

Not familiar terms to you? That's alright, you can read up on them [here](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data)


Basically the training set is a subset of the rows that we will train our model on. The test set is the data that we will test our model on. If we would test the model on the same data that we trained it on, then we couldn't be sure that we created a general model that would also be good at predicting new data. It could afterall just be that created a modell that is good at recognising the patterns in the *small* dataset, but didnt actually learn something about the actual problem.

We will determine the row where we want to split the dataset (assuming the order is already random) and then create two subsets. Generally you take a ratio of 80% of the data for training and 20% for testing, so thats how we choose the index.

In [34]:
row_to_split_at = round(0.8 * len(df_subset), 0)

# we need an integer as an index for splitting
row_to_split_at = int(row_to_split_at)

# split into training and testing data
df_train = df_subset.iloc[:row_to_split_at, :]
df_test = df_subset.iloc[row_to_split_at:, :]

## Create a logistic regression
A logistic regression is like a normal least squares regression, except that the output isn't a continuous numberm but a probability score. In our case between 0 (most likely no malign cancer) and 1 (nearly certain to be a malign cancer).

Googling for "Python logistic regression package" we quickly come across the scikit learn package, which is the most famous python machine learning library. The first [google result](LogisticRegression) is the result for the Logistic Regression implementation.

In [31]:
# you might have to run this command
!pip install sklearn

Processing /Users/dominiquepaul/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897/sklearn-0.0-py2.py3-none-any.whl
Installing collected packages: sklearn
Successfully installed sklearn-0.0


In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

The way that this library works that we first define an object and then call a function on it with our features and our labels. Luckily we already took note of the column names, so this is quite easy.

We follow the instructions as in the documentation, but...


In [35]:
clf = LogisticRegression()
clf.fit(df_train[features], df_train[label])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Hmm, the error reads "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."

Wait a minute, didn't we say that there were missing values when we looked at `df.info()`? It seems that that might be issue. We learnt how we can just filter out those values, but we dont't know yet how we could check for "NaN" or infinity values.

Luckily, we by now know that Google is quite good at helping us find these solutions so we just enter ["How to check for nan or infinity python pandas"](https://www.google.com/search?q=how+to+check+for+nan+or+infinity+python+pandas&oq=how+to+check+for+nan+or+infinity+python+pandas&aqs=chrome..69i57j33.6384j0j1&sourceid=chrome&ie=UTF-8) and sure enough we find a [stackoverflow question](https://stackoverflow.com/questions/17477979/dropping-infinite-values-from-dataframes-in-pandas) that solves our issue quite well. Lets give that a go!

*Note: Alternatively we could consider imputing these values, but we are not going to worry about that for now.*

In [36]:
# copy past from stackoverflow and replacing the name of the dataframe and 
# columns specified in the "subset" parameter
df_train = df_train.replace([np.inf, -np.inf], np.nan).dropna(subset=features, how="all")


In [37]:
clf = LogisticRegression()
clf.fit(df_train[features], df_train[label])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

That still doesn't quite work, so we google a bit and try to understand what the functions `replace` and `dropna` do exacty as they are new to us. Reading the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) for `.dropna()` we can see the following: 

```
how{‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.

‘all’ : If all values are NA, drop that row or column.
```

All seems require all column values to be NA, but we want rows with only a single NaN value to be deleted. So we have to change "all" to "any"




In [None]:
df_train = df_train.replace([np.inf, -np.inf], np.nan).dropna(subset=features, how="any")

In [None]:
feature_columns = ["mean_perimeter", "mean_radius", "mean_smoothness", "smoothness_to_perimeter"]
x = df_train[feature_columns]
y = df_train["Class"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

It works!

In [None]:
clf = LogisticRegression()
clf.fit(x_train, y_train)

We are getting a warning, but lets ignore that for now. If we come across any problems later on, then we should revisit the warning to see if it might be the cause. So we'll just make a mental note of it and continue

Following the instruction in the Sklearn documentation we can now use `.predict()` to generate new predictions. Lets try that for our test values, just to check that the output looks right.

In [None]:
clf.predict(x_test)

That worked as well! We will have to compare those predictions with the actual labels, but at least the format looks right for now.

In [None]:
clf.score(x_test, y_test)

## Minitask 3:
Instead of the labels we need the probabilities. Looking at the documentation, what do we have to do go get those?

In [None]:
# Space for your code

## Assessing the quality of the model
To assess how well our model works we want to calculate `# number of correct predictions` / `total number of predictions`. We could either calculate this manually, but reading through the sklearn documentation we see that there is a function built into the logistic regression object that takes care of this for us.

```
score(X, y, sample_weight=None)[source]
Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
```


In [None]:
clf.score(df_test[features], df_test[label])

## Minitask 4: 
Calculate the accuracy score without using the inbuilt function by comparing the Class labels and the predictions

We get a testing accuracy of 93.9%, thats 43.9% better than random guessing, so it seems that the model definitely learnt something. For this use case there definitely would be enough room for improvement and optimisation, but lets stick with this for now.

Lets add our predictions to the test data dataframe and save it as a .csv to discuss with the doctors from the oncology department.

In [None]:
df_test["Predictions"] = clf.predict(df_test[features])

We can save csv files locally using the `.csv()` method. Have a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) to see which parameters we have to enter into the function.

In [None]:
path = "./data/output_cancer_predictions.csv"
df_test.to_csv(path)