## Logistic Regression
$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$  
* (p) is the probability of the positive class!  
* B0, ..., Bn are the regression coefficients!  
* x1, ..., xn are the feature variables!  

## Main Task
> Predicting if a patient has breast cancer or not, more specifically if a patient is in *benign or not-cancer condition (2)* **OR** the patient is in *malignant or has cancer (4)*.

### Data Understanding  

**1.0. What is the domain area of the dataset?**  
The dataset *breast-cancer-wisconsin.csv* contains information about different patients that either have cancer or not. The dataset is collected by Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA.

**2.0. Which data format?**  
The dataset is in *csv* format!  

**2.1. Do the files have headers or another file describing the data?**  
The files does have headers that describes the data! Each column has a name that describes the data it contains!  

**2.2. Are the data values separated by commas, semicolon, or tabs?**  
The data values are separated by commas!  
Example: 
*id,clumpthickness,uniformcellsize,uniformcellshape,margadhesion,epithelial,barenuclei,blandchromatin,normalnucleoli,mitoses,benormal*
*1000025,5,1,1,1,2,1,3,1,1,2*

**3.0 How many features and how many observations does the dataset have?**  
The dataset has:  
* 11 features or columns!
* 691 observations or rows!  

**4.0 Does it contain numerical features? How many?**  
Yes it contains 11 numerical features!  

**5.0. Does it contain categorical features?  How many?**  
Yes it contains 1 categorical features! (The target class has *2* or *4* values.)  

In [1]:
# Importing Necessary Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
dataset = pd.read_csv("../Datasets/breast-cancer-wisconsin.csv")

In [3]:
RANDOM_STATE = 42

### Basic Exploratory Data Analysis

In [4]:
dataset.head()

Unnamed: 0,id,clumpthickness,uniformcellsize,uniformcellshape,margadhesion,epithelial,barenuclei,blandchromatin,normalnucleoli,mitoses,benormal
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [5]:
dataset.describe()

Unnamed: 0,id,clumpthickness,uniformcellsize,uniformcellshape,margadhesion,epithelial,blandchromatin,normalnucleoli,mitoses,benormal
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [6]:
print(f"Number of features in the dataset is {dataset.shape[1]} and the number of observations/rows in the dataset is {dataset.shape[0]}")

Number of features in the dataset is 11 and the number of observations/rows in the dataset is 699


### Handling Missing Values

In [7]:
dataset.isnull().sum()

id                  0
clumpthickness      0
uniformcellsize     0
uniformcellshape    0
margadhesion        0
epithelial          0
barenuclei          0
blandchromatin      0
normalnucleoli      0
mitoses             0
benormal            0
dtype: int64

In [8]:
dataset.isna().sum()

id                  0
clumpthickness      0
uniformcellsize     0
uniformcellshape    0
margadhesion        0
epithelial          0
barenuclei          0
blandchromatin      0
normalnucleoli      0
mitoses             0
benormal            0
dtype: int64