<a href="https://colab.research.google.com/github/ThistleAna/BreastCancerCase/blob/main/Logistic_Regression_Breast_Cancer_Case_Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression

Logistic regression is a method of statistical analysis used to predict a data value based on prior observations of a dataset. A logistic regression model predicts the value of a dependent variable by analyzing the relationship between one or more existing independent variables.

In data science, logistic regression is a Machine Learning algorithm used for classification problems and predictive analysis.

More real-world applications of logistical regression include:

Bankruptcy predictions

Credit scoring

Consumer behavior

Customer retention

Spam detection

## Importing the libraries

In [None]:
# Calling Pandas library
import pandas as pd

## Importing the dataset

In [None]:
# Upload dataset into colab
# Class has value 2 or benign cancer, 4 for malignant cance
# Importing the dataset
dataset = pd.read_csv('breast_cancer.csv')

# Create 2 sets, X is set of independent variable (all columns except 'class', remove index 0 as we don't need the data from that col)
# y is set of dependent variable 'class'
X = dataset.iloc[:,1:-1].values # all rows, and columns index 1 to all minus one col
y = dataset.iloc[:,-1] # all rows, and the last column

## Splitting the dataset into the Training set and Test set

In [None]:
# Get the function from Sklearn library that can split each set to test and train sets
from sklearn.model_selection import train_test_split
# Create 4 subsets from X and y
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 0)
# Test size is 20% from all entries (683)

## Training the Logistic Regression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
# Create the object/instance of lrr calle classifier
classifier = LogisticRegression(random_state = 0)
# Train X and y train sets using method called fit()
classifier.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Predicting the Test set results

In [None]:
# Whether the tumor is benign or malignant
# Call method called predict()
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

In [None]:
# Make correct and incorrect prediction using sklearn module metrics()
from sklearn.metrics import confusion_matrix
# Introduce new variable
# parameter: ground truth (y_test) and prediction (y_pred)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[84  3]
 [ 3 47]]


84 correct prediction that the tumor is benign
47 correct prediction that the tumor is malignant


3 incorrect prediction that tumor is benign
3 incorrect prediction that the tumor is malignant

84+47 = 131 correct prediction out of 137 from the test set

>> our model performs super well on the training set



In [None]:
# What is our accuracy
(84+47)/(84+47+3+3)

0.9562043795620438

accuracy is 95.62 %

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
# accuracy of 10 from function cross_val_score()
# first argument is classifier, second one is database, and cv of 10
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10 )
print("Accuracy:  {:.2f} %".format(accuracies.mean()*100))
# Chop off the decimal float into 2 digits

# Print standard deviation
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))


Accuracy:  96.70 %
Standard Deviation: 1.97 %


very good observation!!
RRL performing well for the Breast Cancer model