<div class='alert alert-info'>
    Background infomation
    </div>

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.
In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
In other words, the logistic regression model predicts P(Y=1) as a function of X.
    


<div class="alert alert-info"> 
    Objective
    </div>

### Predict Breast Cancer


Predict the class of breast cancer (malignant or ‘bad’ versus benign or ‘good’) from the features of images taken from breast samples. Ten biological attributes of the cancer cell nuclei have been calculated.

<div class='alert alert-info'>
    Content
    </div>
    
#### 1. Data Preprocessing part 1
    - Inspecting Data frame
    - Handling duplicated rows
    - Handling the object values
    
#### 2. Preprocessing the data Part2

#### 3. Correlation and p-value analysis
    - Hypothesis Testing

#### 4. Splitting the dataset into train and test sets

#### 5. Fitting a logistic regression model to the train set  

#### 6. Making predictions and evaluating performance
    - Confusion matrix 
    - Classification report
    - Receiver Operating Characteristic (ROC)
    
#### 7. Overdispersion

#### 8. References

<div class='alert alert-info'>
    Lets dive into it!!
    </div>

## 1. Data Preprocessing part 1

 
Technique that involves transforming raw data into an understandable format


In [36]:
#Import pandas
import pandas as pd

# Load dataset
df = pd.read_csv('data/cancer.data', header = None)

# Inspect first five rows of data
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [37]:
# rename columns
df.rename(columns = {0 :'ID', 1 :'Clump Thickness', 2: 'Uniformity of Cell Size',
                        3:'Uniformity of Cell Shape', 4:'Marginal Adhesion', 5: 'Single Epithelial Cell Size',
                       6:'Bare Nuclei', 7:'Bland Chromatin', 8:'Normal Nucleoli',
                       9:'Mitoses', 10:'Class'}, inplace = True)

# Converting Class entries to binary
df['Class'] = df['Class'].replace([2,4], [1,0])

df.head()

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,1
1,1002945,5,4,4,5,7,10,3,2,1,1
2,1015425,3,1,1,1,2,2,3,1,1,1
3,1016277,6,8,8,1,3,4,3,7,1,1
4,1017023,4,1,1,3,2,1,3,1,1,1


   
### Inspecting Data frame


In [None]:
from pandas_profiling import ProfileReport
#----------------------------------------------------------------------------------------------------------

profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
profile.to_file(output_file="dataset_report.html")

HBox(children=(FloatProgress(value=0.0, description='variables', max=11.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=81.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…




HBox(children=(FloatProgress(value=0.0, description='missing', max=2.0, style=ProgressStyle(description_width=…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




### Profile Report Overview

##### Dataset statistics

- Number of variables------------ 11
- Number of observations--------- 699
- Missing cells------------------ 0
- Missing cells(%)--------------- 0.0%
- Total size in memory----------- 98.8 KiB
- Average record size in memory-- 144.7 B
- Data types : int(10) and object(1)

##### Warnings

- Dataset has 8 (1.1%) duplicate rows
- Uniformity of Cell Shape is highly correlated with Uniformity of Cell Size

### Handling duplicated rows


In [None]:
#visualizing duplicated rows

duplicates = df[df.duplicated()]
duplicates

In [None]:
df[df['ID'] == 1198641]

- There are three accounts of the same ID (1198641) , however only two rows of the ID is duplicated

In [None]:
#dropping duplicates from the dataframe df

df = df.drop_duplicates()

#checking if the duplicated rows were dropped
len_duplicates = len(df[df.duplicated()])

print('There are a total of {0} duplicated rows in the dataframe df'.format(len_duplicates))

### Handling the object values

In [None]:
df['Bare Nuclei'].value_counts()

- value of ? occurs 16 times in the Bare Nuclei column

In [None]:
#Import numpy
import numpy as np
#------------------------------

#replace object with nan
df = df.replace('?', np.nan)

#display changes
df['Bare Nuclei'].unique()

In [None]:
# Iterate over each column of cc_apps
for col in df.columns:
    # Check if the column is of object type
    if df[col].dtypes == 'object':
        # Impute with the most frequent value
        df = df.fillna(df[col].value_counts().index[0])

In [None]:
#Inspecting for null value
df.isnull().sum()

In [None]:
# statistics summary

df.describe()

In [None]:
df.dtypes

- The dataset information above shows that the data set still has object type even though the values where imputed with numerical values

## 2. Preprocessing the data Part2

- Convert the non-numeric data into numeric.
- Split the data into train and test sets.
- Scale the feature values to a uniform range.



In [None]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder 

# Instantiate LabelEncoder
le =  LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in df:
    # Compare if the dtype is object
    if df[col].dtypes =='object':
    # Use LabelEncoder to do the numeric transformation
        df[col]=le.fit_transform(df[col])

In [None]:
df.dtypes

- At last the dataset has intigers as data type

## 3. Correlation and p-value analysis


In [None]:
import seaborn as sns
#--------------------------

corr = df.corr()
sns.heatmap(corr)

In [None]:
'''
compare the correlation between features and remove one of two features 
that have a correlation higher than 0.9, to eliminate multicollinearity
'''

columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False
selected_columns = df.columns[columns]
df = df[selected_columns]

### Hypothesis Testing

Null hypthesis    : The selected columns do not have a signifcant effect on Breast Cancer

Alternative Hypothesis : At least one of the selected columns affect breast cancer significantly

In [None]:
#Selecting columns/features based on p-value
import statsmodels.api as sm
#-------------------------------------------------------
selected_columns = selected_columns[1:-1].values

def elimination(x, Y, sl, columns):
    numVars = len(x[0])
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(Y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
                    columns = np.delete(columns, j)
                    
    regressor_OLS.summary()
    return x, columns
SL = 0.05
df_modeled, selected_columns = elimination(df.iloc[:,1:].values, df.iloc[:,0].values, SL, selected_columns)

In [None]:
print('factors that significantly predict malignant cancer are :\n{}'.format(selected_columns))

## 4. Splitting the dataset into train and test sets


In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Segregate features and labels into separate variables
X,y = df[['Clump Thickness','Uniformity of Cell Size','Bland Chromatin']] , df['Class']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.30,
                                random_state=42)

In [None]:
# Import MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

## 5. Fitting a logistic regression model to the train set

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

## 6. Making predictions and evaluating performance¶

### Confusion matrix 

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred       = logreg.predict(rescaledX_test)
y_pred_train = logreg.predict(rescaledX_train)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier on test: ", logreg.score(rescaledX_test, y_test))

print("Accuracy of logistic regression classifier on train: ", logreg.score(rescaledX_train, y_train))

# Print the confusion matrix of the logreg model
print('\nConfussion Matrix on testing data set')
confusion_matrix(y_pred, y_test)

#### The confusion matrix shows:

- 54 + 144 correct predictions
- 2 + 8 incorrect predictions
- which has an accuracy score of 95.19 % on the test dataset
- and 94.40% accurate on train dataser 

### Classification report

In [None]:
from sklearn.metrics import classification_report
#-------------------------------------------


print('\nTest dataset-------------------------------------------')
print(classification_report(y_test, y_pred))
print('\nTrain dataset------------------------------------------')
print(classification_report(y_train, y_pred_train))

### Receiver Operating Characteristic (ROC)

In [None]:
# Import ROC
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
#----------------------------------------------

logreg_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, threshold_log = roc_curve(y_test, y_pred)

In [None]:
#Import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
#-----------------------------------------------

plt.plot(fpr, tpr, color='green', label='ROC')
plt.plot([0, 1], [0, 1], color='purple', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

## 7. Overdispersion


In statistics, overdispersion is the presence of greater variability in a data set than would be expected based on a given statistical mode

## 8. References


- https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
- https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5769953/