# Predicting Breast Cancer Diagnosis Using Multiple Linear Regression

README:
    
    This project can be run in pycharm, but for better visual presentation it is ideal to run     it in Jupyter. The only thing needed is the dataset, which has to be in the same folder       as the .ipynb file to be imported without problems.
    
    The dataset can be downloaded from Kaggle. I have provided the link directly to the           dataset:
    https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

The first step is importing the packages I will need. I do this in the cell below, where I also import the dataset, which is saved in the same folder as this notebook. In addition, I split the data into dependent and independent variables. 
The dependent variable is Diagnosis, which is located in index 1 of the dataset. The rest of the columns are independent variables.

I created the df and assigned it to the same dataset except utilizing pandas dataframe. I define that so I can use it for the independent 
variables. I would have preferred to use pandas for the dependet variables as well but it does not work well with the non numeric values in that variable. Therefore, I sliced it straight from the dataset, which will allow me to convert it into zeros and ones below.

In [260]:
#importing the necessary libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame 

#Loading the dataset
dataset = pd.read_csv('data.csv')

df = DataFrame(dataset,columns=['ID','Diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean','radius_se','texture_se','perimeter_se','area_se','smoothness_se','compactness_se','concavity_se','concave points_se','symmetry_se','fractal_dimension_se','radius_worst','texture_worst','perimeter_worst','area_worst','smoothness_worst','compactness_worst','concavity_worst','concave points_worst','symmetry_worst','fractal_dimension_worst'])


x = dataset.iloc[:, 2:-1]
y = dataset.iloc[:, 1:2].values

dataset.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Having separated the dataset into x and y values, the first thing I need to do is convert the y values from M and B to 1 and 0. 
To do this I use the LabelEncoder class from sklearn. This transforms the Ms into ones and B into zeros and that is assigned to 
a new y variable which now carries numerical values that can be used in regression.
I also in the same cell split the data into the train and test set so that I can test my regression model once I have trained it.
From this cell I emerge with x_train, y_train, s_test and y_test, which I derive from x and y using train_test_split class from asklearn model selection
I also utilized another sklearn class StandardScaler to normalize the features for better results. 

In [262]:

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

#splitting the data into training and testing set

from sklearn.model_selection import train_test_split

x_train,x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#normalizing data

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)


In the cell below I calculate the Variance Inflation Factor for each of the indepedent variables to determine multicollinearity.
I used statsmodels to do but used for loop to iterate through all the features in x because it has multiple features. 
The majority of the feature have very high VIFs, which is a sign that they explain the same thing. 

In [204]:
from statsmodels.stats.outliers_influence import variance_inflation_factor



vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns
print(vif)
#vif.round(1)

      VIF Factor                 features
0   63306.172036              radius_mean
1     251.047108             texture_mean
2   58123.586079           perimeter_mean
3    1287.262339                area_mean
4     393.398166          smoothness_mean
5     200.980354         compactness_mean
6     157.855046           concavity_mean
7     154.241268      concave points_mean
8     184.426558            symmetry_mean
9     629.679874   fractal_dimension_mean
10    236.665738                radius_se
11     24.675367               texture_se
12    211.396334             perimeter_se
13     72.466468                  area_se
14     26.170243            smoothness_se
15     44.919651           compactness_se
16     33.244099             concavity_se
17     53.698656        concave points_se
18     37.176452              symmetry_se
19     27.532631     fractal_dimension_se
20   9674.742602             radius_worst
21    343.004387            texture_worst
22   4487.781270          perimete

The next step is to run multiple linear regression. One way to do this would be to use python to do all the calculations
for regression. Another is to import the LinearRegression class from sklearn. While the former method would make it easier to 
manipulate results and how they loook in output, it would take long and is somewhat redundant as sklearn already has a class
that does the same thing. Therefore, for the sake of not reiventing the wheel, I opted for sklearn again.

In [205]:
#training multiple linear regression model
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(x_train,y_train)

#prediction on the dataset

y_pred= regr.predict(x_test)

Next I will run a logistic regression model with the same dataset. While I set out to run a logistic regression model, I 
though it would be a good idea to try another model and see how it works. The biggest and most important step in any data
analysis is dataset preprocessing, making sure the data is free of missing values, is normalized if needed as well as recoding 
any values that need to be recoded, and this is where python is a major asset and where I intend to strengthen my skills. 
Once that has been done, as I have above, the kind of analysis or model that is being run on the dataset does not take long and
one could easily run multiple models to find one that is more suitable for the dataset and the problem at hand. However, since the
aim of my project is to show the steps taken to do this in python, I am only focusing on multiple linear regression and have included 
logistic regression to underline the point that multiple different models can be run as long as the dataset meets the specific
assumptions of the model you intend to run. The model score for logistic is quite high at 96.49%, which is somewhat expected given that 
the model has been given 30 features, even though we already know from the VIF that if further analysis was to be done, most of the variables would need
to be removed as they do not add enough value while keeping the complexity high. 

In [263]:
# Scikit Logistic Regression
import sklearn
from sklearn.linear_model import LogisticRegression

scikit_log_reg = LogisticRegression()
scikit_log_reg.fit(x_train,y_train)

scikit_score = scikit_log_reg.score(x_test,y_test)
print ('Scikit score: ', scikit_score)

Scikit score:  0.9649122807017544




Below I run statsmodel to get the results of my regression model in order to do the kind of analysis that looks at how useful
the model is, what variables can be taken seriously and which can be removed to maximize model value while minimizing complexity.


In [264]:
#model performance
import statsmodels.formula.api as sm

regr_OLS= sm.OLS(endog= y, exog= x).fit()
mod_sum = regr_OLS.summary()
print(mod_sum)


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.853
Model:                            OLS   Adj. R-squared:                  0.844
Method:                 Least Squares   F-statistic:                     103.9
Date:                Thu, 16 May 2019   Prob (F-statistic):          1.78e-202
Time:                        23:56:27   Log-Likelihood:                 18.089
No. Observations:                 569   AIC:                             23.82
Df Residuals:                     539   BIC:                             154.1
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
radius_mean               

The results above show all the features and their p-values, among other things. Most of them have high p-values and therefore will be removed 
below when I run the model again, this time time with only those features with p-values less than 0.05, as is customary in statistical analysis.
The other important statistics shown above are Adjusted R squared and Durbin-Watson statistic. Adjusted R-squared is very, which happens when there number
of variables is high, therefore we will not read too much into it. The Durbin-Watson statistic is 1.837, which means there is very little autocorrelation.


In [266]:
x = df[['fractal_dimension_mean','radius_mean','area_mean','smoothness_se','radius_worst','area_worst','fractal_dimension_worst']]
y = dataset.iloc[:, 1:2].values

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

#splitting the data into training and testing set

from sklearn.model_selection import train_test_split

x_train,x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#normalizing data

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from statsmodels.stats.outliers_influence import variance_inflation_factor



vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns
print(vif)
#vif.round(1)

#training multiple linear regression model
from sklearn.linear_model import LinearRegression

regr1 = LinearRegression()
regr1.fit(x_train,y_train)

#prediction on the dataset

y_pred= regr1.predict(x_test)

#model performance
import statsmodels.formula.api as sm

regr1_OLS= sm.OLS(endog= y, exog= x).fit()
mod_sum = regr1_OLS.summary()
print(mod_sum)
print(y_pred)


    VIF Factor                 features
0   207.805209   fractal_dimension_mean
1  2998.126668              radius_mean
2   579.701192                area_mean
3     9.507260            smoothness_se
4  2781.321159             radius_worst
5   529.860170               area_worst
6    74.256058  fractal_dimension_worst
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.811
Model:                            OLS   Adj. R-squared:                  0.808
Method:                 Least Squares   F-statistic:                     343.9
Date:                Fri, 17 May 2019   Prob (F-statistic):          1.79e-198
Time:                        00:01:30   Log-Likelihood:                -52.948
No. Observations:                 569   AIC:                             119.9
Df Residuals:                     562   BIC:                             150.3
Df Model:                           7           

  y = column_or_1d(y, warn=True)


The results above only show those features whose p-values show them to be of statistical significance.
As such, these results can be taken more seriously and the model can now be used to predict diagnosis. The adjusted R-squared went down to 0.808,
which is not a big fall given that now we are using 7 features as opposed to the 30 we began with. Durbin-Watson statistic also
increased and get closer to 2, meaning reduction of features helped us get rid of more autocorrelation. 

Below I print the prediction results from the model and the test results. The results are encouraging. Those that are ones in y_test are above 0.5 in 
y_pred, in many cases very close to 1 and those that are zeros in the actual dataset are very close to 
zero in predicted values, which means the model does a good job of predicting diagnosis. 
Using the for loop, I convert the values greater than 0.5 to 'B' and all other values to to 'M', to basically get the Diagnosis as it would show in the original Dataset. For test purposes, in the same print statement I print the corresponding value from the test data, where 0 represents 'B' and 1 represents 'M', to compare my predicted values next to the original values.

In [283]:
#print('Predicted values of Diagnosis: ', y_pred)
#print('The actual results: ', y_test)

ytest = y_pred

for num in (ytest):
    if num > 0.50:
        print('B')
    else:
        print('M')

print(y_test)


B
M
M
M
M
M
M
M
M
M
B
M
M
M
M
B
M
B
B
B
B
B
M
M
B
M
M
B
M
B
M
B
M
B
M
B
M
B
M
B
M
M
B
M
B
B
M
M
M
B
B
M
B
M
M
M
M
M
M
B
B
B
M
M
B
M
B
B
B
M
M
B
M
M
B
M
M
M
M
M
B
B
B
M
B
M
M
M
B
B
M
B
M
B
M
M
B
M
M
M
M
M
M
M
B
M
B
M
M
B
M
B
B
M
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1
 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0
 1 1 0]
