## DESCRIPTION

John Cancer Hospital (JCH) is a leading cancer hospital in USA. It specializes in treating breast cancer. Over the last few years, JCH has collected breast cancer data from patients who came for screening/treatment. However, this data has almost 30 attributes, and it is difficult to run and interpret the results. You, as an ML expert, have to reduce the number of attributes
(Dimensionality Reduction) so that the results are meaningful and accurate.

Objective: Reduce the number of attributes/features in data to make the analysis of the results comprehensible to doctors.

### Actions to Perform:

1- Use pandas to read data as a dataframe.

2- Check the data. There should be no missing values.

3- Convert the diagnosis column to 1/0 and store in a new column target.

4- Store the encoded column in dataframe and drop the diagnosis column for simplicity.

5- Scale the data so that each feature has a single unit variance.

6- Transform this data to its first 2 principal components.

7- Plot the two dimensions.

8- Print the explained variance.

9- Try the same with 3 principal components.

10- Check the accuracy for 2nd and 3rd components.


# Data Set: Cancer Data Set
Features are computed from a digitized image of a Fine-Needle Aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [152]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [153]:
df=pd.read_csv('breast-cancer-data.csv')

In [154]:
df.sample(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
66,859464,B,9.465,21.01,60.11,269.4,0.1044,0.07773,0.02172,0.01504,...,10.41,31.56,67.03,330.7,0.1548,0.1664,0.09412,0.06517,0.2878,0.09211
482,912519,B,13.47,14.06,87.32,546.3,0.1071,0.1155,0.05786,0.05266,...,14.83,18.32,94.94,660.2,0.1393,0.2499,0.1848,0.1335,0.3227,0.09326
323,895100,M,20.34,21.51,135.9,1264.0,0.117,0.1875,0.2565,0.1504,...,25.3,31.86,171.1,1938.0,0.1592,0.4492,0.5344,0.2685,0.5558,0.1024
494,914102,B,13.16,20.54,84.06,538.7,0.07335,0.05275,0.018,0.01256,...,14.5,28.46,95.29,648.3,0.1118,0.1646,0.07698,0.04195,0.2687,0.07429
518,917062,B,12.88,18.22,84.45,493.1,0.1218,0.1661,0.04825,0.05303,...,15.05,24.37,99.31,674.7,0.1456,0.2961,0.1246,0.1096,0.2582,0.08893


In [155]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [156]:
df.loc[df['diagnosis'] == 'M', 'n_diagnosis'] = 1
df.loc[df['diagnosis'] == 'B', 'n_diagnosis'] = 0
df.drop('diagnosis',axis=1,inplace=True)
df['n_diagnosis']= df['n_diagnosis'].astype('int')

In [157]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

In [158]:
target=df['n_diagnosis']
df.drop('n_diagnosis',axis=1,inplace=True)

# Performing PCA

In [159]:
# first we need to scale data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)

StandardScaler()

In [160]:
scaled_data = scaler.transform(df)

In [161]:
#then perform PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=4)

pca.fit(scaled_data)

x_pca = pca.transform(scaled_data)

scaled_data.shape

(569, 31)

In [162]:
x_pca.shape

(569, 4)

we can see a huge reduction in the number of feauters, from 30 to 2! 

In [163]:
pca.explained_variance_ratio_

array([0.42864701, 0.18376792, 0.09146436, 0.06391475])

# train data

In [164]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

In [165]:
X_train,X_test,y_train,y_test= train_test_split(df, target,test_size=0.2)

In [166]:
#scale 
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#PCA
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

In [167]:
lreg= LogisticRegression()

In [168]:
lreg.fit(X_train,y_train)

LogisticRegression()

In [169]:
#get accuracy 
lreg.score(X_test, y_test)

0.956140350877193

In [170]:
lreg.score(X_train, y_train)

0.967032967032967