**Predicting Cancer**

In [3]:
#imports
import numpy as np
import pandas as pd

_**Loading the cancer dataset**_

_Dataset details:
https://www.kaggle.com/datasets/erdemtaha/cancer-data_

In [4]:
df = pd.read_csv(r"C:\Users\sarayu sree\Downloads\CancerData\Cancer_Data.csv")
df.head

<bound method NDFrame.head of            id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302         M        17.99         10.38          122.80     1001.0   
1      842517         M        20.57         17.77          132.90     1326.0   
2    84300903         M        19.69         21.25          130.00     1203.0   
3    84348301         M        11.42         20.38           77.58      386.1   
4    84358402         M        20.29         14.34          135.10     1297.0   
..        ...       ...          ...           ...             ...        ...   
564    926424         M        21.56         22.39          142.00     1479.0   
565    926682         M        20.13         28.25          131.20     1261.0   
566    926954         M        16.60         28.08          108.30      858.1   
567    927241         M        20.60         29.33          140.10     1265.0   
568     92751         B         7.76         24.54           47.92      181.0  

_**Data Processing**_

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

_Here **"Unnamed: 32", "id"** features are not useful to predict the cancer in patients. That's the reason we have removed these features from the dataset._

In [6]:
df.drop(columns = ['id', 'Unnamed: 32'], axis = 1,inplace = True) 

_The below code shows the unique values in the feature **"diagnosis"** which is the class label._

In [7]:
df['diagnosis'].unique()

array(['M', 'B'], dtype=object)

_As machine cannot take and predict a catagorical value, it needs to be converted into a number. In this dataset, we have only one catagorical feature that is diagnosis feature (M or B). For that reason, we use LabelEncoder to assign values in the place of catagorical values._ 

In [8]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['diagnosis'] = label_encoder.fit_transform(df['diagnosis']) 
df['diagnosis'].unique()

array([1, 0])

In [9]:
df['diagnosis']

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int32

_It assigned 1 for M(Malignant) and 0 for B(Benign)_

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int32  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

_Here, we are separating the dependent feature(class label: diagnosis) and independent features(features: radius_mean, etc)._

In [11]:
x = df.drop(columns = ['diagnosis'])
y = df['diagnosis']

In [12]:
x

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [13]:
y

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int32

_Normalizing the features can help the machine learning model to take less time to train the model, as different range values can consume more time for training. That's why we used **Z-score normalization** to normalize the independent features which is x._

In [23]:
x = (x - np.mean(x)) / np.std(x)
pd.DataFrame(x)

  return std(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,3160.334775,3160.816312,3175.312066,3175.474123,-1223.250708,2005.143570,2399.534842,1580.189555,922.039956,-5589.691332,...,3165.242470,3165.546765,3176.458199,3176.243628,466.878840,2783.325944,2879.296409,2234.459381,2178.990664,-250.332277
1,3161.067531,3162.536014,3175.728087,3176.398457,-1225.646137,2001.372984,2396.858122,1578.205224,919.823833,-5592.815731,...,3165.161707,3166.536855,3175.689725,3176.132879,465.195542,2780.278835,2877.040133,2233.250389,2175.996152,-251.988101
2,3160.817599,3163.345834,3175.608635,3176.048632,-1223.876964,2002.912981,2398.245447,1579.694310,920.762126,-5592.345086,...,3164.867650,3166.882084,3175.502074,3175.698675,466.098561,2781.792211,2878.041856,2234.118305,2177.392297,-252.067900
3,3158.468801,3163.143379,3173.449445,3173.725285,-1221.535621,2005.262965,2398.797865,1579.108787,922.689824,-5587.036159,...,3163.074316,3167.040042,3173.904659,3173.692369,468.965429,2784.602677,2879.176471,2234.339090,2182.286083,-247.334281
4,3160.988007,3161.737830,3175.818705,3176.315978,-1224.538802,2002.399396,2398.252980,1579.085572,919.812881,-5592.509529,...,3164.654355,3165.439288,3175.493138,3175.463114,465.791710,2780.395885,2877.800061,2232.892564,2175.371689,-252.666391
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,3161.348705,3163.611119,3176.102918,3176.833605,-1223.777332,2002.079116,2398.829253,1579.978044,919.509852,-5592.878106,...,3165.256965,3167.023758,3175.907161,3176.257691,465.949519,2780.435961,2877.851395,2233.792456,2174.879884,-252.978382
565,3160.942565,3164.974781,3175.658064,3176.213590,-1224.716716,2001.842223,2397.575011,1578.920749,919.604777,-5593.005690,...,3164.892500,3168.953457,3175.576538,3175.737349,464.879924,2780.314459,2877.423456,2232.897132,2175.708188,-253.243270
566,3159.939995,3164.935221,3174.714808,3175.067701,-1225.659658,2001.821376,2396.928556,1577.762857,919.013324,-5592.842665,...,3163.917141,3168.280912,3174.733599,3174.670296,464.761567,2781.060015,2877.513649,2232.577373,2175.135493,-252.587701
567,3161.076052,3165.226104,3176.024656,3176.224967,-1223.293407,2005.132199,2400.178912,1580.315945,921.959635,-5590.903383,...,3165.317019,3169.143984,3176.458199,3175.895561,467.001581,2784.614127,2880.384487,2234.453290,2178.159125,-250.049656


_Here, we have used **train_test_split** for splitting the training set and testing set._

In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 5)

_**Building Model**_

_Used **Regularized Logistic Regression model** to make the prediction. I have selected this simple model because the class label contains 2 unique values(M and B)._

In [16]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty = 'l2', C = 0.1)
model.fit(x_train, y_train)

_Finding accuracy on the test set._

In [17]:
accuracy = model.score(x_test, y_test)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9824561403508771


_Example input_

In [18]:
if model.predict([[14.32, 19.48, 90.2, 650.1, 0.08758, 0.07973, 0.05611, 0.03728, 0.1833, 0.065, 0.487, 1.121, 3.392, 50.71, 0.00538, 0.02072, 0.0267, 0.01131, 0.02031, 0.002676, 17.85, 23.68, 117.2, 980.5, 0.1244, 0.2204, 0.2661, 0.1056, 0.3321, 0.091]]) == 1:
    print("Malignant")
else:
    print("Benign")

Malignant




_**Deploying the model using streamlit**_

In [19]:
import pickle

In [20]:
pickle.dump(model, open("deploy.pkl", "wb"))