# Breast Cancer Tumors: A Story Told By Data

**Breast cancer**, a type of cancer caused by cancerous cells formed in the breast, and majorly occurs in women, these cancerous cells form tumours that can be seen x-rays.

In this article we'll be using the Python **pandas** library to prepare a data set that contains several properties of tumor, cancerous and non-cancerous, recorded as observed from an x-ray, for further analysis.

Let's start by importing the pandas package for wrangling

In [22]:
import pandas as pd

## Reading The Dataset

Next, we'll be reading the csv file that contains our dataset. You can get the dataset from [here](https://www.kaggle.com/baravindkumar/breast-cancer-cancer-non-cancer-classification).

In [5]:
# read the dataset from the csv file
cancer_df = pd.read_csv("breast_cancer.csv")
# view the data
cancer_df

Unnamed: 0.1,Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
0,0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


Taking a look at the data set, it's obvious that the first column **"Unnamed: 0"** is not needed and it's very similar to the **index column** on the right, therefore the column will be dropped.

In [32]:
cancer_df.drop(labels=["Unnamed: 0"], axis=1)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


### More Info on Data
To get a less ambiguous idea of how the dataset looks like: 

In [24]:
# More info on the dataset
cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               569 non-null    int64  
 1   mean radius              569 non-null    float64
 2   mean texture             569 non-null    float64
 3   mean perimeter           569 non-null    float64
 4   mean area                569 non-null    float64
 5   mean smoothness          569 non-null    float64
 6   mean compactness         569 non-null    float64
 7   mean concavity           569 non-null    float64
 8   mean concave points      569 non-null    float64
 9   mean symmetry            569 non-null    float64
 10  mean fractal dimension   569 non-null    float64
 11  radius error             569 non-null    float64
 12  texture error            569 non-null    float64
 13  perimeter error          569 non-null    float64
 14  area error               5

After looking through the data set the column labelled **"outcome"** contains the 2 classes that indicates whether the tumor is cancerous or not with the values 0 and 1


## Removing Null Values
Let's attempt to clean the data by checking if the data contains any null values.

In [11]:
# check the dataframe for any null values
cancer_df.isnull().values.any()

False

Now that it's confirmed that the dataset doesn't contain any null values, we can move on to filtering the data.

## Filtering The Data

- We'll be taking our first step of filtering the data by getting the data for when the tumor actually turned out to be a result of cancerous cells.


In [17]:
# Get the dataframe for when the tumour was caused by cancerours cells
cancerous_df = cancer_df.loc[(cancer_df["outcome"] == 1)]
cancerous_df

Unnamed: 0.1,Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
19,19,13.540,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.047810,0.1885,...,19.26,99.70,711.2,0.14400,0.17730,0.23900,0.12880,0.2977,0.07259,1
20,20,13.080,15.71,85.63,520.0,0.10750,0.12700,0.04568,0.031100,0.1967,...,20.49,96.09,630.5,0.13120,0.27760,0.18900,0.07283,0.3184,0.08183,1
21,21,9.504,12.44,60.34,273.9,0.10240,0.06492,0.02956,0.020760,0.1815,...,15.66,65.13,314.9,0.13240,0.11480,0.08867,0.06227,0.2450,0.07773,1
37,37,13.030,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.029230,0.1467,...,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,1
46,46,8.196,16.84,51.71,201.9,0.08600,0.05943,0.01588,0.005917,0.1769,...,21.96,57.26,242.2,0.12970,0.13570,0.06880,0.02564,0.3105,0.07409,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
558,558,14.590,22.68,96.39,657.1,0.08473,0.13300,0.10290,0.037360,0.1454,...,27.27,105.90,733.5,0.10260,0.31710,0.36620,0.11050,0.2258,0.08004,1
559,559,11.510,23.93,74.52,403.5,0.09261,0.10210,0.11120,0.041050,0.1388,...,37.16,82.28,474.2,0.12980,0.25170,0.36300,0.09653,0.2112,0.08732,1
560,560,14.050,27.15,91.38,600.4,0.09929,0.11260,0.04462,0.043040,0.1537,...,33.17,100.20,706.7,0.12410,0.22640,0.13260,0.10480,0.2250,0.08321,1
561,561,11.200,29.37,70.67,386.0,0.07449,0.03558,0.00000,0.000000,0.1060,...,38.30,75.19,439.6,0.09267,0.05494,0.00000,0.00000,0.1566,0.05905,1


- Next, let's try getting a statistical description of the previously filtered data (for when tumors actually turned out to be of cancerous cells) 

In [20]:
# Describe the data when the tumour resulted into cancer.
cancerous_df.describe()

Unnamed: 0.1,Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
count,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,...,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0
mean,319.89916,12.146524,17.914762,78.075406,462.790196,0.092478,0.080085,0.046058,0.025717,0.174186,...,23.51507,87.005938,558.89944,0.124959,0.182673,0.166238,0.074444,0.270246,0.079442,1.0
std,153.157795,1.780512,3.995125,11.807438,134.287118,0.013446,0.03375,0.043442,0.015909,0.024807,...,5.493955,13.527091,163.601424,0.020013,0.09218,0.140368,0.035797,0.041745,0.013804,0.0
min,19.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1566,0.05521,1.0
25%,185.0,11.08,15.15,70.87,378.2,0.08306,0.05562,0.02031,0.01502,0.158,...,19.58,78.27,447.1,0.1104,0.112,0.07708,0.05104,0.2406,0.07009,1.0
50%,332.0,12.2,17.39,78.18,458.4,0.09076,0.07529,0.03709,0.02344,0.1714,...,22.82,86.92,547.4,0.1254,0.1698,0.1412,0.07431,0.2687,0.07712,1.0
75%,453.0,13.37,19.76,86.1,551.1,0.1007,0.09755,0.05999,0.03251,0.189,...,26.51,96.59,670.0,0.1376,0.2302,0.2216,0.09749,0.2983,0.08541,1.0
max,568.0,17.85,33.81,114.6,992.1,0.1634,0.2239,0.4108,0.08534,0.2743,...,41.78,127.1,1210.0,0.2006,0.5849,1.252,0.175,0.4228,0.1486,1.0


> From the above description, it is very easy to get useful information from the above result, like the average **"mean radius"** for tumors caused by cancerous cells is **12** and there were about **357** cases where the tumors were as a result of cancerous cells, and several other useful descriptions  

- Repeating the same process for cases where the values in the **"outcome"** column is 0, that is the tumor was non-cancerous. 

In [19]:
# Get the dataframe for when the tumour didn't result in cancer
non_cancerous_df = cancer_df.loc[(cancer_df["outcome"]) == 0]
non_cancerous_df

Unnamed: 0.1,Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
0,0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890,0
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902,0
2,2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758,0
3,3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.17300,0
4,4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
563,563,20.92,25.09,143.00,1347.0,0.10990,0.22360,0.31740,0.14740,0.2149,...,29.41,179.10,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,0
564,564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.1410,0.2113,0.4107,0.2216,0.2060,0.07115,0
565,565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0
566,566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.07820,0


- Describing the data for when the tumor wasn't cancerous

In [21]:
# Describe the data for when the tumor didn't result to cancer
non_cancerous_df.describe()

Unnamed: 0.1,Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,outcome
count,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,...,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0
mean,223.54717,17.46283,21.604906,115.365377,978.376415,0.102898,0.145188,0.160775,0.08799,0.192909,...,29.318208,141.37033,1422.286321,0.144845,0.374824,0.450606,0.182237,0.323468,0.09153,0.0
std,165.308422,3.203971,3.77947,21.854653,367.937978,0.012608,0.053987,0.075019,0.034374,0.027638,...,5.434804,29.457055,597.967743,0.02187,0.170372,0.181507,0.046308,0.074685,0.021553,0.0
min,0.0,10.95,10.38,71.9,361.6,0.07371,0.04605,0.02398,0.02031,0.1308,...,16.67,85.1,508.1,0.08822,0.05131,0.02398,0.02899,0.1565,0.05504,0.0
25%,74.5,15.075,19.3275,98.745,705.3,0.09401,0.1096,0.109525,0.06462,0.17405,...,25.7825,119.325,970.3,0.130475,0.244475,0.326425,0.15275,0.2765,0.076302,0.0
50%,202.5,17.325,21.46,114.2,932.0,0.1022,0.13235,0.15135,0.08628,0.1899,...,28.945,138.0,1303.0,0.14345,0.35635,0.4049,0.182,0.3103,0.0876,0.0
75%,351.25,19.59,23.765,129.925,1203.75,0.110925,0.1724,0.20305,0.103175,0.20985,...,32.69,159.8,1712.75,0.155975,0.44785,0.556175,0.210675,0.359225,0.102625,0.0
max,567.0,28.11,39.28,188.5,2501.0,0.1447,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.17,0.291,0.6638,0.2075,0.0


> From the above description, we can see that there were **212 cases** where the tumor was non-cancerous, and the **mean-radius** of tumors of non-cancerous cells is **17**.

Find the code used in this article [here]()