Label Encoding : converting the labels into numerical form

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [None]:
cancer_data = pd.read_csv('/content/drive/MyDrive/DataSet/cancer_data.csv')
cancer_data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Here's a breakdown of the columns in the `cancer_data` DataFrame and the overall meaning of the dataset:

*   **id**: A unique identifier for each record.
*   **diagnosis**: The diagnosis of the mass ('M' for malignant, 'B' for benign). This is likely the target variable you would want to predict.
*   **radius\_mean**: The mean of distances from the center to points on the perimeter.
*   **texture\_mean**: The mean of the standard deviation of gray-scale values.
*   **perimeter\_mean**: The mean size of the core tumor.
*   **area\_mean**: The mean area of the tumor.
*   **smoothness\_mean**: The mean of local variation in radius lengths.
*   **compactness\_mean**: The mean of `perimeter^2 / area - 1.0`.
*   **concavity\_mean**: The mean of the severity of concave portions of the contour.
*   **concave points\_mean**: The mean number of concave portions of the contour.
*   **symmetry\_mean**: The mean symmetry of the tumor.
*   **fractal\_dimension\_mean**: The mean of "coastline approximation" - 1.

The dataset contains these and other features (suffixed with `_se` for standard error and `_worst` for the worst or largest mean value) for cell nuclei. It is a common dataset used for binary classification tasks, specifically to predict whether a tumor is malignant or benign based on the measured characteristics of the cell nuclei.

In [None]:
cancer_data.shape

(569, 33)

In [None]:
#finding the count of different labels
cancer_data['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,357
M,212


In [None]:
label_encoder = LabelEncoder()

In [None]:
labels = label_encoder.fit_transform(cancer_data['diagnosis'])

In [None]:
#Appending the labels to the DataFrame
cancer_data['target'] = labels

In [None]:
cancer_data.tail(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32,target
559,925291,B,11.51,23.93,74.52,403.5,0.09261,0.1021,0.1112,0.04105,...,82.28,474.2,0.1298,0.2517,0.363,0.09653,0.2112,0.08732,,0
560,925292,B,14.05,27.15,91.38,600.4,0.09929,0.1126,0.04462,0.04304,...,100.2,706.7,0.1241,0.2264,0.1326,0.1048,0.225,0.08321,,0
561,925311,B,11.2,29.37,70.67,386.0,0.07449,0.03558,0.0,0.0,...,75.19,439.6,0.09267,0.05494,0.0,0.0,0.1566,0.05905,,0
562,925622,M,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,...,128.7,915.0,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,,1
563,926125,M,20.92,25.09,143.0,1347.0,0.1099,0.2236,0.3174,0.1474,...,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,,1
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,,1
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,,1
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,,1
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,,1
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,,0


0 --> Benign

1 --> Malignant

In [None]:
cancer_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,357
1,212


In [None]:
iris_data = pd.read_csv('/content/drive/MyDrive/DataSet/iris_data.csv')
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
iris_data['Species'].value_counts()

Unnamed: 0_level_0,count
Species,Unnamed: 1_level_1
Iris-setosa,50
Iris-versicolor,50
Iris-virginica,50


In [None]:
iris_labels = label_encoder.fit_transform(iris_data['Species'])

In [None]:
iris_data['species'] =  iris_labels

In [None]:
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,species
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0


In [None]:
iris_data['species'].value_counts()

Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
0,50
1,50
2,50
