<a href="https://colab.research.google.com/github/niconomist98/DataAnalyticsUQ/blob/main/Notebooks/03-BSC%20Data%20exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Python para manejo de datos*

## Definición del problema a resolver
El data set contiene las caracteristicas de una foto tomada mediante microscopio a un area celular bien definida de una biopsia a una masa mamaria.
Las características se calculan a partir de una imagen digitalizada con aguja fina (PAAF) de la masa.
Los datos describen  10 caracteristicas de cada muestra que dan cuenta del tamaño, forma y textura de los núcleos celulares presentes en la imagen.
Se computan la media, error standard y  los valores extremos para cada una de las 10 caracteristicas de cada imagen estudiada resultando un total de 30 caracteristicas para cada muestra.
El data set obtenido por el  Dr. Wolberg es conocido como  Wisconsin Breast Cancer Data y ha sido empleado para estudiar y clasificar correctamente casos de tumores malignos.
La idea es utilizar el aprendizaje de maquina para determinar si la masa analizada es Benigna o Maligna. Para lograrlo se experimentara con distintos modelos de clasificacion para llegar a resultados que puedan contribuir a detectar casos de cancer.


Para mas informacion sobre el procedimiento visitar :  https://pages.cs.wisc.edu/~olvi/uwmp/mpml.html


| Feature name          | Type   | Missing % | Description and values                                              |
|-----------------------|--------|-----------|----------------------------------------------------------------------|
| diagnosis (Target)    | Object | 4.61%     | Diagnosis: B (Benign) M (Malignant) for each case                   |
| erty                  | int64  | 0%        | ...                                                                  |
| iuytr                 | Object | 4.18%     | ...                                                                  |
| idID number           | int64  | 3.09%     | Unique identifier of the observation                                |
| index                 | int64  | 0%        | ...                                                                  |
| radius_mean           | Object | 3.60%     | Mean of distances from center to points on the perimeter             |
| texture_mean          | Object | 3.83%     | Standard deviation of gray-scale values                             |
| perimeter_mean        | Object | 3.85%     | Mean size of the core tumor                                          |
| area_mean             | Object | 4.46%     | ...                                                                  |
| smoothness_mean       | Object | 4.49%     | Mean of local variation in radius lengths                           |
| Compactness mean      | Object | 3.93%     | Mean of perimeter^2 / area - 1.0                                    |
| Concavity mean        | Object | 4.36%     | Mean of severity of concave portions of the contour                  |
| concave points_mean   | Object | 3.70%     | Mean for number of concave portions of the contour                   |
| symmetry_mean         | Object | 4.18%     | ...                                                                  |
| fractal_dimension_mean| Object | 3.37%     | Mean for “coastline approximation” - 1                               |
| radius_se             | Object | 3.30%     | Standard error for the mean of distances from center to points on the perimeter |
| texture_se            | Object | 3.37%     | Standard error for standard deviation of gray-scale values           |
| perimeter_Se          | Object | 3.52%     | ...                                                                  |
| area_se               | Object | 4.13%     | ...                                                                  |
| smoothness_se         | Object | 3.04%     | Standard error for local variation in radius lengths                |
| compactness_se        | Object | 3.57%     | Standard error for perimeter^2 / area - 1.0                          |
| concavity_se          | Object | 4.08%     | Standard error for severity of concave portions of the contour       |
| concave points_se     | Object | 3.6%      | Standard error for number of concave portions of the contour          |
| symmetry_se           | Object | 3.22%     | ...                                                                  |
| fractal_dimension_se  | Object | 3.42%     | Standard error for “coastline approximation” - 1                     |
| radius_worst          | Object | 3.17%     | “Worst” or largest mean value for mean of distances from center to points on the perimeter |
| texture_worst         | Object | 3.29%     | “Worst” or largest mean value for standard deviation of gray-scale values |
| perimeter_worst       | Object | 3.23%     | ...                                                                  |
| area_worst            | Object | 3.29%     | ...                                                                  |
| smoothness_worst      | Object | 3.57%     | “Worst” or largest mean value for local variation in radius lengths  |
| compactness_worst     | Object | 3.09%     | “Worst” or largest mean value for perimeter^2 / area - 1.0           |
| concavity_worst       | Object | 3.32%     | “Worst” or largest mean value for severity of concave portions of the contour |
| concave_points_worst  | Object | 2.96%     | ...                                                                  |
| simmetry_worst        | Object | 3.17%     | “Worst” or largest mean value for severity of concave portions of the contour |
| fractal_dimension_worst| Object | 3.12%     | “Worst” or largest mean value for “coastline approximation” - 1      |



**Checking the datatypes and values we can see the columns are not in the datatype they are supposed to, we need to change dtype Object for a numeric datatype in order to perform the analysis, we also see there are a few low values for each column. We'll solve these errors.**

# **Instalacion de librerias en el entorno**

In [1]:
!pip install fastparquet

Collecting fastparquet
  Downloading fastparquet-2024.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Downloading fastparquet-2024.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fastparquet
Successfully installed fastparquet-2024.11.0


# Importar librerias

In [2]:
import pandas as pd
import numpy as np
import fastparquet
import pyarrow

# Cargar Datasets

In [7]:
url_breast_cancer_dataset = 'https://raw.githubusercontent.com/niconomist98/DataAnalyticsUQ/main/Datos/breast_cancer_dataset/BreastCancerDS.csv'
df=pd.read_csv(url_breast_cancer_dataset,index_col=0)

# Descripcion general del dataset

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19710 entries, 0 to 19709
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   index                    19710 non-null  int64 
 1   perimeter_se             19015 non-null  object
 2   radius_worst             19085 non-null  object
 3   concave points_mean      18980 non-null  object
 4   smoothness_mean          18825 non-null  object
 5   area_mean                18830 non-null  object
 6   concavity_se             18905 non-null  object
 7   texture_mean             18955 non-null  object
 8   concavity_worst          19055 non-null  object
 9   smoothness_se            19110 non-null  object
 10  concave points_se        19000 non-null  object
 11  area_worst               19060 non-null  object
 12  compactness_mean         18935 non-null  object
 13  radius_mean              19000 non-null  object
 14  area_se                  18895 non-null  ob

## Limpieza y calidad de datos general

In [9]:
df.isnull().sum() ##cantidad de nulos

Unnamed: 0,0
index,0
perimeter_se,695
radius_worst,625
concave points_mean,730
smoothness_mean,885
area_mean,880
concavity_se,805
texture_mean,755
concavity_worst,655
smoothness_se,600


In [10]:
df[['index','id']] #mostrar  contenido de las  columnas indicando su indice


Unnamed: 0,index,id
0,223,999765432456788
1,279,8911834
2,307,89346
3,571,857343.0
4,576,85759902
...,...,...
19705,350,899187
19706,442,90944601
19707,279,8911834
19708,501,91504


In [11]:
print("index null count : "+str(df["index"].isnull().sum()), ", id null count: "+ str(df["id"].isnull().sum()))

index null count : 0 , id null count: 610


In [12]:
df.duplicated().value_counts() ##revisar duplicados


Unnamed: 0,count
True,15773
False,3937


In [13]:
df.duplicated(subset=['index', 'id']).value_counts()


Unnamed: 0,count
True,17872
False,1838


In [14]:
df['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,10245
M,5935
999765432456788,655
-88888765432345,655
?,655
rxctf378968 7656463sdfg,655


In [15]:
df[df['diagnosis']=='B'][['index','perimeter_se','diagnosis']].head(10)

Unnamed: 0,index,perimeter_se,diagnosis
1,279,1.83,B
2,307,1144.0,B
4,576,2183.0,B
5,350,2225.0,B
7,120,1103.0,B
10,401,,B
11,189,1687.0,B
12,52,1.52,B
14,241,757.0,B
19,187,1742.0,B


In [16]:
df[(df['diagnosis']=='B')|(df['diagnosis']=='M')]['diagnosis'].value_counts() ## mostrar resultado  del
# dataset filtrado por condiciones


Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,10245
M,5935


# Explorando el dataset filtrado sin modificar df

In [17]:
df[(df['diagnosis']=='-88888765432345')|(df['diagnosis']=='999765432456788')]['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
999765432456788,655
-88888765432345,655


In [18]:
df[(df['diagnosis']=='?')|(df['diagnosis']=='rxctf378968 7656463sdfg')]['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
rxctf378968 7656463sdfg,655


# guardando en df el resultado del dataset filtrado

In [19]:
df=df[(df['diagnosis']=='B')|(df['diagnosis']=='M')]

In [20]:
df['diagnosis'].value_counts() #quedamos solo con los registros que estan correctamente etiquetados

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,10245
M,5935


# Eliminar columnas

In [21]:
df['erty'].value_counts() ## todos los valores son iguales, se borra erty


Unnamed: 0_level_0,count
erty,Unnamed: 1_level_1
908765434567,16180


In [22]:
df.drop('erty',axis=1,inplace=True)

In [23]:
df[['iuytr','symmetry_mean']] ## estas dos columnas tienen la misma informacion, borrar una


Unnamed: 0,iuytr,symmetry_mean
1,211.0,211.0
2,-88888765432345.0,-88888765432345.0
4,?,?
5,0.1671,0.1671
7,0.1667,0.1667
...,...,...
19703,216.0,216.0
19705,0.1671,0.1671
19706,0.1405,0.1405
19708,0.2275,0.2275


In [24]:
df.drop('iuytr',axis=1,inplace=True)

In [25]:
df

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,id,symmetry_mean,symmetry_worst,diagnosis,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
1,279,1.83,999765432456788.0,0.03711,0.09516,587.4,0.01457,15.18,0.1456,0.004235,...,8911834,211.0,0.2955,B,0.001593,999765432456788.0,0.1724,0.01528,0.01541,999765432456788.0
2,307,1144.0,9699.0,0.003472,-88888765432345.0,246.3,0.003681,14.4,0.01472,0.007389,...,89346,-88888765432345.0,0.2991,B,0.002153,56.36,0.05232,0.02701,0.004883,0.1746
4,576,2183.0,?,0.02278,0.09524,409.0,0.01349,18.75,?,0.008328,...,85759902,?,0.3306,B,0.002386,73.34,?,0.03218,0.008722,0.3249
5,350,2225.0,13.28,0.01162,0.07561,421.0,0.005949,17.07,0.03046,0.006583,...,899187,0.1671,0.2731,B,0.002668,73.7,0.06476,0.02216,0.006991,0.3534
7,120,1103.0,12.82,0.02623,0.09373,403.3,0.01514,10.82,0.2102,?,...,865137,0.1667,0.3016,B,0.002206,73.34,239.0,0.01344,?,?
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19703,602,2844.0,?,0.05613,0.1008,809.8,0.02219,21.54,0.2992,0.004877,...,86730502,216.0,?,M,?,106.2,0.3055,0.01535,0.01952,0.4332
19705,350,2225.0,rxctf378968 7656463sdfg,0.01162,0.07561,421.0,0.005949,17.07,0.03046,rxctf378968 7656463sdfg,...,899187,0.1671,0.2731,B,rxctf378968 7656463sdfg,73.7,0.06476,0.02216,0.006991,0.3534
19706,442,2235.0,15.27,0.009937,-88888765432345.0,585.9,0.007741,15.79,0.03517,-88888765432345.0,...,90944601,0.1405,0.1859,B,0.002564,-88888765432345.0,0.1071,-88888765432345.0,0.01156,0.3563
19708,501,2974.0,16.01,0.06759,0.1162,595.9,0.03476,24.49,0.3381,0.00968,...,91504,0.2275,0.3651,M,0.006995,92.33,0.3966,0.02434,0.03856,0.4751


# Valores faltantes

In [26]:
missing = df.isnull().sum()
missing[missing>0]*100/len(df)

Unnamed: 0,0
perimeter_se,3.275649
radius_worst,3.182942
concave points_mean,3.244747
smoothness_mean,4.295426
area_mean,4.140915
concavity_se,3.831891
texture_mean,3.522868
concavity_worst,3.213844
smoothness_se,2.6267
concave points_se,3.430161


## Espacios en blanco

In [27]:
df['index']=df['index'].astype('string')
list= ['index', 'perimeter_se', 'radius_worst', 'concave points_mean',
       'smoothness_mean', 'area_mean', 'concavity_se', 'texture_mean',
       'concavity_worst', 'smoothness_se', 'concave points_se', 'area_worst',
       'compactness_mean', 'radius_mean', 'area_se', 'concave points_worst'
       , 'fractal_dimension_worst', 'perimeter_worst', 'texture_se',
       'fractal_dimension_mean', 'texture_worst', 'smoothness_worst',
       'concavity_mean', 'id', 'symmetry_mean', 'symmetry_worst', 'diagnosis',
       'fractal_dimension_se', 'perimeter_mean', 'compactness_worst',
       'symmetry_se', 'compactness_se', 'radius_se']
for i in list:
    df[i]= df[i].str.strip()

In [28]:
for i in df.columns:
    print(df[i].value_counts().head(6))
    print('\n'+'------------------------')

index
637    30
20     30
556    30
521    30
547    30
266    30
Name: count, dtype: Int64

------------------------
perimeter_se
rxctf378968 7656463sdfg    545
-88888765432345.0          525
?                          475
999765432456788.0          465
1778.0                      85
5801.0                      80
Name: count, dtype: int64

------------------------
radius_worst
-88888765432345.0          520
999765432456788.0          515
rxctf378968 7656463sdfg    510
?                          490
13.34                      115
14.34                      115
Name: count, dtype: int64

------------------------
concave points_mean
-88888765432345.0          535
rxctf378968 7656463sdfg    535
999765432456788.0          500
?                          485
0.0                        300
0.05564                     80
Name: count, dtype: int64

------------------------
smoothness_mean
rxctf378968 7656463sdfg    555
999765432456788.0          505
?                          500
-888887654323

# Creando una lista de los errores y usandola para reemplazar por na

In [29]:
list=['rxctf378968 7656463sdfg','-88888765432345.0','999765432456788.0','?']
for i in list:
       df.replace(i,np.nan,inplace=True)


In [30]:
df

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,id,symmetry_mean,symmetry_worst,diagnosis,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
1,279,1.83,,0.03711,0.09516,587.4,0.01457,15.18,0.1456,0.004235,...,8911834,211.0,0.2955,B,0.001593,,0.1724,0.01528,0.01541,
2,307,1144.0,9699.0,0.003472,,246.3,0.003681,14.4,0.01472,0.007389,...,89346,,0.2991,B,0.002153,56.36,0.05232,0.02701,0.004883,0.1746
4,576,2183.0,,0.02278,0.09524,409.0,0.01349,18.75,,0.008328,...,85759902,,0.3306,B,0.002386,73.34,,0.03218,0.008722,0.3249
5,350,2225.0,13.28,0.01162,0.07561,421.0,0.005949,17.07,0.03046,0.006583,...,899187,0.1671,0.2731,B,0.002668,73.7,0.06476,0.02216,0.006991,0.3534
7,120,1103.0,12.82,0.02623,0.09373,403.3,0.01514,10.82,0.2102,,...,865137,0.1667,0.3016,B,0.002206,73.34,239.0,0.01344,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19703,602,2844.0,,0.05613,0.1008,809.8,0.02219,21.54,0.2992,0.004877,...,86730502,216.0,,M,,106.2,0.3055,0.01535,0.01952,0.4332
19705,350,2225.0,,0.01162,0.07561,421.0,0.005949,17.07,0.03046,,...,899187,0.1671,0.2731,B,,73.7,0.06476,0.02216,0.006991,0.3534
19706,442,2235.0,15.27,0.009937,,585.9,0.007741,15.79,0.03517,,...,90944601,0.1405,0.1859,B,0.002564,,0.1071,,0.01156,0.3563
19708,501,2974.0,16.01,0.06759,0.1162,595.9,0.03476,24.49,0.3381,0.00968,...,91504,0.2275,0.3651,M,0.006995,92.33,0.3966,0.02434,0.03856,0.4751


# Enconding

In [31]:
from sklearn import preprocessing
label_encoding = preprocessing.LabelEncoder()
df['diagnosis'] = label_encoding.fit_transform(df['diagnosis'])

## 0 es beningno y 1 es maligno


## Cambiando el tipo de dato de una columna

In [32]:
df=df.astype('float')

In [33]:
df['diagnosis']=df['diagnosis'].astype('int').astype('category')
df['index']=df['index'].astype('int')




In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16180 entries, 1 to 19709
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   index                    16180 non-null  int64   
 1   perimeter_se             13640 non-null  float64 
 2   radius_worst             13630 non-null  float64 
 3   concave points_mean      13600 non-null  float64 
 4   smoothness_mean          13450 non-null  float64 
 5   area_mean                13425 non-null  float64 
 6   concavity_se             13475 non-null  float64 
 7   texture_mean             13565 non-null  float64 
 8   concavity_worst          13540 non-null  float64 
 9   smoothness_se            13610 non-null  float64 
 10  concave points_se        13600 non-null  float64 
 11  area_worst               13530 non-null  float64 
 12  compactness_mean         13480 non-null  float64 
 13  radius_mean              13495 non-null  float64 
 14  area_se    

## Descripción de la data

In [35]:
df.describe()

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,concavity_mean,id,symmetry_mean,symmetry_worst,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
count,16180.0,13640.0,13630.0,13600.0,13450.0,13425.0,13475.0,13565.0,13540.0,13610.0,...,13415.0,14720.0,13485.0,13600.0,13555.0,13450.0,13725.0,13580.0,13570.0,13595.0
mean,328.05068,2536.409364,323.328936,2.424616,3.835104,656.435978,1.147903,19.416823,24.441271,0.007034,...,7.370354,30758980000000.0,16.305977,31.056838,0.010388,92.300212,25.303201,0.219976,0.145002,81.108863
std,189.47624,1738.237381,1680.786074,16.207594,19.928257,349.18159,17.733254,4.385227,106.06891,0.00313,...,34.793982,182461200000000.0,52.51652,92.196651,0.199383,24.473417,95.437704,2.066789,1.474177,292.899986
min,0.0,0.7714,7.93,0.0,0.05263,143.5,0.0,9.71,0.0,0.001713,...,0.0,-88888770000000.0,0.1167,0.1565,0.000895,43.79,0.02729,0.007882,0.002252,0.1115
25%,165.75,1482.0,13.29,0.02027,0.08641,420.5,0.01514,16.33,0.1211,0.005033,...,0.02995,865468.0,0.1634,0.2523,0.002217,75.49,0.1508,0.01502,0.0134,0.2351
50%,326.0,2143.0,15.3,0.0337,0.09592,556.7,0.02626,18.9,0.2571,0.006307,...,0.06387,907915.0,0.1813,0.2884,0.003053,86.91,0.2297,0.018525,0.02048,0.3438
75%,494.0,3168.0,19.92,0.07752,0.1061,782.7,0.04256,21.87,0.426725,0.008109,...,0.1457,8912944.0,0.2027,0.3313,0.004463,104.3,0.3856,0.02324,0.03247,0.5907
max,656.0,9807.0,9981.0,162.0,123.0,2501.0,396.0,39.28,1252.0,0.03113,...,313.0,999765400000000.0,304.0,544.0,6.0,188.5,1058.0,31.0,27.0,2873.0


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16180 entries, 1 to 19709
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   index                    16180 non-null  int64   
 1   perimeter_se             13640 non-null  float64 
 2   radius_worst             13630 non-null  float64 
 3   concave points_mean      13600 non-null  float64 
 4   smoothness_mean          13450 non-null  float64 
 5   area_mean                13425 non-null  float64 
 6   concavity_se             13475 non-null  float64 
 7   texture_mean             13565 non-null  float64 
 8   concavity_worst          13540 non-null  float64 
 9   smoothness_se            13610 non-null  float64 
 10  concave points_se        13600 non-null  float64 
 11  area_worst               13530 non-null  float64 
 12  compactness_mean         13480 non-null  float64 
 13  radius_mean              13495 non-null  float64 
 14  area_se    

## Guardando en parquet


In [37]:
df.to_csv("../Datasets/preprocessed/BreastCancer.csv", index = False)


OSError: Cannot save file into a non-existent directory: '../Datasets/preprocessed'

In [None]:

df=pd.read_parquet("../Datasets/preprocessed/BreastCancer.parquet", engine='pyarrow')
df.info()

# Borrar columnas innecesarias


In [None]:
df=df.drop('id',axis=1)
df=df.drop('index',axis=1)


## Eliminando duplicados


In [None]:
df=df.drop_duplicates(keep='first', ignore_index=False)


In [None]:
df.info()

**Saving a dataset without duplicates and including nas, this is done in order to use this dataset in other possible experiments (breastcancerdeduplicated.parquet).**

In [None]:
df.to_parquet("../datasets/preprocessed/BreastCancerdeduplicated.parquet", index = False)


## Data set experimental #1

In [None]:
df.dropna(inplace=True)

In [None]:
df['diagnosis']=df['diagnosis'].astype('category')

In [None]:
df.info()


## Después de limpiar los datos, obtenemos un conjunto de datos sin valores faltantes ni errores. El conjunto de datos resultante contiene 569 ejemplos (filas), 30 características y 1 objetivo (diagnóstico). Utilizaremos este conjunto de datos (Breastclean1) para el primer experimento.

In [None]:
df.to_parquet("../Datasets/preprocessed/Breastclean1.parquet", index = False)

## Resultados Parciales
**Data set crudo inicial (dataset name: BreastCancerDS.csv)**
- 19719 rows, 35 Columns , memory usage: 5.4+ MB


**Despues de quitar duplicados ,incluyendo nas (dataset name: breastcancerdeduplicated.parquet) :**
- 3159 rows , 31 Columns , memory usage: 789.8 KB



**Dataset limpio sin duplicados ni nas(dataset name: Breastclean1.parquet):**
- 569 rows, 31 columns, memory usage :138.5 KB


