- 4. Manipulación de datos con Pandas
    
    En este ejercicio, trabajarás con el conjunto de datos sobre el cáncer de mama, que contiene varias características de tumores benignos y malignos. A través de este ejercicio, practicarás cómo:
    
    1. **Mergear columnas** usando `concat()`.
    2. **Agrupar valores** por categorías usando `groupby()`.
    3. **Filtrar datos** usando condiciones con `where()`.
    
    ### Instrucciones:
    
    1. **Carga el dataset:** Usa el siguiente código para cargar el conjunto de datos `breast_cancer` de `scikit-learn` y convertirlo en un `DataFrame` de pandas.

In [71]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Cargar el dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Añadir la columna 'target' (maligno o benigno)
df["target"] = data.target

# Mostrar las primeras filas para ver cómo está estructurado el DataFrame
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


Ejercicio 1 - Mergear columnas: Crea un nuevo DataFrame que contenga solo dos columnas de características (por ejemplo, 'mean radius' y 'mean texture'). Crea otro DataFrame que contenga la columna 'target' que indica si el tumor es maligno o benigno. Fusiona ambos DataFrames. Muestra las primeras filas de este nuevo DataFrame. Aprovecha para probar el resto de parámetros vistos en clase. Prueba también a mergear con merge() añadiendo al DataFrame de la columna target otra columna ‘mean_radius’ y utiliza esta columna como clave en los dos DataFrames.

In [None]:
from tabulate import tabulate

# Creamos dos dataframes a partir de las columnas del original
df_means = pd.DataFrame(
    {"mean radius": df["mean radius"], 
     "mean texture": df["mean texture"]}
)
df_target = pd.DataFrame({"target": df["target"]})


# Concatenamos ambos dataframe usando concat por filas y por columnas para apreciar diferencias
df_concat_rows = pd.concat([df_means, df_target], axis=1)
df_concat_columns = pd.concat([df_means, df_target], axis=0)

# print(df_concat_rows)
# print(df_concat_columns)

# Añadimos una columna común para ambos dataframe para poder hacerles merge mediante la función del mismo nombre (mean radius)
df_target["mean radius"] = df["mean radius"]
# Mergeamos ambos df por la columna en común 'mean radius'

df_merge = pd.merge(df_means, df_target, how="inner", left_index=True, right_index=True)

print(tabulate(df_merge, headers="keys", tablefmt="psql"))

+-----+-----------------+----------------+----------+-----------------+
|     |   mean radius_x |   mean texture |   target |   mean radius_y |
|-----+-----------------+----------------+----------+-----------------|
|   0 |          17.99  |          10.38 |        0 |          17.99  |
|   1 |          20.57  |          17.77 |        0 |          20.57  |
|   2 |          19.69  |          21.25 |        0 |          19.69  |
|   3 |          11.42  |          20.38 |        0 |          11.42  |
|   4 |          20.29  |          14.34 |        0 |          20.29  |
|   5 |          12.45  |          15.7  |        0 |          12.45  |
|   6 |          18.25  |          19.98 |        0 |          18.25  |
|   7 |          13.71  |          20.83 |        0 |          13.71  |
|   8 |          13     |          21.82 |        0 |          13     |
|   9 |          12.46  |          24.04 |        0 |          12.46  |
|  10 |          16.02  |          23.24 |        0 |          1