## Ejercicio 1: Análisis del Dataset Iris

**Objetivo**: Realizar un análisis exhaustivo del dataset Iris utilizando las técnicas vistas.

### Instrucciones:
1. **Medidas de Frecuencia**:
   - Calcula la frecuencia de cada especie en el dataset Iris.
   - Calcula las frecuencias porcentuales.

2. **Medidas de Tendencia Central**:
   - Calcula la media, mediana y moda de las variables `sepal length (cm)` y `petal length (cm)`.

3. **Medidas de Dispersión**:
   - Calcula la desviación estándar, varianza, rango y rango intercuartil (IQR) de las variables `sepal width (cm)` y `petal width (cm)`.

4. **Percentiles y Cuartiles**:
   - Calcula los percentiles 25, 50 (mediana) y 75 para `sepal length (cm)`.

5. **Resumen Estadístico**:
   - Realiza un resumen estadístico de todo el DataFrame.
   - Realiza un resumen estadístico separado para cada especie considerando las cuatro variables del DataFrame.

6. **Covarianza y Correlación**:
   - Calcula la covarianza entre `sepal length (cm)` y `petal length (cm)`.
   - Calcula la correlación entre `sepal length (cm)` y `petal length (cm)`.

In [None]:
%pip install pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import pandas as pd

from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

In [None]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
species_count = df['species'].value_counts()

species_count

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

In [None]:
species_porcent = df['species'].value_counts(normalize=True)

species_porcent

species
setosa        0.333333
versicolor    0.333333
virginica     0.333333
Name: proportion, dtype: float64

In [None]:
print(df["sepal length (cm)"].mean())

print(df["sepal length (cm)"].median())

print(df["sepal length (cm)"].mode())

print(df["petal length (cm)"].mean())

print(df["petal length (cm)"].median())

print(df["petal length (cm)"].mode())

5.843333333333334
5.8
0    5.0
Name: sepal length (cm), dtype: float64
3.7580000000000005
4.35
0    1.4
1    1.5
Name: petal length (cm), dtype: float64


In [None]:
print(df["petal width (cm)"].std())

print(df["petal width (cm)"].var())

min_petal_width = df["petal width (cm)"].min()

max_petal_width = df["petal width (cm)"].max()

print(max_petal_width - min_petal_width)

Q1_petal_width = df['petal width (cm)'].quantile(0.25)
Q3_petal_width = df['petal width (cm)'].quantile(0.75)

print(Q3_petal_width  - Q1_petal_width )

0.7622376689603465
0.5810062639821029
2.4
1.5


In [None]:
cuartiles = df["sepal length (cm)"].quantile([0.25, 0.5, 0.75])
print("Cuartiles: ")
print(cuartiles)

Cuartiles: 
0.25    5.1
0.50    5.8
0.75    6.4
Name: sepal length (cm), dtype: float64


In [None]:
print(df.describe())

print(df['species'].value_counts())

       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)  
count        150.000000  
mean           1.199333  
std            0.762238  
min            0.100000  
25%            0.300000  
50%            1.300000  
75%            1.800000  
max            2.500000  
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


In [None]:
df.groupby('species', observed = False)['sepal length (cm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9


In [None]:
df.groupby('species', observed = False)['petal length (cm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,1.462,0.173664,1.0,1.4,1.5,1.575,1.9
versicolor,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1
virginica,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9


In [None]:
df.groupby('species', observed = False)['sepal width (cm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,3.428,0.379064,2.3,3.2,3.4,3.675,4.4
versicolor,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
virginica,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8


In [None]:
df.groupby('species', observed = False)['petal width (cm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
versicolor,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
virginica,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


In [None]:
print(df['sepal length (cm)'].cov(df['petal length (cm)']))

print(df['sepal length (cm)'].corr(df['petal length (cm)']))

1.2743154362416105
0.8717537758865829


## Ejercicio 2: Análisis del Dataset Titanic

**Objetivo**: Aplicar las técnicas de análisis exploratorio para investigar las características de los pasajeros del Titanic.

**Nota**: Deberá cargar el dataset de la siguiente manera:

import seaborn as sns

import pandas as pd

import numpy as np

*# Cargar el dataset Titanic de seaborn*

titanic = sns.load_dataset('titanic')

### Instrucciones:
1. **Medidas de Frecuencia**:
   - Calcula la frecuencia de los pasajeros por clase (`Pclass`) y por sexo (`Sex`).
   - Calcula también la frecuencia de supervivencia (`Survived`).
   - Calcula las frecuencias porcentuales.

2. **Medidas de Tendencia Central**:
   - Calcula la media, mediana y moda de las edades (`Age`).
   - Calcula también la media de las tarifas (`Fare`) solo para los pasajeros de primera clase (`Pclass == 1`).

3. **Medidas de Dispersión**:
   - Calcula la desviación estándar, varianza, rango y rango intercuartil (IQR) de las tarifas (`Fare`).
   - Calcula el rango intercuartil (IQR) de las edades (`Age`) para los sobrevivientes (`Survived == 1`).

4. **Percentiles y Cuartiles**:
   - En lugar de solo calcular percentiles específicos, calcula todos los percentiles (0-100) para las tarifas (`Fare`).

5. **Resumen Estadístico**:
   - Realiza un resumen estadístico de todo el DataFrame.
   - Realiza un resumen estadístico separado para los sobrevivientes (`Survived == 1`) y para los que no sobrevivieron (`Survived == 0`) considerando en principio su clase (`Pclass`) y otro resumen separado para los sobrevivientes (`Survived == 1`) y para los que no sobrevivieron (`Survived == 0`) considerando su sexo (`Sex`).

6. **Covarianza y Correlación**:
   - Calcula la covarianza entre la edad (`Age`) y la tarifa (`Fare`).
   - Calcula la matriz de correlación para las variables numéricas del dataset.

7. **Tablas de Contingencia**:
   - Calcula una tabla de contingencia entre `Pclass` y `Survived`.

In [1]:
%pip install seaborn pandas numpy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import seaborn as sns

import pandas as pd

import numpy as np

# Cargar el dataset Titanic de seaborn

titanic = sns.load_dataset('titanic')

titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [3]:
titanic["pclass"] = titanic["pclass"].astype("category") # Cambiamos el tipo de pclass de entero a category porque es una variable categórica (clase social).

titanic["survived"] = titanic["survived"].astype("category") # Cambiamos el tipo de survived de entero a category porque es una variable categórica (sobrevivió o no).


In [7]:
print(titanic["pclass"].value_counts())

print(titanic["sex"].value_counts())

print(titanic["survived"].value_counts())

print(titanic["pclass"].value_counts(normalize = True))

print(titanic["sex"].value_counts(normalize = True))

print(titanic["survived"].value_counts(normalize = True))

pclass
3    491
1    216
2    184
Name: count, dtype: int64
sex
male      577
female    314
Name: count, dtype: int64
survived
0    549
1    342
Name: count, dtype: int64
pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64
sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64
survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64


In [None]:
print(titanic['age'].mean())

print(titanic['age'].median())

print(titanic['age'].mode())

print(titanic[titanic['pclass'] == 1]['fare'].mean())

29.69911764705882
28.0
0    24.0
Name: age, dtype: float64
84.1546875


In [None]:
print(titanic['fare'].std())

print(titanic['fare'].var())

print(titanic['fare'].max() - titanic['fare'].min())

print(titanic['fare'].quantile(0.75) - titanic['fare'].quantile(0.25))

print(titanic[titanic['survived'] == 1]['age'].quantile(0.75) - titanic[titanic['survived'] == 1]['age'].quantile(0.25))

49.693428597180905
2469.436845743117
512.3292
23.0896
17.0


In [None]:
pd.set_option('display.max_rows', None)

In [None]:
titanic['fare'].quantile([i/100 for i in range(101)])

Unnamed: 0,fare
0.0,0.0
0.01,0.0
0.02,6.3975
0.03,6.975
0.04,7.05252
0.05,7.225
0.06,7.225
0.07,7.2292
0.08,7.25
0.09,7.25


In [None]:
pd.reset_option('display.max_rows')

In [10]:
titanic.describe()

Unnamed: 0,age,sibsp,parch,fare
count,714.0,891.0,891.0,891.0
mean,29.699118,0.523008,0.381594,32.204208
std,14.526497,1.102743,0.806057,49.693429
min,0.42,0.0,0.0,0.0
25%,20.125,0.0,0.0,7.9104
50%,28.0,0.0,0.0,14.4542
75%,38.0,1.0,0.0,31.0
max,80.0,8.0,6.0,512.3292


In [11]:
titanic.describe(include=['object', 'category', 'bool'])

Unnamed: 0,survived,pclass,sex,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891,891,891,889,891,891,891,203,889,891,891
unique,2,3,2,3,3,3,2,7,3,2,2
top,0,3,male,S,Third,man,True,C,Southampton,no,True
freq,549,491,577,644,491,537,537,59,644,549,537


In [12]:
titanic[titanic["survived"] == 1]["pclass"].describe()

count     342
unique      3
top         1
freq      136
Name: pclass, dtype: int64

In [8]:
titanic[titanic["survived"] == 0]["pclass"].describe()

count     549
unique      3
top         3
freq      372
Name: pclass, dtype: int64

In [13]:
titanic[titanic["survived"] == 1]["sex"].describe()

count        342
unique         2
top       female
freq         233
Name: sex, dtype: object

In [14]:
titanic[titanic["survived"] == 0]["sex"].describe()

count      549
unique       2
top       male
freq       468
Name: sex, dtype: object

In [None]:
print(titanic['age'].cov(titanic['fare']))

73.84902981461926


In [19]:
titanic_variables_numericas = titanic.select_dtypes(include=[np.number])

titanic_variables_numericas.corr()

Unnamed: 0,age,sibsp,parch,fare
age,1.0,-0.308247,-0.189119,0.096067
sibsp,-0.308247,1.0,0.414838,0.159651
parch,-0.189119,0.414838,1.0,0.216225
fare,0.096067,0.159651,0.216225,1.0


In [4]:
pd.crosstab(titanic['pclass'], titanic['survived'])

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119
