### 1: Introduccion

En este informe, se realizará un análisis de datos utilizando un conjunto de datos sobre abalones.
El objetivo es evaluar el impacto de puntos influyentes, outliers, multicolinealidad y técnicas de transformación en el modelo de regresión.
Se calcularán métricas como R^2 y MSE y se interpretarán los resultados.


### 2: Análisis Inicial de los Datos

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import norm, uniform, skewnorm
from ucimlrepo import fetch_ucirepo
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Obtención del dataset
abalone = fetch_ucirepo(id=1)

# Preparar los datos
X = abalone.data.features
y = abalone.data.targets
X = X.drop('Sex', axis=1)
X.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)
df = pd.concat([y, X], axis=1)

### 3: Análisis de Puntos Influyentes

In [2]:
X_fit = sm.add_constant(X)
model = sm.OLS(y, X_fit)
fitted_model = model.fit()

# Cálculo de r^2 y parámetros
print(fitted_model.params)
print('\nR^2 =', fitted_model.rsquared)

# Identificación de puntos influyentes
influence = fitted_model.get_influence()
H_diag = influence.hat_matrix_diag
cooks_dist = influence.cooks_distance[0]

const              2.985154
Length            -1.571897
Diameter          13.360916
Height            11.826072
Whole_weight       9.247414
Shucked_weight   -20.213913
Viscera_weight    -9.829675
Shell_weight       8.576242
dtype: float64

R^2 = 0.5276299399919837


### 4: Outliers

In [4]:
# Identificación de los Outliers
# Detección usando Z-score
up_lim = X.mean() + 3 * X.std()
dw_lim = X.mean() - 3 * X.std()

print("Upper limit:\n", up_lim)
print("\nLower limit:\n", dw_lim)

# Detección usando percentiles
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
iqr = Q3 - Q1

print("Q1:", Q1)
print("\nQ3:", Q3)
print("\nIQR:", iqr)

outliers_iqr = (X < Q1 + 1.5 * iqr) | (X > Q3 + 1.5 * iqr)
print("Number of outlier samples", X.to_numpy()[outliers_iqr].shape)

Upper limit:
 Length            0.884271
Diameter          0.705601
Height            0.264998
Whole_weight      2.299909
Shucked_weight    1.025256
Viscera_weight    0.509436
Shell_weight      0.656439
dtype: float64

Lower limit:
 Length            0.163713
Diameter          0.110162
Height            0.014035
Whole_weight     -0.642425
Shucked_weight   -0.306521
Viscera_weight   -0.148249
Shell_weight     -0.178777
dtype: float64
Q1: Length            0.4500
Diameter          0.3500
Height            0.1150
Whole_weight      0.4415
Shucked_weight    0.1860
Viscera_weight    0.0935
Shell_weight      0.1300
Name: 0.25, dtype: float64

Q3: Length            0.615
Diameter          0.480
Height            0.165
Whole_weight      1.153
Shucked_weight    0.502
Viscera_weight    0.253
Shell_weight      0.329
Name: 0.75, dtype: float64

IQR: Length            0.1650
Diameter          0.1300
Height            0.0500
Whole_weight      0.7115
Shucked_weight    0.3160
Viscera_weight    0.1595
S

### 5: Transformación para lidiar con los outliers

In [9]:
# Escalado min-max
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler.fit(X)

print("Max Values:", scaler.data_max_)

print("\nTransformation step:")
mima = scaler.transform(X)
print(mima)
print(scaler.transform([[2, 2, 2, 2, 2, 2, 2]]))

# Normalización Z-score
scaler = StandardScaler()
scaler.fit(X)

print('Means:', scaler.mean_)

print('Transformation step:')
zSc = scaler.transform(X)
print(zSc, '\n')
print(scaler.transform([[2, 2, 2, 2, 2, 2, 2]]))

# Winsorización
Q1 = np.quantile(X.to_numpy(), 0.25)
Q3 = np.quantile(X.to_numpy(), 0.75)

iqr = Q3 - Q1

print("Q1:", Q1)
print("Q1:", Q3)

from scipy.stats.mstats import winsorize

win = winsorize(X.to_numpy(), limits=[0.32, 0.32])
print(win)

Max Values: [0.815  0.65   1.13   2.8255 1.488  0.76   1.005 ]

Transformation step:
[[ 0.02702703  0.04201681 -0.83185841 ... -0.69939475 -0.73535221
  -0.70403587]
 [-0.25675676 -0.29411765 -0.84070796 ... -0.86751849 -0.87360105
  -0.86347783]
 [ 0.22972973  0.22689076 -0.76106195 ... -0.65635508 -0.62870309
  -0.58445441]
 ...
 [ 0.41891892  0.41176471 -0.63716814 ... -0.29455279 -0.24423963
  -0.38913802]
 [ 0.48648649  0.44537815 -0.73451327 ... -0.28715535 -0.31402238
  -0.41305431]
 [ 0.71621622  0.68067227 -0.65486726 ...  0.27034297 -0.00987492
  -0.01644245]]
[[4.2027027  5.53781513 2.53982301 0.41526474 1.68863484 4.26530612
  2.98305929]]
Means: [0.5239921  0.40788125 0.1395164  0.82874216 0.35936749 0.18059361
 0.23883086]
Transformation step:
[[-0.57455813 -0.43214879 -1.06442415 ... -0.60768536 -0.72621157
  -0.63821689]
 [-1.44898585 -1.439929   -1.18397831 ... -1.17090984 -1.20522124
  -1.21298732]
 [ 0.05003309  0.12213032 -0.10799087 ... -0.4634999  -0.35668983
  -0

