El conjunto de datos se encuentra aquí. Tienes dos tareas principales. La primera tarea es utilizar el conjunto de datos del abulón para predecir la edad del abulón a partir de mediciones físicas utilizando KNN. Este es un problema de regresión. Es posible que desees buscar en Google "KNN regression scikit-learn". La segunda tarea es utilizar el conjunto de datos de abulón para predecir el sexo a partir de sus características. Esta debería ser una tarea más fácil, ya que ya hemos repasado la clasificación KNN usando Python.



# Variables

-----------------------------
Sex / nominal / -- / M, F, and I (infant)

Length / continuous / mm / Longest shell measurement

Diameter / continuous / mm / perpendicular to length

Height / continuous / mm / with meat in shell

Whole weight / continuous / grams / whole abalone

Shucked weight / continuous / grams / weight of meat

Viscera weight / continuous / grams / gut weight (after bleeding)

Shell weight / continuous / grams / after being dried

Rings / integer / -- / +1.5 gives the age in years

In [1]:
import pandas as pd
import numpy as np
# Para escalar datos
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

In [2]:
df = pd.read_csv('archive (1).zip', sep=',')

In [3]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [4]:
df['Age'] = df['Rings']+1.5

In [5]:
df = df.drop_duplicates()

In [6]:
df.shape

(4177, 10)

In [7]:
df.dtypes

Sex                object
Length            float64
Diameter          float64
Height            float64
Whole weight      float64
Shucked weight    float64
Viscera weight    float64
Shell weight      float64
Rings               int64
Age               float64
dtype: object

Predecir la edad del abulón a partir de mediciones físicas utilizando KNN

In [8]:
df.corr().sort_values(by='Age', ascending=False)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Age
Rings,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0,1.0
Age,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0,1.0
Shell weight,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574,0.627574
Diameter,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466,0.57466
Height,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467,0.557467
Length,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672,0.55672
Whole weight,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039,0.54039
Viscera weight,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819,0.503819
Shucked weight,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884,0.420884


In [9]:
df.Age

0       16.5
1        8.5
2       10.5
3       11.5
4        8.5
        ... 
4172    12.5
4173    11.5
4174    10.5
4175    11.5
4176    13.5
Name: Age, Length: 4177, dtype: float64

In [10]:
df.columns

Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings', 'Age'],
      dtype='object')

In [11]:
X = df.loc[:,['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight']]

In [12]:
y = df.loc[:, 'Age'].values

In [13]:
scaler = StandardScaler()
# Ajusta en X (asumiendo que tu matriz de características es una matriz NumPy)
scaler.fit(X)
# Transform X
X = scaler.transform(X)

In [14]:
from sklearn.neighbors import KNeighborsRegressor

In [15]:
knn = KNeighborsRegressor(n_neighbors=5,weights='distance')
knn.fit(X, y)
predictions = knn.predict(X)

In [16]:
predictions

array([16.5,  8.5, 10.5, ..., 10.5, 11.5, 13.5])

In [17]:
score = knn.score(X, y)
score

1.0

## Knn Regressor (sin train-test)

In [18]:
X = df.loc[:,['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight']]
y = df.loc[:, 'Age'].values

In [19]:
scaler = StandardScaler()
# Ajusta en X (asumiendo que tu matriz de características es una matriz NumPy)
scaler.fit(X)
# Transform X
X = scaler.transform(X)

In [20]:
from sklearn.neighbors import KNeighborsRegressor

In [21]:
knn = KNeighborsRegressor(n_neighbors=3,weights='uniform')
knn.fit(X, y)
predictions = knn.predict(X)

In [22]:
score = knn.score(X, y)
score

0.7200173769577054

## Knn Regressor (con train-test)

In [23]:
X = df.loc[:,['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight']]
y = df.loc[:, 'Age'].values

In [24]:
from sklearn.model_selection import train_test_split
X_train , X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)

In [25]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [26]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3,weights='uniform')
knn.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=3)

In [27]:
score = knn.score(X_train, y_train)
score

0.7263940586030272

# Knn Classifier (sin train test)

In [28]:
X = df.iloc[:,1:-1].values
y = df.iloc[:,0].values

scaler = StandardScaler()
# Ajusta en X (asumiendo que tu matriz de características es una matriz NumPy)
scaler.fit(X)
# Transform X
X = scaler.transform(X)

In [29]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3,weights='uniform')
classifier.fit(X,y)

KNeighborsClassifier(n_neighbors=3)

In [30]:
y_pred = classifier.predict(X)

In [31]:
score = classifier.score(X, y)
score

0.7368925065836724

# Knn Classifier con train-test

In [32]:
X = df.iloc[:,1:-1].values
y = df.iloc[:,0].values

In [33]:
from sklearn.model_selection import train_test_split
X_train , X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)

In [34]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [35]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3,weights='uniform')
classifier.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=3)

In [36]:
score = classifier.score(X_test, y_test)
score

0.5158587672052664

In [37]:
score = classifier.score(X_train, y_train)
score

0.7390263367916999

In [38]:
y_pred = classifier.predict(X_test)

In [39]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           F       0.42      0.44      0.43       529
           I       0.71      0.65      0.68       556
           M       0.44      0.46      0.45       586

    accuracy                           0.52      1671
   macro avg       0.52      0.52      0.52      1671
weighted avg       0.52      0.52      0.52      1671

[[233  56 240]
 [100 362  94]
 [225  94 267]]


Una vez que hayas completado las dos tareas, responde las siguientes preguntas.

¿Podrías haber utilizado la regresión lineal y la regresión KNN para resolver el problema de regresión?

Si , ya que ambos algoritmos sirven para problemas de regresión

¿Podrías haber utilizado la regresión lineal para el problema de clasificación?

No, ya que en el problema de clasificacion la variable target es una variable nominal

## Regresion Lineal (sin train-test)

In [40]:
from sklearn.linear_model import LinearRegression


In [41]:
df.corr().sort_values(by='Age',ascending=False)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Age
Rings,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0,1.0
Age,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0,1.0
Shell weight,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574,0.627574
Diameter,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466,0.57466
Height,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467,0.557467
Length,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672,0.55672
Whole weight,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039,0.54039
Viscera weight,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819,0.503819
Shucked weight,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884,0.420884


In [42]:
X = df.iloc[:,1:-2]
y = df[['Age']].values

In [43]:
reg = LinearRegression(fit_intercept=True)

In [44]:
reg.fit(X,y)

LinearRegression()

In [45]:
score = reg.score(X, y)
print(score)


0.5276299399919839


## Regresion Lineal (con train-test)

In [46]:
from sklearn.linear_model import LinearRegression


In [47]:
X = df.iloc[:,1:-2]
y = df[['Age']].values

In [48]:
# si el código parece extraño, revise el desempaquetado de tuplas
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size =0.3 , random_state=3)

In [49]:
# Estandarizar datos
scaler = StandardScaler()
scaler.fit(X_train) # siempre fit solo a train
X_train = scaler.transform(X_train) # transform a train y test
X_test = scaler.transform(X_test)

In [50]:
reg = LinearRegression()

In [51]:
reg.fit(X_train, y_train)  ### siempre fit a train

LinearRegression()

In [52]:
# Podemos ver las predicciones de entrenamiento para ver qué tan bien nuestro modelo se ajusta a los datos de entrenamiento (verificando el sesgo)
predictions = reg.predict(X_train)

# calcular la precisión de la clasificación
score = reg.score(X_train, y_train)
print(score)


0.5439255168679182


Preguntas para pensar y responder

Una vez que hayas completado la tarea, responde las siguientes preguntas:

¿Cuál de KNN o regresión lineal pareció un mejor modelo cuando no usaste la división entrenar probar?

KNN regressor fue un mej rmodleo con 0.72 de score

¿Cuál de KNN o regresión lineal pareció un mejor modelo cuando usaste la división entrenar probar?

KNN regressor con 0.72 

 ¿Hubo alguna ventaja en la regresión lineal en términos de la cantidad de código que tenías que escribir?
 
No, fue casi lo mismo

¿Hay alguna forma de mostrarle a alguien cuál de los dos modelos fue más efectivo?

Si a través del score, ya que tiene una mejor predicción

¿Crees que podrías haber mejorado KNN para mejorar la eficacia del modelo?

Podría haber cambiado parámetros, ejemplo cantidad de vecinos