# EDA (Exploratory Data Analysis) of the dataset

In this notebook, explore the Abalone dataset, by showing relevant visualizations that help understand the problem you are modelling.

Please make sure to write down your conclusions in the final notebook and to remove these intructions.

# Imports

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
%matplotlib inline

from sklearn.preprocessing import  StandardScaler
from sklearn.model_selection import  train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import  RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import  GradientBoostingRegressor
from sklearn.linear_model import  Ridge
from sklearn.svm import SVR

pd.set_option('display.max_columns', 500)


# Data

In [None]:
data = pd.read_csv("../data/abalone.csv")
data

# EDA

From problem statement and feature discription, let's first compute the target variable of the problem ' Age' and assign it to the dataset. Age = 1.5+Ring

In [None]:
data['age'] = data['Rings']+1.5
data.drop('Rings', axis = 1, inplace = True)

In [None]:
data

In [None]:
print('This dataset has {} observations with {} features.'.format(data.shape[0], data.shape[1]))

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.describe()

Key insights: <br> <br>
No missing values in the dataset <br>
All numerical features but 'sex'<br>
Though features are not normaly distributed, are close to normality <br>
None of the features have minimum = 0 except Height <br>
Each feature has difference scale range

In [None]:
data.hist(figsize=(20,10), grid=False, layout=(2, 4), bins = 30)


In [None]:
numerical_features = data.select_dtypes(include=[np.number]).columns
categorical_features = data.select_dtypes(include=[object]).columns

In [None]:
skew_values = skew(data[numerical_features], nan_policy = 'omit')
dummy = pd.concat([pd.DataFrame(list(numerical_features), columns=['Features']), 
        pd.DataFrame(list(skew_values), columns=['Skewness degree'])], axis = 1)
dummy.sort_values(by = 'Skewness degree' , ascending = False)

For normally distributed data, the skewness should be about 0. For unimodal continuous distributions, a skewness value > 0 means that there is more weight in the right tail of the distribution. The function skewtest can be used to determine if the skewness value is close enough to 0, statistically speaking. <br>
Height has highest skewedness followed by age, Shucked weight (can be cross verified through histogram plot).

In [None]:
sns.countplot(x = 'Sex', data = data, palette="Set3")

In [None]:
data.groupby('Sex')[['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight','Viscera weight', 'Shell weight', 'age']].mean().sort_values('age')

## Bivariate Analysis
Bivariate analysis is a vital part of data analysis process, for it gives clear picture on how each features are affected in presence of other features.
It also helps understand and identify significance features, overcome multi-collinearity effect, and inter-dependency. Thus, it provides insights on hidden data noise pattern.

In [None]:
sns.pairplot(data[numerical_features])

Key insights: <br> <br>
length is linearly correlated with diameter and non-linearly correlated with height, whole weight, shucked weight, viscera weight, and shell weight.

In [None]:
plt.figure(figsize=(20,7))
sns.heatmap(data[numerical_features].corr(), annot=True)

Whole Weight is almost linearly correlated with all other features except age <br>
Height has least linearity with remaining features <br>
Age is most linearly proprtional with Shell Weight, followed by Diameter and length <br>
Age is least correlated with Shucked Weight <br>
Such high correlation coefficients among features can result into multi-collinearity.

# Outliers

In [None]:
data = pd.get_dummies(data.drop(columns=['age']))
dummy_data = data.copy()
data.boxplot( rot = 90, figsize=(20,5))

- Numerical Features (Length, Diameter, Height, Weights): Most of the numerical features, such as Length, Diameter, and the different weight measures (Whole weight, Shucked weight, Viscera weight, Shell weight), exhibit some degree of outliers. These outliers are visible as individual points that extend beyond the whiskers of the boxplot. Notably, the feature Whole weight has a particularly wide range and a significant number of outliers, indicating variability in this measure.

- Height: The Height feature appears to have a concentration of outliers below the lower whisker, suggesting there are many samples with unusually small heights. This might warrant further investigation to determine if these values are accurate or the result of data entry errors.

- Categorical Features (Sex_F, Sex_I, Sex_M): The categorical features (sex columns) appear as flat boxplots because they are encoded as binary variables, hence they don’t exhibit typical boxplot variability. These features don’t provide insight through boxplots, but can be explored through other statistical or visualization techniques.

General Observation: Most of the weight-related features have considerable variability with multiple outliers, which suggests that outlier treatment (like removal or transformation) might be necessary before applying machine learning models. Features such as Length and Diameter are relatively more stable, though they still show some outliers.