# Penguin Dataset

The dataset consists of 7 columns and 344 rows.

The 7 coulms are:

1. species: penguin species (Chinstrap, Adélie, or Gentoo)
2. culmen_length_mm: culmen length in mm
3. culmen_depth_mm: culmen depth in mm
4. flipper_length_mm: flipper length in mm
5. body_mass_g: body mass in grams
6. island: Name of the island (Dream, Torgersen, or Biscoe)
7. sex: male or female penguin

We will be doing exploratory data analysis for this dataset

<b> Importing Libraries <b>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<b> Reading and understanding the dataset <b>

In [None]:
df = pd.read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

<b> Null Values <b>

In [None]:
df.isnull().sum()

As we can see from the results above, there are a few missing values. Hence, I will be replacing the null values with the mean value.

In [None]:
df["culmen_length_mm"] = df["culmen_length_mm"].fillna(value = df["culmen_length_mm"].mean())
df["culmen_depth_mm"] = df["culmen_depth_mm"].fillna(value = df["culmen_depth_mm"].mean())
df["flipper_length_mm"] = df["flipper_length_mm"].fillna(value = df["flipper_length_mm"].mean())
df["body_mass_g"] = df["body_mass_g"].fillna(value = df["body_mass_g"].mean())


In [None]:
df['sex'] = df['sex'].fillna('MALE')

In [None]:
df.loc[336,'sex'] = 'MALE'

Now, we check if there are still any null values

In [None]:
df.isna().sum()

In [None]:
df.tail()

<b> Duplicated Data <b>

In [None]:
duplicated = df.duplicated()
print(duplicated.sum())
df[duplicated]

<font size="3"> <b> Bar Charts </font> <b>

In [None]:
df['species'].value_counts().plot(kind='bar')

In [None]:
df['species'].value_counts()

In [None]:
df['island'].value_counts().plot(kind='bar')

In [None]:
df['island'].value_counts()

In [None]:
df['sex'].value_counts().plot(kind='bar')

In [None]:
df['sex'].value_counts()

<font size="3"> <b> Histogram </font> <b>

In [None]:
sns.distplot(a=df['culmen_depth_mm'], label="Culmen-Depth", kde=False)
sns.distplot(a=df['culmen_length_mm'], label="Culmen-Length", kde=False)
sns.distplot(a=df['flipper_length_mm'], label="Flipper-Length", kde=False)

plt.title("Histogram of Penguin Features, by Species")

plt.legend()

<font size="3"> <b> Pair Plot </font> <b>

In [None]:
sns.pairplot(df,hue = "species")

<font size="3"> <b> Categorical Plot </font> <b>

In [None]:
sns.catplot(x="sex", y="body_mass_g", hue="species", kind="bar", data=df)

In [None]:
sns.catplot(x="sex", y="culmen_length_mm", hue="species", kind="bar", data=df)

In [None]:
sns.catplot(x="sex", y="culmen_depth_mm", hue="species", kind="bar", data=df)

In [None]:
sns.catplot(x="sex", y="flipper_length_mm", hue="species", kind="bar", data=df)

<font size="3"> <b> Correlation between variables </font> <b>

In [None]:
df.corr()

<font size="3"> <b> Heat Map </font> <b>

In [None]:
corr = df.corr(method = 'spearman')
plt.figure(figsize=(14,7))
plt.title("Correlation of the independent features")

sns.heatmap(corr, annot=True)