In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


At the first step we should install and import our needed packages.

### N1.

In [2]:
df = pd.read_csv('data/penguins_size.csv')

We can use different functions for reading our dataset depending on the file format (read_csv, read_exel, read_json, read_html and ...).
In this case we use read_csv for reading dataset and insert it in pandas DataFrame. 

### N2.

In [3]:
shape = df.shape

print(f'Our dataset shape is {shape}, that means we have {shape[0]} rows and {shape[1]} columns.')

Our dataset shape is (344, 7), that means we have 344 rows and 7 columns.


With shape attribute we can know shape of our dataset. (number of rows and columns.)

### N3.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   culmen_length_mm   342 non-null    float64
 3   culmen_depth_mm    342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


To learn more about our dataset and see summary of it, we can use info function. As you can see, we have three object (string) and four float columns.
Also, our dataset takes up 18.9 KB of memory space. Two columns (species and island) don't have any null values but rest of them have it.

In [5]:
df.head(10)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,MALE
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


In [6]:
df.tail(10)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
334,Gentoo,Biscoe,46.2,14.1,217.0,4375.0,FEMALE
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,MALE
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,.
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,MALE
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,FEMALE
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,MALE


It's good to use head and tail functions for observing some rows from top and bottom. It helps to get more familiar with dataset.

### N4.

In [7]:
unique_species = df['species'].unique()

print("Unique species in our dataset are", unique_species, "while there are 17 penguin species on the planet.")


Unique species in our dataset are ['Adelie' 'Chinstrap' 'Gentoo'] while there are 17 penguin species on the planet.


To find out the number of unique species, we need to use the unique function. It means each of our 344 penguins are either Adelie or Chinstrap or Gentoo.

### N5.

In [8]:
sum_nan_values_per_columns = df.isnull().sum()

print("Numbers of NaN values per columns:")
print("----------------------------------")
print(sum_nan_values_per_columns)

Numbers of NaN values per columns:
----------------------------------
species               0
island                0
culmen_length_mm      2
culmen_depth_mm       2
flipper_length_mm     2
body_mass_g           2
sex                  10
dtype: int64


This result shows that each of our columns have some null values except species and island columns.

In [9]:
df[df.isnull().any(axis=1)]

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
339,Gentoo,Biscoe,,,,,


In the table above, you can see all the rows that have null values.

### N6.

In [10]:
culmen_depth_by_island = df.groupby('island')['culmen_depth_mm'].mean()

print("Mean culmen depth by island:")
print("----------------------------")
print(culmen_depth_by_island)



Mean culmen depth by island:
----------------------------
island
Biscoe       15.874850
Dream        18.344355
Torgersen    18.429412
Name: culmen_depth_mm, dtype: float64


The avarage of penguin's culmen depth in Torgersen is higher than other islands.

In [11]:
culmen_depth_by_island = df.groupby('island')['culmen_depth_mm'].max()

print("Maximum culmen depth by island:")
print("-------------------------------")
print(culmen_depth_by_island)


Maximum culmen depth by island:
-------------------------------
island
Biscoe       21.1
Dream        21.2
Torgersen    21.5
Name: culmen_depth_mm, dtype: float64


The penguin with biggest culmen depth lives in Torgersen!

In [12]:
culmen_depth_by_island = df.groupby('island')['culmen_depth_mm'].min()

print("Minimum culmen depth by island:")
print("-------------------------------")
print(culmen_depth_by_island)


Minimum culmen depth by island:
-------------------------------
island
Biscoe       13.1
Dream        15.5
Torgersen    15.9
Name: culmen_depth_mm, dtype: float64


The penguin with smallest culmen depth lives in Biscore!

### N7.

In [13]:
body_mass_by_sex = df.groupby('sex')['body_mass_g'].mean()

print("Mean body mass by sex:")
print("----------------------")
print(body_mass_by_sex)

Mean body mass by sex:
----------------------
sex
.         4875.000000
FEMALE    3862.272727
MALE      4545.684524
Name: body_mass_g, dtype: float64


In [14]:
df.loc[~df['sex'].isin(['MALE', 'FEMALE']), 'sex'] = np.nan

As you can see we have one wrong value (.). for cleaning this problem I carefully looked at the table and noticed that the rows are one between male and female (female turn), but since this logic was not strong, I decided to fill that field with NaN. (because dtype was object, if it was float or int I filled it with upper or lower row value or mean.)

In [15]:
body_mass_by_sex = df.groupby('sex')['body_mass_g'].mean()

print("Mean body mass by sex:")
print("----------------------")
print(body_mass_by_sex)

Mean body mass by sex:
----------------------
sex
FEMALE    3862.272727
MALE      4545.684524
Name: body_mass_g, dtype: float64


Male panguins are heavier than female!

In [16]:
body_mass_by_sex = df.groupby('sex')['body_mass_g'].max()

print("Maximum body mass by sex:")
print("-------------------------")
print(body_mass_by_sex)

Maximum body mass by sex:
-------------------------
sex
FEMALE    5200.0
MALE      6300.0
Name: body_mass_g, dtype: float64


In [17]:
body_mass_by_sex = df.groupby('sex')['body_mass_g'].min()

print("Minimum body mass by sex:")
print("-------------------------")
print(body_mass_by_sex)

Minimum body mass by sex:
-------------------------
sex
FEMALE    2700.0
MALE      3250.0
Name: body_mass_g, dtype: float64


As could be guessed and it is clear from the results, the heaviest penguin is male and the lightest one is female!

### N8. 

In [18]:
df.groupby(['island', 'species']).size().unstack(fill_value=0)

species,Adelie,Chinstrap,Gentoo
island,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Biscoe,44,0,124
Dream,56,68,0
Torgersen,52,0,0


This table shows us only Adelie is present in all three islands while Chinstrap and Gentoo are only present in Dream and Biscoe respectively!

### N9. 

In [19]:
count_of_each_sex_by_species = df.groupby('species')['sex'].value_counts().unstack()
share_female_by_species = ((count_of_each_sex_by_species['FEMALE']) / 
                           (count_of_each_sex_by_species['MALE'] + 
                            count_of_each_sex_by_species['FEMALE']))

print(share_female_by_species)

species
Adelie       0.500000
Chinstrap    0.500000
Gentoo       0.487395
dtype: float64


50 percent of Adelie and Chinstarp panguins are female while less than half of Gentoo panguins are female (about 49%).

### N10. 

In [20]:
flipper_by_species = df.groupby('species')['flipper_length_mm'].agg(['min', 'max', 'mean'])

print("Flipper length comparisons between species:")
print("-------------------------------------------")
print(flipper_by_species)


Flipper length comparisons between species:
-------------------------------------------
             min    max        mean
species                            
Adelie     172.0  210.0  189.953642
Chinstrap  178.0  212.0  195.823529
Gentoo     203.0  231.0  217.186992


Shortest and longest flipper length belongs to Adelie and Gentoo penguins respectively with 172 mm and 231 mm. Also, on average, the longest flipper length belongs to Gentoo, Chinstrap and Adelie penguins, respectively.

# 3 + 2 more analysis!

### N11.

In [21]:
body_mass_by_island = df.groupby('island')['body_mass_g'].mean()

print(body_mass_by_island)


island
Biscoe       4716.017964
Dream        3712.903226
Torgersen    3706.372549
Name: body_mass_g, dtype: float64


We can understand panguins who lives in Biscoe island are heavier than two other islands concretely!

### N12.

In [22]:
correlation_culmen_and_flipper = df['culmen_length_mm'].corr(df['flipper_length_mm'])

print(f"Correlation rate is: {correlation_culmen_and_flipper}")


Correlation rate is: 0.6561813407464281


Correlation rate is near 1, it means we have linear positive correlation. When culmen length increases, the flipper length tends to increase as well.

### N13.

In [23]:
correlations = df[['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr().unstack()

print(correlations)


culmen_length_mm   culmen_length_mm     1.000000
                   culmen_depth_mm     -0.235053
                   flipper_length_mm    0.656181
                   body_mass_g          0.595110
culmen_depth_mm    culmen_length_mm    -0.235053
                   culmen_depth_mm      1.000000
                   flipper_length_mm   -0.583851
                   body_mass_g         -0.471916
flipper_length_mm  culmen_length_mm     0.656181
                   culmen_depth_mm     -0.583851
                   flipper_length_mm    1.000000
                   body_mass_g          0.871202
body_mass_g        culmen_length_mm     0.595110
                   culmen_depth_mm     -0.471916
                   flipper_length_mm    0.871202
                   body_mass_g          1.000000
dtype: float64


This table give us many good information about correlations between numeric columns.

### N14.

In [24]:
body_mass_by_species = df.groupby('species')['body_mass_g'].mean()

print(body_mass_by_species)


species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64


It is surprisingly clear that the Gentoo panguins are heavier than the others!

### N15.

In [25]:
df.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


And this is the final table that gives us a summary of valuable information about numerical data to have a better view and idea of our dataset.
for example more than half of panguins are above 4 Kg! The same weight as a gaming laptop with its charger!

## Thanks for everything!