In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Exploring Dataset Columns

For the purposes of this analysis and report, we will focus on the thorax & wing traits dataset. We begin by exploring the dataset columns. The majority of columns are related to wing and thorax measurements - so we should expect them to be numerical. Some columns (temperature, vial, replicate) are related to the experimental setup of the study - i.e. the conditions under which the measurements were taken. These may or may not be relevant predictors.

We see that there are 2 species of Drosophila, and that they were taken from 5 population locations. A relevant objective could be to predict the species of Drosophila based on its wing and thorax measurements (and perhaps the experimental predictors) - this could provide some insight into the trait differences between the two species, which may be useful for further research into the flies.

In [2]:
df = pd.read_csv("../data/83_Loeschcke_et_al_2000_Thorax_&_wing_traits_lab pops.csv")
df

Unnamed: 0,Species,Population,Latitude,Longitude,Year_start,Year_end,Temperature,Vial,Replicate,Sex,Thorax_length,l2,l3p,l3d,lpd,l3,w1,w2,w3,wing_loading
0,D._aldrichi,Binjour,-25.52,151.45,1994,1994,20,1,1,female,1.238,2.017,0.659,1.711,2.370,2.370,1.032,1.441,1.192,1.914
1,D._aldrichi,Binjour,-25.52,151.45,1994,1994,20,1,1,male,1.113,1.811,0.609,1.539,2.148,2.146,0.938,1.299,1.066,1.928
2,D._aldrichi,Binjour,-25.52,151.45,1994,1994,20,1,2,female,1.215,1.985,0.648,1.671,2.319,2.319,0.991,1.396,1.142,1.908
3,D._aldrichi,Binjour,-25.52,151.45,1994,1994,20,1,2,male,1.123,1.713,0.596,1.495,2.091,2.088,0.958,1.286,1.062,1.860
4,D._aldrichi,Binjour,-25.52,151.45,1994,1994,20,2,1,female,1.218,1.938,0.641,1.658,2.298,2.298,1.010,1.418,1.148,1.886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1726,D._buzzatii,Wahruna,-25.20,151.17,1994,1994,30,10,1,male,1.033,1.568,0.528,1.309,1.837,1.837,0.783,1.107,0.920,1.778
1727,D._buzzatii,Wahruna,-25.20,151.17,1994,1994,30,10,2,female,1.138,1.719,0.558,1.442,1.999,1.999,0.867,1.223,0.992,1.757
1728,D._buzzatii,Wahruna,-25.20,151.17,1994,1994,30,10,2,male,1.019,1.515,0.501,1.305,1.807,1.805,0.821,1.135,0.911,1.771
1729,D._buzzatii,Wahruna,-25.20,151.17,1994,1994,30,10,3,female,1.118,1.620,0.544,1.404,1.947,1.947,0.863,1.187,0.976,1.741


In [3]:
df.columns

Index(['Species', 'Population', 'Latitude', 'Longitude', 'Year_start',
       'Year_end', 'Temperature', 'Vial', 'Replicate', 'Sex', 'Thorax_length',
       'l2', 'l3p', 'l3d', 'lpd', 'l3', 'w1', 'w2', 'w3', 'wing_loading'],
      dtype='object')

In [4]:
df["Species"].value_counts(sort=False)

Species
D._aldrichi    840
D._buzzatii    891
Name: count, dtype: int64

In [5]:
df["Population"].value_counts(sort=False)

Population
Binjour          341
Gogango_Creek    354
Grandchester     349
Oxford_Downs     341
Wahruna          346
Name: count, dtype: int64

In [6]:
df["Year_start"].unique(), df["Year_end"].unique()

(array([1994]), array([1994]))

In [7]:
df["Temperature"].value_counts(sort=False)

Temperature
20    578
25    581
30    572
Name: count, dtype: int64

In [8]:
df["Vial"].value_counts(sort=False)

Vial
1     170
2     172
3     169
4     172
5     178
6     176
7     173
8     177
9     172
10    172
Name: count, dtype: int64

In [9]:
df["Replicate"].value_counts(sort=False)

Replicate
1    600
2    586
3    545
Name: count, dtype: int64

In [10]:
df["Sex"].value_counts(sort=False)

Sex
female    859
male      872
Name: count, dtype: int64

In [11]:
df[["Thorax_length", "l2", "l3p", "l3d", "lpd", "l3", "w1", "w2", "w3", "wing_loading"]].describe()

Unnamed: 0,l2,l3p,l3d,lpd,l3,w1,w2,w3
count,1731.0,1731.0,1731.0,1731.0,1731.0,1731.0,1731.0,1731.0
mean,1.723935,0.585854,1.455826,2.041169,2.040291,0.914038,1.252196,1.038279
std,0.165536,0.05361,0.128044,0.178219,0.178354,0.074163,0.106781,0.089665
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.607,0.547,1.37,1.9205,1.919,0.864,1.176,0.976
50%,1.722,0.585,1.457,2.04,2.039,0.912,1.251,1.037
75%,1.84,0.624,1.54,2.1595,2.1585,0.963,1.3255,1.1
max,2.095,0.742,1.742,2.419,2.418,1.084,1.514,1.282


# Data Cleaning and Preparation

Now that we're familiar with the columns, we'll remove any that may be unnecessary for species classification.

In [12]:
df.columns

Index(['Species', 'Population', 'Latitude', 'Longitude', 'Year_start',
       'Year_end', 'Temperature', 'Vial', 'Replicate', 'Sex', 'Thorax_length',
       'l2', 'l3p', 'l3d', 'lpd', 'l3', 'w1', 'w2', 'w3', 'wing_loading'],
      dtype='object')

- There is only one `Year_start` and `Year_end` value so we can drop those columns
- Since we are already using population as a predictor, we'll drop the latitude and longitude columns since this is equivalent to the population
- The main predictors are thorax length, wing loading, l2, l3p, ..., w3, sex, species
- Predictors that *may* be useful are vial, replicate, and temperature. These relate to the experimental setup of the study. We can assess if these are important for classification a bit later, e.g. with feature selection

In [13]:
df.drop(columns=["Latitude", "Longitude", "Year_start", "Year_end"], inplace=True)
df.shape

(1731, 16)

Next we can inspect the dtypes of each column to see if they're as expected.

In [14]:
df.dtypes

Species           object
Population        object
Temperature        int64
Vial               int64
Replicate          int64
Sex               object
Thorax_length     object
l2               float64
l3p              float64
l3d              float64
lpd              float64
l3               float64
w1               float64
w2               float64
w3               float64
wing_loading      object
dtype: object

- `Thorax_length` and `wing_loading` should not be `object`s - this could be due to some invalid data rows - we could remove these or replace their values by e.g. the mean value. It turns out there is only one row that has an invalid thorax length and wing loading ratio, so we will just remove that one.
- `Sex`, `Species`, and `Population` are categorical, we will want to make these numeric so that our models can use them.
- In addition to the categorical variables above, replicate and vial are actually also categorical - the numbering given to these variables is just an identifier and is arbitrary. We should avoid e.g. mapping these categories to integers arbitrarily, as this implies an ordinal relationship, which is not the case (e.g. if we encoded "Male" to be 0 and "Female" to be 1, this could be interpreted by models as "Female" > "Male", which doesn't make sense, or if we used vial numbers as-is the model may compute measures like the "average vial number" which also doesn't make sense). One thing we can do is use one-hot encoding to create binary indicators for each category and avoid inferring incorrect ordinal relationships from categorical variables.

In [15]:
for feature in ("Sex", "Population", "Replicate", "Vial"):
    encoded = pd.get_dummies(df[feature], prefix=feature)
    df = pd.concat([df, encoded], axis=1)

In [16]:
print("invalid thorax_length values:", df["Thorax_length"].apply(pd.to_numeric, errors="coerce").isna().sum())
print("invalid wing_loading values :", df["wing_loading"].apply(pd.to_numeric, errors="coerce").isna().sum())

invalid thorax_length values: 1
invalid wing_loading values : 1


In [17]:
df["Thorax_length"] = pd.to_numeric(df["Thorax_length"], errors="coerce")
df["wing_loading"] = pd.to_numeric(df["wing_loading"], errors="coerce")
mask = df["Thorax_length"].isna() | df["wing_loading"].isna()
df = df[~mask]
df.shape

(1730, 36)

We now have everything cleaned up properly. The original `Species`, `Sex`, and `Thorax_length` columns have been kept around as we'll want to convert the numeric values back to their corresponding categories to interpret our results later.

In [18]:
df.dtypes

Species                      object
Population                   object
Temperature                   int64
Vial                          int64
Replicate                     int64
Sex                          object
Thorax_length               float64
l2                          float64
l3p                         float64
l3d                         float64
lpd                         float64
l3                          float64
w1                          float64
w2                          float64
w3                          float64
wing_loading                float64
Sex_female                     bool
Sex_male                       bool
Population_Binjour             bool
Population_Gogango_Creek       bool
Population_Grandchester        bool
Population_Oxford_Downs        bool
Population_Wahruna             bool
Replicate_1                    bool
Replicate_2                    bool
Replicate_3                    bool
Vial_1                         bool
Vial_2                      

Now we will split the data into a train split (70%) and test split (30%), and then save the cleaned and split data into two new csvs. We do this so that there is no chance of accidentally using the test data during our analysis.

In [19]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, train_size=0.7, random_state=42)
print(train.shape, test.shape)


(1211, 36) (519, 36)


In [20]:
train.to_csv("../data/thorax_and_wing_train.csv", index=False)
test.to_csv("../data/thorax_and_wing_test.csv", index=False)