## Loading Training Data & Basic Information

In [None]:
# If pandas is not installed, install it
pip install pandas

In [12]:
import pandas as pd

# load the training data from the csv file
train = pd.read_csv("dataset-train-vf.csv")

# Shape of the training data
print("Training data shape: ", train.shape)

# Display feature names
print("Feature names: ", train.columns.tolist())

# Display the first few rows of the training data
train.head()

Training data shape:  (4480, 13)
Feature names:  ['ID', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'y']


Unnamed: 0,ID,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,y
0,1,,62330,,0.748,4.845455,30405,18.066667,2.807634,663180,,C1,square
1,2,,4370,,0.858,1.072727,2445,1.266667,0.712986,49420,,C2,square
2,3,0.000729,1449,196.3,0.841,0.172727,795,0.42,0.112528,16240,,C2,square
3,4,0.043499,24702,349.7,0.594,5.254545,9570,7.16,2.417831,239680,0.430355,C3,circle
4,5,0.000972,1104,162.5,0.792,0.109091,570,0.32,0.06693,12040,,C3,square


### Important Notes

- The training data set has 13 columns: 1 ID column + 11 feature columns + 1 label column. The ID column will mostly be dropped since it is not a useful feature.
- For the features, all features are numerical (f1-f10) except for f11, which is categorical. We might need one hot encoding to deal with f11.
- Some columns show NaN values. We need to check how many NaNs per column to decide weather should we drop the feature or impute with some value (mean/median/mode).
- The label column has squares and circles. Therefore, it is a binary classification problem as stated before in the guidelines.

## Detailed Information

We will need to explore more information about the training dataset. Let's start with class distributions of the circle and square labels. Also, we will see the missing values count per column.

In [19]:
# Class distribution
print("Class distribution (absolute):")
print(train["y"].value_counts())

print("\n")

print("Class distribution (relative):")
print(train["y"].value_counts(normalize=True))

print("\n")

# Missing values count
missing_values = train.isnull().sum()
print("Missing values count:")
print(missing_values)


Class distribution (absolute):
y
square    4181
circle     299
Name: count, dtype: int64


Class distribution (relative):
y
square    0.933259
circle    0.066741
Name: proportion, dtype: float64


Missing values count:
ID        0
f1     1838
f2        0
f3     1384
f4        0
f5        0
f6        0
f7        0
f8        0
f9        0
f10    3912
f11       0
y         0
dtype: int64
