In [2]:
import pandas as pd
import seaborn as sns

# Investigation of Data Variables:
The first task in an EDA after loading is a dataset is to find out what each variable in it represents. The classification of data variables was discussed back [here](https://github.com/Hassan-Farid/365-Days-of-ML/blob/main/Kinds%20of%20Data%20Variables.ipynb). Today, we will look at applying the investigation process on real datasets and get a grip on this essential process as it is the main step which gives us insights on what type of preprocessing is to be performed. 

## Titanic Dataset:
Titanic Dataset is a very widely used dataset used for Machine Learning tasks. We will perform structural investigation for data variables on the dataset and find out how we discriminate the different kinds of variables. I am going to use seaborn package to load the titanic dataset and pandas for viewing and performing data operations on it. So lets get it loading:

In [6]:
#Loading the titanic dataset
titanic_df = sns.load_dataset('titanic')

#Viewing the first 5 records of the dataset
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Okay, so now that the dataset has been loaded, we can see that there are many variables present in the dataset. To start our investigation, the first thing is to generate the information about these variables i.e. the data type with each of these is defined (since I am using python, datatypes will be specific to python)

In [7]:
#Viewing information about the variables 
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


We can see that the variables in the result are defined with 2 Boolean, 2 Category, 2 Float, 4 Integers and 5 Objects. Now for our first step, we have to classify each of these types as either Numeric or Non-Numeric. Since float and integers represent numbers, they will be numeric in nature whereas boolean and objects will be non-numeric. As for category, it depends on the type of value stored in it and cannot be classified into one or the other without looking at the data. So lets first discover what type of data are the category classes representing, we have:

### Class Variable:

In [17]:
#Viewing the type of contents of the class variable on the basis of unique values (check on numeric)
def check_numeric(var):
    try:
        res = all(float(x) for x in var)
        print("{} is a numeric variable".format(var))
    except:
        print("{} is a non-numeric variable".format(var))

In [18]:
#Applying the method for class variable to figure out its type
check_numeric('class')

class is a non-numeric variable


### Deck Variable:

In [20]:
#Applying the method for deck variable to figure out its type
check_numeric('deck')

deck is a non-numeric variable


Thus, the provided variables can be classified into numeric and non-numeric variables as follows:

|Numeric||Non-Numeric|
|-|-|-|
|survived, pclass, age, sibsp, parch, fare||sex, embarked, who, adult_male, embark_town, alive, alone, class, deck|

Now to find out the type of categories as well as the sub-types for numeric and non-numeric we have to start further investigation into the records stored in each variable

## Numeric Variable Investigation:
Lets first start off with the numeric variables in the dataset. We know that numeric variables can be classified into either discrete or continuous based on being integer or floats. We already know from our info that 4 are integers and 2 are floating points, but just seeing that is not something from which you should make decision about the category. 

Why you ask? Well although integer columns are surely to be discrete in nature, the same cannot be said about floating points, since it might be the case that an integer value has been stored in the format of floating point with "%d.0" format e.g. 5.0. Now lets take a look at the floating point variables from the dataset and check if they are integers or not?

### Age Variable:

In [21]:
#Checking whether all the values in a variable are integers (discrete value)
def check_integer(var):
    res = all(x.is_integer() for x in titanic_df[var])
    if res:
        print("{} is a discrete variable".format(var))
    else:
        print("{} is a continuous variable".format(var))

In [22]:
#Applying method for age variable to check if its integer or not
check_integer('age')

age is a continuous variable


### Fare Variable:

In [23]:
#Checking whether fare is integer or float
check_integer('fare')

fare is a continuous variable


Thus, we will classify fare as a floating point i.e. a continuous variable. Hence, we have for classification of numeric values, we have:

|Discrete||Continuous|
|-|-|-|
|survived, pclass, sibsp, parch||age, fare|

Now, lets move on further classification of the Discrete and Continuous variables.

## Discrete Variable Investigation:
Lets now look at the different discrete variables we have and check whether they are qualitative or quantitative in nature. To figure this out, we need human interpretation of the provided feature and thus this is not an automated process. In both cases, the value will be provided in the form of an integer but based on the name of the column, it would be considered one of the two.

The criteria for determining qualitative or quantitative is based on what that particular variable represents. If the variable's meaning corresponds to some measured or counted values e.g. size, number of items, sales, etc. than we will put it in quantitative otherwise we will put it in qualitative.

Now, based on the classified discrete variables, we will get the following classification:

|Quantitative||Qualitative|
|-|-|-|
|survived, pclass||sibsp, parch|

Now, lets move towards further classifying the Qualitative variables:

## Quantiative Variable Investigation:
Lets now look at the different qualitative variables and classify them into dichotomous and multichotomous. This can be defined by checking out the number of unique values present in each variable. If the obtained unique values for a variable are exactly two, then it would be dichotomous, otherwise it would be multichotomous.

### Survived Variable: 

In [30]:
#Check if the number of unqiue values are 2 or not
def check_dichotomous(var):
    res = titanic_df[var].nunique()
    if res == 2:
        print("{} is a dichotomous variable".format(var))
    else:
        print("{} is a multichotomous variable".format(var))

In [31]:
#Check if survival is dichotomous
check_dichotomous('survived')

survived is a dichotomous variable


### Pclass Variable:

In [32]:
#Check if pclass is dichotomous
check_dichotomous('pclass')

pclass is a multichotomous variable


Thus, we have for the classification of the given quantitative variables:

|Dichotomous||Multichotomous|
|-|-|-|
|survived||pclass|

## Multichotomous Variable Investigation:
Multichotomous Variables can be classified into nominal and ordinal based on an individual's observation and just like qualitative/quantitative cannot be classified based on some methodology. Now, if the multichotomous variable cannot be ranked and doesn't have any order, we would put it as a nominal variable, otherwise we will put it as ordinal variable.

Now, for the pclass variable, if we reference the titanic dataset, it represents the ticket class of the passenger. Class is usually a means to represent the difference in the quality of people i.e. a first class is the wealthiest whereas the steerage class is immigrants. So we can say that the pclass variable is an ordinal variable as first class differs in rank from the other classes and so on.

Thus, we classify pclass as an ordinal variable. In our case we can see that there is no example of numerical nominal variable (though dichotomous is always a nominal so we can say it is one in a perspective)

## Continuous Variable Investigation:
For continuous variables, we cannot identify if they are interval based or ratio based with the help of computation, rather we need to use an individual's observation to classify it. For this task, we will check if the variable being used has some absolute zero value or not. If it does, we classify it as ratio-scaled, otherwise, as interval-scaled.

For age variable, we know that an age of 0 means nothing, regardless of which creature we are referring to. Thus, age is a ratio-scaled variable.

For fare variable, we know that 0 fare of some ride is 0, regardless of which person, vehicle or currency is being used. Thus, fare is a ratio-scaled variable.

In our case, there is no interval scale variable present, but if there was a factor where a variable had 0 value for one unit but does not equate to 0 value in other units of measurement, then we would classify it as interval-scaled.

## Non-Numeric Variable Investigation:
Lets take a look at the non-numeric variables now. To classify them, one way is to check the number of unique values for each variable and if it is greater than the expected number of classes, we can consider it non-categorical and otherwise categorical. But usually this decision is to be made by individual's observation rather than basing it on some computation. Lets take a look at how we would classify them:

* For sex variable, we know that it is a categorical variable since it provides a choice between male and female. Since, there are only two possible choices, we say it is also dichotomous in nature.

* For embarked variable, we know that it is a categorical variable as it provides a choice between three values of embarkment. Since, there are more than two choices and the place for embarkment doesn't possess property of ranking e.g. if someone embarks from Southhampton, he doesn't necessarily have a higher rank than someone who embarks from Cherbourg and vice versa, therefore, we say that embarked is a nominal variable.

* For who variable, we know that it is a categorical variable as it provides a choice between three values i.e. man, woman and child. Since there are more than two choices and the status of the person doesn't possess ordering e.g. a man isn't superior to a woman and vice versa, therefore, we say that who is a nominal variable. (It is true that the drill is usually woman and children first but we can't say for sure that would 100% be the case)

* For adult_male variable, we know that it is a categorical variable as it provides a choice between true and false. Since, there are only two possible choices, we say it is also dichotomous in nature.

* For embark_town variable, we have the same condition as that of embarked, thus, it is also a nominal variable.

* For alive variable, we know that it is a categorical variable as it provides a choice between yes and no. Since, there are only two possible choices, we say it is also dichotomous in nature. 

* For alone variable, we know that it is a categorical variable as it provides a choice between true and false. Since, there are only two possible choices, we say it is also dichotomous in nature.

* For class variable, we know that it is a categorical variable as it provides a choice between three different classes. Since, there are more than two categories involved but class is based on ranking i.e. a person of first class is to be wealthier than a person of third class, thus, we say it is a ordinal variable.

* For deck variable, we know that it is a categorical variable as it provides a choice between six decks of the cruise. Since, there are more than two categories involved but one deck does not have superiority over the other, we can say that deck is a nominal variable.

In our case, there are no non-categorical non-numeric variables, and even if there were any, they would have been decomposed into useful variables in preprocessing phase, example would be the complete name of each passenger, from which we could have extracted the surnames to generate family groupings, etc.

## Classification Table:
Thus, after classifying all the variables in the dataset, we get the following classification table:

|Variable Name||Variable Kind|
|-|-|-|
|survived||Dichotomous Discrete Variable|
|pclass||Ordinal Discrete Variable|
|sex||Dichotomous Non-Numeric Variable|
|age||Ratio-scaled Continuous Variable|
|sibsp||Quantitative Discrete Variable|
|parch||Quantitative Discrete Variable|
|fare||Ratio-scaled Continuous Variable|
|embarked||Nominal Non-numeric Variable|
|class||Ordinal Non-numeric Variable|
|who||Nominal Non-numeric Variable|
|adult_male||Dichotomous Non-numeric Variable|
|deck||Nominal Non-numeric Variable|
|embark_town||Nominal Non-numeric Variable|
|alive||Dichotomous Non-Numeric Variable|
|alone||Dichotomous Non-Numeric Variable|

This completes our case on structural investigation of a dataset. If we follow the same approach i.e. step by step checking out each value and matching its case, then we can split any dataset into these different classes regardless of the number of features. This classification is an essential step in EDA as it allows you to determine what each variable represents and how it would be dealt with in the preprocessing phase. 

That's it for today! We will start with the Quantity Investigation Phase from tomorrow!