# ML Lab 03

## Prit Kanadiya
## 211070010

<b>Aim</b>: To implement Candidate Elimination Algorithm on Titanic Dataset.

<b>Theory</b>: 

A concept is a well-defined collection of objects. For example, the concept “a bird” encompasses all the animals that are birds and includes no animal that isn’t a bird.

Each concept has a definition that fully describes all the concept’s members and applies to no objects that belong to other concepts. Therefore, we can say that a concept is a boolean function defined over a set of all possible objects and that it returns true only if a given object is a member of the concept. Otherwise, it returns false.

In concept learning, we have a dataset of objects labeled as either positive or negative. The positive ones are members of the target concept, and the negative ones aren’t. Our goal is to formulate a proper concept function using the data, and the Candidate Elimination Algorithm (CEA) is a technique for doing precisely so.

We call the hypothesis space H the set of all candidate hypotheses that the chosen representation can express. 

There may be multiple hypotheses that fully capture the positive objects in our data. But, we’re interested in those also consistent with the negative ones. All the functions consistent with positive and negative objects (which means they classify them correctly) constitute the version space of the target concept. The Candidate Elimination Algorithm (CEA) relies on a partial ordering of hypotheses to find the version space.

Imagine we want to learn to identify birds from a group of animals. We can start by looking at different birds and try to understand the features which make them different from other animals. We might look at a sparrow and derive that if the animal has wings, feathers, brown colour, a short tail etc. then it is a bird. However by looking at other birds, we might realize that colour has nothing to do with a bird, and we can come to generalized idea that if it has feathers or wings, it is a bird.

Here, bird becomes a Concept, which we want to learn. Concept Learning was the idea a boolean function defined over a set of all possible animal whicht returns true only if a given object is a member of a bird. The problem of inducing general functions from specific training examples is central to concept learning.

Formally, in the context of machine learning, we want to learn the function ```f(x)``` or hypothesis ```h```  within a hypothesis space ```H```, which can provide the most accurate approximation of the concept.

Algorithm:

<img src="https://www.baeldung.com/wp-content/uploads/sites/4/2021/09/flowchart-cea.jpg" alt="flowchart">

In [2]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Reading the dataset using read_csv
titanic = pd.read_csv("../assets/data/titanic.csv")

In [4]:
# Printing first 5 instances of the dataset
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Printing last 5 instances of the dataset
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
# Chceking for null values or unclean instances among the dataset
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
# Since we observe null values, we fill first clean our dataset. This is very important step since it allows us to perform analytics on dataset after cleaning it.

# For all null values in age column, we replace them with the mean
age = titanic['Age'].fillna(titanic['Age'].mean())
titanic["Age"] = age

# For all null values in cabin and embarked column, we replace them with the mode
cabin = titanic['Cabin'].fillna(titanic['Cabin'].mode()[0])
titanic["Cabin"] = cabin
embarked = titanic['Embarked'].fillna(titanic['Embarked'].mode()[0])
titanic["Embarked"] = embarked


In [8]:
# Confirming that dataset has no Null values
titanic.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [9]:
# We remove all the fields which will not very relevant to the concept
titanic.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Fare'], inplace=True)

In [10]:
# Our concept it the people who survived the titanic disaster
# Our initial hypotheses is the most specific one
# Concept Learning in Machine Learning can be thought of as a boolean-valued function defined over a large set of training data. 

# We drop Survived column because that is our concept
Survived = titanic["Survived"]
titanic = titanic.drop(["Survived"], axis = 1)

In [11]:
titanic.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch
count,891.0,891.0,891.0,891.0
mean,2.308642,29.699118,0.523008,0.381594
std,0.836071,13.002015,1.102743,0.806057
min,1.0,0.42,0.0,0.0
25%,2.0,22.0,0.0,0.0
50%,3.0,29.699118,0.0,0.0
75%,3.0,35.0,1.0,0.0
max,3.0,80.0,8.0,6.0


In [12]:
# Fields such as SibSp and Parch represent number of siblings / spouses aboard, and number of parents / children aboard, respectively. These can be represented as a boolean variable as isSibSp and isParch.

titanic["isSibSp"] = titanic["SibSp"].apply(lambda x: 0 if x == 0 else 1)
titanic["isParch"] = titanic["Parch"].apply(lambda x: 0 if x == 0 else 1) 
titanic.drop(["SibSp", "Parch"], axis=1, inplace=True)

In [13]:
#Age can be divided into 3 groups, 0-20, 20-40, 40 and above. 
titanic["Age"] = titanic["Age"].apply(lambda x: 0 if x <= 20 else 1 if 20 < x <= 40 else 2)

In [15]:
uv_list = []

for col in titanic.columns:
  unique_values = titanic[col].unique()
  uv_list.append(unique_values)
  print(f"Unique values in column '{col}': {unique_values}")

Unique values in column 'Pclass': [3 1 2]
Unique values in column 'Sex': ['male' 'female']
Unique values in column 'Age': [1 2 0]
Unique values in column 'Embarked': ['S' 'C' 'Q']
Unique values in column 'isSibSp': [1 0]
Unique values in column 'isParch': [0 1]


In [21]:
# Initializing our hypothesis list G and S. It is important to note that G and S are list of lists
S = [[None] * (titanic.shape[1])]
G = [["?"] * (titanic.shape[1])]
print("The most specific hypotesis is: ", S)
print("The most general hypothesis is: ", G)

The most specific hypotesis is:  [[None, None, None, None, None, None]]
The most general hypothesis is:  [['?', '?', '?', '?', '?', '?']]


In [24]:
def test_hypothesis(hypothesis, test_point, label):
    # For negative example
    if label == 0:
        if None in hypothesis:
            # This checks if there is "None" present in hypothesis
            return True
        
        for i in range(len(hypothesis)):
            if hypothesis[i] == "?":
                continue
            elif hypothesis[i] != test_point[i]:
                # For any mismatch, since the label is negative, return True
                return True
            
        # Here, it clears the hypothesis, but it is negative, so return False
        return False

    else:
        if None in hypothesis:
            # This checks if there is "None" present in hypothesis
            return False
        
        for i in range(len(hypothesis)):
            if hypothesis[i] == "?":
                continue
            elif hypothesis[i] != test_point[i]:
                # For any mismatch, since the label is positive, return False
                return False

        # Here, it clears the hypothesis, and it is positive, so return True   
        return True

In [30]:
# Iterating through the dataset and modifying hypothesis accordingly.
# Candidate elimination algorithm goes through both positives and negatives.
# For positive instances, we go from specific to general hypothesis.
# For negative instances, we go from general to specific hypothesis.

for i, row in titanic.iterrows():

    print(i)

    # Positive example
    if (Survived[i] == 1):
        # Remove from G any hypothesis inconsistent with the instance
        g_list = list(G)
        for g in g_list:
            if not test_hypothesis(g, row, 1):
                G.remove(g)

        # Replace each s from S that is inconsistent with the instance with its minimal specialization consistent with the instance and less general from a hypothesis in G
        temp = []
        for s in S:
            if not test_hypothesis(s, row, 1):
                temp.append(s)

        for t in temp:
            S.remove(t)
        
        for t in temp:
            for j in range(len(t)):
                if t[j] == "?":
                    continue
                else:
                    t_copy = t[:j] + ["?"] + t[j+1:]
                    if test_hypothesis(t_copy, row, 1) and t_copy not in S and t_copy not in temp:
                        S.append(t_copy)
                        temp.append(t_copy)

        # Remove from S any hypothesis more general than other hypothesis in S
        specificity = 0
        for s in S:
            temp = 0
            for idx in range(0, len(s)):
                if s[idx] != "?":
                    temp += 1
            if (temp > specificity):
                specificity = temp
                
        s_list = list(S)
        for s in s_list:
            temp = 0
            for idx in range(0, len(s)):
                if s[idx] != "?":
                    temp += 1
            if (temp < specificity):
                # This implies that s is more general or less specific than other hypothesis in S
                S.remove(s)

    # Negative example
    else:
        # Remove from S any hypothesis inconsistent with the instance
        s_list = list(S)
        for s in s_list:
            if not test_hypothesis(s, row, 0):
                S.remove(s)

        # Replace each g from G that is inconsistent with the instance with its minimal specialization consistent with the instance and more general from a hypothesis in s
        temp = []
        for g in G:
            if not test_hypothesis(g, row, 0):
                temp.append(g)

        for t in temp:
            G.remove(t)
        
        # Add hypothesis to G referring to temp
        for t in temp:
            for j in range(len(t)):
                if t[j] != "?":
                    continue
                else:
                    for uv in uv_list[j]:
                        t_copy = t[:j] + [uv] + t[j+1:]
                        if test_hypothesis(t_copy, row, 0) and t_copy not in G and t_copy not in temp:
                            G.append(t_copy)
                            temp.append(t_copy)

        for t in temp:
            for j in range(len(t)):
                if t[j] == "?":
                    for s in S:
                        if s[j] != "?" and s[j] != row[j]:
                            new_hypothesis = t[:j] + [row[j]] + t[j+1:]
                            if new_hypothesis not in G:
                                G.append(new_hypothesis)

        # Remove from G any hypothesis less general than other hypothesis in G
        specificity = len(g)
        for g in G:
            temp = 0
            for idx in range(0, len(g)):
                if g[idx] != "?":
                    temp += 1
            if (temp < specificity):
                specificity = temp
        
        g_list = list(G)
        for g in g_list:
            temp = 0
            for idx in range(0, len(g)):
                if s[idx] != "?":
                    temp += 1
            if (temp > specificity):
                # This implies that g is more specific or less general than other hypothesis in G
                G.remove(g)

        print("G: ", G)
        print("S: ", S)


0
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "C:\Users\pritp\AppData\Roaming\Python\Python311\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\pritp\AppData\Local\Temp\ipykernel_16364\4278751945.py", line -1, in <module>
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\pritp\AppData\Roaming\Python\Python311\site-packages\IPython\core\interactiveshell.py", line 2105, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pritp\AppData\Roaming\Python\Python311\site-packages\IPython\core\ultratb.py", line 1396, in structured_traceback
    return FormattedTB.structured_traceback(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pritp\AppData\Roaming\Python\Python311\site-packages\IPython\core\ultratb.py", line 1287, in

In [20]:
# The hypothesis collapses due to no specific hypothesis
print("The most specific hypotesis are: ", S)
print("The most general hypothesis are: ", G)

The most specific hypotesis are:  []
The most general hypothesis are:  []


<b>Conclusion</b>: Hence understood and implemented Candidate Elimination Algorithm using titanic dataset.