# <div align="center">CP322-A Mini-Project 1: Machine Learning</div>
## <div align="center">Group 6</div>
### <div align="center">due on 15-Oct-2023 at 11:30 PM</div>

Imports:

In [None]:
import numpy as np

## Task 1: Acquire, preprocess, and analyze the data

1. Load the datasets into NumPy objects (i.e., arrays or matrices) in Python. Remember to convert the wine dataset
to a binary task, as discussed above.
2. Clean the data. Are there any missing or malformed features? Are there other data oddities that need to be
dealt with? You should remove any examples with missing or malformed features and note this in your
report. For categorical variables, you can use a one-hot encoding.
3. Compute basic statistics on the data to understand it better. E.g., what are the distributions of the positive vs.
negative classes, what are the distributions of some of the numerical features? what are the correlations between
the features? how do the scatter plots of pair-wise features look like for some subset of features?

In [None]:
def readFile(filename):
    data = []
    labels = []
    with open(filename, "r") as file:
        for line in file:
            line = line.strip()
            if line:  # Skip empty lines
                row = line.split(",")
                if filename == "adult.data":
                    # Convert non-numerical features to float
                    age = float(row[0])
                    fnlwgt = float(row[2])
                    education_num = float(row[4])
                    capital_gain = float(row[10])
                    capital_loss = float(row[11])
                    hours_per_week = float(row[12])
                    # Combine the numerical features
                    numerical_features = [age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week]
                    # Append the numerical features
                    data.append(numerical_features)
                    label = row[-1]
                    # Map the labels to binary values, e.g., '<=50K' to 0 and '>50K' to 1
                    labels.append(0 if label == ' <=50K' else 1)
                elif filename == "Rice_Cammeo_Osmancik.arff.txt":
                    data.append([float(val) for val in row[:-1]])
                    label = row[-1]
                    labels.append(0 if label == 'Cammeo' else 1)
                elif filename == "agaricus-lepiota.data":
                    label = 0 if row[0] == 'e' else 1
                    labels.append(label)

                    # Define mappings for categorical values
                    cap_shape_mapping = {'b': 0, 'c': 1, 'x': 2, 'f': 3, 'k': 4, 's': 5}
                    cap_surface_mapping = {'f': 0, 'g': 1, 'y': 2, 's': 3}
                    cap_color_mapping = {'n': 0, 'b': 1, 'c': 2, 'g': 3, 'r': 4, 'p': 5, 'u': 6, 'e': 7, 'w': 8, 'y': 9}
                    bruises_mapping = {'t': 0, 'f': 1}
                    odor_mapping = {'a': 0, 'l': 1, 'c': 2, 'y': 3, 'f': 4, 'm': 5, 'n': 6, 'p': 7, 's': 8}
                    gill_attachment_mapping = {'a': 0, 'd': 1, 'f': 2, 'n': 3}
                    gill_spacing_mapping = {'c': 0, 'w': 1, 'd': 2}
                    gill_size_mapping = {'b': 0, 'n': 1}
                    gill_color_mapping = {'k': 0, 'n': 1, 'b': 2, 'h': 3, 'g': 4, 'r': 5, 'o': 6, 'p': 7, 'u': 8, 'e': 9, 'w': 10, 'y': 11}
                    stalk_shape_mapping = {'e': 0, 't': 1}
                    stalk_root_mapping = {'b': 0, 'c': 1, 'u': 2, 'e': 3, 'z': 4, 'r': 5, '?': 6}
                    stalk_surface_above_ring_mapping = {'f': 0, 'y': 1, 'k': 2, 's': 3}
                    stalk_surface_below_ring_mapping = {'f': 0, 'y': 1, 'k': 2, 's': 3}
                    stalk_color_above_ring_mapping = {'n': 0, 'b': 1, 'c': 2, 'g': 3, 'o': 4, 'p': 5, 'e': 6, 'w': 7, 'y': 8}
                    stalk_color_below_ring_mapping = {'n': 0, 'b': 1, 'c': 2, 'g': 3, 'o': 4, 'p': 5, 'e': 6, 'w': 7, 'y': 8}
                    veil_type_mapping = {'p': 0, 'u': 1}
                    veil_color_mapping = {'n': 0, 'o': 1, 'w': 2, 'y': 3}
                    ring_number_mapping = {'n': 0, 'o': 1, 't': 2}
                    ring_type_mapping = {'c': 0, 'e': 1, 'f': 2, 'l': 3, 'n': 4, 'p': 5, 's': 6, 'z': 7}
                    spore_print_color_mapping = {'k': 0, 'n': 1, 'b': 2, 'h': 3, 'r': 4, 'o': 5, 'u': 6, 'w': 7, 'y': 8}
                    population_mapping = {'a': 0, 'c': 1, 'n': 2, 's': 3, 'v': 4, 'y': 5}
                    habitat_mapping = {'g': 0, 'l': 1, 'm': 2, 'p': 3, 'u': 4, 'w': 5, 'd': 6}

                    # Convert non-numerical features to float using the mappings
                    encoded_features = [
                        cap_shape_mapping[row[1]],
                        cap_surface_mapping[row[2]],
                        cap_color_mapping[row[3]],
                        bruises_mapping[row[4]],
                        odor_mapping[row[5]],
                        gill_attachment_mapping[row[6]],
                        gill_spacing_mapping[row[7]],
                        gill_size_mapping[row[8]],
                        gill_color_mapping[row[9]],
                        stalk_shape_mapping[row[10]],
                        stalk_root_mapping[row[11]],
                        stalk_surface_above_ring_mapping[row[12]],
                        stalk_surface_below_ring_mapping[row[13]],
                        stalk_color_above_ring_mapping[row[14]],
                        stalk_color_below_ring_mapping[row[15]],
                        veil_type_mapping[row[16]],
                        veil_color_mapping[row[17]],
                        ring_number_mapping[row[18]],
                        ring_type_mapping[row[19]],
                        spore_print_color_mapping[row[20]],
                        population_mapping[row[21]],
                        habitat_mapping[row[22]]
                    ]
                    data.append(encoded_features) 
                        
                else:
                    data.append([float(val) for val in row[:-1]])
                    label = row[-1]
                    labels.append(0 if label == 'b' else 1)


    return data, labels


### Dataset 1 (Ionosphere): 

In [None]:
filename = "data/ionosphere.data"

data,labels = readFile(filename)

# Count the number of positive class instances
positive_count = sum(1 for label in labels if label == 1)

# Count the number of negative class instances
negative_count = sum(1 for label in labels if label == 0)

#what are the distributions of the positive vs. negative classes?
print("Distribution of classes:")
print("Positive (g):", positive_count)
print("Negative (b):", negative_count)

print("\nData:")
print(data,labels)

#what are the distributions of some of the numerical features?

### Dataset 2 (Adult Data Set):

In [None]:
filename = "data/adult.data"

data, labels = readFile(filename)

# Count the number of positive class instances
positive_count = sum(1 for label in labels if label == 1)

# Count the number of negative class instances
negative_count = sum(1 for label in labels if label == 0)

print("Distribution of classes:")
print("Positive (>50):", positive_count)
print("Negative (<=50):", negative_count)

print("\nData:")
print(data, labels)

#what are the distributions of some of the numerical features?

### Dataset 3 (Rice):

In [None]:
filename = "Rice_Cammeo_Osmancik.arff.txt"

data, labels = readFile(filename)

# Count the number of positive class instances
positive_count = sum(1 for label in labels if label == 1)

# Count the number of negative class instances
negative_count = sum(1 for label in labels if label == 0)

print("Distribution of classes:")
print("Positive (Cammeo):", positive_count)
print("Negative (Osmancik):", negative_count)

print("\nData:")
print(data, labels)

#what are the distributions of some of the numerical features?

### Dataset 4 (Mushroom):

In [None]:
filename = "agaricus-lepiota.data"
data, labels = readFile(filename)

# Count the number of positive class instances
positive_count = sum(1 for label in labels if label == 1)

# Count the number of negative class instances
negative_count = sum(1 for label in labels if label == 0)

print("Distribution of classes:")
print("Positive (Poisinous):", positive_count)
print("Negative (Edible):", negative_count)

print("\nData:")
print(data, labels)

#what are the distributions of some of the numerical features?

## Task 2: Implement the models

#### 1. Implement logistic regression, and use (full batch) gradient descent for optimization.
#### 2. Implement k-Nearest Neighbor (KNN), and find the best K.

Implement both models as Python classes. You should use the constructor for the class to initialize the model
parameters as attributes, as well as to define other important properties of the model.
• Each of your models’ classes should have (at least) two functions:
– Define a fit function, which takes the training data (i.e., x and y)—as well as other hyperparameters (e.g.,
the learning rate and/or number of gradient descent iterations)—as input. This function should train your
model by modifying the model parameters.
– Define a predict function, which takes a set of input points (i.e., x) as input and outputs predictions (i.e.,
yˆ) for these points. Note that for linear regression you need to convert probabilities to binary 0-1
predictions by thresholding the output at 0.5!
In addition to the model classes, you should also define functions evaluate_acc to evaluate the model accuracy.
This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy
score.
• Lastly, you should implement a script to run k-fold cross-validation

### Logistic Regression:
Melissa
Grant
Yvonne

In [None]:
class LogisticRegression:
    def __init__(self):
        self.learning_rate = 0.01
        self.num_iterations = 1000
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        # Sigmoid function to convert values to probabilities between 0 and 1
        return 1 / (1 + np.exp(-z)) #sigmoid(z) = 1 / ( 1 + e( - z ) )

    def fit(self, data, labels): #training the logistic regression model
        num_samples, num_features = data.shape
        self.weights = np.zeros(num_features)
        self.bias = 0
        converge=0.0001
        converged = False
        cost1 = 1
        count = 0
        while not converged and count<self.num_iterations:
        # Gradient descent
        #for i in range(self.num_iterations):
            #Hypothesis Function
            linear_model = np.dot(data, self.weights) + self.bias
            predictions = self.sigmoid(linear_model)

            # Compute gradients
            #∂J/∂w = (1/m) * Σ[(h(x) - y) * x] , ∂J/∂b = (1/m) * Σ(h(x) - y)

            dw = (1/num_samples) * np.dot((predictions - labels),data)
            db = (1/num_samples) * np.sum(predictions - labels)

            # Update the parameters in the opposite direction of the gradient
            #w := w - α * ∂J/∂w  ,  b := b - α * ∂J/∂b
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            cost = 0   
    
            probabilities = self.sigmoid(linear_model)
            cost = -1/num_samples * (np.dot(1 - labels, np.log(1 - probabilities + converge)) + np.dot(labels, np.log(probabilities + converge)))
            if abs(cost1-cost)<=converge:
                converged = True
            cost1=cost
            count+=1
        return
        
        
            
    def predict(self, data):
        #Hypothesis Function
        linear_model = np.dot(data, self.weights) + self.bias
        predictions = self.sigmoid(linear_model)
        return [1 if p >= 0.5 else 0 for p in predictions]

    def evaluate_acc(self, label_true, label_pred):
        correct = np.sum(label_true == label_pred)
        total = len(label_true)
        return correct / total

filename = "adult.data"

data, labels = readFile(filename)

# Combine features and labels
data_with_labels = list(zip(data, labels))

# Split data into training and testing sets (80% training, 20% testing)
split_ratio = 0.7
split_index = int(len(data_with_labels) * split_ratio)

train_data, train_labels = zip(*data_with_labels[:split_index])
test_data, test_labels = zip(*data_with_labels[split_index:])

model = LogisticRegression()

# Fit the model to the training data
model.fit(np.array(train_data), np.array(train_labels))

# Make predictions on the test data
labels_pred = model.predict(np.array(test_data))

# Evaluate the model's accuracy
accuracy = model.evaluate_acc(np.array(test_labels), labels_pred)
print(f"Accuracy: {accuracy:.2f}")



def k_fold (data, labels, k):
    
    accuracies = []
    index_length = len(data)//k
    counter = 0
    model = LogisticRegression()
    
    for i in range(k):
        if i == k-1:
           
            data_testing_set = data[counter:]
            data_training_set = data[:counter]
            label_testing_set = labels[counter:]
            label_training_set = labels[:counter]
            
        else:
            data_testing_set = data[counter:index_length+counter]
            label_testing_set = labels[counter:index_length+counter]
            if counter == 0:
                data_training_set = data[index_length+1:]
                label_training_set = labels[index_length+1:]
            else:
                data_training_set = np.concatenate((data[0:counter] , data[index_length+counter+1:]))
                label_training_set = np.concatenate((labels[0:counter] , labels[index_length+counter+1:]))
        counter+=index_length

        model.fit(np.array(data_training_set),np.array(label_training_set))
        labels_pred = model.predict(np.array(data_testing_set))
        accuracy = model.evaluate_acc(np.array(label_testing_set),np.array(labels_pred))
        accuracies.append(accuracy)
    
    return accuracies


kfold = k_fold(data,labels, k = 4)
print("kfold",kfold)

### K-Nearest Neighbor (KNN):

## Task 3: Run Experiments

The goal of this project is to have you explore linear classification and compare different features and models. Use
5-fold cross-validation to estimate performance in all of the experiments. Evaluate the performance using accuracy.
You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a
minimum, you must complete the following experiments in the order stated below:

#### 1. Compare the accuracy of k-nearest neighbor and logistic regression on the four datasets.

#### 2. Test different k values for the k-nearest neighbor to find the best k-value by showing the accuracy plot. 

#### 3. Test different learning rates for gradient descent applied to logistic regression. Use a threshold for change in the value of the cost function as termination criteria and plot the accuracy on the train/validation set as a function of iterations of gradient descent.

#### 4. Compare the accuracy of the two models as a function of the size of the dataset (by controlling the training size)

Note: The above experiments are the minimum requirements that you must complete; however, this project is open-ended. For example, you might investigate different stopping criteria for gradient descent in logistic regression and develop an automated approach to select a good subset of features. You do not need to do all of these things, but you should demonstrate creativity, rigor, and an understanding of the course material in how you run your chosen experiments and how you report on them in your write-up.