This notebook walks you through the process of cleaning the data based on the given instructions and later on developing a KNN model to predict the gender of a person.




### 1. Data cleaning

we will clean the data based on the given instructions 
1. If there are missing values, impute them and explain your choices.
2. Ensure that all heights are in inches.
3. Ensure that age is in months.
4. Ensure that the weight is in pounds.
5. Ensure that the salary is in USD.
6. Ensure that the prior work experience is in months. 
7. If the data is not correct, rectify it and explain your choices. Why do you think that a given record or 
input feature value is not correct?
1. Transpose the dataset so that each record is laid out horizontally and has the following format:
Height, Weight, Age(Months), Experience(Months), Salary(USD), Gender(M/F) 
1. Create only one column for gender with values M&F to indicate male or female.


In [273]:
# Import required libraries
import pandas as pd # libbrary used for analysis,cleaning and manipulation of data
import numpy as np # python library used for working with numerical data
from fancyimpute import IterativeImputer #FancyImpute is a powerful Python library for various imputation algorithms
from sklearn.model_selection import train_test_split # method that is used for spliting dataset in training and test 
from sklearn.preprocessing import MinMaxScaler # standandizes feature
from sklearn.neighbors import KNeighborsClassifier # Classifier implementing the k-nearest neighbors vote
from sklearn.metrics import accuracy_score,confusion_matrix # compute accuracy and confusion matrix



- The original dataset is in excel lets read it using read_excel method provided by pandas.

In [274]:
#read excel dataset and then convert to csv
data_xlxs=pd.read_excel('Data1P_05_105_06_106.xlsx')
data_xlxs.to_csv('data.csv')

- Now we have a csv file (most convectional data format mostly used by pandas).

In [275]:
#read the csv file
df=pd.read_csv('data.csv')
df.head(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Select any number and color the entire column,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,...,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80
0,0,,What is your height in inches?,,,64.0,70.0,70,69,61.0,...,73,70,175,69,70,64.0,70.0,67.2,72.0,69.0
1,1,,What is your weight in pounds?,,,124.0,211.6,200,124,134.5,...,165,198,128,160,160,121.25,158.73,125.66,160.937,156.0
2,2,,What is your age in # of months?,,,300.5,279.0,303,285,336.0,...,282,280,310,301,288,269.0,312.0,281.0,276.0,286.0
3,3,,How many months of paid experience did you hav...,,,0.0,0.0,24,18,60.0,...,0,24,36,0,16,3.0,0.0,8.0,0.0,13.0
4,4,,What is your gender,Male (1/0),,,1.0,1,1,0.0,...,1,1,1,1,1,0.0,1.0,0.0,1.0,
5,5,,,Female (1/0),,1.0,0.0,0,0,1.0,...,0,0,0,0,0,1.0,0.0,1.0,0.0,
6,6,,How many months of unpaid experience did you h...,,,6.0,0.0,0,0,0.0,...,0,0,3,0,0,0.0,0.0,0.0,0.0,


Notes:
- We can observe that the colums are rows and rows are column.
- The shape of the data is 7 by 82.

1. Lets start by Transposing the dataset so that each record is laid out horizontally 

In [276]:
transposed_df=df.T.reset_index()
transposed_df.head(10)

Unnamed: 0,index,0,1,2,3,4,5,6
0,Unnamed: 0.1,0,1,2,3,4,5,6
1,Unnamed: 0,,,,,,,
2,Select any number and color the entire column,What is your height in inches?,What is your weight in pounds?,What is your age in # of months?,How many months of paid experience did you hav...,What is your gender,,How many months of unpaid experience did you h...
3,Unnamed: 2,,,,,Male (1/0),Female (1/0),
4,Unnamed: 3,,,,,,,
5,Unnamed: 4,64.0,124.0,300.5,0.0,,1.0,6.0
6,Unnamed: 5,70.0,211.6,279.0,0.0,1.0,0.0,0.0
7,Unnamed: 6,70,200,303,24,1,0,0
8,Unnamed: 7,69,124,285,18,1,0,0
9,Unnamed: 8,61.0,134.5,336.0,60.0,0.0,1.0,0.0


Note:
- After transposing using we can see that there are unnecessary rows and columns. 

Lets delete Unnamed: 0.1 Unnamed: 0 rows since they have no relevant data values.
- Also Select any number and color the entire column column will be deleted as it values also don't make sense.

In [277]:
#delete unnecessary rows and columns
transposed_df.drop([0,1],axis=0,inplace=True) # drop the first and second rows

transposed_df.reset_index(drop=True,inplace=True) # we reset the index to start again from index 0
transposed_df.columns = transposed_df.iloc[0] # define the columns of your dataframe

transposed_df.reset_index(drop=True,inplace=True)
transposed_df.drop([0],axis=0,inplace=True)

df1=transposed_df.drop(columns="Select any number and color the entire column",axis=1) #drop the duplicate column
df1.head(10)



Unnamed: 0,What is your height in inches?,What is your weight in pounds?,What is your age in # of months?,How many months of paid experience did you have before you started your graduate degree at UTA?,What is your gender,NaN,How many months of unpaid experience did you have before you started your graduate degree at UTA?
1,,,,,Male (1/0),Female (1/0),
2,,,,,,,
3,64.0,124.0,300.5,0.0,,1.0,6.0
4,70.0,211.6,279.0,0.0,1.0,0.0,0.0
5,70.0,200.0,303.0,24.0,1,0,0.0
6,69.0,124.0,285.0,18.0,1,0,0.0
7,61.0,134.5,336.0,60.0,0.0,1.0,0.0
8,63.5,123.6,261.0,3.0,0.0,1.0,0.0
9,65.76,158.7,310.0,12.0,1.0,0.0,0.0
10,61.2,141.0,284.0,16.0,0.0,1.0,0.0


2. Rename the columns to have the following format Height, Weight, Age(Months), Experience(Months), Salary(USD), Gender(M/F) 
    
    Note:
   - There is no salary column in the original dataset.
   - Instaed there is a column named 'How many months of unpaid experience did you have before you started your graduate degree at UTA?' which it was not given a naming convection and therefore we renamed to Unpaid experience.

In [278]:
#create a dictionary with the column names 
col_dict={"What is your height in inches?":"Height","What is your weight in pounds?":"Weight","What is your age in # of months?":"Age(Months)",
 "How many months of paid experience did you have before you started your graduate degree at UTA?"
:"Experience(Months)","What is your gender":"Male(1/0)",np.NaN:"Female (1/0)","How many months of unpaid experience did you have before you started your graduate degree at UTA?":"Unpaid experience"}

df1.rename(columns=col_dict,inplace=True)# rename methos allows us to rename the column names

# remove row index 1 and 2 because they are just NAN values and reset the index
df1.drop([1,2],axis=0,inplace=True)
df1.reset_index(drop=True, inplace=True)


df1.head()

Unnamed: 0,Height,Weight,Age(Months),Experience(Months),Male(1/0),Female (1/0),Unpaid experience
0,64.0,124.0,300.5,0.0,,1.0,6.0
1,70.0,211.6,279.0,0.0,1.0,0.0,0.0
2,70.0,200.0,303.0,24.0,1.0,0.0,0.0
3,69.0,124.0,285.0,18.0,1.0,0.0,0.0
4,61.0,134.5,336.0,60.0,0.0,1.0,0.0


3. Create only one column for gender with values M&F to indicate male or female.

In [279]:
# lets use lambda function to populate the new gender column from male and female columns
df1['Gender(M/F)'] = df1.apply(lambda row: 'Male' if row['Male(1/0)'] == 1 else ('Female' if row['Female (1/0)'] == 1 else None), axis=1)

# drop the male and female columns
df1.drop(columns=['Male(1/0)','Female (1/0)'],axis=1,inplace=True)
df1.head(100)

Unnamed: 0,Height,Weight,Age(Months),Experience(Months),Unpaid experience,Gender(M/F)
0,64.0,124.0,300.5,0.0,6.0,Female
1,70.0,211.6,279.0,0.0,0.0,Male
2,70,200,303,24,0,Male
3,69,124,285,18,0,Male
4,61.0,134.5,336.0,60.0,0.0,Female
...,...,...,...,...,...,...
72,64.0,121.25,269.0,3.0,0.0,Female
73,70.0,158.73,312.0,0.0,0.0,Male
74,67.2,125.66,281.0,8.0,0.0,Female
75,72.0,160.937,276.0,0.0,0.0,Male


Note:
Now we have a dataframe that is transposed with a shape of 77 by 6.

4. let Check if all attributes have values. If there are missing values, impute them and explain your choices.

In [280]:
# sum of all missing values in each columns in percentage
for col in df1.columns:
    print(f"Sum of missing values in {col} column: {df1[col].isnull().sum()/df1.shape[0]*100}")


Sum of missing values in Height column: 0.0
Sum of missing values in Weight column: 0.0
Sum of missing values in Age(Months) column: 0.0
Sum of missing values in Experience(Months) column: 0.0
Sum of missing values in Unpaid experience column: 18.181818181818183
Sum of missing values in Gender(M/F) column: 5.194805194805195


Note:
 -  We can see just two columns with missing data and we are going to use impute them.
 -  Since unpaid experience column and Gender has a substancial percentage of missing data in comparison to just 77 rows of all data, We can't drop them

We want to use an imputation technique Multiple Imputations by Chained Equations (**MICE**) where the missing values are imputed by predicting them using other features from the dataset.

Reasons for using MICE:

 a. It accounts for the relationship between variables, which can lead to more accurate imputations.

 b. Also I needed a method that takes into consideration of the othe variables and then makes preedictions on the missing values.

In [281]:
# we need to work with numerical data therefore lets convert the strings to numerals
df1['Gender(M/F)'] = df1['Gender(M/F)'].map({'Male': 1, 'Female': 2})

In [282]:
df1.head()

Unnamed: 0,Height,Weight,Age(Months),Experience(Months),Unpaid experience,Gender(M/F)
0,64.0,124.0,300.5,0.0,6.0,2.0
1,70.0,211.6,279.0,0.0,0.0,1.0
2,70.0,200.0,303.0,24.0,0.0,1.0
3,69.0,124.0,285.0,18.0,0.0,1.0
4,61.0,134.5,336.0,60.0,0.0,2.0


In [283]:
#implement MICE
imputer = IterativeImputer(max_iter=8, random_state=0) # create an imputer object

imputed_df=imputer.fit_transform(df1)

# fit_transform gives us an array so lets convert it to a dataframe
df2 = pd.DataFrame(imputed_df, columns=df1.columns)


# lets convert'Gender and Unpaid experience column to integers as they may contain float that may not make sense
df2['Gender(M/F)'] = df2['Gender(M/F)'].round().astype(int)
df2['Unpaid experience']=df2['Unpaid experience'].round().astype(int)

# convert gender back to male and female
df2['Gender(M/F)'] = df2['Gender(M/F)'].map({1:'Male', 2:'Female'})

In [284]:
df2.isnull().sum()

0
Height                0
Weight                0
Age(Months)           0
Experience(Months)    0
Unpaid experience     0
Gender(M/F)           0
dtype: int64

Note:
- now there are no missing values.

1. Save the dataset in a file called ClassData.csv and use only Height, Weight, Age, and Gender variables



In [285]:
df2.drop(columns=["Unpaid experience","Experience(Months)"],inplace=True)
df2.columns

Index(['Height', 'Weight', 'Age(Months)', 'Gender(M/F)'], dtype='object', name=0)

In [286]:
# save the new dataset
df2.to_csv(" ClassData.csv")

In [287]:
df2.head()

Unnamed: 0,Height,Weight,Age(Months),Gender(M/F)
0,64.0,124.0,300.5,Female
1,70.0,211.6,279.0,Male
2,70.0,200.0,303.0,Male
3,69.0,124.0,285.0,Male
4,61.0,134.5,336.0,Female


### 2. Implementing KNN Model

We will implement KNN in 2 scenarios:

 a. using libraries.

 b. Implemeting with no libraries other than numpy and math.

In [288]:
# define the target variable and the input features
x=df2[['Height', 'Weight', 'Age(Months)']]
y=df2['Gender(M/F)']

We will use Min-Max scaler to scaling the features.

Min-Max scaling transforms the features by scaling their values to a specific range, typically [0, 1].

In [289]:
scaler = MinMaxScaler(feature_range=(0, 1))
x_scaled=scaler.fit_transform(x)
type(x_scaled)



numpy.ndarray

In [290]:
print(x_scaled.shape)
print(x.shape)

(77, 3)
(77, 3)


In [291]:
# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)


In [292]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(61, 3)
(61,)
(16, 3)
(16,)


Q2. Let Create a function that will predict the gender of a person from a set of input parameters,
namely height, weight, and age using KNN.
 The program should ask for three parameters:
 1. Similarity measurements and accept: Cartesian, Manhattan and Minkowski (accept C, M, K)
 2. Order and accept: 1 – 5
 3.  Value of K and accept: 1 – 5 

In [293]:
# vaidation function to check if the input number is between 1 and 5
def input_validation(user_promt, a, b):
    while True:
        try:
            number = int(input(user_promt))
            if a <= number <= b:
                return number
            else:
                print(f"Please enter a number between {a} and {b}.")
        except ValueError:
            print("Invalid input. Please enter a valid integer.")

In [294]:
# function that accepts  parameters and fits the data to a knn classifier model
def gender_predict_knn(similarity_measure, order, k, x, y):
    if similarity_measure == 'C':
        metric = 'euclidean'
    elif similarity_measure == 'M':
        metric = 'manhattan'
    elif similarity_measure == 'K' :
        metric = 'minkowski'
    else:
        raise ValueError("Invalid similarity_measurement. Please enter letters C, M, or K.")
    
    knn = KNeighborsClassifier(n_neighbors=k, p=order, metric=metric)
    knn.fit(X_train,y_train)
    return knn



a) predict the gender (Male/Female) using the inputted similarity measurements,
order, and the value of K.

In [295]:
# prompt user to enter similarity measurement, order, and K value input and then make predictions
    
similarity_measure = input("please enter similarity measurement where C is for Cartesian, M for Manhattan, and K for Minkowski): ")
order = input_validation("please enter the order value between 1 and 5: ", 1, 5)
value_k = input_validation("please enter the k value between 1 and 5: ", 1, 5)




# call gender_predict_knn and  pass in  the 4 parameter 
knn_model = gender_predict_knn(similarity_measure, order, value_k, X_train, y_train)

# Predict the gender from one of the x_test values
predicted_gender = knn_model.predict([X_test[6]])

# print out the predicted gender
print(f"The predicted gender when similariry measure {similarity_measure} , order value of {order} and k value {value_k} is :", predicted_gender[0])



The predicted gender when similariry measure K , order value of 1 and k value 1 is : Male


b) Lets Show the following results for Cartesian Distance, order = 2, and K= 3 in terms of :

1. Accuracy
2. Confusion matrix

In [296]:

# call gender_predict_knn pass in (Cartesian Distance, order = 2, K = 3) and fit the model
knn_model_1 = gender_predict_knn('C', 2, 3, X_train, y_train)

# Calculate accuracy score 
accuracy_value = accuracy_score(y_test, knn_model_1.predict(X_test))
# get gender for all X_test instances
predicted_gender_2 = knn_model_1.predict(X_test)
# Confusion matrix
conf_matrix = confusion_matrix(y_test, knn_model_1.predict(X_test))

# print the accuracy and confusion matris
print("Accuracy using this parameter (Cartesian Distance, order = 2, K = 3) is  :", accuracy_value)
print("Confusion matrix using this parameter (Cartesian Distance, order = 2, K = 3) is  :", conf_matrix)
print("Predicted gender based on these parameters (Cartesian Distance, order = 2, K = 3) is  :", predicted_gender)

   

Accuracy using this parameter (Cartesian Distance, order = 2, K = 3) is  : 0.625
Confusion matrix using this parameter (Cartesian Distance, order = 2, K = 3) is  : [[2 6]
 [0 8]]
Predicted gender based on these parameters (Cartesian Distance, order = 2, K = 3) is  : ['Male']



c) Prediction results using different values of K and distance measures for the 20% test data and print out
1. Accuracy_score
2. Predicted gender Male or Female
   

In [310]:
# Prediction results using different values of K and distance measures for the 20% test data
k_values = [1, 2, 3, 4, 5]
distances = ['C', 'M', 'K']

# Loop through the K values and different similarity measures
for k in k_values:
    for i in distances:
        knn_model_2 = gender_predict_knn(i, 5, k, X_train, y_train)
        y_pred_1 = knn_model_2.predict(X_test)
        accuracy_val_2 = accuracy_score(y_test, y_pred_1)
        
        predicted_gender = np.where(y_pred_1 == 2, "male", "female")  # Assuming 0 represents male and 1 represents female
        
        print(f"Confusion Matrix for K_value={k}, distance measure={i}:")
        print(confusion_matrix(y_test, y_pred_1))
        
        print(f"Accuracy_score for K_value={k}, distance measure={i}: {accuracy_val_2}")
        
        print(f"Predicted Gender for all values of X_test:")
        print(predicted_gender)
        
        print("\n")


Confusion Matrix for K_value=1, distance measure=C:
[[5 3]
 [0 8]]
Accuracy_score for K_value=1, distance measure=C: 0.8125
Predicted Gender for all values of X_test:
['female' 'male' 'female' 'male' 'male' 'male' 'female' 'male' 'male'
 'male' 'male' 'male' 'female' 'male' 'male' 'female']


Confusion Matrix for K_value=1, distance measure=M:
[[3 5]
 [1 7]]
Accuracy_score for K_value=1, distance measure=M: 0.625
Predicted Gender for all values of X_test:
['female' 'female' 'female' 'male' 'male' 'male' 'male' 'male' 'male'
 'male' 'male' 'male' 'male' 'male' 'male' 'female']


Confusion Matrix for K_value=1, distance measure=K:
[[5 3]
 [0 8]]
Accuracy_score for K_value=1, distance measure=K: 0.8125
Predicted Gender for all values of X_test:
['female' 'male' 'female' 'male' 'male' 'male' 'female' 'male' 'male'
 'male' 'male' 'male' 'female' 'male' 'male' 'female']


Confusion Matrix for K_value=2, distance measure=C:
[[6 2]
 [2 6]]
Accuracy_score for K_value=2, distance measure=C: 0.75

Q3 (a) Your program should ask for three parameters:

 a. Similarity measurements and accept: Cartesian, Manhattan and Minkowski (accept C, M, K)

 b. Order and accept: 1 – 5

 c. Value of K: 1 – 5 

NOTE: You are not required to do an intensive input variable validation


We are going to create a 3 different functions to calculate the respective distance depending on user input similarity of measure.

Cartesian distance:

- measurement technique to find the distance between consecutive points.
- In the function we pass in two parameters representing the point and then find the square root of the sum of the difference between the two points.Mathemativally it formula is 
 ![Alt text](image-1.png)
 
-However we are going to use numpy libraries to claculate.

manhattan_distance

- The distance between two points is the sum of the absolute differences of their Cartesian coordinates.
- Its formula is similar to Minkowski distance but substituting p=1.
- ![Alt text](image-2.png)


minkowski_distance

- This is the distance measured between two points in N-dimensional space.

![Alt text](image-3.png)



In [298]:

# Function to compute Cartesian distance
def cartesian_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# Function to compute Manhattan distance
def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))

# Function to compute Minkowski distance of a given order
def minkowski_distance(a, b, order):
    return np.sum(np.abs(a - b) ** order) ** (1 / order)

1. Calculate_distance function is triggered depending on the user input of similarity measure and then calls the respective distance calculator function.

In [299]:
#create a function that will call a specific diatance depending on the input value C,K Or M
def calculate_distance(similarity_measure, order, a, b):
    if similarity_measure == 'C':
        return cartesian_distance(a, b)
    elif similarity_measure == 'M':
        return manhattan_distance(a, b)
    elif similarity_measure == 'K':
        return minkowski_distance(a, b, order)
    else:
        raise ValueError("Invalid similarity measure please enter C, K or M")

In [300]:
# function that gets the most frequent label from a list or array of labels
def majority_vote(labels):
    unique_labels, counts = np.unique(labels, return_counts=True)
    return unique_labels[np.argmax(counts)]

In [301]:
# encode the male and female strings to numericals
label_mapping = {'Female': 1, 'Male': 2}
y = np.array([label_mapping[label] for label in y])

2. Create a function that takes the input parameters and calculates the distance based on the respective user similarity measure input and get the prediction based on the majority label

In [302]:
# Function to predict gender using KNN without using libraries except NumPy and Math
def gender_predict_knn_2(similarity_measure, order, k, x_train, y_train, x_test):
    train_data, _ = x_train.shape# get the shape of the array and assign the rows to train_data, we asin empt variavle to the columns because we don't need them
    test_data, _ = x_test.shape # get the shape of the array and assign the rows to test_data
    y_pred_2 = np.empty(test_data, dtype=int)
    
    #loop through the test data and train data and calculate the distance for the current x_test[i] and x_train[j]
    for i in range(test_data):
        distances = []
        for j in range(train_data):
            dist = calculate_distance(similarity_measure, order, x_test[i], x_train[j])
            #print(f"i={i}, j={j}, distance={dist}")
            distances.append((dist, y_train[j]))
        
        # Sort distances and obtain the top value of k neighbour
        distances.sort(key=lambda x: x[0])
        k_neighbors = distances[:k]
        
        # Getthe labels of the k neighbors
        neighbor_labels = [neighbor[1] for neighbor in k_neighbors]
        
        # Predict the gender for the given text data
        predicted_gender = majority_vote(neighbor_labels)
        #y_pred[i] = predicted_gender
        if predicted_gender == 1:
            y_pred_2[i] = label_mapping['Female']
        elif predicted_gender == 2:
            y_pred_2[i] = label_mapping['Male']
    
    return y_pred_2

3. Here just like in q2 we want the user to input the 3 parameters and then we get the accuracy score of the test_data depending on the input parameters.

In [303]:
# Split the data into 80% training and 20% testing
# we cam also use the split we has earlier no harm.
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

#prompt the user to enter the required varaibles
similarity_measure = input("Please enter similarity measurement (C for Cartesian, M for Manhattan, K for Minkowski): ")
order = int(input("Please enter the order value between 1 and 5: "))
k_value = int(input("Please enter the k value between 1 and 5: "))

# pass in the required parameters to the and train the model with the parameters
y_pred = gender_predict_knn_2(similarity_measure, order, k_value, X_train, y_train, X_test)


#create a function to calculate the accuracy score of the predicted  value and y_test
def calculate_accuracy(similarity_measure, order, k_value, x_train, y_train, x_test, y_test):
    y_pred= gender_predict_knn_2(similarity_measure, order, k_value, x_train, y_train, x_test)
    accuracy_score = np.mean(y_pred == y_test)
    return accuracy_score


accuracy = calculate_accuracy(similarity_measure, order, k_value, X_train, y_train, X_test, y_test)
print(f"Similarity Measure: {similarity_measure}, Order: {order}, K-Value: {k_value}, Accuracy: {accuracy}")

Similarity Measure: K, Order: 2, K-Value: 2, Accuracy: 0.75


4. Prediction results using different values of K and distance measures for the 20% test data
- In this code we have a list of k values and distance measures.
- We will

In [312]:
# Function to calculate confusion matrix
def calculate_confusion_matrix(true_y, y_pred):
    unique_labels = np.unique(np.concatenate((true_y, y_pred)))
    num_labels = len(unique_labels)
    conf_mat = np.zeros((num_labels, num_labels), dtype=int)
    
    for i in range(len(true_y)):
        true_label = np.where(unique_labels == true_y[i])[0][0]
        pred_label = np.where(unique_labels == y_pred[i])[0][0]
        conf_mat[true_label][pred_label] += 1
    
    return conf_mat

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)


# List of configuration options
k_values = [1, 2, 3, 4, 5]
distance_measures = ['C', 'M', 'K']
pred_results = {}

# Iterate through different values of K and distance measures
for k in k_values:
    for i in distance_measures:
        y_pred = gender_predict_knn_2(i, 5, k, X_train, y_train,X_test)
        #accuracy = accuracy_score(y_test, y_pred)
        accuracy = calculate_accuracy(i, 5, k, X_train, y_train, X_test, y_test)
        conf_matrix = calculate_confusion_matrix(y_test, y_pred)
        predicted_genders = np.where(y_pred == 2, "male", "female")  # Assuming 0 represents male and 1 represents female
        
        #display the accuracy score, confusion matrix and predicted gender for all k and i values
        print(f"Confusion Matrix for K_value={k}, distance measure={i}:")
        print(conf_matrix)
        
        print(f"Accuracy_score for K_value={k}, distance measure={i}: {accuracy}")
        
        print(f"Predicted Gender for all values of X_test:")
        print(predicted_genders)
        
        print("\n")
        


Confusion Matrix for K_value=1, distance measure=C:
[[5 3]
 [0 8]]
Accuracy_score for K_value=1, distance measure=C: 0.8125
Predicted Gender for all values of X_test:
['female' 'male' 'female' 'male' 'male' 'male' 'female' 'male' 'male'
 'male' 'male' 'male' 'female' 'male' 'male' 'female']


Confusion Matrix for K_value=1, distance measure=M:
[[3 5]
 [1 7]]
Accuracy_score for K_value=1, distance measure=M: 0.625
Predicted Gender for all values of X_test:
['female' 'female' 'female' 'male' 'male' 'male' 'male' 'male' 'male'
 'male' 'male' 'male' 'male' 'male' 'male' 'female']


Confusion Matrix for K_value=1, distance measure=K:
[[5 3]
 [0 8]]
Accuracy_score for K_value=1, distance measure=K: 0.8125
Predicted Gender for all values of X_test:
['female' 'male' 'female' 'male' 'male' 'male' 'female' 'male' 'male'
 'male' 'male' 'male' 'female' 'male' 'male' 'female']


Confusion Matrix for K_value=2, distance measure=C:
[[6 2]
 [2 6]]
Accuracy_score for K_value=2, distance measure=C: 0.75

Result after accepting 1 input record (ie, predict M or F after receiving height, weight, and age)

Data to be used for prediction for 1 record: Height 65 inches, Weight: 150 lbs , Age: 300 months.

In [305]:
# We created NumPy array for the features of the record to be predicted
features_to_predict = np.array([65, 150, 300])

# Lets Specify the values for similarity_measure, order, and k_value
similarity_measure = 'C'  
order = 2                
k_value = 3              

# Use the gender_predict_knn_2 function to make a prediction for the record
predicted_gender = gender_predict_knn_2(similarity_measure, order, k_value, X_train, y_train, features_to_predict.reshape(1, -1))

# Decode the predicted_gender value so that we output strings
if predicted_gender == 1:
    predicted_label = 'Female'
elif predicted_gender == 2:
    predicted_label = 'Male'

# display the predicted gender
print(f"Predicted Gender: {predicted_label}")


Predicted Gender: Male
