Assignment Question:

Some ML algorithms

KNN

SVM

Decision Trees

Random forest

Research the algorithms above and write a brief review of each.
Implement any of them or other algorithms that you may like as an end-to-end project. That means you should present a project that involves data cleaning, EDA, data preprocessing, Model building, and model evaluation. Remember to use proper documentation (most of the exercises you submitted last week were not properly documented).
This is a group project, please spend Monday and Tuesday working on this project. Let me know if you have any challenges.
PS: I suggest you check out sample projects on Kaggle.

**SOLUTIONS**

 * K-Nearest Neighbours is a simple yet powerful machine learning algorithm used for classification and regression tasks. It predicts the label or value of a new data point based on the majority class or average value of its k nearest neighbors in the feature space. The choice of k is crucial, impacting model performance, with smaller values risking overfitting and larger ones leading to underfitting. Despite its simplicity and effectiveness, KNN's computational cost grows with dataset size, and its performance relies on distance metric choice and data distribution.


   
 * Support Vector Machine Support is a machine learning method for classification and regression tasks. It finds the best line or boundary to separate different classes of data while maximizing the space between the data points and this line. It can handle both linear and nonlinear relationships and works well with small to medium-sized datasets. However, it can be computationally intensive and requires careful selection of parameters for optimal performance.

 * Decision Tree is a machine learning model that makes predictions by dividing the data into branches based on features, forming a hierarchical structure. It's straightforward and interpretable, suitable for classification and regression tasks. However, it may overfit and require tuning to improve generalization.

 * Random Forest is a machine learning approach that harnesses the collective wisdom of multiple decision trees. Instead of relying on a single decision tree, it builds a "forest" of trees, each trained on a random subset of the data and a random subset of features. This randomness introduces diversity among the trees, which helps to reduce overfitting and improve generalization. When making predictions, the Random Forest combines the outputs of all the trees, typically through a voting mechanism for classification tasks or averaging for regression tasks. This ensemble approach often results in more accurate and robust predictions compared to individual trees. Random Forest is widely used across various domains due to its versatility, scalability, and ability to handle complex datasets with high-dimensional features. Additionally, it provides insights into feature importance, making it valuable for feature selection and understanding the underlying patterns in the data.




In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error,accuracy_score,classification_report


In [3]:
data=pd.read_csv('C:\\Users\\tekno asya\\PycharmProjects\\WTFModelDeploymentGroup5\\Oral Cancer Consolidated-latest-numeric1 1.csv')
data.head()

Unnamed: 0,Donor ID,Project Code,Primary Site,Gender,Age at Diagnosis,Survival Time (days),SSM,CNSM,STSM,SGV,METH-A,METH-S,EXP-A,EXP-S,PEXP,miRNA-S,JCN,Mutations,Mutated Genes,Tumor Stage at Diagnosis
0,DO27935,ORCA-IN,Head and neck,1,55,788,True,False,False,False,False,False,False,False,False,False,False,98,145,20
1,DO50673,ORCA-IN,Head and neck,1,45,1206,True,False,False,False,False,False,False,False,False,False,False,168,265,20
2,DO50723,ORCA-IN,Head and neck,0,65,354,True,False,False,False,False,False,False,False,False,False,False,69,122,20
3,DO27917,ORCA-IN,Head and neck,1,40,224,True,False,False,False,False,False,False,False,False,False,False,31,44,20
4,DO50725,ORCA-IN,Head and neck,1,26,285,True,False,False,False,False,False,False,False,False,False,False,192,278,20


Drop all the columns with only True and False values because they are not useful for our model


In [None]:
data=data.drop(data.columns[6:18],axis=1)
data.head()

In [None]:
data.info()

We drop the columns with object type as well

In [None]:
data=data.drop(data.columns[:3],axis=1)
data.head()

Next we try to convert the Survival Time (days) column from object to integer type

In [None]:
data=data['Survival Time (days)'].astype(int)

We have an error so we must check the values of our column

In [None]:
data['Survival Time (days)'].unique()

There is a '?' value preventing us from converting our column to integer type. Therefore, we need to clean our data. We will replace it with the mean survival time.

In [None]:
data['Survival Time (days)']=data['Survival Time (days)'].replace('?',0).astype(int)

In [None]:
data.info()

Now we replace the 0 we inserted earlier with the mean Survival Time

In [None]:
data_mean=data['Survival Time (days)'].mean().round().astype(int)
data_mean

In [None]:
data['Survival Time (days)']=data['Survival Time (days)'].replace(0,data_mean)

In [None]:
data.head()

**EXPLORATORY DATA ANALYSIS**

Checking the gender distribution

In [None]:
num_women=data['Gender'].sum()
num_men=data.shape[0]-num_women
total=[num_women,num_men]
labels=['Women','Men']
plt.pie(total,labels=labels,autopct='%1.1f%%')
plt.legend()
plt.show()

Plotting Age against Survival Time

In [None]:
plt.scatter(data['Age at Diagnosis'],data['Survival Time (days)'])
plt.ylabel('Survival Time')
plt.xlabel("Age")
plt.title("Survival Time vs Age")

In [None]:
sns.boxplot(data,x=data['Tumor Stage at Diagnosis'],y=data['Age at Diagnosis'])


From the box plot, we can see that most people that are diagnosed are above 30 years old.

In [None]:

plt.pie(data['Tumor Stage at Diagnosis'].value_counts(),explode=(0.1,0.1,0.1,0.1), autopct='%1.1f%%',shadow=True, labels = ['10','20','30','40'],colors=['b','g','c','r'])
plt.title("Tumor Stage Distribution in our data")
plt.show()

In [None]:
plt.hist(data['Age at Diagnosis'])
plt.title("Age distribution in our data")
plt.xlabel("Age at Diagnosis")

In [None]:
sns.pairplot(data,x_vars=['Gender','Age at Diagnosis','Mutated Genes','Tumor Stage at Diagnosis'],y_vars=['Survival Time (days)'],kind='reg')

In [None]:
data.corr()

In [None]:
X=data.drop('Survival Time (days)',axis=1).values #Independent variables
y=data['Survival Time (days)'].values #Dependent variable
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=4)

In [None]:
k_val_min = 2
test_MAE_array = []
k_array = []
MAE = 1000000000
count=0
for k in range(2, 20):
    count+=1
    print(f"Iteration number {count}.......")
    model = KNeighborsRegressor(n_neighbors=k).fit(train_x, train_y)
    y_predict = model.predict(test_x)
    y_true = test_y
    test_MAE = mean_absolute_error(y_true, y_predict)
    print(f"The current MAE is {test_MAE}")
    if test_MAE < MAE:
        MAE = test_MAE
        k_val_min = k
    test_MAE_array.append(test_MAE)
    k_array.append(k)

    print(f"The current best k value is {k_val_min} and the lowest MAE is {MAE}")

plt.plot(k_array, test_MAE_array,'r')
plt.show()

From our code, the best value of k is 9.

In [None]:
model=KNeighborsRegressor(n_neighbors=9)
model.fit(train_x,train_y)
predictions=model.predict(test_x)


In [None]:
import pickle
import os
model_data={'model': model}

# File path where you want to save the model
file_path = 'C:\\Users\\tekno asya\\PycharmProjects\\WTFModelDeploymentGroup5\\model.pkl'

# Ensure the directory exists, create it if it doesn't
directory = os.path.dirname(file_path)
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the model to the file
try:
    with open(file_path, 'wb') as file:
        pickle.dump(model_data, file)
    print(f"Data saved successfully to {file_path}")
except Exception as e:
    print("Error occurred while saving data:", e)


In [None]:
with open(file_path,'rb') as file:
    model_data=pickle.load(file)
model=model_data['model']

In [None]:
data['Mutated Genes'].min()

In [None]:
val=np.array([1,55,145,20]).reshape(1,-1)
print(model.predict(val)[0])

In [None]:
gender='Male'
age=12
genemutations=234
tumorstage=40

gender_encoded = 0 if gender == 'Male' else 1  # Encoding gender as numerical value

        # Create feature vector
X = np.array([gender_encoded, age, genemutations, tumorstage],dtype=int).reshape(1, -1)
X

        #

In [None]:
model.predict(X)[0]