# CST4050 - Individual Assessment (block 1)

**Student**:

- Name: Thanima
- Surname:Firoz
- Student number:M00849665
- Campus:Dubai
- Group number:3

## 1. Data Processing

Data processing in machine learning is a critical step required to increase the quality of data. This is done by preparing (cleaning and organizing) raw data into an understandable and readable format suitable for building and training machine learning models. The dataset provided is a COVID dataset in CSV format with 126 rows and 19476 columns. It has categorical attributes such as Sex, Severity, Age, and genome sequences that are continuous variables. There are no null values present in the dataset. In this case, Severity being our target/dependent variable and categorical in the nature of NonICU and ICU will be encoded into numeric values 0 and 1, respectively, as machine learning algorithms perform best with numeric values. The dataset also shows the continuous variable (genomes), which is our independent variable, which has values that vary in different ranges, which can affect the efficiency of our model. Normalization and Standardization will therefore be applied to enable the dataset to have a common scale and get a normal distribution of the dataset. By this, the mean of the dataset will be made 0, and the standard deviation equivalent to 1. The large nature of our predictors will require that we perform the Principal Component Analysis (PCA) to reduce the dimensionality of data while retaining as much information as possible to enhance the model's performance.


                         Libraries and Packages

List of all packages used in the notebook 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder 
import seaborn as sns 
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)



                          Downloading dataset

Reading Dataset

Reading the dataset from the given CSV file.

In [None]:
filename = r'C:\Users\matebook x\Desktop\Data Science\ML\cw\covid_data.csv'
genomic_df1 =pd.read_csv(filename)
genomic_df1

                            Data Pre-Processing

Converting the data into usable format. 

Following modifications has been done to the data to get most out of it:

Removing Sample column as it is not contributing anything to the prediction.

Checking null values if any.

Applying Feature Engineering-Apply feature encoding and feature binning techniques 

Feature Encoding-Most of the ML algorithms cannot handle categorical variables and it is important to do feature encoding.

Sex Column-Applying One Hot Encoding method inorder to convert from categorical to numerical variable and splits the category each to a column.It creates three different columns each for male,female and unknown and replaces one column with 1 and rest if the columns will be 0.

Severity Column-Applying Label Encoding to transform from categorical into numerical variable by assigning a numerical value to each of categories.
Binning the age column.

Feature Binning-This technique is used to convert continuous variable to categorical.

Age column-Applying Equal Frequency unsupervised feature binning to transform from numerical to categorical bins without considering the target cladd label into account and then applying one hot encoding technique.


In [None]:
# checking the datatypes.
print(genomic_df1.dtypes)

# checking columns having string/object datatype.
genomic_df1.select_dtypes(object)

In [None]:
# Feature  Binning the age column

agerange=genomic_df1['Age']
bins = [20, 30, 40, 50, 60, 70, 80, 90]
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79, 80-89','90-99']
agerange1 = pd.cut(agerange, bins, labels = labels,include_lowest = True)
genomic_df1.drop('Age',inplace=True,axis=1)
df = pd.concat((genomic_df1, agerange1), axis=1)



In [None]:
# Label Encoding Severity column.

cols=['Severity']
df[cols]=df[cols].apply(LabelEncoder().fit_transform)

# Deleting Sample column
drop_columns=['Sample']
df.drop(drop_columns,inplace=True,axis=1)

#seperating the Severity lable column

without_Severity_column = df.drop('Severity', axis = 1)       
Severity_column = df['Severity']

#Seperating columns for one hot coding.

colums_to_convert = ['Age','Sex']   
colums_to_convert

In [None]:
without_Severity_column = pd.get_dummies(without_Severity_column, columns = colums_to_convert)    

#performing one hot coding

without_Severity_column.head()

In [None]:
# Droping the row with unknown sex value

drop_columns=['Sex_unknown']
without_Severity_column.drop(drop_columns,inplace=True,axis=1)

In [None]:
#Adding the Severity column again at the last position.

final_data = pd.concat([without_Severity_column, Severity_column], axis = 1)
final_data.head()

In [None]:
final_data.shape

In [None]:
final_data.describe()

In [None]:
# checking for any nullvalues in the dataframe.
final_data.isna().sum()

There are no null values in the dataframe.

                             Data Analysis
 
Visualising the pre precessed data and trying to get the intution about different characterstics.

In [None]:
ICU_admission_distribution = final_data['Severity'].value_counts()
print("Total Patients after pre processing: ", sum(ICU_admission_distribution))
print("Distribution of ICU admissions")
print("Patients who were not admitted to ICU: ",ICU_admission_distribution[0])
print("Patients who were admitted to ICU: ",ICU_admission_distribution[1])
labels= ['Admitted to ICU', 'Not Admitted to ICU']
colors=['tomato', 'deepskyblue']
sizes= [ICU_admission_distribution[1], ICU_admission_distribution[0]]
plt.pie(sizes,labels=labels, colors=colors, startangle=90, autopct='%1.1f%%')
plt.title("ICU Distribution of data")
plt.axis('equal')
plt.show()

In [None]:

Age_distribution = final_data['Age_70-79, 80-89'].value_counts()
print("Age Distribution")
print("Patients below age 70: ",Age_distribution[0])
print("Patients above age 70: ",Age_distribution[1])
labels= ['Below 70', 'Above 70']
colors=['lightgreen', 'violet']
sizes= [Age_distribution[0], Age_distribution[1]]
plt.pie(sizes,labels=labels, colors=colors, startangle=90, autopct='%1.1f%%')

plt.axis('equal')
plt.title("Age Distribution of data")

plt.show()

In [None]:
Gender = final_data['Sex_male'].value_counts()
print("Total Patients after pre processing: ", sum(Gender))
print("Gender of Patients")
print("Female Patients: ",Gender[0])
print("Male Patients: ",Gender[1])
labels= ['Male Patients', 'Female Patients']
colors=['lightgreen', 'violet']
sizes= [Gender[0], Gender[1]]
plt.pie(sizes,labels=labels, colors=colors, startangle=90, autopct='%1.1f%%')

plt.axis('equal')
plt.title("Gender of Patients")

plt.show()

                                    Dimensionality Reduction

Dimensionality Reduction transforms the data from a high number of features into a lower number of features.In the given data,the  number of observations are less compared to  numbers of features.This can increase the variance in data,which could cause overfitting.Moreover the combinatorial explosion or a large number of values would lead to a computational intractable problem where the process takes too long to finish.Here I am using Principal Component Analysis(PCA) for dimensionality reduction.A significant benefit of PCR is that by using the principal components, if there is some degree of multicollinearity between the variables in the dataset, this procedure should be able to avoid this problem since performing PCA on the raw data produces linear combinations of the predictors that are uncorrelated.



In [None]:
# seperating the genomic feature columns for dimensionality reduction
Severity=final_data.iloc[:, -1]
df1=final_data.iloc[:, :-10]
df1.head()

In [None]:
# Standardizing the features
x = StandardScaler().fit_transform(df1)

In [None]:
n_comp = 60
# PCA
print('\nRunning PCA ...')
pca = PCA(n_components=n_comp, svd_solver='full', random_state=1001)
X_pca = pca.fit_transform(x)

print('Explained variance: %.4f' % pca.explained_variance_ratio_.sum())

print('Individual variance contributions:')
for j in range(n_comp):
    print(pca.explained_variance_ratio_[j])

In [None]:
plt.plot(np.arange(pca.n_components_) + 1, pca.explained_variance_ratio_, 'ro-', linewidth=2)
plt.show()


Running Principal Component analysis, it is found that the model genomic information can be captured with 60 features as they seem to cover the maximum variance. Hence, the 60 new features in an entirely new dimension are chosen as the final set of genomic features removing the other least discriminative ones. The chosen 60 features seem to have an explained variance of 85.6 % which can be clearly visualized in the graph above.

In [None]:
def groupComponents(x, y, classLabels, x_name='PC1', y_name='PC2'):   
    classDict = {}
    classes = np.unique(classLabels)
    for label in classes:
        idx = np.where(classLabels == label)
        classDict[label] = (x[idx], y[idx])
    for lab in classes:
        x, y = classDict[lab]
        plt.scatter(x, y, label=lab,alpha=0.4)
    plt.legend(fontsize=12)
    plt.xlabel(f'Projection on {x_name}', fontsize=12)
    plt.ylabel(f'Projection on {y_name}', fontsize=12)
    
    plt.show()

In [None]:
#Data points projection along PCs
projected1 = np.matmul(x, pca.components_[0])
projected2 = np.matmul(x, pca.components_[1])
projected3 = np.matmul(x, pca.components_[2])
projected4 = np.matmul(x, pca.components_[3])
projected5 = np.matmul(x, pca.components_[4])

plt.scatter(projected1, projected2)
plt.show()

In [None]:
groupComponents(projected1, projected3, Severity,x_name='PC1', y_name='PC3')

In [None]:
expl_df = pd.DataFrame(pca.components_.T[:, :60], columns=[f'PC{x}' for x in range(1,60+1)], index=df1.columns)
expl_df

In [None]:
# combining pca with age and sex column

df_pca = pd.concat([pd.DataFrame(data = X_pca),
                    final_data[['Age_20-29',	'Age_30-39',	'Age_40-49',	'Age_50-59',	'Age_60-69',	
                    'Age_70-79, 80-89',	'Age_90-99',	'Sex_female',	'Sex_male',	'Severity']]],
                    axis = 1)

df_pca.head()

In [None]:
# Splitting the data into X & Y
X = df_pca.iloc[:,df_pca.columns!='Severity']
# Separating out the target
Y= df_pca.iloc[:,df_pca.columns=='Severity']

## 2. Training and tuning



The given dataset contains 19476 columns, where 19472 columns are genome sequences corresponding to a sample of a person of a certain age and gender. These predictors, which include the genome sequences, are required to classify the Severity for that sample, whether it is a Non-ICU or an ICU case. Since this analysis involves classification of Severity, a supervised learning algorithm for Classification needs to be applied.  The data, including the Sex, Age, and genome sequences reduced to 60 components using PCA, will be used to train and test the model. 80% of this data will be the training data, and the rest 20% will be used as the unseen test data. This model can be validated using k-fold cross validation later to validate the accuracy of this model using different folds of the data. Hyperparameter tuning will be implemented to find the best parameters to train the model. The behavior of a machine learning model can be under control by tuning the best parameters. GridSearchCV is one of the most common hyperparameter tuning techniques to fine-tune the parameters, where we provide a list of parameters, and each combination of those provided values is created to train models and see which combination(grid) provides the most accuracy. The combination that provides the best result of all the combinations is chosen as the hyperparameters which we use to train our actual model. 

                 Importing various Libraries

In [None]:

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn import svm
from sklearn import tree
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt 
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

In [None]:
# shape of datasets

print(X.shape)
print(Y.shape)

Seperating train & test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.30, random_state=1)

                      Performing Logistic Regression,Linear SVM,RBF SVM,Decision Tree and Random Forest.

In [None]:


names = ["Linear SVM", "RBF SVM", 
         "Decision Tree", "Random Forest","Logistic Regression"]
classifiers = [
    SVC(kernel="linear", C=100),
    SVC(gamma=2, C=10),
    
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegressionCV()
    
]
scores = []

for clf in classifiers:
    model =  clf.fit(X_train, Y_train)
    y_pred = clf.predict(X_test)
    score = accuracy_score(Y_test, y_pred)
    scores.append(score)
    
for i in range(len(scores)):
    print(names[i] + " : " + str(scores[i]))



Accuracy Results as follows.

Linear SVM : 0.8157894736842105

RBF SVM : 0.5263157894736842

Gaussian Process : 0.5263157894736842

Decision Tree : 0.6842105263157895

Random Forest : 0.6578947368421053

Logistic Regression : 0.7368421052631579

Among this,The highest accuracy is for Linear Kernel.

In [None]:
# train the model using training data
Model_linear=svm.SVC(kernel='linear',C=1)
Model_linear.fit(X_train,Y_train)
#test the model using testing data
y_pred_linear=Model_linear.predict(X_test)

Printing  the results as accuracy, recall, precision and confusion matrix

In [None]:


accuracy_linear=Model_linear.score(X_test,Y_test)
print("Accuracy SVM (Linear Kernel):",metrics.accuracy_score(Y_test,y_pred_linear))
print("Recall SVM (linear Kernel):",recall_score(Y_test,y_pred_linear))
print("Precision SVM (linear Kernel):",precision_score(Y_test,y_pred_linear))
print("confusion matrix SVM (linear Kernel):\n",confusion_matrix(Y_test,y_pred_linear))


Results as follows

Accuracy SVM (Linear Kernel): 0.8157894736842105

Recall SVM (linear Kernel): 0.9444444444444444

Precision SVM (linear Kernel): 0.7391304347826086

confusion matrix SVM (linear Kernel):

 [[14  6]

 [ 1 17]]

               Tuning the parameters of Linear SVM model using Grigsearchcv()            

In [None]:
# we will use the optimazer on the SVM model

Model_SVM=svm.SVC()

#define the parameters for the SVM that we want to optimise

hyperparameter_space=[{'C':[0.1,1,10,100],'gamma':[1,0.1,0.01,0.001],'kernel':['rbf','sigmoid']},{'C':[0.1,1,10,100],'kernel':['linear']}]

# create our optimiser using the set of parametrs

optimizer=GridSearchCV(Model_SVM,param_grid=hyperparameter_space,scoring="accuracy",cv=2,return_train_score=True)

# train the model using the optimizer (train with tuning to find the best set of parameters)

optimizer.fit(X_train,Y_train)

# print the results

print("Optimal hyperparameter combination:",optimizer.best_params_)
print()
print("Mean cross-validated training accuracy score:",optimizer.best_score_)





The tuning results of Linear SVM model are as follows

Optimal hyperparameter combination: {'C': 0.1, 'kernel': 'linear'}

Mean cross-validated training accuracy score: 0.8295454545454546


## 3. Model validation

Model validation is the process of evaluating a trained model on test data set. This provides the generalization ability of a trained model.

Training model with different activation functions and finding model with best accuracy

In [None]:
# use the best parameters set to train the module

optimizer.best_estimator_.fit(X_train,Y_train)

# use the trained model to test
y_pred=optimizer.best_estimator_.predict(X_test)


# print the test results
print("Test accuracy:",metrics.accuracy_score(Y_test,y_pred))
print(confusion_matrix(Y_test,y_pred))


Finally we can conclude that ourselected model(Linear SVM) has an accuracy of 81.58%

In order to improve the model accuracy, there are several parameters need to be tuned. Three major parameters including:
1. Kernels: The main function of the kernel is to take low dimensional input space and transform it into a higher-dimensional space. It is mostly useful in non-linear separation problem.
2. C (Regularisation): C is the penalty parameter, which represents misclassication or error term. The misclassication or error term tells the SVM optimisation how much error is bearable. This is how you can control the trade-o between decision boundary and misclassication term.when C is high it will classify all the data points correctly, also there is a chance to overt.
3. Gamma: It denes how far inuences the calculation of plausible line of separation when gamma is higher, nearby points will have high inuence; low gamma means far away points also be considered to get the decision boundary.

## 4. Model interpretation



                                          Confusion matrix
                                          
Confusion matrix is presented for test data with the highest scoring feature subset and optimal parameters, where the rows correspond to the actual performed activities, while the columns correspond to the predicated activity labels. We can clearly see the number of false positives and false negatives are very low. It is intrepreted that there seem to be very less number of misclasified data.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, y_pred)
print(cm)

                              Visualizing Confusion Matrix using Heatmap

Visualizing  the results of the model in the form of a confusion matrix using matplotlib and seaborn.    

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

**COMMENTS:** 
* 14 patients were predicted that they **will** will be admitted to ICU ,the Prediction was CORRECT (True-Positive)
* 17 patients were predicted that they **will NOT** be admitted to ICU, the Prediction was CORRECT (True-Negative)
* 6 patients were predicted that they **will** will be admitted to ICU but the Prediction was WRONG (False-Positive)
* 1 patients were predicted that they **will NOT**  be admitted to ICU but the Prediction was WRONG (False-Negative)

In [None]:
def ass(Y_test,y_pred):
  tn, fp, fn, tp = confusion_matrix(Y_test, y_pred).ravel()
  accuracy=(tp+tn)/(tp+fp+fn+tn)
  specificity = tn/(tn+fp)
  sensitivity=tp/(tp+fn)
  print("Accuracy:",accuracy*100)
  print("Sensitivity:",sensitivity*100)
  print("Specificity:",specificity*100)
  print("recall: ", metrics.recall_score(Y_test, y_pred))
  print("f1: ", metrics.f1_score(Y_test, y_pred))
  print("ROC_AUC_Score:",roc_auc_score(Y_test, y_pred)*100)
ass(Y_test,y_pred)

In [None]:

print(classification_report(Y_test,y_pred))

                            Applying Kfold cross validation

In [None]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, Y, cv=10)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

## 5. Predictions



The model was used to make prediction using X_train and the predictionwas stored in an object named y_pred.

In [None]:
y_pred=optimizer.best_estimator_.predict(X_test)
y_pred 

y_pred predicts the probability of class being a zero and One.
One means patient will be admitted to ICU and zero means patient will not be admitted to ICU.

In [None]:
test = X_test.loc[:,:]
test['Severity'] = Y_test
test['pred'] = y_pred
test.head()

## 6. Discussion



The main purpose of the model is to predict the severity of covid pateints based on genomic features.
Predictions deal with input observations to learn the unknown pattern by making predictions on unseen data. The data is split into 80: 20 ratio as training and testing data.The training data is used to train the model. Once the model is trained, it is be used to predict the Severity of the unseen test data. A confusion matrix heatmap was plotted to check the predictive model's performance, and accuracy score and precision was be checked. The accuracy score obtained is 81.58% and hence we can presume that the the model is set to fit the data accurately. The confusion matrix can show the number of observations correctly classified. This model can then successfully predict the Severity of parents admitted with COVID cases.The model is  safe to proceed due to the lower number of False Negatives.However, higher precision may actually be desirable. As we are not diagnosing illness, but determining if a person is likely to need an ICU bed,false Positives may lead to beds being occupied unnecessarily. If beds are extremely limited (which they are increasingly becoming), this would not ideal.

Support Vector Machines (SVM) is one of machine learning algorithms using supervised learning models for pattern recognition and for classification and regression analysis. 

The advantages of using SVM model are as follows.

1)In the classification task, SVM is more favored than the other methods because SVM provides a global solution for data classification.
2)SVM model is able to find non-linear seperation classes.
2)SVM model is very effective in high dimensional spaces.
3)SVM Model is relatevely memory efficient as it uses a subset of training points in the decision function(called support vectors).
4)SVM model utilizes a regularisation parameter and can tune bias-varaince tradeoff.
5)SVM model is effective in cases where number of dimensions is greater than the number of samples.

The limitations of SVM model are as follows.

1)SVM model will not perform well when data has more noise as the target classes will be overlapping.
2)If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
3)SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

## 7. Complexity



The complexity of an algorithm/model can be expressed using the Big O Notation, which defines an upper bound of an algorithm, it bounds a function only from 1) time complexity,deal with how long the algorithm is executed and  2) space complexity,deal with how much memory is used by its algorithm.

The complexity of SVM model are as follows.

Training Time Complexity=O(n²),n is the number of training samples.

Note: if n is large, avoid using SVM.


Run-time Complexity= O(k*d)

K= number of Support Vectors,d=dimentionality of the data

Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support vectors from the rest of the training data. The QP solver used by the libsvm-based implementation scales between O(n features X n2 samples) and O(n features X n3 samples)depending on how efficiently the libsvm cache is used in practice (dataset dependent). If the data is very sparse n features should be replaced by the average number of non-zero features in a sample vector.For the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

It is always better to reduce the dimension of the data to decrease the computation complexities.

SVM algorithms are not scale invariant and hence the data should be scaled.
