# Logistic Regression 

In [1]:
import numpy as np 
import pandas as pd 

from functools import reduce 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix


`logistic_regression` function performs logistic regression on a given dataset. The function takes two arguments:  data , which is the dataset, and  result_group , which specifies the target variable to be predicted (1 for "MCQ160L" or 2 for "MCQ220").

In [2]:
def logistic_regression(data, result_group):

    if result_group == 1 :
        # Split the data into training and test sets
        X_train, X_test, y_train, y_test = train_test_split(data.drop(["MCQ220", "MCQ160L"], axis=1), data["MCQ160L"], test_size=0.2)

    elif result_group == 2:
        # Split the data into training and test sets
        X_train, X_test, y_train, y_test = train_test_split(data.drop(["MCQ220", "MCQ160L"], axis=1), data["MCQ220"], test_size=0.2)


    # Train the logistic regression model on the training data
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)

    # Evaluate the logistic regression model on the test data
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("\n Accuracy:", accuracy)
    
    # Generate the confusion matrix
    confusion_matrix_logisticR= confusion_matrix(y_test, y_pred)

    print("\n Confusion Matrix:")
    print(confusion_matrix_logisticR)
            


## Results of First Aproach

In [6]:


path = "/home/asma-rashidian/Documents/DrRahmani_projects/project1/data/result/integrated_data_p2.csv"
data = pd.read_csv(path)
## PCA 
print("-" * 100)
print(" The logistic regression for Liver disease in first approch base on PCA dimension reduction method is:")
logistic_regression(data, 1)
print("*" * 50)
print(" The logistic regression for Cancer in first approch base on PCA dimension reduction method is:")
logistic_regression(data, 2)
print("-" * 100)

## Isomap
path = "/home/asma-rashidian/Documents/DrRahmani_projects/project1/data/result/integrated_isomap.csv"
data = pd.read_csv(path)
print("-" * 100)
print(" The logistic regression for Liver disease in first approch base on Isomap dimension reduction method is:")
logistic_regression(data, 1)
print("*" * 50)
print(" The logistic regression for Cancer in first approch base on Isomap dimension reduction method is:")
logistic_regression(data, 2)
print("-" * 100)



----------------------------------------------------------------------------------------------------
 The logistic regression for Liver disease in first approch base on PCA dimension reduction method is:

 Accuracy: 0.8820638820638821

 Confusion Matrix:
[[ 728    0  110    0]
 [   3    0   51    0]
 [  74    0 1067    0]
 [   0    0    2    0]]
**************************************************
 The logistic regression for Cancer in first approch base on PCA dimension reduction method is:

 Accuracy: 0.8324324324324325

 Confusion Matrix:
[[803   0 125]
 [  8   0  97]
 [111   0 891]]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
 The logistic regression for Liver disease in first approch base on Isomap dimension reduction method is:

 Accuracy: 0.8162162162162162

 Confusion Matrix:
[[ 608    0  253    0]
 [   4    0   42    0]
 [ 

## Second Approach 

In [7]:


path = "/home/asma-rashidian/Documents/DrRahmani_projects/project1/data/result/pca_transformation_outlier_removal.csv"
data = pd.read_csv(path)
## PCA 
print("-" * 100)
print(" The logistic regression for Liver disease in first approch base on PCA dimension reduction method is:")
logistic_regression(data, 1)
print("*" * 50)
print(" The logistic regression for Cancer in first approch base on PCA dimension reduction method is:")
logistic_regression(data, 2)
print("-" * 100)

## Isomap
path = "/home/asma-rashidian/Documents/DrRahmani_projects/project1/data/result/isomap_transformation_outlier_removal.csv"
data = pd.read_csv(path)
print("-" * 100)
print(" The logistic regression for Liver disease in first approch base on Isomap dimension reduction method is:")
logistic_regression(data, 1)
print("*" * 50)
print(" The logistic regression for Cancer in first approch base on Isomap dimension reduction method is:")
logistic_regression(data, 2)
print("-" * 100)

----------------------------------------------------------------------------------------------------
 The logistic regression for Liver disease in first approch base on PCA dimension reduction method is:

 Accuracy: 0.88992628992629

 Confusion Matrix:
[[ 775    0   71    0]
 [   3    0   36    0]
 [ 113    0 1036    0]
 [   0    0    1    0]]
**************************************************
 The logistic regression for Cancer in first approch base on PCA dimension reduction method is:

 Accuracy: 0.8501228501228502

 Confusion Matrix:
[[807   0  76]
 [  3   0 103]
 [121   2 923]]
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
 The logistic regression for Liver disease in first approch base on Isomap dimension reduction method is:

 Accuracy: 0.7906633906633906

 Confusion Matrix:
[[ 605    0  307    0]
 [   0    0   45    0]
 [  7

# Analysing the Results 

## First approach

For the liver disease classification task using PCA dimension reduction method:
- The accuracy of the logistic regression model is approximately 0.890, indicating that the model predicted the correct outcome for around 89.0% of the test samples.
- The confusion matrix shows the distribution of predicted classes compared to the actual classes. It reveals that the model made correct predictions for 775 samples of class 1, 0 samples of class 2, 1036 samples of class 3, and 0 samples of class 4.

For the cancer classification task using PCA dimension reduction method:
- The accuracy of the logistic regression model is approximately 0.850, indicating that the model predicted the correct outcome for around 85.0% of the test samples.
- The confusion matrix shows the distribution of predicted classes compared to the actual classes. It reveals that the model made correct predictions for 807 samples of class 1, 0 samples of class 2, and 923 samples of class 3.

For the liver disease classification task using Isomap dimension reduction method:
- The accuracy of the logistic regression model is approximately 0.791, indicating that the model predicted the correct outcome for around 79.1% of the test samples.
- The confusion matrix shows the distribution of predicted classes compared to the actual classes. It reveals that the model made correct predictions for 605 samples of class 1, 0 samples of class 2, 1004 samples of class 3, and 0 samples of class 4.

For the cancer classification task using Isomap dimension reduction method:
- The accuracy of the logistic regression model is approximately 0.788, indicating that the model predicted the correct outcome for around 78.8% of the test samples.
- The confusion matrix shows the distribution of predicted classes compared to the actual classes. It reveals that the model made correct predictions for 620 samples of class 1, 0 samples of class 2, and 984 samples of class 3.

These results suggest that the logistic regression models performed reasonably well in predicting the outcomes for both the liver disease and cancer classification tasks. However, further analysis and evaluation may be required to assess the overall performance and generalizability of the models.

## Second aproach 

For the liver disease classification task:
- The logistic regression model with PCA dimension reduction achieved an accuracy of approximately 0.890.
- The confusion matrix shows that the model correctly predicted 775 samples of class 1, 0 samples of class 2, 1036 samples of class 3, and 0 samples of class 4.
- The model achieved a relatively high accuracy and showed good performance in predicting class 1 and class 3.

For the cancer classification task:
- The logistic regression model with PCA dimension reduction achieved an accuracy of approximately 0.850.
- The confusion matrix shows that the model correctly predicted 807 samples of class 1, 0 samples of class 2, and 923 samples of class 3.
- The model achieved a relatively high accuracy and showed good performance in predicting class 1 and class 3.

For the liver disease classification task with Isomap dimension reduction:
- The logistic regression model achieved an accuracy of approximately 0.791.
- The confusion matrix shows that the model correctly predicted 605 samples of class 1, 0 samples of class 2, 1004 samples of class 3, and 0 samples of class 4.
- The model achieved a moderate accuracy and showed good performance in predicting class 1 and class 3.

For the cancer classification task with Isomap dimension reduction:
- The logistic regression model achieved an accuracy of approximately 0.788.
- The confusion matrix shows that the model correctly predicted 620 samples of class 1, 0 samples of class 2, and 984 samples of class 3.
- The model achieved a moderate accuracy and showed good performance in predicting class 1 and class 3.

Overall, the logistic regression models with PCA dimension reduction performed slightly better in terms of accuracy compared to the models with Isomap dimension reduction for both the liver disease and cancer classification tasks. However, further analysis and evaluation may be required to assess the models' performance comprehensively.

## Feature selection and Dimendion reduction Benfits VS Negative effects!

In this project, attribute selection and dimensionality reduction can have several benefits: 
 
1. Improved Model Performance: By selecting relevant attributes and reducing the dimensionality of the data, we can focus on the most important features that contribute to the target variable. This can lead to improved model performance, as the model can better capture the underlying patterns and relationships in the data. 
 
2. Reduced Overfitting: Dimensionality reduction techniques help in reducing the complexity of the model by eliminating irrelevant or redundant features. This can prevent overfitting, where the model becomes too specific to the training data and fails to generalize well to unseen data. 
 
3. Faster Training and Inference: With fewer attributes and reduced dimensions, the model requires less computational resources and time for training and making predictions. This can be especially beneficial when dealing with large datasets or real-time applications. 
 
However, there can be potential negative effects of attribute selection and dimensionality reduction: 
 
1. Loss of Information: Removing certain attributes or reducing dimensions can result in the loss of some information from the original dataset. This can lead to a loss of important patterns or relationships that could be useful for the model. 
 
2. Increased Complexity: Some dimensionality reduction techniques, such as non-linear methods, can introduce additional complexity to the model. This may make it harder to interpret the results and understand the underlying relationships in the data. 

## Comparison between first and second approach 


Base on the results the first approach "Considering the outliers" had a little better performance Although they both had good results! 