## Workshop Week 6

## Logistic Regression
Breast Cancer data from [the UCI repository](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) contains records corresponding to 
cases of observed tumors.   There are a number of observations for each and a categorisation in the `class` column: 2 for benign (good), 4 for malignant (bad).  Your task is to build a logistic regression model to classify these cases. 

The data is provided as a CSV file.  There are a small number of cases where no value is available, these are indicated in the data with `?`. I have used the `na_values` keyword for `read_csv` to have these interpreted as `NaN` (Not a Number).  Your first task is to decide what to do with these rows. You could just drop these rows or you could [impute them from the other data](http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values).

You then need to follow the procedure outlined in the lecture for generating a train/test set, building and evaluating a model. Your goal is to build the best model possible over this data.   Your first step should be to build a logistic regression model using all of the features that are available.
  

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE

In [26]:
bcancer = pd.read_csv("C:/Users/BEYOND/Downloads/breast-cancer-wisconsin.csv", na_values="?")
bcancer.head()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [28]:
# Examine the data: check number of rows and number of columns
num_rows, num_columns = bcancer.shape

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

Number of rows: 699
Number of columns: 11


In [30]:
# Look at the statistical summary of the dataframe
summary = bcancer.describe()

print(summary)

       sample_code_number  clump_thickness  uniformity_cell_size  \
count        6.990000e+02       699.000000            699.000000   
mean         1.071704e+06         4.417740              3.134478   
std          6.170957e+05         2.815741              3.051459   
min          6.163400e+04         1.000000              1.000000   
25%          8.706885e+05         2.000000              1.000000   
50%          1.171710e+06         4.000000              1.000000   
75%          1.238298e+06         6.000000              5.000000   
max          1.345435e+07        10.000000             10.000000   

       uniformity_cell_shape  marginal_adhesion  single_epithelial_cell_size  \
count             699.000000         699.000000                   699.000000   
mean                3.207439           2.806867                     3.216023   
std                 2.971913           2.855379                     2.214300   
min                 1.000000           1.000000                    

In [31]:
# Check how many classes we do have from the "class" column
class_counts = bcancer['class'].value_counts()

print("Number of classes in the 'class' column:")
print(class_counts)

Number of classes in the 'class' column:
2    458
4    241
Name: class, dtype: int64


In [32]:
# Check number of samples for each class and comment whether dataset is balanced?
class_counts = bcancer['class'].value_counts()

print("Number of samples for each class:")
print(class_counts)

# Check if the dataset is balanced or not
is_balanced = all(count == class_counts.iloc[0] for count in class_counts)
if is_balanced:
    print("The dataset is balanced.")
else:
    print("The dataset is imbalanced.")

Number of samples for each class:
2    458
4    241
Name: class, dtype: int64
The dataset is imbalanced.


In [33]:
# Deal with the NaN values in the data
cleaned_bcancer_option1 = bcancer.dropna()  # Remove rows with NaN values
cleaned_bcancer_option2 = bcancer.dropna(axis=1)  # Remove columns with NaN values

# Option 2: Impute NaN values
# For example, you can replace NaN values with the mean of each column
cleaned_bcancer_option3 = bcancer.fillna(bcancer.mean())

# You can also replace NaN values with a specific value, such as 0
cleaned_bcancer_option4 = bcancer.fillna(0)

# Print the dimensions of the cleaned dataframes to compare
print("Original bcancer data shape:", bcancer.shape)
print("After removing rows with NaN values shape:", cleaned_bcancer_option1.shape)
print("After removing columns with NaN values shape:", cleaned_bcancer_option2.shape)
print("After imputing NaN values with mean shape:", cleaned_bcancer_option3.shape)
print("After imputing NaN values with 0 shape:", cleaned_bcancer_option4.shape)

Original bcancer data shape: (699, 11)
After removing rows with NaN values shape: (683, 11)
After removing columns with NaN values shape: (699, 10)
After imputing NaN values with mean shape: (699, 11)
After imputing NaN values with 0 shape: (699, 11)


In [34]:
# Split your data into training(80%) and testing data (20%) and use random_state=142
from sklearn.model_selection import train_test_split

# Assuming your breast cancer data is stored in a DataFrame named 'bcancer'
# You should replace 'bcancer' with the actual name of your DataFrame
# For example, if your DataFrame is called 'df', you would use df.dropna() or df.fillna()

# Splitting the data into features (X) and target variable (y)
X = bcancer.drop(columns=['class'])  # Assuming 'class' is the target variable
y = bcancer['class']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=142)

# Print the shapes of the resulting datasets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (559, 10)
Shape of X_test: (140, 10)
Shape of y_train: (559,)
Shape of y_test: (140,)


In [37]:
# Build your Logistic Regression model


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

# Assuming your breast cancer data is stored in a DataFrame named 'bcancer'
# You should replace 'bcancer' with the actual name of your DataFrame
# For example, if your DataFrame is called 'df', you would use df.dropna() or df.fillna()

# Impute NaN values with the mean of each column
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Splitting the imputed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=142)

# Creating a Logistic Regression model object
logreg_model = LogisticRegression()

# Fitting the model to the training data
logreg_model.fit(X_train, y_train)

# Making predictions on the testing data
y_pred = logreg_model.predict(X_test)

# Evaluating the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)



Accuracy of Logistic Regression model: 0.6928571428571428


In [38]:
# Do predictions on test set
# Make predictions on the testing data
y_pred = logreg_model.predict(X_test)

# Display the predicted values
print("Predicted values on the test set:")
print(y_pred)


Predicted values on the test set:
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


### Evaluation

To evaluate a classification model we want to look at how many cases were correctly classified and how many
were in error.  In this case we have two outcomes - benign and malignant.   SKlearn has some useful tools, the 
[accuracy_score]() function gives a score from 0-1 for the proportion correct.  The 
[confusion_matrix](http://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) function 
shows how many were classified correctly and what errors were made.  Use these to summarise the performance of 
your model (these functions have already been imported above).

In [39]:
# Evaluate the performance of your trained model
from sklearn.metrics import accuracy_score, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Logistic Regression model:", accuracy)

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)


Accuracy of the Logistic Regression model: 0.6928571428571428

Confusion Matrix:
[[97  0]
 [43  0]]


**This is the checkpoint mark for this week's workshop. You need to report `Accuracy Score` on test set and also show `confusion matrix`. You also need to provide analysis based on the results you got.**

### Feature Selection

Since you have many features available, one part of building the best model will be to select which features to use as input to the classifier. Your initial model used all of the features but it is possible that a better model can 
be built by leaving some of them out.   Test this by building a few models with subsets of the features - how do your models perform? 

This process can be automated.  The [sklearn RFE function](http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) implements __Recursive Feature Estimation__ which removes 
features one by one, evaluating the model each time and selecting the best model for a target number of features.  Use RFE to select features for a model with 3, 4 and 5 features - can you build a model that is as good or better than your initial model?

In [41]:
from sklearn.feature_selection import RFE
import pandas as pd

# Initialize Logistic Regression model
logreg_model = LogisticRegression()

# Number of features to select
num_features = [3, 4, 5]

for n in num_features:
    # Initialize RFE
    rfe = RFE(estimator=logreg_model, n_features_to_select=n)
    
    # Convert feature matrix to DataFrame with column names
    X_train_df = pd.DataFrame(X_train, columns=X.columns)
    
    # Fit RFE
    rfe.fit(X_train_df, y_train)
    
    # Get selected features
    selected_features = X_train_df.columns[rfe.support_]
    
    # Train Logistic Regression model with selected features
    logreg_model.fit(X_train_df[selected_features], y_train)
    
    # Make predictions on test set
    X_test_df = pd.DataFrame(X_test, columns=X.columns)  # Convert test set to DataFrame
    y_pred = logreg_model.predict(X_test_df[selected_features])
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print results
    print(f"Number of features selected: {n}")
    print(f"Selected features: {selected_features}")
    print(f"Accuracy: {accuracy}\n")


Number of features selected: 3
Selected features: Index(['sample_code_number', 'uniformity_cell_size', 'bare_nuclei'], dtype='object')
Accuracy: 0.6928571428571428

Number of features selected: 4
Selected features: Index(['sample_code_number', 'uniformity_cell_size', 'uniformity_cell_shape',
       'bare_nuclei'],
      dtype='object')
Accuracy: 0.6928571428571428

Number of features selected: 5
Selected features: Index(['sample_code_number', 'uniformity_cell_size', 'uniformity_cell_shape',
       'bare_nuclei', 'normal_nucleoli'],
      dtype='object')
Accuracy: 0.6928571428571428



## Conclusion

Write a brief conclusion to your experiment.  You might comment on the proportion of __false positive__ and __false negative__ classifications your model makes.  How useful would this model be in a clinical diagnostic setting? 

In this experiment, we explored building a Logistic Regression model for breast cancer classification using the Wisconsin Breast Cancer dataset. We started by training a model with all available features and achieved a certain accuracy score on the test set. 

Next, we employed Recursive Feature Elimination (RFE) to select subsets of features with 3, 4, and 5 features, respectively, and trained Logistic Regression models using these subsets. We then evaluated the performance of each model in terms of accuracy.

Our results showed that by reducing the number of features, we could achieve comparable accuracy to the initial model. This suggests that a simpler model with fewer features could be as effective as the initial model, which used all available features.

Regarding the clinical diagnostic setting, the performance metrics such as accuracy, false positives, and false negatives are crucial. False positives would indicate instances where the model wrongly predicts malignancy when the actual diagnosis is benign, potentially causing unnecessary anxiety and medical interventions. False negatives, on the other hand, would indicate instances where the model fails to identify malignancy when it's present, potentially delaying necessary treatments.

In a clinical diagnostic setting, false negatives can be particularly concerning as they may lead to delayed treatment and worse patient outcomes. Therefore, while considering the accuracy of the model, it's crucial to also evaluate its false positive and false negative rates. Further tuning and validation of the model, possibly incorporating additional clinical factors and expert knowledge, would be necessary to improve its utility and reliability in clinical practice.