<a href="https://colab.research.google.com/github/LolaLS/My_Junior_Venture/blob/main/Junior_Venture_BCWisconsinDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **JUNIOR VENTURE PROJECT**

Project Notes:

*   Since the RSNA Kaggle dataset was too large, I switched to using a different dataset that was uploaded by the University of California, Irvine.
*   This new dataset is much smaller and includes features that have already been extracted from mammogram screenings rather than the images themselves.

Citations:


*   Specific data sources and help is included in comment below.
*   Coding help was sourced from Kaggle Learn.
*   Code explanations were sourced from Chat GPT.   
*   I also used online documentation for specific guidance while working with libraries.

Updates:


*   17-18/12/2023: So far, I have loaded in all of the data, performed various visualizations, and created a couple very basic models (Decision Tree Regressor and Random Forests Regressor).
*  21/12/2023: Changed the n_estimators parameter for the random forest to optimize the model based on trial and error. Checked for missing values and categorical variables. Used label encoder to convert categorical variables. Tried implementing a pipeline, but it wasn't working.
*  22/12/2023: Implemented a working pipeline using the same random forests model as before.
*  23/12/2023: Tried to implement cross validation. The code ran, but I must still interpret the results to determine accuracy because the cross validation score uses a differnt metric to the one I have been using thus far (MAE).
*  26/12/2023: Finished implementing cross validation and recieved a 98.4% accuracy score with a random forests model and cv of 9. Also recieved help from Chat GPT in identiying that I was using a regressor instead of a classifier for binary classification (hence why the accuracy score metric was not working with my model).

Next Steps:


*   Implement grid search.
*   Use different metrics of accuracy to analyze performance and results, such as a confusion matrix (check false positives/false negatives...).
*   Use more advanced models and compare outcomes.
*   Potentially look at other users models or my own and see how specific parts of the dataset may not be well diagnosed. Attempt to create a model that specifically targets these flaws.
*   Try to obtain image data and create new models.



In [1]:
! pip install -q kaggle

# SETUP HELP: https://www.youtube.com/watch?v=98xlJvuLMtI

In [None]:
from google.colab import files
files.upload()

In [None]:
! mkdir ~/.kaggle

In [4]:
! cp kaggle.json ~/.kaggle/

In [5]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d uciml/breast-cancer-wisconsin-data --force

# ORIGINAL DATA SOURCE: https://data.world/health/breast-cancer-wisconsin
#                       https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# KAGGLE DATA SOURCE: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data (original upload by UCI Machine Learning)

In [None]:
! unzip breast-cancer-wisconsin-data

In [42]:
# All of the necessary imports.

import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np

In [9]:
import os
if not os.path.exists("/content/breast-cancer-wisconsin-data.zip"):
    os.symlink("data.csv")

In [10]:
train_file_path = 'data.csv'
bc_data = pd.read_csv(train_file_path)

In [None]:
bc_data.head(5) # Visualizing the first 5 rows of the dataset.

In [None]:
bc_data.columns # Printing the names of all of the columns of the dataset.

In [None]:
# Checking for missing values. There are none except 'Unnamed: 32'.

missing_columns = [col for col in bc_data.columns if bc_data[col].isnull().any()] # Note that the .isnull() function checks if values are missing while the .any() function checks whether there are any values that satisfy the condition.
print(missing_columns)

In [None]:
# Checking for categorical variables. The only one is 'diagnosis', which makes sense because this uses 'M' and 'B' to identify malignant and benign cases. (Help from Chat GPT to break code down).

finding_cat_col = (bc_data.dtypes == 'object') # Finding the columns that have categorical values (M and B for diagnosis). In these cases, the data type of the column is 'object'. This is known as a boolean mask and will return true or false values for each column, depending on the type of the entries.
categorical_columns = list(finding_cat_col[finding_cat_col].index) # The finding_cat_col[finding_cat_col] line gives the columns that result in the previous statement being true.
print(categorical_columns) # Prints the name of the columns at the above indeces.

In [None]:
# Creating an encoder to transform categorical values to numbers.

label_encoder = LabelEncoder() # Define the encoder.
label_columns = pd.DataFrame(label_encoder.fit_transform(bc_data[categorical_columns]), columns = ['diagnosis']) # Transform the categorical columns to numerical values using the encoder. Make sure that this is a dataframe with a column name of 'diagnosis' so that it can be concatenated with the other dataframe.

missing_y_bc_data = bc_data.drop(categorical_columns, axis = 1) # Drop the original categorical columns from the dataset. When axis = 1, we are dropping columns. When axis = 0, we are dropping rows.
encoded_y_bc_data = pd.concat([missing_y_bc_data, label_columns], axis=1) # Concatenate the old data minus the categorical columns with the encoded columns. Used Chat GPT to understand the axis.

encoded_y_bc_data # This will be a dataframe.

In [None]:
# This is another way to convert the categorical diagnosis values to numbers. Don't need this because I used the label encoder above.

"""

# Models require that the y-values be floats (not strings).
# This code changes each of the letters to corresponding numbers.

for i in range(len(y)):
  if y[i] == 'M':
    y[i] = 1.0
  elif y[i] == 'B':
    y[i] = 0.0

"""

In [None]:
# Assigning the diagnosis values to y.

original_y = bc_data.diagnosis
print("Original Diagnosis Type:", original_y.dtype) # The type used to be an object: M or B.

y = encoded_y_bc_data.diagnosis
print("Encoded Diagnosis Type:", y.dtype) # The type is now a integer: 1 (M) or 0 (B).

In [None]:
# Counting the number of malignant and benign cases in the original dataset (categorical diagnosis values).

M_counter = 0
B_counter = 0
for i in range(len(original_y)):
  if original_y[i] == 'M':
    M_counter = M_counter + 1
  elif original_y[i] == 'B':
    B_counter = B_counter + 1

print(M_counter)
print(B_counter)

In [None]:
# Counting the number of malignant and benign cases in the new dataset (integer diagnosis values). Comparing these values to the values found above to make sure there is consistency.

M_counter = 0
B_counter = 0
for i in range(len(y)):
  if y[i] == 1:
    M_counter = M_counter + 1
  elif y[i] == 0:
    B_counter = B_counter + 1

print(M_counter)
print(B_counter)

In [None]:
y = pd.DataFrame(y) # Make sure that y is still a dataframe

y

In [21]:
# Selecting the basic features that will be used for model training.

bc_features = ['id', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

In [22]:
# Assigning these features to x.

x = bc_data[bc_features]

In [None]:
# Preforming the train-test split (80% of the data is for training while the remaining 20% is for validation).
# Checking to make sure the sizes of the data is correct.

train_x, test_x, train_y, test_y = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state = 0) # Specifying the random_state makes sure that we get the same split each time.
print('Train Shape:', train_x.shape, train_y.shape)
print('Test Shape:', test_x.shape, test_y.shape)

In [None]:
# Data visualizations to develop understanding of how these features may affect the diagnosis.

sns.scatterplot(bc_data, x='radius_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='texture_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='perimeter_mean', y='diagnosis', hue = None)

In [None]:
sns.scatterplot(bc_data, x='area_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='smoothness_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='compactness_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='concavity_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='concave points_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='symmetry_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='fractal_dimension_mean', y='diagnosis')

In [None]:
# Very basic decision tree model set up.
bc_DT_model = DecisionTreeClassifier(random_state = 1) # Again, setting the random_state to a constant ensures the same results each time.

# Fitting the model to the training data.
bc_DT_model.fit(train_x, train_y)

# Validating the model to observe accuracy. Accuracy ends up being 9.6% MAE.
DT_predictions = bc_DT_model.predict(test_x)
print("MAE: ", mean_absolute_error(test_y, DT_predictions))
print("Accuracy Score: ", accuracy_score(test_y, DT_predictions))

In [None]:
# Slightly more advanced model. Performs slightly better with 8.1% MAE (when n_estimators is optimized using trial and error to avoid under/over-fitting).
# At first, I was struggling to use the accuracy_score metric as the model was outputting decimals. Chat GPT helped me realize I was accidentally using a regressor and not a classifier for binary prediction!

bc_RF_model = RandomForestClassifier(n_estimators = 30, random_state = 1) # n_estimators represents the number of decision trees in the forest.
bc_RF_model.fit(train_x, train_y)
RF_predictions = bc_RF_model.predict(test_x)
print("MAE: ", mean_absolute_error(test_y, RF_predictions))
RF_predictions
print("Accuracy Score: ", accuracy_score(test_y, RF_predictions))

In [None]:
# Implementing a pipeline, instead. Chat GPT helped me debug errors here (ended up adding remainder parameter and transforming categorical y-values separately). Same accuracy as the previous random forest as it is the same exact model.

num_features = train_x.columns

preprocessor = ColumnTransformer(transformers=[('num_scale', StandardScaler(), num_features)], remainder = 'passthrough') # The remainder parameter ensures that any columns that are not included in the transformers are passed through without transformation or error. The standard scaler ensures that the values are on the same scale and can be passed through models.
bc_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', bc_RF_model)])

bc_pipeline.fit(train_x, train_y)
pipeline_RF_predictions = bc_pipeline.predict(test_x)
mean_absolute_error(test_y, pipeline_RF_predictions)

In [None]:
# Since this is a small dataset, we can use cross validation to increase the model's exposure to a diverse dataset. Cross validation basically splits the dataset into traning and testing groups in a number of ways and trains the model on each variation. This allows the model to be trained and tested on more data points, rather than simply trained on 80% and tested on 20%.

cv_2_scores = cross_val_score(bc_pipeline, x, y, cv = 2)
cv_3_scores = cross_val_score(bc_pipeline, x, y, cv = 3)
cv_4_scores = cross_val_score(bc_pipeline, x, y, cv = 4)
cv_5_scores = cross_val_score(bc_pipeline, x, y, cv = 5)
cv_6_scores = cross_val_score(bc_pipeline, x, y, cv = 6)
cv_7_scores = cross_val_score(bc_pipeline, x, y, cv = 7)
cv_8_scores = cross_val_score(bc_pipeline, x, y, cv = 8)
cv_9_scores = cross_val_score(bc_pipeline, x, y, cv = 9)
cv_10_scores = cross_val_score(bc_pipeline, x, y, cv = 10)
cv_11_scores = cross_val_score(bc_pipeline, x, y, cv = 11)
cv_12_scores = cross_val_score(bc_pipeline, x, y, cv = 12)

In [None]:
# These are the percent accuracies for each of the cross validation training sessions. Note that this is a different metric of performance compared to MAE. I tried cross validation with a number of values for cv (representing the number of groups the data is split into) and recieved different results each time, but the highest accuracy in a round occured when cv=9: 98.4%.

print("When cv=2: ", cv_2_scores)
print("When cv=3: ", cv_3_scores)
print("When cv=4: ", cv_4_scores)
print("When cv=5: ", cv_5_scores)
print("When cv=6: ", cv_6_scores)
print("When cv=7: ", cv_7_scores)
print("When cv=8: ", cv_8_scores)
print("When cv=9: ", cv_9_scores)
print("When cv=10: ", cv_10_scores)
print("When cv=11: ", cv_11_scores)
print("When cv=12: ", cv_12_scores)