<a href="https://colab.research.google.com/github/LolaLS/My_Junior_Venture/blob/main/Junior_Venture_BCWisconsinDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **JUNIOR VENTURE PROJECT**

Project Notes:

*   Since the RSNA Kaggle dataset was too large, I switched to using a different dataset that was uploaded by the University of California, Irvine.
*   This new dataset is much smaller and includes features that have already been extracted from mammogram screenings rather than the images themselves.

Citations:


*   Specific data sources and help is included in comment below.
*   Coding help was sourced from Kaggle Learn.
*   I also used online documentation for specific guidance while working with libraries.

Updates:


*   17-18/12/2023: So far, I have loaded in all of the data, performed various visualizations, and created a couple very basic models (Decision Tree Regressor and Random Forests Regressor).

Next Steps:


*   Use more advanced models and compare outcomes.
*   Cross validation.
*   Potentially look at other users models or my own and see how specific parts of the dataset may not be well diagnosed. Attempt to create a model that specifically targets these flaws.
*   Try to obtain image data and create new models.



In [None]:
! pip install -q kaggle

# SETUP HELP: https://www.youtube.com/watch?v=98xlJvuLMtI

In [None]:
from google.colab import files
files.upload()

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d uciml/breast-cancer-wisconsin-data

# ORIGINAL DATA SOURCE: https://data.world/health/breast-cancer-wisconsin
#                       https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# KAGGLE DATA SOURCE: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data (original upload by UCI Machine Learning)

In [None]:
! unzip breast-cancer-wisconsin-data

In [111]:
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [None]:
import os
if not os.path.exists("/content/breast-cancer-wisconsin-data.zip"):
    os.symlink("data.csv")

In [None]:
train_file_path = 'data.csv'
bc_data = pd.read_csv(train_file_path)

In [None]:
bc_data.head(5) # Visualizing the first 5 rows of the dataset.

In [None]:
bc_data.columns # Printing the names of all of the columns of the dataset.

In [None]:
y = bc_data.diagnosis

In [None]:
# Used this code to check how many y-values were labeled M and B (malignant and benign).
# Compared these values to how many y-values were labeled 1.0 and 0.0 after changing the strings to floats.

M_counter = 0
B_counter = 0
for i in range(len(y)):
  if y[i] == 'M':
    M_counter = M_counter + 1
  elif y[i] == 'B':
    B_counter = B_counter + 1

print(M_counter)
print(B_counter)

In [None]:
# Models require that the y-values be floats (not strings).
# This code changes each of the letters to corresponding numbers.

for i in range(len(y)):
  if y[i] == 'M':
    y[i] = 1.0
  elif y[i] == 'B':
    y[i] = 0.0

In [None]:
# Comparing these values to the values found above.

M_counter = 0
B_counter = 0
for i in range(len(y)):
  if y[i] == 1.0:
    M_counter = M_counter + 1
  elif y[i] == 0.0:
    B_counter = B_counter + 1

print(M_counter)
print(B_counter)

In [None]:
# Selecting the basic features that will be used for model training.

bc_features = ['id', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

In [None]:
# Assigning these features to x.

x = bc_data[bc_features]

In [None]:
# Preforming the train-test split (80% of the data is for training while the remaining 20% is for validation).
# Checking to make sure the sizes of the data is correct.

train_x, test_x, train_y, test_y = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state = 0) # Specifying the random_state makes sure that we get the same split each time.
print ('Train Shape:', train_x.shape, train_y.shape)
print ('Test Shape:', test_x.shape, test_y.shape)

In [None]:
# Data visualizations to develop understanding of how these features may affect the diagnosis.

sns.scatterplot(bc_data, x='radius_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='texture_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='perimeter_mean', y='diagnosis', hue = None)

In [None]:
sns.scatterplot(bc_data, x='area_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='smoothness_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='compactness_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='concavity_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='concave points_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='symmetry_mean', y='diagnosis')

In [None]:
sns.scatterplot(bc_data, x='fractal_dimension_mean', y='diagnosis')

In [None]:
# Very basic decision tree model set up.
bc_DT_model = DecisionTreeRegressor(random_state = 1) # Again, setting the random_state to a constant ensures the same results each time.

# Fitting the model to the training data.
bc_DT_model.fit(train_x, train_y)

# Validating the model to observe accuracy. Accuracy ends up being 9.6% MAE.
DT_predictions = bc_DT_model.predict(test_x)
mean_absolute_error(test_y, DT_predictions)

In [None]:
# Slightly more involved model. Performs slightly better with 8.3% MAE.

bc_RF_model = RandomForestRegressor(random_state = 1)
bc_RF_model.fit(train_x, train_y)
RF_predictions = bc_RF_model.predict(test_x)
mean_absolute_error(test_y, RF_predictions)