# Yash Kumar

## Research Questions

1. **Which characteristic(s) most closely relate(s) to determining if a breast lump is cancerous or non-cancerous? How much weight does it carry in determining the diagnosis relative to the other features?**

- I will analyze the correlation between each feature and the cancer diagnosis to understand their relationship better. To do this, I will create correlation heatmaps using Pearson correlation coefficients to visualize the correlations. After examining these correlations, I will select the features that show the highest correlations for further analysis. Finally, I will create detailed scatter plots, box plots, violin plots, and density plots using the 'seaborn' and 'matplotlib' libraries to understand better the relationship between the selected features and the diagnosis. Through this analysis, I will identify the characteristics with the highest correlation with a patient's cancer diagnosis. Furthermore, this analysis can help determine which features may be used as predictive cancer diagnosis markers and assist in making accurate diagnoses.

***

2. **How do the distributions of features vary across different sizes of tumours? More specifically, how does the importance of features vary with the size of the tumour?**

- The dataset can be divided into three groups based on tumour size to explore how the distributions of features vary across different sizes of tumours. Then, visualizations such as histograms, density plots, and box plots can be used to compare the feature distributions across these groups. By examining these visualizations, one can identify whether certain features are more important for predicting malignancy or benignity in tumours of a particular size and whether the relationship between features and diagnosis varies across tumour sizes.

***

3. **How well can one predict whether a breast mass is cancerous or non-cancerous using the K-Nearest Neighbours or Support Vector Classifier models? (I am currently only familiar with KNN and SVC.)**

- I have finished working on the third question and built a web application using Streamlit.

- What have I done? 
    - To prepare the data, I employed 'LabelEncoder()' to transform the categorical data into numerical data and pipelines to streamline the ML process. In addition, 'StandardScalar()' was utilized to ensure all features were on the same scale. Subsequently, I employed SVC and KNN models to predict whether a patient was cancerous and 'accuracy_score()' to measure the model's accuracy. Finally, to obtain a more precise representation of the accuracy, I implemented 'cross_val_score().' Eventually, I settled on the Support Vector Classifier model, optimizing it with 'GridSearchCV(),' discovering the ideal parameters, and achieving a 'cross_val_score' of 97.9%.

**Web App Link: [Streamlit App](https://kyash03-breastcancerwebapp-main-ckdp68.streamlit.app)**

I have the code for the development of the ML model and the app itself in a folder called 'WebAppCode.'

In [248]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [249]:
# sns.set_style("darkgrid")

In [250]:
df = pd.read_csv('../data/raw/data.csv')

In [251]:
# df.head(10)

**The above output indicates that the values will need to be scaled. Scaling them will make it easier to compare variables and improve visualization.**

In [252]:
# df.info()

**The above cell's result shows that none of the columns have a missing value.**

In [253]:
# df.shape

In [254]:
# df.columns

**We do not need the 'id' and 'Unnamed: 32' columns.**

In [255]:
# df.drop(['id', 'Unnamed: 32'], axis=1).describe(
#     include=np.number).apply(lambda x: x.apply(lambda y: format(y, 'f')))

**The above code removes the 'id' and 'Unnamed: 32' columns from the data frame and then applies the format() function to the output of the 'describe' method to display the results as floats.**

In [256]:
# df.drop(['id', 'Unnamed: 32'], axis=1).describe(exclude=np.number)

**The following cell cleans the data, i.e., drops the unnecessary columns.**

In [257]:
# df_cleaned = df.copy().drop(['id', 'Unnamed: 32'], axis=1)

**The above code creates a copy of the original frame to maintain an unmodified version of the dataset and then drops the unnecessary columns.**

In [258]:
# df_cleaned

In [259]:
# df_output = pd.DataFrame()

# for column_name in df_cleaned.columns[1:]:
#     user_df = df_cleaned.groupby(
#         by='diagnosis').describe().round(2).loc[:, column_name]
#     user_df['feature_name'] = [column_name] * 2
#     df_output = pd.concat([df_output, user_df])

In [260]:
# df_output.iloc[:, 1:][[
#     'feature_name', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'
# ]]

**The above output displays the statistics for each feature after being grouped by 'diagnosis.'**

In [261]:
# df_cleaned['diagnosis'].value_counts()

In [262]:
# plt.figure(dpi=250)
# fig_1 = sns.countplot(data=df_cleaned, x='diagnosis')
# plt.xlabel('Diagnosis')
# plt.ylabel('Count')
# plt.title('Diagnosis Counts')
# fig_1.bar_label(fig_1.containers[0], label_type='center', size=18)

**The above graph tells us that the dataset contains a much larger number of benign samples.**

In [263]:
# df_cleaned.isnull().sum()

In [264]:
# corr = df_cleaned.corr(numeric_only=True)

In [265]:
# plt.figure(dpi=250, figsize=(20, 20))
# fig_2 = sns.heatmap(corr,
#                     annot=True,
#                     mask=np.triu(np.ones_like(corr)),
#                     fmt=".2f")
# plt.title('Heatmap of Correlations between Columns', size=20)

In [266]:
# plt.figure(dpi=250, figsize=(20, 20))

# corr[abs(corr < 0.75)] = 0
# fig_2 = sns.heatmap(corr,
#                     annot=True,
#                     mask=np.triu(np.ones_like(corr)),
#                     fmt=".2f")

# plt.title('Heatmap of Correlations between Columns', size=20)

**The heatmap indicates that multiple variables have a high correlation, i.e., a correlation value greater than or equal to 0.75. Other than the apparent high correlation between the 'perimeter' and 'area' variables, what stands out is the strong relationship between the 'compactness,' 'concavity,' and 'concave points' variables.**

**I will explore this relationship using regression plots later on.**

In [267]:
# [x for x in df_cleaned.columns if '_mean' in x]

**The histograms below indicate that most of the variables are right-skewed. The only variable that closely resembles a normal distribution is 'symmetry_mean.'**

In [268]:
# plt.figure(dpi=250, figsize=(16, 56))
# for i, column_name in enumerate(df_cleaned.columns[1:]):
#     fig = plt.subplot(10, 3, i + 1)
#     fig.set_title('Distribution of ' + column_name)
#     sns.histplot(data=df_cleaned[column_name], bins=50, color='grey', kde=True)

**I will standardize the data to get accurate values and visually informative graphs.**

In [269]:
# from sklearn.preprocessing import StandardScaler

In [270]:
# standard_scaler = StandardScaler()

In [271]:
# standard_scaler.fit(df_cleaned.iloc[:, 1:])

In [272]:
# standardized_data = pd.DataFrame(standard_scaler.transform(
#     df_cleaned.iloc[:, 1:]),
#                                  columns=df_cleaned.columns[1:])

In [273]:
# standardized_data

In [274]:
# plt.figure(dpi=250, figsize=(14, 8))
# feature_names = ['perimeter', 'area']
# for i, feature_name in enumerate(feature_names):
#     plt.subplot(1, 2, i + 1)
#     plt.title('Boxplot of ' + str.capitalize(feature_name[0]) +
#               feature_name[1:],
#               size=20)
#     plt.xlabel('Feature')
#     plt.ylabel('Standardized Value')
#     sns.boxplot(data=standardized_data[
#         [x for x in standardized_data.columns if feature_name in x]])

**Observations:**

1. **The box plots indicate that the 'mean' values are more spread out than their counterparts, i.e., 'se' and 'worst. 'mean' values seem to have a large variation.**
2. **The 'perimeter' and 'area' values are quite spread out, indicating that they're both most likely not centred around a particular value; instead, they take on a large range of values.**

In [275]:
# plt.figure(dpi=250, figsize=(14, 8))
# feature_names = ['perimeter', 'area']
# for i, feature_name in enumerate(feature_names):
#     plt.subplot(1, 2, i + 1)
#     plt.title('Boxplot of ' + str.capitalize(feature_name[0]) +
#               feature_name[1:] + ' (Without Outliers)',
#               size=18)
#     plt.xlabel('Feature')
#     plt.ylabel('Standardized Value')
#     sns.boxplot(data=standardized_data[[
#         x for x in standardized_data.columns if feature_name in x
#     ]],
#                 showfliers=False)

**By setting 'showfliers' to 'False,' outliers from the box plot have been removed, resulting in a more focused and accurate graph that avoids any potential misleading interpretations.**

In [276]:
# plt.figure(dpi=250, figsize=(20, 6))
# fig_3 = sns.violinplot(data=standardized_data.iloc[:, :10].assign(
#     diagnosis=df_cleaned['diagnosis']).melt(id_vars='diagnosis'),
#                        x='variable',
#                        y='standardized value',
#                        hue='diagnosis')
# plt.title('Violin Plots of Mean Values', size=20)

In [277]:
# plt.figure(dpi=250, figsize=(20, 6))
# fig_4 = sns.violinplot(data=standardized_data.iloc[:, 10:20].assign(
#     diagnosis=df_cleaned['diagnosis']).melt(id_vars='diagnosis'),
#                        x='variable',
#                        y='standardized value',
#                        hue='diagnosis')
# plt.title('Violin Plots of Standard Error Values', size=20)

In [278]:
# plt.figure(dpi=250, figsize=(20, 6))
# fig_5 = sns.violinplot(data=standardized_data.iloc[:, 20:].assign(
#     diagnosis=df_cleaned['diagnosis']).melt(id_vars='diagnosis'),
#                        x='variable',
#                        y='standardized value',
#                        hue='diagnosis')
# plt.title('Violin Plots of Worst Values', size=20)

**As indicated previously, the values of 'perimeter' and 'area' vary greatly. However, it is important to note that this is only true for malignant diagnoses. Most benign cases are centred around a particular value.**

**The above theory seems to hold regardless of the feature in question; for all features, the malignant diagnoses seem to take on a larger range of values compared to the benign diagnoses.**

In [279]:
# column_pairs = [('concavity_mean', 'concave points_mean', 'mean'),
#                 ('concavity_se', 'concave points_se', 'se'),
#                 ('concavity_worst', 'concave points_worst', 'worst')]

# plt.figure(figsize=(20, 6), dpi=250)

# for i, (x_col, y_col, title_suffix) in enumerate(column_pairs):
#     plt.subplot(1, 3, i + 1)
#     fig = sns.regplot(x=x_col, y=y_col, data=standardized_data)
#     fig.set_title(
#         f'Regression Plot of "{x_col}" vs "{y_col}" ({title_suffix})')

**Observations:**

1. **Compared to the 'mean' and 'worst' values, the 'se' values seem centred around the same value, except for a couple of outliers.**
2. **'concave points' and 'concavity' seem highly correlated, regardless of the statistic used. The heatmap previously made confirms this relationship.**

## Method Chaining

In [289]:
import project_functions3

In [292]:
final_df = project_functions3.load_and_process('../data/raw/data.csv')

In [293]:
final_df

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400
