# Summary of the data set:

In this section we are going to display a quick summary of the data including sample of row data, some graphs and tables and statistics summaries.
First we have to import the data, so we can work with it and display its content correctly.

In [None]:
import pandas as pd
dataCancer = pd.read_csv('cancer_preprossing.csv')

### Sample of data:

First we going to display the first 5 row as a sample of the preprocessed data using the following code:

In [None]:
from IPython.display import display, HTML  # Import the HTML class so we can display the table
#print sample of 1st 5  
print('The first 5 rows of the raw data:')
data_sample = dataCancer.head(5)
# Generate HTML code from the sample
sample_table = data_sample.to_html(index=False)
# Display the HTML table in the notebook
display(HTML(sample_table))

## Data Visualization

Displaying the data in a graphic format can help to analyze the data and reporting it. We used several charts as following:

We use the pie chart to show the percentage of different diagnosis in our data which turn out to be 62.7% benign and 37.3% malignant.

In [None]:
# graphs and tables show variable distribution , missing values
import matplotlib.pyplot as plt

#calculate the frequancy for each diagnosis
diagnosis_frequency = dataCancer['diagnosis'].value_counts(normalize=True)*100

#plot a ppie chart
diagnosis_frequency.plot.pie(autopct='%1.1f%%',figsize=(5,5),startangle=90)

#display the plot with appropriate apperance by changing the values of the attribute
plt.title('Diagnosis frequency')
plt.ylabel('')
plt.show()

Also, we use the plot box to show variable distribution for each attribute and comparing the values across different diagnosis.

In [None]:
#this is to solve the problem of "ModuleNotFoundError: No module named ‘seabor’"
%pip install seaborn 
import seaborn as sns


target_column = 'diagnosis'

# choosing the numiric data only
feature_columns = dataCancer.columns[(dataCancer.columns != target_column) & (dataCancer.columns != 'id')]


# Determine the number of rows and columns for the subplot layout
num_features = len(feature_columns)
num_rows = (num_features - 1) // 4 + 1
num_cols = min(4, num_features)

# Create box plots for each feature grouped by 'diagnosis'
plt.figure(figsize=(16, 4 * num_rows))

for i, feature in enumerate(feature_columns):
    plt.subplot(num_rows, num_cols, i + 1)
    sns.boxplot(x=target_column, y=feature, data=dataCancer)
    plt.title(f'Box Plot: {feature} vs. {target_column}')
    plt.xlabel(target_column)
    plt.ylabel(feature)

plt.tight_layout()
plt.show()

Majority of our data is a numeric data and one of the best ways to visualize it is using the matrix of scatter plots(pair plot) in order to have a clear vision of the relationship between the data.

In [None]:
#exclude the id attribute
selected_feature = dataCancer.iloc[:, 1:]

# Create a matrix of scatter plots
sns.pairplot(selected_feature, hue='diagnosis', palette = 'Set2')


plt.suptitle('Pair Plot of Breast Cancer Dataset', y=1.02)
plt.show()


## Summary Statistics

It's important to take a look at the statistic summary to capture the different characteristics of our data set, such as the central tendency measurement and the variance. The table below shows these measurements that have been calculated by the python code above it.

In [None]:
import statistics as st

summary_data = []

for columnName in feature_columns:
    columnData = dataCancer[columnName]
    midrange = (max(columnData) + min(columnData)) / 2
    summary_data.append({
        'Feature': columnName,
        'Mean': st.mean(columnData),
        'Median': st.median(columnData),
        'Mode': st.mode(columnData),
        'Midrange': midrange,
        'Variance': st.variance(columnData)
    })

# Create the DataFrame after the loop
statistical_summaries = pd.DataFrame(summary_data)

# Generate HTML code from the DataFrame
statistical_summaries_table = statistical_summaries.to_html(index=False)
# Display the HTML table in the notebook
display(HTML(statistical_summaries_table))