#  NB2: Exploratory Data Analysis

## Objectives

* Exploratory data analysis (EDA)
* Descriptive statistics
* Visualization

## Inputs

* inputs/datasets/data_clean_id/data.csvv

## Outputs

* outputs/datasets/cleaned/binary_data.csv
* outputs/nb2/diagnosis.jpeg
* outputs/nb2/hist_mean.jpeg, outputs/nb2/hist_se.jpeg, outputs/nb2/hist_worst.jpeg
* outputs/nb2/density_mean.jpeg, outputs/nb2/density_se.jpeg, outputs/nb2/density_worst.jpeg
* outputs/nb2/boxplot_mean.jpeg, outputs/nb2/boxplot_se.jpeg, outputs/nb2/boxplot_worst.jpeg
* outputs/nb2/correlation.jpeg, outputs/nb2/scatter.jpeg

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

Now that we have a good intuitive sense of the data, Next step involves taking a closer look at attributes and data values. In this section, I am getting familiar with the data, which will provide useful knowledge for data pre-processing.
## 2.1 Objectives of Data Exploration
Exploratory data analysis (EDA) is a very important step which takes place after feature engineering and acquiring data and it should be done before any modeling. This is because it is very important for a data scientist to be able to understand the nature of the data without making assumptions. The results of data exploration can be extremely useful in grasping the structure of the data, the distribution of the values, and the presence of extreme values and interrelationships within the data set.
> **The purpose of EDA is:**
* to use summary statistics and visualizations to better understand data, 
*find clues about the tendencies of the data, its quality and to formulate assumptions and the hypothesis of our analysis
* For data preprocessing to be successful, it is essential to have an overall picture of your data
Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers.** 

Next step is to explore the data. There are two approached used to examine the data using:

1. ***Descriptive statistics*** is the process of condensing key characteristics of the data set into simple numeric metrics. Some of the common metrics used are mean, standard deviation, and correlation. 
	
2. ***Visualization*** is the process of projecting the data, or parts of it, into Cartesian space or into abstract images. In the data mining process, data exploration is leveraged in many different steps including preprocessing, modeling, and interpretation of results. 


# 2.2 Descriptive statistics
Summary statistics are measurements meant to describe data. In the field of descriptive statistics, there are many [summary measurements](http://www.saedsayad.com/numerical_variables.htm))

In [None]:
import matplotlib.pyplot as plt

# Load libraries for data processing
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
from scipy.stats import norm
import seaborn as sns  # visualization


plt.rcParams["figure.figsize"] = (15, 8)
plt.rcParams["axes.titlesize"] = "large"

In [None]:
data = pd.read_csv("outputs/datasets/cleaned/data.csv", index_col=False)
data.drop("Unnamed: 0", axis=1, inplace=True)
data.head(3)

In [None]:
# basic descriptive statistics
data.describe()

In [None]:
data["diagnosis"].replace({"B": 0, "M": 1}, inplace=True)

# save data with binary encoding on diagnosis
data.to_csv("outputs/datasets/cleaned/binary_data.csv")

In [None]:
data.skew()

 >The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.
 From the graphs, we can see that **radius_mean**, **perimeter_mean**, **area_mean**, **concavity_mean** and **concave_points_mean** are useful in predicting cancer type due to the distinct grouping between malignant and benign cancer types in these features. We can also see that area_worst and perimeter_worst are also quite useful.

In [None]:
data.diagnosis.unique()

M = 1 B = 0

In [None]:
# Group by diagnosis and review the output.
diag_gr = data.groupby("diagnosis", axis=0)
pd.DataFrame(diag_gr.size(), columns=["# of observations"])

Check binary encoding from NB1 to confirm the coversion of the diagnosis categorical data into numeric, where
* Malignant = 1 (indicates prescence of cancer cells)
* Benign = 0 (indicates abscence)

##### **Observation**
> *357 observations indicating the absence of cancer cells and 212 show absence of cancer cell*

Lets confirm this, by ploting the histogram

# 2.3 Unimodal Data Visualizations

One of the main goals of visualizing the data here is to observe which features are most helpful in predicting malignant or benign cancer. The other is to see general trends that may aid us in model selection and hyper parameter selection.

Apply 3 techniques that you can use to understand each attribute of your dataset independently.
* Histograms.
* Density Plots.
* Box and Whisker Plots.

In [None]:
# lets get the frequency of cancer diagnosis
sns.set_style("white")
sns.set_context({"figure.figsize": (10, 8)})
sns.countplot(x=data["diagnosis"])
plt.savefig("outputs/nb2/diagnosis.jpeg")
plt.show()

## 2.3.1 Visualise distribution of data via histograms
Histograms are commonly used to visualize numerical variables. A histogram is similar to a bar graph after the values of the variable are grouped (binned) into a finite number of intervals (bins).

Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

### Separate columns into smaller dataframes to perform visualization

In [None]:
# Break up columns into groups, according to their suffix designation
# (_mean, _se, and __worst)
# to perform visualisation plots off.
data_diag = data.loc[:, ["diagnosis"]]

# For a merge + slice:
data_mean = data.iloc[:, 1:11]
data_se = data.iloc[:, 11:22]
data_worst = data.iloc[:, 23:]

# print(data_id_diag.columns)
print(data_mean.columns)
print(data_se.columns)
print(data_worst.columns)

### Histogram the "_mean" suffix designition

In [None]:
# Plot histograms of CUT1 variables
hist_mean = data_mean.hist(
    bins=10,
    figsize=(15, 10),
    grid=False,
)
plt.savefig("outputs/nb2/hist_mean.jpeg")
# Any individual histograms, use this:
# df_cut['radius_worst'].hist(bins=100)

### __Histogram for  the "_se" suffix designition__

In [None]:
# Plot histograms of _se variables
hist_se = data_se.hist(
    bins=10,
    figsize=(15, 10),
    grid=False,
)
plt.savefig("outputs/nb2/hist_se.jpeg")

### __Histogram "_worst" suffix designition__

In [None]:
# Plot histograms of _worst variables
hist_worst = data_worst.hist(
    bins=10,
    figsize=(15, 15),
    grid=False,
)
plt.savefig("outputs/nb2/hist_worst.jpeg")

### __Observation__ 

>We can see that perhaps the attributes  **concavity**,and **concavity_point ** may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.


## 2.3.2 Visualize distribution of data via density plots

### Density plots "_mean" suffix designition

In [None]:
# Density Plots
plt = data_mean.plot(
    kind="density",
    subplots=True,
    layout=(4, 3),
    sharex=False,
)
s = pd.Series([0, 1])
ax = s.plot.density()
ax.figure.savefig("outputs/nb2/density_mean.jpeg")

### Density plots "_se" suffix designition

In [None]:
# Density Plots
plt = data_se.plot(
    kind="density",
    subplots=True,
    layout=(4, 3),
    sharex=False,
    sharey=False,
    fontsize=12,
    figsize=(15, 10),
)
s = pd.Series([0, 1])
ax = s.plot.density()
ax.figure.savefig("outputs/nb2/density_se.jpeg")

### Density plot "_worst" suffix designition

In [None]:
# Density Plots
plt = data_worst.plot(
    kind="kde",
    subplots=True,
    layout=(4, 3),
    sharex=False,
    sharey=False,
    fontsize=5,
    figsize=(15, 10),
)
s = pd.Series([0, 1])
ax = s.plot.density()
ax.figure.savefig("outputs/nb2/density_worst.jpeg")

### Observation
>We can see that perhaps the attributes perimeter,radius, area, concavity,ompactness may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

## 2.3.3 Visualise distribution of data via box plots

### Box plot "_mean" suffix designition

In [None]:
# box and whisker plots
plt = data_mean.plot(
    kind="box", subplots=True, layout=(4, 4), sharex=False, sharey=False, fontsize=12
)
s = pd.Series([0, 1])
ax = s.plot.box()
ax.figure.savefig("outputs/nb2/boxplot_mean.jpeg")

### Box plot "_se" suffix designition

In [None]:
# box and whisker plots
plt = data_se.plot(
    kind="box", subplots=True, layout=(4, 4), sharex=False, sharey=False, fontsize=12
)
s = pd.Series([0, 1])
ax = s.plot.box()
ax.figure.savefig("outputs/nb2/boxplot_se.jpeg")

### Box plot "_worst" suffix designition

In [None]:
# box and whisker plots
plt = data_worst.plot(
    kind="box", subplots=True, layout=(4, 4), sharex=False, sharey=False, fontsize=12
)
s = pd.Series([0, 1])
ax = s.plot.box()
ax.figure.savefig("outputs/nb2/boxplot_worst.jpeg")

### Observation
>We can see that perhaps the attributes perimeter,radius, area, concavity,ompactness may have an exponential distribution ( ). We can also see that perhaps the texture and smooth and symmetry attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

# 2.4 Multimodal Data Visualizations
* Scatter plots
* Correlation matrix

### Correlation matrix

In [None]:
import matplotlib.pyplot as plt

# plot correlation matrix
data = pd.read_csv("outputs/datasets/cleaned/data.csv", index_col=False)
data.drop("Unnamed: 0", axis=1, inplace=True)

plt.figure(figsize=(25, 12))
corr = data.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap="YlGnBu")
plt.savefig("outputs/nb2/correlation.jpeg")

### Observation:
We can see strong positive relationship exists with mean values paramaters between 1-0.75;.
* The mean area of the tissue nucleus has a strong positive correlation with mean values of radius and parameter;
* Some paramters are moderately positive corrlated (r between 0.5-0.75)are concavity and area, concavity and perimeter etc
* Likewise, we see some strong negative correlation between fractal_dimension with radius, texture, parameter mean values.
    

In [None]:
plt.style.use("fivethirtyeight")
sns.set_style("white")

data = pd.read_csv("inputs/datasets/raw/data.csv", index_col=False)
g = sns.PairGrid(
    data[
        [
            data.columns[1],
            data.columns[2],
            data.columns[3],
            data.columns[4],
            data.columns[5],
            data.columns[6],
        ]
    ],
    hue="diagnosis",
)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter, s=3)
plt.savefig("outputs/nb2/scatter.jpeg")

### Summary

* Mean values of cell radius, perimeter, area, compactness, concavity
    and concave points can be used in classification of the cancer. Larger
    values of these parameters tends to show a correlation with malignant
    tumors.
* mean values of texture, smoothness, symmetry or fractual dimension
    does not show a particular preference of one diagnosis over the other. 
    
* In any of the histograms there are no noticeable large outliers that warrants further cleanup.