# Workshop 6 - Pandas and Data Visualisation 

### Pandas 

Pandas is a popular open-source data analysis and manipulation library for Python programming language. It provides data structures like Series and DataFrame, which allow you to manipulate and analyze structured data seamlessly. Pandas is widely used for tasks such as data cleaning, data transformation, data analysis, and data visualization.

Pandas provides a wide range of functionalities, including:

* Reading and writing data from and to various file formats, such as CSV, Excel, SQL databases, and more.
* Data cleaning and preprocessing, including handling missing data and removing duplicates.
* Data transformation, such as merging, reshaping, and aggregating datasets.
* Data analysis, including statistical analysis and time series analysis.
* Data visualization using libraries like Matplotlib and Seaborn.

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

### Pandas dataframe

A two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

### Read CSV file 

Please go to the GitHub page to download the data. 

The Breast Cancer Dataset is from UC Irvine Machine Learning Repository. Features are computed from a digitised image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 

In [None]:
data = pd.read_csv("data.csv")

type(data)

### Viewing data

Take a look of the dataset:

In [None]:
data.head()

Similarly, you can use tail:

In [None]:
data.tail()

Or you can use this code:

In [None]:
data

Get all the column names of the dataset:

In [None]:
data.columns

Get all the row names of the dataset:

In [None]:
data.index

Get the data type of each column:

In [None]:
data.dtypes

Get the statistic summary of your dataset:

In [None]:
data.describe()

Check the count of your data, it doesn't count missing values:

In [None]:
data.count()

Transposing your data:

In [None]:
data.T

Sort your dataset by values:

In [None]:
data.sort_values(by="id")

### Data selection `[]`

For a `DataFrame`, passing a single label selects a column.

In [None]:
data["id"]

Passing a slice `:` selects matching rows:

In [None]:
data[0:3]

Selecting multiple columns with all rows:

In [None]:
data.loc[:, ["id", "diagnosis"]]

Selecting by index:

In [None]:
data.iloc[0:3, 0:3]

Selecting non-continuously:

In [None]:
data.iloc[[0,2,4], [1,3,5]]

Getting a value explicitly:

In [None]:
data.iloc[6, 6]

### Selecting with conditions

Select rows where `data.diagnosis` is equal to M:

In [None]:
data[data["diagnosis"] == "M"].head()

In [None]:
data[data["radius_mean"] > 20].head()

### Missing data

To check if your data has missing values:

In [None]:
data.isna()

In [None]:
data_dropped = data.drop(["Unnamed: 32", "id"], axis=1)

# by specifying axis=1, it means we want to drop columns
# if axis=0, it means we want to drop rows

data_dropped.head()

### Feature selection and visualisation

Take a look of our diagnosis count:

In [None]:
ax = sns.countplot(data, x="diagnosis")
B,M = data.diagnosis.value_counts()
print("Number of Malignant:", M)
print("Number of Benign:", B)

### Violin plot 

Before making the violin plot, we need to normalise the data. 

In [None]:
data_dropped_norm = (data_dropped - data_dropped.mean()) / (data_dropped.std())
data_dropped_norm.head()

__Exercise: solve the error (hint: delete the column "diagnosis", and add it back later using `pd.concat`)__

Unpivot the dataframe from wide to long format, so we can use it for plotting. Please see this [page](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to learn more about `pandas.melt`.

In [None]:
data_norm_melt = pd.melt(data_norm,
                         id_vars="diagnosis",
                         var_name="features",
                         value_name="value")
data_norm_melt

Make the violin plot for 30 features:

In [None]:
plt.figure(figsize=(12,8), dpi=300)
sns.violinplot(data=data_norm_melt, x="features", y="value", hue="diagnosis", split=True, inner="quart")
plt.xticks(rotation=90)

### Boxplot

As an alternative of violin plot, box plot can be used. Box plots are also useful in terms of seeing outliers. 

In [None]:
plt.figure(figsize=(12,8), dpi=300)

# flierprops = dict(markerfacecolor='black', markersize=5, linestyle='none', marker='o')

sns.boxplot(data=data_norm_melt, x="features", y="value", hue="diagnosis", linewidth=1.0)
plt.xticks(rotation=90)

### Correlations

Not always true but, if two features are correlated with each other we can drop one of them. Let's check the correlation between __compactness_worst__ and __concavity_worst__.

In [None]:
sns.jointplot(data=data, x="compactness_worst", y="concavity_worst", kind='reg')

### Heat map 

Heat map allows us to view correlations bwtween all features.

In [None]:
plt.figure(figsize=(18,18), dpi=300)
sns.heatmap(data_dropped.corr(), annot=True, linewidth=0.5, fmt='.1f', cmap="crest")

### Drop features

We can drop some highly correlated features based on the heat map.

__radius_mean__ highly correlates with __perimeter_mean__, __area_mean__, __concave points_mean__, __radius_worst__, __perimeter_worst__, and __area_worst__. So drop all of them.

In [None]:
drop_list = ["perimeter_mean", "area_mean", "concave points_mean", "radius_worst", "perimeter_worst", "area_worst"]
data_feature_dropped = data_dropped.drop(drop_list, axis=1)
data_feature_dropped

Then, we can plot the heat map again:

In [None]:
plt.figure(figsize=(18,18), dpi=300)
sns.heatmap(data_feature_dropped.corr(), annot=True, linewidth=0.5, fmt='.1f', cmap="crest")

__Exercise: there are more features are correlated with each other, please drop them.__

# References 

* Kaggle - [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/)
* Kaggle DATAI - [Feature Selection and Data Visualisation](https://www.kaggle.com/code/kanncaa1/feature-selection-and-data-visualization)
* pandas - [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
* seaborn [documentation](https://seaborn.pydata.org/tutorial.html) 
* pandas - [documentation](https://pandas.pydata.org/docs/) 
* [ChatGPT](https://chat.openai.com/)