# Explore Statistics by Data Visualization

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import PowerTransformer
from scipy.stats import skewnorm
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
%matplotlib inline

We will use a small dataset that contains (Physics,Biology and Maths) marks of a classroom of students.

Read the comma-seperated file(csv) that contains the marks. We assign the "Names" column as our index

In [None]:
df = pd.read_csv("https://archive.org/download/ml-fundamentals-data/machine-learning-fundamentals-data/grades.csv",index_col=0)

Show the first 5 rows of data.

In [None]:
df.()

Show all the data entries.

In [None]:
df

Show the first 5 rows of ```biology``` data column.

In [None]:
df[""].head()

Describe the dataset.
- count: Total number of valid data. Will ignore null
- mean: Mean value of data
- std: Standard deviation of data
- min: Minimum value of data
- 25%, 50%, 75%: Percentiles of data. This can be specifed from the parameters by passing ```percentiles``` as list format. Default value is [.25, .5, .75]

In [None]:
df.()

Show the information about you data frame, e.g. Columns, Data types

In [None]:
df.()

Show available columns of data.

In [None]:
df..values

Plot a **bar chart** of the grades data.

In [None]:
df.(kind="bar")

Plot a **box plot** of the grades data.

### Boxplot parameters
<img src="https://matplotlib.org/3.2.2/_images/boxplot_explanation.png" width="500"/>

[Image Source: Matplotlib](https://matplotlib.org/3.2.2/faq/howto_faq.html)

### Understanding boxplot
- Boxplot is a method to display the distribution of data
- The Interquartile Range(IQR) indicates the range where most data is spread. We can use this to observe the spread of data. In other words, the data is concentrated in the IRQ. 
- Simple interpretation:
    - For Biology, we can understand that most student score in between 59 to 79. 
    - Also, someone score below the expected minimum. Note that the expected minimum is only a boxplot indicator. It is not the minimum score from the data. The expected minimum value is calculated using the formula above. From this, we can understand that someone underachieved the test. This is called lower outlier. If someone score over the expected maximum, it is called higher outlier.

Show boxplot

In [None]:
df.()

### Histogram parameters
- x-axis: observed value
- y-axis: frequency of occurences

### Understanding histogram
- Histogram can be used to evaluate frequency of value.

Plot the **histograms** of the grades data.

In [None]:
df.()

Plot only the histogram of "Physics" column.

In [None]:
df["Physics"].()

We can plot a distribution plot by using **seaborn** module.

In [None]:
sns.(df["Physics"])

We can check how "normally distributed" a distribution is by checking the skewness of the distribution.
- A skewness value of 0 indicates a symmetrical distribution of values.
- A negative skewness value indicates an asymmetry in the distribution and the tail is larger towards the left hand side of the distribution(Left skewed).
- A positive skewness value indicates an asymmetry in the distribution and the tail is larger towards the right hand side of the distribution(Right skewed)

Check the skewness of all columns or only 1 column.

In [None]:
df.()

In [None]:
df["Physics"].()

# Data Transformation 1 - Skewness

In many Machine Learning modeling scenarios, **normality** of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a **Gaussian distribution** as possible in order to stabilize variance and **minimize skewness**.

In [None]:
transformer = PowerTransformer(method='box-cox', standardize=False)

In [None]:
df["Physics"].shape

Currently, the data looks like this:
```example_data = [1, 2, 3, 4, 5, 6]```
The example_data has 6 elements arranged as single 1d array (also known as vector). 

In [None]:
# View df["Physics"] in list

print(df["Physics"].tolist())

The Transformer accepts 2d array, which is not compatible to our 1D data. We will first transform the shape of data with reshape().

In [None]:
data_2d = df["Physics"].values.reshape(-1,1)
data_2d.shape

Every data in the 1D array is transformed into their own list of size (1, 1) and combined together with other data to produce (26, 1).

In [None]:
# View transformed data in list

print(data_2d.tolist())

Transform the data and check the values.

In [None]:
data_trans = transformer.fit_transform(data_2d)

In [None]:
df_new = pd.DataFrame(data_trans,index=df.index)
df_new.head()

Rename the column to ```Physics``` according to the original. This is not a required step. But, if you have more than one column, renaming is a good practice

In [None]:
df_new.rename(columns={0: "Physics"}, inplace=True)
df_new.head()

Plot the histogram of the transformed "Physics" marks and visualize the distribution.

In [None]:
df_new.hist()

In [None]:
splot = sns.distplot(df_new)

The plotted distribution graph looks a like a **normal distribution**, but due to the dataset, the transformation is not very obvious.

We will compare the **skewness** to assert that the data transformation has made the distribution more "normally distributed".

In [None]:
print("Skewness before: {}".format(df["Physics"].skew()))
print("Skewness after: {}".format(df_new.skew().squeeze())) # squeeze() convert 1D object to scalar

Now, let us try with our own generated distribution. We will generate a distribution that is **greatly skewed**.

In [None]:
# skewnorm will generate random numbers

rand_vars = skewnorm.rvs(5, size=10000)

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(rand_vars)
plt.show()

In [None]:
df = pd.DataFrame(rand_vars)

In [None]:
sns.distplot(df)

As you can see, we have 10000 random variables are the distribution is non-normal distribution and greatly skewed.

Let us check the skewness of our generated data.

In [None]:
df.skew().squeeze()

Our dataset is positive skewed. This round, we will use **Log Transform** to transform our data and observe the results.

In [None]:
transformer = FunctionTransformer(np.log1p, validate=True)

In [None]:
data_trans = transformer.transform(df)

In [None]:
df_new = pd.DataFrame(data_trans)

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(data_trans)
plt.show()

In [None]:
print("Skewness before: {}".format(df.skew().squeeze()))
print("Skewness after: {}".format(df_new.skew().squeeze()))

As you can the skewness is greatly reduced and the distribution resembles a **normal distribution**.

In [None]:
sns.distplot(df_new)

# Data Transformation 2 - One-hot encoding

Another frequent encounter of data is the value are not in number. The example below shows a string dataset. Sklearn algorithm cannot process string data. So, the data need to be represented as numbers. One method is to perform one-hot encoding.

In [None]:
x = ["Jack", "Jill", "Mary", "Jack", "Jill", "Jill"]
df = pd.DataFrame(x, columns=["Name"])

get_dummies will create new columns according to all variables in the specified column.

In [None]:
df_one_hot = pd.get_dummies(df)

In [None]:
df_one_hot

As seen above, all variables are converted to columns and the column that is associated to it is marked as 1

# Data Transformation 3 - Ordinal Variables

This technique transform the data by assigning labels according to a ranges or data group. An ordinal variable is similar to categorical variable except there is a sense of order to the labelled data.

For example, in clothing, rather than dealing with continuous chest size, shirt length, etc. for measurement, consumer use labels such as small, medium, large, etc. to define the size. The order is from small to large.

The example below shows an application for households income range in Malaysia. The defined labels are based on 2019 income thresholds. [Source](https://ringgitplus.com/en/blog/personal-finance-news/dosm-survey-higher-income-thresholds-for-b40-m40-t20-households-in-2019.html#:~:text=According%20to%20the%20report%2C%20the,960%20and%20above%20for%20T20.)

In [None]:
# Generate 10 random numbers range 1000 to 20000 indicating households income
min_limit = 1000
max_limit = 20000
a = np.random.randint(low=min_limit, high=max_limit, size=10) 
df = pd.DataFrame(a, columns=["data"])
df

In [None]:
bins = [min_limit, 4850, 10959, max_limit]
labels = ["B40", "M40", "T20"]

df["encoded"] = pd.cut(df["data"], bins=bins, labels=labels, include_lowest=True)
df

Now all the data are encoded according to the range of bins. Then you can one-hot encode them.

In [None]:
df_one_hot = pd.get_dummies(df["encoded"])
df = pd.concat([df["data"], df_one_hot], axis=1)
df

Now the dataframe are transformed into their respective labels

# Data Transformation 4 - Handle Missing data
In the real-world dataset, there will be missing data which most of the time, represented as NaN.

Not all missing data can be handled. One example is string data such as address or product name. There is no way to approximate the data by itself. 

Adding artificial data may or may not work well as it will introduce noise, especially if the missing data is the label. Thus, one handling method is to remove the row. Handle missing data is more of experimentation to see which works the best.

In this example, we are going to look at the sklearn function called SimpleImputer.

Consider the data below.

In [None]:
# Generate random data with NaN
a = np.random.randn(6)
idx = np.random.randint(len(a)) 
a[idx] = np.nan

In [None]:
df = pd.DataFrame(a, columns=["My Data"])
df

In [None]:
imp = SimpleImputer(strategy="mean")
df_fixed = imp.fit_transform(df)

In [None]:
df_fixed

### String data

In [None]:
a = ["jack", "jill", "john", "ali", np.nan, "jack"]
df = pd.DataFrame(a, columns=["name"])

In [None]:
imp = SimpleImputer(strategy="constant", fill_value="abu")
df_fixed = imp.fit_transform(df)
print(df_fixed)

SimpleImputer provides four strategies to impute missing data
- mean: replace with mean along each column
- median: replace with median along each column
- most_frequent: replace with the most frequent value along each column
- constant: replace with a specified fill_value. Can be used for strings

It is important to understand that handling missing data is just an approximation. As you can see with SimpleImputer, the strategies are based on statistical technique(except constant), which, depends on overall column values. Imputing value using "noisy" data will add more noise to the data.