# Explore Statistics by Data Visualization

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We will use a small dataset that contains (Physics,Biology and Maths) marks of a classroom of students.

Read the comma-seperated file(csv) that contains the marks. We assign the "Names" column as our index

In [None]:
df = pd.read_csv("https://archive.org/download/ml-fundamentals-data/machine-learning-fundamentals-data/grades.csv",index_col=0)

Show the first 5 rows of data.

In [None]:
df.head()

Show all the data entries.

In [None]:
df

Show only the data column that you want.

In [None]:
df["Biology"].head()

Describe the dataset with mean, standard deviation, data entries count and etc.

In [None]:
df.describe()

Show the information about you data frame, e.g. Columns, Data types

In [None]:
df.info()

Show available columns of data.

In [None]:
df.columns.values

Plot a **bar chart** of the grades data.

In [None]:
df.plot(kind="bar")

Plot a **box plot** of the grades data.

In [None]:
df.boxplot()

Plot the **histograms** of the grades data.

In [None]:
df.hist()

Plot only the histogram of "Physics" column.

In [None]:
df["Physics"].hist()

We can plot a distribution plot by using **seaborn** module.

In [None]:
sns.distplot(df["Physics"])

We can check how "normally distributed" a distribution is by checking the skewness of the distribution.
- A skewness value of 0 indicates a symmetrical distribution of values.
- A negative skewness value indicates an asymmetry in the distribution and the tail is larger towards the left hand side of the distribution(Left skewed).
- A positive skewness value indicates an asymmetry in the distribution and the tail is larger towards the right hand side of the distribution(Right skewed)

Check the skewness of all columns or only 1 column.

In [None]:
df.skew()

In [None]:
df["Physics"].skew()

# Data Transformation

In many Machine Learning modeling scenarios, **normality** of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a **Gaussian distribution** as possible in order to stabilize variance and **minimize skewness**.

In [None]:
from sklearn.preprocessing import PowerTransformer

In [None]:
transformer = PowerTransformer(method='box-cox', standardize=False)

In [None]:
df["Physics"].shape

The Transformer accepts 2d array, we will first transform the shape of data with reshape().

In [None]:
data_2d=df["Physics"].values.reshape(-1,1)
data_2d.shape

Transform the data and check the values.

In [None]:
data_trans=transformer.fit_transform(data_2d)

In [None]:
df_new=pd.DataFrame(data_trans,index=df.index)
df_new.head()

Plot the histogram of the transformed "Physics" marks and visualize the distribution.

In [None]:
df_new.hist()

In [None]:
splot=sns.distplot(df_new)

The plotted distribution graph looks a like a **normal distribution**, but due to the dataset, the transformation is not very obvious.

We will compare the **skewness** to assert that the data transformation has made the distribution more "normally distributed".

In [None]:
print("Skewness before: {}".format(df["Physics"].skew()))
print("Skewness after: {}".format(df_new.skew().squeeze()))

Now, let us try with our own generated distribution. We will generate a distribution that is **greatly skewed**.

In [None]:
from scipy.stats import skewnorm

In [None]:
rand_vars = skewnorm.rvs(5, size=10000)

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(rand_vars)
plt.show()

In [None]:
df=pd.DataFrame(rand_vars)

In [None]:
sns.distplot(df)

As you can see, we have 10000 random variables are the distribution is non-normal distribution and greatly skewed.

Let us check the skewness of our generated data.

In [None]:
df.skew().squeeze()

Our dataset is positive skewed.

This round, we will use **Log Transform** to transform our data and observe the results.

In [None]:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)

In [None]:
data_trans=transformer.transform(df)

In [None]:
df_new=pd.DataFrame(data_trans)

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(data_trans)
plt.show()

In [None]:
print("Skewness before: {}".format(df.skew().squeeze()))
print("Skewness after: {}".format(df_new.skew().squeeze()))

As you can the skewness is greatly reduced and the distribution resembles a **normal distribution**.

In [None]:
sns.distplot(df_new)