# Describing Datasets: demo

# Part 1: Basic Statistics

The sample input dataset is taken from Conway & Myles Machine Learning for Hackers book, Chapter 2.

Each sample contains three columns.
* Height in inches
* Weight in pounds
* IsMale: 1 corresponds to a male person, and 0 corresponds to a female person.

We want to collect different basic statistics for this dataset.

In [None]:
 data_file = "height_weight_gender.csv"

In [None]:
import pandas as pd

data = pd.read_csv(data_file)
print(data.columns)
print(data.dtypes)

In [None]:
data.describe()

## Subset of a dataset

In [None]:
height = data["Height"]
height.head()

In [None]:
# Only women:
women = data[data["IsMale"]== 0]
w_height = women["Height"]
w_height.head()

In [None]:
men = data[data["IsMale"]== 1]
m_height = men["Height"]
m_height.head()

## Mean and Median

To find a mean across a single dimension:

In [None]:
print("MEAN of the height:")
print(height.mean())

In [None]:
print("Mean of the men height:")
print(m_height.mean())

In [None]:
print("Mean of the women height:")
print(w_height.mean())

To find a median:

In [None]:
print("MEDIAN of the height:")
print(height.median())

The mean and the median do not differ significantly in this dataset.

## Difference between Mean and Median

To see the difference let's add several new observations to the dataset: some tallest known men:

272 cm (107.087 in)

270 cm (106.299 in) 

269 cm (105.906 in)

265 cm (104.331 in)

264 cm (103.937 in)

In [None]:
# create a new dataset with an oulier
m_height_out = m_height.copy()

new_row =  107.087
# add the new row at the end
m_height_out.loc[len(m_height_out)] = new_row

new_row =  106.299
# add the new row at the end
m_height_out.loc[len(m_height_out)] = new_row

new_row =  106.906
# add the new row at the end
m_height_out.loc[len(m_height_out)] = new_row

new_row =  104.331
# add the new row at the end
m_height_out.loc[len(m_height_out)] = new_row

new_row =  103.937
# add the new row at the end
m_height_out.loc[len(m_height_out)] = new_row

# check that they were added
m_height_out.tail()

In [None]:
print("Mean of the height in the original dataset:")
print(m_height.mean())
print("MEAN of the height in dataset with one outlier:")
print(m_height_out.mean())

In [None]:
print("MEDIAN of the height in the original dataset:")
print(m_height.median())
print("MEDIAN of the height in dataset with one outlier:")
print(m_height_out.median())

## Standard deviation
To find a variance/standard deviation:

In [None]:
# Returns unbiased variance over requested axis.
# Normalized by N-1 by default: ddof = 1
height.var(ddof=1)

If that wiould be not a sample but the entire population, we would set ddof=0:

In [None]:
# Population variance: ddof = 0
height.var(ddof=0)

In [None]:
# Returns sample standard deviation over requested axis.
# Normalized by N-1 by default: ddof = 1
height.std(ddof=1)

## Mode

The mode of a set of values is the value that appears most often. It can be multiple values. 

Because the values of height and weight are continuous, the mode is not directly applicable here:

In [None]:
data["Height"].mode()

That means that each value occurs only once, and every one of them is a mode value.

In [None]:
# we convert the float into int
int_height = data["Height"].astype('int32')
int_height.dtype

In [None]:
# check that this did not harm the original dataset
data["Height"].dtype

In [None]:
# now compute basic statistics
print("MEAN of the int_height:")
print(int_height.mean())

print("MEDIAN of the int_height:")
print(int_height.median())

print("MODE of the int_height:")
print(int_height.mode())

# Part 2: Visualizing Data

## Bar charts

We use these charts to compare frequencies or counts: for example the count of data points in different categories. They are  often used for displaying nominal(categorical) data: data that has no inherent order or structure, such as hair color or preferred drink.

In our dataset we have two categories: Male and Female. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- this "magic" tells all future matplotlib plots to be drawn inside notebook and not in a separate window.

How many total observations in each category? Using frequency counts for each.

In [None]:
# create data for the plot
plotdata = pd.DataFrame({
    "totals":[m_height.count(), w_height.count()]},
    index=["Male", "Female"])
plotdata.plot(kind='bar', figsize=(5, 3))

plt.title("Total frequencies")
plt.ylabel("Count")

What is the mean height and mean weight of men and women in the dataset?

We will also compute the mean BMI index:  

$BMI = weight (lb) / height (in)^2 \times 703$

In [None]:
# create data for the plot
male_height_mean = data[data["IsMale"]==1]["Height"].mean()
male_weight_mean = data[data["IsMale"]==1]["Weight"].mean()

fem_height_mean = data[data["IsMale"]==0]["Height"].mean()
fem_weight_mean = data[data["IsMale"]==0]["Weight"].mean()

male_BMI_mean = male_weight_mean/(male_height_mean)**2 * 703
fem_BMI_mean = fem_weight_mean/(fem_height_mean)**2 * 703

plotdata = pd.DataFrame({
    "Male":[male_height_mean, male_weight_mean, male_BMI_mean],
    "Female":[fem_height_mean,fem_weight_mean, fem_BMI_mean]},
index=["Height", "Weight", "BMI"])
plotdata.plot(kind='bar', figsize=(6, 4))

plt.title("Mean values for weight, height and BMI")
plt.ylabel("Mean values")

## Histogram
A histogram divides the values within a numerical variable into “bins”, and counts the frequency of observations that fall into each bin. It is commonly used to obtain a very immediate and intuitive sense of the distribution of values within a variable.

In [None]:
# plot one column of data
data.hist(column="Height", bins=80, figsize=(7, 3))

plt.title("Distribution of height")
plt.xlabel("Height, in")
plt.ylabel("Frequency")

This distribution is **unimodal**.

In [None]:
# plot one column of data
data.hist(column="Weight", bins=80, figsize=(7, 3))

plt.title("Distribution of weight")
plt.xlabel("Weight, lbs")
plt.ylabel("Frequency")

This distribution is **bimodal**.

## Percentiles

In [None]:
percentiles = [0.25, 0.5, 0.75]
data['Height'].quantile(percentiles)

We now use this to identify and remove outliers. We determine the value of a fence around the data:

The interquartile range (IQR) is a measure of the spread of the middle 50% of the data. The IQR can be calculated as the difference between the 75-th percentile and the 25-th percentile of the dataset. Any data point outside the range of 1.5 times the IQR below the 25th percentile or above the 75th percentile can be considered an outlier.

To identify outliers using the IQR, we can use the quantile() function in pandas to calculate the 25th and 75th percentiles of the dataset. We can then calculate the IQR and use it to identify outliers.

In [None]:
# calculate IQR for column Height
Q1 = data['Height'].quantile(0.25)
Q3 = data['Height'].quantile(0.75)
IQR = Q3 - Q1

# identify outliers
threshold = 1.5
outliers = data[(data['Height'] < Q1 - threshold * IQR) | (data['Height'] > Q3 + threshold * IQR)]
outliers.count()

Remove outliers from the dataset.

In [None]:
# drop rows containing outliers
data_clean = data.drop(outliers.index)

## Box Plot

In [None]:
# let's define a function to remove outliers

# let's define a function to remove outliers
def get_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # identify outliers
    threshold = 1.5
    outliers = df[(df[col] < Q1 - threshold * IQR) | (df[col] > Q3 + threshold * IQR)]
    
    return outliers.index

def remove_outliers(df, index):
    return df.drop(index)

In [None]:
col = 'Height'
outliers = get_outliers(data, col)
df_clean = remove_outliers(data, outliers)
df_clean.boxplot(column=col, vert=False)

In [None]:
# now we will compare height of male and female by removing their outliers
col = 'Height'
fem_outliers = get_outliers(data[data['IsMale']==0], col)
male_outliers =  get_outliers(data[data['IsMale']==1], col)

df_clean = remove_outliers(data, fem_outliers)
df_clean = remove_outliers(df_clean, male_outliers)

In [None]:
df_clean.boxplot(column=col, by='IsMale', vert=False )

In [None]:
# and now we will do the same for the weight
col = 'Weight'
fem_outliers = get_outliers(data[data['IsMale']==0], col)
male_outliers =  get_outliers(data[data['IsMale']==1], col)

df_clean = remove_outliers(data, fem_outliers)
df_clean = remove_outliers(df_clean, male_outliers)

df_clean.boxplot(column=col, by='IsMale', vert=False)

Copyright &copy; 2024 Marina Barsky. All rights reserved.