# Section Three - Statistical Analysis of feature data

In this section, I perform extensive statistical analysis on the 16 features extracted in Section Two from the 140 handwritten images. This analysis is divided into four parts:
1.	Descriptive Statistics and Visualisations (3.1)
2.	Group-Specific Summary Statistics and Discriminative Visualisations (3.2)
3.	Hypothesis Testing of Group Differences (3.3)
4.	Correlation and Linear Association Analysis among features (3.4)

## Library Installation
The commands needed to install the libraries required for section three.

In [None]:
%pip install pandas
%pip install numpy 
%pip install matplotlib 
%pip install seaborn 
%pip install scipy 
%pip install scikit-learn
%pip install IPython

## Import Libraries and Load Data

In [None]:
import pandas
import numpy
import matplotlib.pyplot as pyplot
import seaborn
from scipy import stats
from IPython.display import display

data = pandas.read_csv("40394874_features.csv", delimiter=',')
print("Head of Feature Data")
print(data.head())

numeric_cols = ['nr_pix', 'rows_with_1', 'cols_with_1', 'rows_with_3p', 'cols_with_3p', 'aspect_ratio', 'neigh_1', 'no_neigh_above', 'no_neigh_below', 'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz', 'no_neigh_vert', 'connected_areas', 'eyes', 'custom']

data[numeric_cols] = data[numeric_cols].apply(pandas.to_numeric)

## Descriptive Statistics and Visualisations

In this section I calculated summary statistics for each numeric feature and generated histograms for the first six key features.

In [None]:
print("Overall Summary Statistics:")
display(data[numeric_cols].describe())

features_for_histograms = ['nr_pix', 'rows_with_1', 'cols_with_1', 'rows_with_3p', 'cols_with_3p', 'aspect_ratio']

seaborn.set_theme()

for feature in features_for_histograms:
    pyplot.figure(figsize=(8, 12))
    seaborn.histplot(data[feature], bins=25, kde=True)
    pyplot.title(f"Histogram of {feature}")
    pyplot.xlabel(feature)
    pyplot.ylabel("Frequency")
    pyplot.show()

## Group-Specific Summary Statistics and Discriminative Visualisations
I divided the dataset into two groups – letters (symbols a-j) and non-letters (all other symbols) – and computed summary statistics for all 16 features, including the mean, standard deviation, and median for each group. I then created Violin Plots for the most significant features.

In [None]:
letters = list("abcdefghij")
data['group'] = data['label'].apply(lambda x: 'letter' if x in letters else 'non-letter')
print("Group Counts:")
print(data['group'].value_counts())

group_stats = data.groupby('group')[numeric_cols].agg(['mean', 'std', 'median'])
print("Descriptive Statistics by Group:")
for feature in numeric_cols:
    print(f"{feature}\n\n")
    display(group_stats[feature])

for feature in ['aspect_ratio', 'connected_areas', 'custom']:
    pyplot.figure(figsize=(8, 6))
    seaborn.violinplot(x='group', y=feature, data=data)
    pyplot.title(f"Violin plot of {feature} by Group")
    pyplot.xlabel("Group")
    pyplot.ylabel(feature)
    pyplot.tight_layout()
    pyplot.show()

## Hypothesis Testing of Group Differences
In this section, I performed statistical tests to determine whether the differences in feature values between letters and non-letters and non-letters are statistically significant. For each of the 16 features, an independent t-test, assuming unequal variances, is conducted comparing the two groups.

In [None]:
print("Hypothesis Testing (t-tests) for all features:")

hypothesis_results = {}

for feature in numeric_cols:
    group_letters = data[data['group'] == 'letter'][feature]
    group_nonletters = data[data['group'] == 'non-letter'][feature]
    t_stat, p_val = stats.ttest_ind(group_letters, group_nonletters, equal_var=False)
    hypothesis_results[feature] = {'t-test': f"{t_stat:.5f}",'p-value': f"{p_val:.5f}"}
    print(f"{feature}: t-statistic = {t_stat:}, p-value = {p_val:}")

print('\n\n')

hypothesis_data = pandas.DataFrame(hypothesis_results).T
hypothesis_data = hypothesis_data.reset_index().rename(columns={'index': 'Feature'})
display(hypothesis_data)

features = list(hypothesis_results.keys())
p_values = [float(hypothesis_results[feat]['p-value']) for feat in features]

pyplot.figure(figsize=(6, 16))
pyplot.bar(features, p_values)
pyplot.axhline(0.05, linestyle='--', label="p = 0.05")
pyplot.xticks(rotation=90, ha='right')
pyplot.xlabel("Features")
pyplot.ylabel("p-value")
pyplot.title("p-values from t-tests for each feature")
pyplot.legend()
pyplot.tight_layout()
pyplot.show()

for feature in ['no_neigh_horiz', 'eyes', 'rows_with_1']:
    pyplot.figure(figsize=(8, 6))
    seaborn.violinplot(x='group', y=feature, data=data)
    pyplot.title(f"Violin plot of {feature} by Group")
    pyplot.xlabel("Group")
    pyplot.ylabel(feature)
    pyplot.tight_layout()
    pyplot.show()

## Correlation and Linear Association Analysis among features
In this section, I investigated the linear relationships amongst the 16 features extracted from the handwritten images. I calculated the  Pearson’s correlation coefficient, computed a correlation matrix that quantifies the degree to which each pair of features vary together using the Python Data Analysis Library (pandas), and generated Scatter Plots based on the pairs with strong correlation.

In [None]:
pearson_correlation_matrix = data[numeric_cols].corr()

pyplot.figure(figsize=(12, 10))
seaborn.heatmap(pearson_correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
pyplot.title("Heatmap of Pearson Correlation Coefficients")
pyplot.tight_layout()
pyplot.show()

high_correlation_pairs = []
for i, feat1 in enumerate(numeric_cols):
    for feat2 in numeric_cols[i+1:]:
        correlation_value = pearson_correlation_matrix.loc[feat1, feat2]
        if abs(correlation_value) > 0.7:
            high_correlation_pairs.append((feat1, feat2, correlation_value))

print("Highly Correlated Feature Pairs (|r| > 0.7):")
for pair in high_correlation_pairs:
    print(f"{pair[0]} and {pair[1]}: correlation = {pair[2]:.2f}")

print("\n")

for pair in high_correlation_pairs:
    feat1, feat2, _ = pair
    pyplot.figure(figsize=(8, 6))
    seaborn.scatterplot(x=feat1, y=feat2, hue='group', data=data)
    pyplot.title(f"Scatter Plot: {feat1} vs. {feat2}")
    pyplot.xlabel(feat1)
    pyplot.ylabel(feat2)
    pyplot.tight_layout()
    pyplot.show()