# BIOM/SYSC5405 – Pattern Classification and Experiment Design - Assignment 1


For this assignment, I will be using the following packages:


1.   Numpy
2.   Pandas
3.   Plotly
4.   Matplotlib
5.   Seaborn
6.   Scipy



# Libraries and Dependencies


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
import scipy.stats as stats
from scipy.stats import pearsonr

# An overall overview of the data

**The next two cells are for illustration purposes.**


1.   Loading the data into a pandas DataFrame
2.   Initial data visualization to understand its structure and contents



In [None]:
data = pd.read_csv('assigData1.csv')

fig = make_subplots(rows=2, cols=3, subplot_titles=("Apple Diameter", "Apple Weight", "Orange Diameter", "Orange Weight", "Grape Diameter", "Grape Weight"))

fig.add_trace(
    go.Histogram(x=data['W_apple']),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=data['D_apple']),
    row=1, col=2
)
fig.add_trace(
    go.Histogram(x=data['W_orange']),
    row=1, col=3
)
fig.add_trace(
    go.Histogram(x=data['D_orange']),
    row=2, col=1
)
fig.add_trace(
    go.Histogram(x=data['W_grape']),
    row=2, col=2
)
fig.add_trace(
    go.Histogram(x=data['D_grape']),
    row=2, col=3
)

# Question 1 - a


### i) Assuming the class-conditional distributions follow multivariate normal distributions with unknown mean and covariance matrix for each class, estimate the three means and the three covariance matrices.

#### **Solution For Q1 - a - i**

**Mean vector $(\mu_i)$**


$
\mu_i = \begin{bmatrix} \mu_{d_i} \\ \mu_{w_i} \end{bmatrix} = \frac{1}{n_i} \sum_{j=1}^{n_i} \begin{bmatrix} d_{i,j} \\ w_{i,j} \end{bmatrix}
$

**Covariance matrix $(\Sigma_i)$**

$
\Sigma_i = \frac{1}{n_i - 1} \sum_{j=1}^{n_i} \left( \begin{bmatrix} d_{i,j} \\ w_{i,j} \end{bmatrix} - \mu_i \right) \left( \begin{bmatrix} d_{i,j} \\ w_{i,j} \end{bmatrix} - \mu_i \right)^T
$

**where:**
*   $n_i$ is the number of observations for fruit $i$
*   $d_{i,j} $ and $ w_{i,j}$ are the diameter and weight of the
$j-th$ observation for fruit $i$


---

<br></br>
*Instead of implementing them using for-loops as mentioned in the equations above, we can simply use pandas or numpys built-in function to calculate mean and covariance.*

In [None]:
def pretty_print(covar,mean):
  print("Apple Mean Vector:")
  print(tabulate(mean.reset_index(), headers=["Feature", "Mean"], tablefmt="pretty"))

  print("\nApple Covariance Matrix:")
  print(tabulate(covar, headers=covar.columns.to_list(), showindex=covar.columns.to_list(), tablefmt="pretty"))

  print("\n#---------------------------------------------------------#\n")

apple_cols = ['D_apple', 'W_apple']
orange_cols = ['D_orange', 'W_orange']
grape_cols = ['D_grape', 'W_grape']

apple_mean = data[apple_cols].mean().to_numpy()
orange_mean = data[orange_cols].mean().to_numpy()
grape_mean = data[grape_cols].mean().to_numpy()

apple_cov = data[apple_cols].cov().to_numpy()
orange_cov = data[orange_cols].cov().to_numpy()
grape_cov = data[grape_cols].cov().to_numpy()



pretty_print(data[apple_cols].cov(),data[apple_cols].mean())
pretty_print(data[orange_cols].cov(),data[orange_cols].mean())
pretty_print(data[grape_cols].cov(),data[grape_cols].mean())


Apple Mean Vector:
+---+---------+------------------+
|   | Feature |       Mean       |
+---+---------+------------------+
| 0 | D_apple | 53.0789582921689 |
| 1 | W_apple | 4.9216130832437  |
+---+---------+------------------+

Apple Covariance Matrix:
+---------+--------------------+--------------------+
|         |      D_apple       |      W_apple       |
+---------+--------------------+--------------------+
| D_apple | 172.9455017242476  | 1.6133534466988508 |
| W_apple | 1.6133534466988508 | 2.9256289979343975 |
+---------+--------------------+--------------------+

#---------------------------------------------------------#

Apple Mean Vector:
+---+----------+-------------------+
|   | Feature  |       Mean        |
+---+----------+-------------------+
| 0 | D_orange | 79.6807244180421  |
| 1 | W_orange | 5.963264321698779 |
+---+----------+-------------------+

Apple Covariance Matrix:
+----------+--------------------+--------------------+
|          |      D_orange      |    



---





---





---



### ii) For each estimated covariance matrix, compute the determinant and the trace.

#### **Solution For Q1 - a - ii**




---


*   *Determinant*

For a 2x2 matrix A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}

$
\det(A) = ad - bc
$



---


*    *Trace*

For a matrix A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{pmatrix}

$
\text{tr}(A) = \sum_{i=1}^n a_{ii}
$



---


However, we can simply use the built-in function of Numpy. See below:

In [None]:
def compute_determinant_and_trace(cov_matrix, fruit_type):
    determinant = np.linalg.det(cov_matrix)
    trace = np.trace(cov_matrix)
    print(f"{fruit_type} Determinant:", determinant)
    print(f"{fruit_type} Trace:", trace)

compute_determinant_and_trace(apple_cov, "Apple")
compute_determinant_and_trace(orange_cov, "Orange")
compute_determinant_and_trace(grape_cov, "Grape")

Apple Determinant: 503.37146556279697
Apple Trace: 175.87113072218202
Orange Determinant: 14.771560120742006
Orange Trace: 20.198321183965867
Grape Determinant: 19.73945780801918
Grape Trace: 39.14443348556851




---





---





---



### iii) For each estimated covariance matrix, compute the eigenvectors and eigenvalues.

#### **Solution For Q1 - a - iii**



---



*   *Eigenvalue*


For a matrix A and an eigenvalue $\lambda$, the eigenvalue satisfies:

$$
\det(A - \lambda I) = 0
$$

where $I$ is the identity matrix with the same dimension as A.


---

*   *Eigenvector*

For a matrix A and an eigenvalue $\lambda$, the eigenvector $\mathbf{v}$ is:

$$
(A - \lambda I) \mathbf{v} = 0
$$



---

Again, we don't need to implement these from scratch and can simple use Numpy built-in functions:

In [None]:
def compute_eigenvalues_and_vectors(cov_matrix, fruit_type):
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
    print(fruit_type)
    print("Eigenvalues:", eigenvalues)
    print("Eigenvectors:\n", tabulate(eigenvectors, tablefmt="pretty"))
    print("\n#---------------------------------------------------------#\n")

    return eigenvalues, eigenvectors




Apple_eival, Apple_eivec = compute_eigenvalues_and_vectors(apple_cov, "Apple")
Orange_eival, Orange_eivec = compute_eigenvalues_and_vectors(orange_cov, "Orange")
Grape_eival, Grape_eivec = compute_eigenvalues_and_vectors(grape_cov, "Grape")

Apple
Eigenvalues: [172.96080979   2.91032093]
Eigenvectors:
 +----------------------+-----------------------+
|  0.9999549886385958  | -0.009487923734195155 |
| 0.009487923734195155 |  0.9999549886385958   |
+----------------------+-----------------------+

#---------------------------------------------------------#

Orange
Eigenvalues: [19.43840492  0.75991627]
Eigenvectors:
 +--------------------+---------------------+
| 0.9946880969943884 | -0.1029348808600963 |
| 0.1029348808600963 | 0.9946880969943884  |
+--------------------+---------------------+

#---------------------------------------------------------#

Grape
Eigenvalues: [38.63349189  0.51094159]
Eigenvectors:
 +-----------------------+----------------------+
|  0.9999776410267227   | 0.006687110484440749 |
| -0.006687110484440749 |  0.9999776410267227  |
+-----------------------+----------------------+

#---------------------------------------------------------#





---





---





---



### iiiv) For each estimated covariance matrix, determine whether it is symmetric and positive semidefinite.

#### **Solution For Q1 - a - viii**



---


1.   **Check Symmetry:** A matrix is symmetric if it is equal to its transpose. For a covariance matrix, this should always be true so we expect all to be true.

2.   **Check Positive Semidefiniteness:** A matrix is positive semidefinite if all its eigenvalues are non-negative.



---


Given these, based on the Q-iii, We already know the answer, but again, let's have a simple code for it.

We can use

```
np.array_equal(matrix, matrix.T) and p.all(eigenvalues >= 0) from Numpy
```



In [None]:
def is_symmetric(matrix):

    return np.array_equal(matrix, matrix.T)

def is_positive_semidefinite(eigenvalues):

    return np.all(eigenvalues >= 0)

In [None]:
symmetric = is_symmetric(apple_cov)
positive_semidefinite = is_positive_semidefinite(Apple_eival)

print("Is Apple Covariance Matrix symmetric?", symmetric)
print("Is Apple Covariance Matrix positive semidefinite?", positive_semidefinite)

Is Apple Covariance Matrix symmetric? True
Is Apple Covariance Matrix positive semidefinite? True


In [None]:
symmetric = is_symmetric(orange_cov)
positive_semidefinite = is_positive_semidefinite(Orange_eival)

print("Is Orange Covariance Matrix symmetric?", symmetric)
print("Is Orange Covariance Matrix positive semidefinite?", positive_semidefinite)

Is Orange Covariance Matrix symmetric? True
Is Orange Covariance Matrix positive semidefinite? True


In [None]:
symmetric = is_symmetric(grape_cov)
positive_semidefinite = is_positive_semidefinite(Grape_eival)

print("Is Grape Covariance Matrix symmetric?", symmetric)
print("Is Grape Covariance Matrix positive semidefinite?", positive_semidefinite)

Is Grape Covariance Matrix symmetric? True
Is Grape Covariance Matrix positive semidefinite? True




---





---





---



# Question 1 - b


Create a scatter plot showing weight vs. diameter for all three classes, colouring the data according to fruit
class. Label your axes and add a legend. By examining the scatter plot, for each fruit class, do the data appear to
follow a bivariate normal distribution?

## **Solution For Q1 - b**



---


In the following two cells, I have provided both 2D and 3D scatter plots. Upon examining the scatter plot, it appears that the **apple** data does not follow a normal distribution. Both Grape and Orange have oval shape.

Additionally, I have included a **Q-Q plot**, which further confirms that the apple diameter deviates from a normal distribution. Quantile-Quantile plot, is a graphical tool is used to compare the distribution of data to a distribution.

1.   If data follows a normal distribution, the points will lie on a straight, diagonal line.

2.   Otherwise, if curly, it means data deviates from the normal distribution, which in our case, Apple diameter has a curvy line.




---


Reference -> [3D Scatter Plot with Colorscaling and Marker Styling Plotly](https://plotly.com/python/3d-scatter-plots/#:~:text=.show()-,3D%20Scatter%20Plot%20with%20Colorscaling%20and%20Marker%20Styling,-import%20plotly.graph_objects)


The code is straightforward, utilizing Plotly's built-in functions to plot the data. For each fruit, the relevant data is passed into a layout, and once all the data is set up, we can easily combine and display the plots together.

2D plot

In [None]:
data_apple = data[['D_apple', 'W_apple']].copy()
data_apple['Fruit'] = 'Apple'
data_apple.rename(columns={'D_apple': 'Diameter', 'W_apple': 'Weight'}, inplace=True)

data_orange = data[['D_orange', 'W_orange']].copy()
data_orange['Fruit'] = 'Orange'
data_orange.rename(columns={'D_orange': 'Diameter', 'W_orange': 'Weight'}, inplace=True)

data_grape = data[['D_grape', 'W_grape']].copy()
data_grape['Fruit'] = 'Grape'
data_grape.rename(columns={'D_grape': 'Diameter', 'W_grape': 'Weight'}, inplace=True)

data_combined = pd.concat([data_apple, data_orange, data_grape])



trace_apple = go.Scatter(
    x=data_apple['Diameter'],
    y=data_apple['Weight'],
    mode='markers',
    name='Apple',
    marker=dict(color='red')
)

trace_orange = go.Scatter(
    x=data_orange['Diameter'],
    y=data_orange['Weight'],
    mode='markers',
    name='Orange',
    marker=dict(color='orange')
)

trace_grape = go.Scatter(
    x=data_grape['Diameter'],
    y=data_grape['Weight'],
    mode='markers',
    name='Grape',
    marker=dict(color='purple')
)

layout = go.Layout(
    title='Weight vs. Diameter for Different Fruits - 2D',
    xaxis=dict(title='Diameter'),
    yaxis=dict(title='Weight'),
    legend_title='Fruit Class'
)

fig = go.Figure(data=[trace_apple, trace_orange, trace_grape], layout=layout)

fig.show()

3D plot

In [None]:
trace_apple = go.Scatter3d(
    x=data_apple['Diameter'],
    y=data_apple['Weight'],
    z=[1] * len(data_apple),
    mode='markers',
    name='Apple',
    marker=dict(color='red', size=5)
)

trace_orange = go.Scatter3d(
    x=data_orange['Diameter'],
    y=data_orange['Weight'],
    z=[2] * len(data_orange),
    mode='markers',
    name='Orange',
    marker=dict(color='orange', size=5)
)

trace_grape = go.Scatter3d(
    x=data_grape['Diameter'],
    y=data_grape['Weight'],
    z=[3] * len(data_grape),
    mode='markers',
    name='Grape',
    marker=dict(color='purple', size=5)
)

layout = go.Layout(
    title='3D Scatter Plot of Weight vs. Diameter for Different Fruits',
    scene=dict(
        xaxis_title='Diameter',
        yaxis_title='Weight',
        zaxis_title='Fruit Class'
    ),
    legend_title='Fruit Class'
)

fig = go.Figure(data=[trace_apple, trace_orange, trace_grape], layout=layout)

fig.show()


Q-Q plot

In [None]:
def qqplot(data):
    (quantiles, values), (slope, intercept, r) = stats.probplot(data, dist="norm")
    return quantiles, values, slope, intercept

for fruit in ['Apple', 'Orange', 'Grape']:
    df_fruit = data_combined[data_combined['Fruit'] == fruit]

    quantiles_diameter, values_diameter, slope_d, intercept_d = qqplot(df_fruit['Diameter'])
    quantiles_weight, values_weight, slope_w, intercept_w = qqplot(df_fruit['Weight'])

    fig = make_subplots(rows=1, cols=2, subplot_titles=(f"{fruit} Diameter Q-Q Plot", f"{fruit} Weight Q-Q Plot"))

    fig.add_trace(go.Scatter(x=quantiles_diameter, y=values_diameter, mode='markers', name='Diameter Data'), row=1, col=1)
    fig.add_trace(go.Scatter(x=quantiles_diameter, y=slope_d * quantiles_diameter + intercept_d, mode='lines', name='Ideal Line', line=dict(color='red')), row=1, col=1)

    fig.add_trace(go.Scatter(x=quantiles_weight, y=values_weight, mode='markers', name='Weight Data'), row=1, col=2)
    fig.add_trace(go.Scatter(x=quantiles_weight, y=slope_w * quantiles_weight + intercept_w, mode='lines', name='Ideal Line', line=dict(color='red')), row=1, col=2)

    fig.update_layout(height=300, width=700, title_text=f"Q-Q Plots for {fruit}")
    fig.update_xaxes(title_text="Theoretical Quantiles")
    fig.update_yaxes(title_text="Sample Quantiles")

    fig.show()




---





---





---



# Question 1 - c


Plot the histograms for each feature showing the distribution of each feature over each class. For each feature,
you should plot all three potentially overlapping histograms representing the three fruit types on a single axis.

### i) Use transparency and a different colour and/or line style for each class and make sure you can see all the data (i.e., that bars are not completely occluding each other in your figure).

#### **Solution For Q1 - C - i**



Again I'm using Plotly. It has its own built-in function: https://plotly.com/python/histograms/

All I had to do is to set the barmod as overlay to display bars on top of each other as requested in the question. I also set two different bingroup since we want two separate things (Weight and Diameter).

In [None]:
fig = go.Figure()

# histograms for Diameter
for fruit, color in zip(['Apple', 'Orange', 'Grape'], ['red', 'orange', 'purple']):
    df_fruit = data_combined[data_combined['Fruit'] == fruit]
    fig.add_trace(go.Histogram(
        x=df_fruit['Diameter'],
        name=f'Diameter - {fruit}',
        marker_color=color,
        opacity=0.6,
        bingroup=2
    ))

# histograms for Weight
for fruit, color in zip(['Apple', 'Orange', 'Grape'], ['blue', 'green', 'yellow']):
    df_fruit = data_combined[data_combined['Fruit'] == fruit]
    fig.add_trace(go.Histogram(
        x=df_fruit['Weight'],
        name=f'Weight - {fruit}',
        marker_color=color,
        opacity=0.6,
        bingroup=1
    ))

fig.update_layout(
    title='Histograms of Diameter and Weight for Different Fruits',
    xaxis_title='',
    yaxis_title='',
    barmode='overlay',
    legend_title='Fruit Class',
    xaxis2=dict(
        title='Weight',
        overlaying='x',
        side='top'
    ),
    yaxis2=dict(
        title='Count',
        overlaying='y',
        side='right'
    )
)

fig.show()



---





---





---



### ii) If you wish to separate oranges from the other two classes, which feature would you prefer and why? What if you wanted to separate grapes from the other two classes? (150 words)

#### **Solution For Q1  - C - ii**

By visualizing the data with histograms, we can observe both weight and diameter distributions. From these plots, it's clear that distinguishing oranges is much easier using the diameter histogram (on the right). However, for grapes, the diameter histogram provides little useful insight. Instead, the weight histogram (left) is much more effective for separating grapes.

In conclusion, based on the plot:

*   To separate oranges (right side of the plot), diameter is the more practical feature.
*   To separate grapes (left side of the plot), weight is the more effective feature.





---





---





---



# Question 1 - d


Provide a plot visualizing apple weight vs. diameter. Add a line of best fit and report the Pearson Correlation
Coefficient. Do the data look correlated? How does your observation compare to the computed Pearson
Correlation Coefficient and to the estimated covariance matrix from part a above?

#### **Solution For Q1 - d**


1.  First, prepare the apple data.

2.  Compute the Pearson Correlation Coefficient using the SciPy library.

3.  Fit a line to the data using NumPy’s linear regression.

**Polyfit**: This function fits a polynomial to the data using the least squares method. By passing 1 as the degree, we specify a linear fit. It returns the coefficients for the line.

```
coefficients = np.polyfit(data_apple['Diameter'], data_apple['Weight'], 1)
```

4.  With the coefficients obtained, create a 1D polynomial object and plot it.

5.  The remaining code involves creating the scatter plot and adding the line from step 4 to it.




---

References:


*   https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
*   https://numpy.org/doc/stable/reference/generated/numpy.poly1d.html



In [None]:
data_apple = data[['D_apple', 'W_apple']].rename(columns={'D_apple': 'Diameter', 'W_apple': 'Weight'})

# 2
corr_coefficient, _ = pearsonr(data_apple['Diameter'], data_apple['Weight'])
print(f'Pearson Correlation Coefficient: {corr_coefficient}')

# 3
coefficients = np.polyfit(data_apple['Diameter'], data_apple['Weight'], 1)
# 4
poly = np.poly1d(coefficients)
fit_line = poly(data_apple['Diameter'])

fig = go.Figure()

# 5
fig.add_trace(go.Scatter(
    x=data_apple['Diameter'],
    y=data_apple['Weight'],
    mode='markers',
    name='Apple Data',
    marker=dict(color='blue', size=8)
))


fig.add_trace(go.Scatter(
    x=data_apple['Diameter'],
    y=fit_line,
    mode='lines',
    name='Best Fit Line',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title='Apple Weight vs. Diameter with Line of Best Fit',
    xaxis_title='Diameter',
    yaxis_title='Weight'
)

fig.show()

Pearson Correlation Coefficient: 0.07172412519669366


My Observation:


**Apple Covariance Matrix:**

|         | D_apple           | W_apple           |
|---------|-------------------|-------------------|
| D_apple | 172.9455017242476 | 1.6133534466988508 |
| W_apple | 1.6133534466988508 | 2.9256289979343975 |



Pearson Correlation Coefficient: 0.07172412519669366

1.   **Scatter plot:** Horizontal spread with a horizontal line states no clear correlation.
2.   **Pearson Correlation Coefficient:** The Pearson Correlation Coefficient of 0.072 is very close to 0, suggesting a very weak positive linear relationship between weight and diameter.
3.   The covariance between weight and diameter is 1.613, which is much smaller compared to the variances (172.95 and 2.93). This low covariance value is showing that the two variables have a very weak linear relation.






