# Data Analysis Project

Name: **Hamed Aarab** | Student Number: **9925003**

_This project is also available on my [GitHub](https://github.com/Hawmex/aut_data_analysis_project)._


First, we import the libraries we need.


In [1]:
import numpy as np
import pandas as pd


## Question 1


Let's define our dataset.


In [2]:
dataset = np.array([-5.0, 23.0, 17.6, 7.23, 1.11])

dataset


array([-5.  , 23.  , 17.6 ,  7.23,  1.11])

Now, according to the min-max feature scaling formula, we can normalize a dataset within `[0, 1]`. Then we can scale our values to any range by multiplying them by the range's length and adding the its lower bound to them.

Here's the function that does so.


In [3]:
def normalize_with_range(values: np.ndarray, new_range: tuple) -> np.ndarray:
    min, max = values.min(), values.max()

    return (values - min) / (max - min) * \
        (new_range[1] - new_range[0]) + new_range[0]


### Part A


We use the aforementioned function to get the normalized values within the range `[0, 1]`.


In [4]:
normalize_with_range(dataset, (0, 1))


array([0.        , 1.        , 0.80714286, 0.43678571, 0.21821429])

### Part B


We use the aforementioned function to get the normalized values within the range `[-1, 1]`.


In [5]:
normalize_with_range(dataset, (-1, 1))


array([-1.        ,  1.        ,  0.61428571, -0.12642857, -0.56357143])

### Part C


Like Parts A and B, we can write a function and use the z-score normalization formula in it.

Then we can get our z-score normalized values.


In [6]:
def normalize_with_z_score(values: np.ndarray) -> np.ndarray:
    return (values - values.mean()) / values.std()


normalize_with_z_score(dataset)


array([-1.33779582,  1.37893488,  0.85499396, -0.15116666, -0.74496637])

## Question 2


Let's define our vector.


In [7]:
patients_child_count = np.array([3, 1, 0, 2, 7, 3, 6, 4, 2, 0, 0, 10, 15, 6])


First, we need to get our z-score normalized values.

Then, we return any value whose absolute z-score normalized value is higher than a limit (defaults to 3).


In [8]:
def find_outliers(values: np.ndarray, z_score_limit: float = 3) -> np.ndarray:
    normalized_values = normalize_with_z_score(values)
    return np.array([values[index] for index, normalized_value in np.ndenumerate(normalized_values) if abs(normalized_value) > z_score_limit])


Outliers with an absolute z-score normalized value higher than 3.


In [9]:
find_outliers(patients_child_count)


array([], dtype=float64)

Outliers with an absolute z-score normalized value higher than 2.


In [10]:
find_outliers(patients_child_count, z_score_limit=2)


array([15])

## Question 3


First, we read our data file and see how our dataframe looks.


In [11]:
dataframe = pd.read_csv('iris.data', header=None)

dataframe


Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


We can get some details about it.


In [12]:
dataframe.describe()


Unnamed: 0,0,1,2,3
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


The cell below, implements the following algorithm:

1. Randomly select `k` data points as the clusters' means.
2. Iterate until the means stop changing.
   1. Assign each data point to its nearest mean.
   2. Calculate the new mean of each cluster.


In [13]:
def find_clusters_with_k_means(dataframe: pd.DataFrame, k: int) -> tuple[float, np.ndarray]:
    labels = np.array([None] * dataframe.shape[0])

    # Step 1
    previous_means = dataframe.sample(k).reset_index(drop=True)
    current_means = dataframe.sample(k).reset_index(drop=True)

    # Step 2
    while not current_means.equals(previous_means):
        # Step 2.1
        for row_index, row in dataframe.iterrows():
            distances = np.array([None] * k)

            for mean_index, mean in current_means.iterrows():
                distances[mean_index] = np.sqrt(np.sum(row - mean) ** 2)

            labels[row_index] = distances.argmin()

        next_means = pd.DataFrame(
            np.array([[None] * dataframe.shape[1]] * k)).reset_index(drop=True)

        # Step 2.2
        for column_index, column in enumerate(dataframe.columns):
            for mean_index, mean in current_means.iterrows():
                next_means.at[mean_index,
                              column_index] = dataframe[column][labels == mean_index].mean()

        previous_means = current_means
        current_means = next_means

    wcss = 0

    for row_index, row in dataframe.iterrows():
        wcss += sum((row - current_means.iloc[labels[row_index]]) ** 2)

    return wcss, labels


Now, we run our function with `k=3` and get the results.


In [14]:
wcss, labels = find_clusters_with_k_means(dataframe.iloc[:, 0:4], k=3)

print(f'Within-cluster some of squares: {wcss}\n')
print(f'Labels:\n{labels}')


Within-cluster some of squares: 87.83033071183704

Labels:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 2 0 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 0 2 0 0 0 0 2 0 0 0 0
 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 2 0 0 0 2 0 0 0 2 0 0 0 2 0
 0 2]


Also, we can add a new column that holds each data point's cluster name.


In [15]:
dataframe['Cluster'] = labels

dataframe


Unnamed: 0,0,1,2,3,4,Cluster
0,5.1,3.5,1.4,0.2,Iris-setosa,1
1,4.9,3.0,1.4,0.2,Iris-setosa,1
2,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5.0,3.6,1.4,0.2,Iris-setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,0
146,6.3,2.5,5.0,1.9,Iris-virginica,2
147,6.5,3.0,5.2,2.0,Iris-virginica,0
148,6.2,3.4,5.4,2.3,Iris-virginica,0
