# Lab2-Data Pre-processing and Visualization


## Section1: Importing and Saving data


### Numpy

NumPy is a library for the Python programming language that provides support for large, multi-dimensional arrays and matrices, along with a variety of high-level mathematical functions to operate on these arrays. It's widely used for numerical computations and serves as the foundation for many other scientific and data analysis libraries in Python.

#### 1.1.1.Creating Arrays:


In [4]:
import numpy as np

# Put it on top of your code. It will Make numpy values easier to read. supress avoids 0.0e0 where not necessary, precession is decimal digits
np.set_printoptions(precision=3, suppress=True)

# From a list
a = np.array([1, 2, 3])
print("Array from list:\n", a)

# All zeros
b = np.zeros((2, 2))
print("All zeros:\n", b)

# All ones
c = np.ones((3, 3))
print("All ones:\n", c)

# Identity matrix
d = np.eye(4)
print("Identity matrix:\n", d)

# Range with step
e = np.arange(0, 10, 2)
print("Array from range with step:\n", e)

# Random values
f = np.random.rand(2, 2)
print("Random array:\n", f)

# Random integers
g = np.random.randint(1, 10, (3, 3))
print("Random integer array:\n", g)

# Evenly spaced values
h = np.linspace(0, 15, 6)
print("Evenly spaced array:\n", h)

Array from list:
 [1 2 3]
All zeros:
 [[0. 0.]
 [0. 0.]]
All ones:
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
Identity matrix:
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
Array from range with step:
 [0 2 4 6 8]
Random array:
 [[0.703 0.374]
 [0.09  0.059]]
Random integer array:
 [[2 6 1]
 [6 3 4]
 [9 4 5]]
Evenly spaced array:
 [ 0.  3.  6.  9. 12. 15.]


#### 1.1.2.Creating 2D Arrays:


In [5]:
import numpy as np

# Create a 2x3 array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D array:\n")
print(array_2d)

2D array:

[[1 2 3]
 [4 5 6]]


#### 1.1.3. Array Operations:


In [6]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise addition
c = a + b
print(f"Element-wise addition:\n{c}")
# Output: [5 7 9]

# Element-wise subtraction
d = a - b
print(f"Element-wise subtraction:\n{d}")
# Output:[-3 - 3 - 3]

# Element-wise multiplication
e = a * b
print(f"Element-wise multiplication:\n{e}")
# Output: [4 10 18]

# Element-wise division
f = a / b
print(f"Element-wise division:\n{f}")
# Output: [0.25 0.4 0.5]

# Element-wise exponentiation
g = a**2
print(f"Element-wise exponentiation:\n{g}")
# Output: [1 4 9]

Element-wise addition:
[5 7 9]
Element-wise subtraction:
[-3 -3 -3]
Element-wise multiplication:
[ 4 10 18]
Element-wise division:
[0.25 0.4  0.5 ]
Element-wise exponentiation:
[1 4 9]


#### 1.1.4. Indexing and Slicing:


In [20]:
import numpy as np

# 1D array
arr_1d = np.array([0, 1, 2, 3, 4, 5])
# Indexing in 1D array
print(arr_1d[2])
# Output: 2

# Slicing in 1D array
print(arr_1d[1:4])
# Output: [1 2 3]

# 2D array
arr_2d = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
# Indexing in 2D array
print(arr_2d[1, 2])
# Output: 5

# Slicing rows in 2D array
print(arr_2d[1, :2])
# Output: [[0, 1, 2], [3, 4, 5]]

# Slicing specific columns for the first two rows in 2D array
print(arr_2d[0:2, 1])
# Output: [1 4]

# Combining indexing and slicing in 2D array
print(arr_2d[0:2, 0:2])
# Output: [[0 1][3 4]]

# Reshaping a 1D array to a 2D array
a = np.array([1, 2, 3, 4, 5, 6])
reshaped_a = a.reshape(3, 2)
print("Reshaped array:\n", reshaped_a)
# Output: # [[1 2 3] # [4 5 6]]

# Reshaping a 2D array to a 1D array
flattened_a = reshaped_a.reshape(-1)
print("Flattened array:", flattened_a)
# Output: [1 2 3 4 5 6]

# Reshaping with unknown dimension
unknown_dim_a = a.reshape(2, -1)
# -1 is automatically inferred to be 3
print("Reshaped with unknown dimension:\n", unknown_dim_a)
# Output: # [[1 2 3]

2
[1 2 3]
5
[3 4]
[1 4]
[[0 1]
 [3 4]]
Reshaped array:
 [[1 2]
 [3 4]
 [5 6]]
Flattened array: [1 2 3 4 5 6]
Reshaped with unknown dimension:
 [[1 2 3]
 [4 5 6]]


#### 1.1.5. Universal Functions (ufuncs)


In [27]:
import numpy as np
import math as m

# Define example arrays
arr1 = np.array([1, 4, 9, 16])
arr = np.array([1, 2, 3, 4])

# Square root
sqrt_arr = np.sqrt(arr1)
print("Square root:", sqrt_arr)

# Rounding to the nearest integer
rounded_arr = np.round([1.3, 2.7, 4.1])
print("Rounded:", rounded_arr)

# Absolute value
abs_arr = np.abs([-1, -2, -3])
print("Absolute value:", abs_arr)

# Trigonometric functions
sin_arr = np.sin(arr1)
print("Sine values:", sin_arr)
print(f"{m.sin(1):.2f}")
cos_arr = np.cos(arr1)
print("Cosine values:", cos_arr)

# Mean
mean_value = np.mean(arr)
print("Mean value:", mean_value)

# Standard deviation
std_deviation = np.std(arr)
print("Standard deviation:", std_deviation)

# Variance
variance_val = np.var(arr1)
print("Variance:", variance_val)

# Sum of all elements
total_sum = np.sum(arr)
print("Total sum:", total_sum)

# Minimum and Maximum values
min_value = np.min(arr)
print("Minimum value:", min_value)
max_value = np.max(arr)
print("Maximum value:", max_value)

Square root: [1. 2. 3. 4.]
Rounded: [1. 3. 4.]
Absolute value: [1 2 3]
Sine values: [ 0.841 -0.757  0.412 -0.288]
0.84
Cosine values: [ 0.54  -0.654 -0.911 -0.958]
Mean value: 2.5
Standard deviation: 1.118033988749895
Variance: 32.25
Total sum: 10
Minimum value: 1
Maximum value: 4


#### Extra Statical Operations

##### P-value & t-statistic

1. The p-value is not directly computed by NumPy, as it is typically associated with statistical hypothesis testing
2. you can use the scipy.stats module, which builds upon NumPy and provides functions for statistical analysis, including the computation of p-values


In [75]:
import numpy as np
from scipy.stats import ttest_1samp

# Example data
data = np.array([2, 3, 4, 5, 6, 2, 4, 5, 6, 7])
# True sample mean value
mean_sample_val = np.mean(data)
print(f"Sample mean value: {mean_sample_val}")

# Standard Deviation
std_val = np.std(data)
print(f"The standard deviation is {std_val}")

# Hypothesized population mean H0
hypothesized_mean = 4.6

# Perform one-sample t-test
t_statistic, p_value = ttest_1samp(data, hypothesized_mean)

# The T-statistic
#  *   measure of how many standard deviations the sample mean is from the hypothesized population mean.
#  *   positive T-statistic -> the sample_mean > hypothesized_mean,
#  *   negative T-statistic -> the sample_mean < hypothesized_mean
#
# The P-value
#  *   assuming that the null hypothesis is true:
#       *   is the probability of obtaining a T-statistic as extreme as the one observed in the sample
#  *  small P-value (typically less than the chosen significance level, e.g., 0.05)
#       * -> the observed sample mean is unlikely to have occurred by random chance alone
#  *  If P-value < significance level -> may reject the null hypothesis in favor of the alternative hypothesis
print("T-statistic:", t_statistic)
print(
    f"The sample True mean is {'>' if t_statistic > 0 else '<'} than the hypothesized mean"
)
print(
    f"The sample mean is {abs(t_statistic)} standard deviations away ({'larger' if t_statistic > 0 else 'smaller'}) from the hypothesized mean."
)
print("P-value:", p_value)
print(
    f"Because the  P-value is {'>' if p_value > 0.05 else '<='} significance level (0.05),\n{'can Reject' if p_value <= 0.05 else 'there is not enough evidence to Reject Reject'} the H0"
)

Sample mean value: 4.4
The standard deviation is 1.624807680927192
T-statistic: -0.36927447293799687
The sample True mean is < than the hypothesized mean
The sample mean is 0.36927447293799687 standard deviations away (smaller) from the hypothesized mean.
P-value: 0.72046166395703
Because the  P-value is > significance level (0.05),
there is not enough evidence to Reject Reject the H0


##### Confidence Intervals

To compute a confidence interval in Python, you can use the `scipy.stats` module. Specifically, you can use the `t.interval` function for a one-sample t-confidence interval.

Parameters used:

- confidence level
- degree of freedom (e.g., n - 1 : for a one-sample t-test)
- sample_mean (it is known)
- sample_std (it is known)

<!-- ![Confidence Interval](./images/CI_fromulas.png) -->
<img src="./images/CI_fromulas.png" alt="CI" style="width:300px;"/>

The resulting confidence_interval will be a tuple representing the lower and upper bounds of the interval.


In [80]:
import numpy as np
from scipy.stats import t

# Example data
data = np.array([2, 3, 4, 5, 6, 2, 4, 5, 6, 7])

# Confidence level (e.g., 95%)
confidence_level = 0.95

# Degrees of freedom
degrees_of_freedom = len(data) - 1

# Mean and standard deviation of the sample
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation

# Compute the confidence interval
confidence_interval = t.interval(
    confidence_level,
    degrees_of_freedom,
    sample_mean,
    scale=sample_std / np.sqrt(len(data)),
)

print("Confidence Interval:", confidence_interval)

Confidence Interval: (3.1748098888379914, 5.625190111162009)


#### 1.1.6. Masking and Boolean Indexing:


In [79]:
import numpy as np

# Create an array
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Create a mask
mask = arr > 5

# Use boolean indexing to create a filtered array
filtered_arr = arr[mask]
print("Filtered array:", filtered_arr)

# Combine masking and boolean indexing in one line
filtered_arr_one_line = arr[arr > 5]
print("Filtered array in one line:", filtered_arr_one_line)

Confidence Interval: (3.1748098888379914, 5.625190111162009)


#### 1.1.7. Broadcasting and Reshaping:


In [77]:
import numpy as np

# Broadcasting with a scalar
a = np.array([1, 2, 3])
result = a * 2
print("Broadcasting with a scalar:", result)
# Output: [2 4 6]

# Broadcasting with different shaped arrays
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
c = np.array([1, 2, 3])
result = b + c
print("Broadcasting with different shaped arrays:", result)
# Output:
# [[ 2 4 6]
# [ 5 7 9]
# [ 8 10 12]]

# Broadcasting a column vector across a 2D array
d = np.array([[1], [2], [3]])
result = b + d
print("Broadcasting a column vector across a 2D array:", result)
# Output:
# [[ 2 3 4]
# [ 6 7 8]
# [10 11 12]]

# Reshaping a 1D array to a 2D array
a = np.array([1, 2, 3, 4, 5, 6])
reshaped_a = a.reshape(2, 3)
print("Reshaped array:\n", reshaped_a)
# Output:
# [[1 2 3]
# [4 5 6]]
# Reshaping a 2D array to a 1D array
flattened_a = reshaped_a.reshape(-1)
print("Flattened array:", flattened_a)
# Output: [1 2 3 4 5 6]
# Reshaping with unknown dimension
unknown_dim_a = a.reshape(2, -1)
# -1 is automatically inferred to be 3
print("Reshaped with unknown dimension:\n", unknown_dim_a)

# Output:
# [[1 2 3]
# [4 5 6]]

Broadcasting with a scalar: [2 4 6]
Broadcasting with different shaped arrays: [[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]
Broadcasting a column vector across a 2D array: [[ 2  3  4]
 [ 6  7  8]
 [10 11 12]]
Reshaped array:
 [[1 2 3]
 [4 5 6]]
Flattened array: [1 2 3 4 5 6]
Reshaped with unknown dimension:
 [[1 2 3]
 [4 5 6]]
