# NumPy Basics and Usage
NumPy is a powerful library for numerical computing in Python. It provides support for arrays, matrices, and many mathematical functions. This notebook will guide you through the basics of NumPy and show you how to use it effectively.

## Installing NumPy
To install NumPy, use the following command:
```
pip install numpy
```
You can also install it via conda if you're using Anaconda.

##### NOTE:

- When using Anaconda any time you want to install a package you will need to put the "!" infront of the pip
&nbsp;

    - This will work
      ```
          !pip install numpy
                 or
          !pip3 install numpy
      ```
      &nbsp;
    - This will not work
      ```
          pip install numpy
                 or
          pip3 install numpy
      ```

In [None]:
!pip install numpy

## Creating Arrays
NumPy arrays are the core of the library. You can create arrays in several ways, using functions like `np.array`, `np.zeros`, `np.ones`, `np.empty`, `np.arange`, and `np.linspace`.

- NumPy arrays are fundamental data structures in NumPy, and they form the foundation for most of the library's operations. Arrays are like lists in Python, but they provide much more efficient storage and computational capabilities, especially when dealing with large amounts of numerical data. Arrays can have multiple dimensions, allowing for the representation of matrices and higher-dimensional tensors.

- There are several ways to create NumPy arrays:

    - `np.array:` Converts a Python list or tuple into an array. This is the most straightforward way to create an array from existing data.
    - `np.zeros:` Creates an array filled with zeros. You specify the shape of the array, and all elements are initialized to zero.
    - `np.ones:` Similar to `np.zeros`, but initializes the array with ones instead of zeros.
    - `np.empty:` Creates an array without initializing its values, meaning the array will contain whatever values are already in memory at that location.
    - `np.arange:` Generates an array containing a range of values, similar to Python's `range` function but returns an array instead of a list.
    - `np.linspace:` Creates an array with a specified number of evenly spaced values between a start and end value.

- Understanding these methods is crucial as they allow you to initialize arrays in different scenarios, making NumPy flexible for various types of numerical computations.

In [2]:
import numpy as np

# Creating an array from a list
array_from_list = np.array([1, 2, 3, 4, 5])
print("Array from list:", array_from_list)

# Creating an array of zeros
# The argument (3, 3) specifies the shape of the array. Here, it creates a 3x3 array filled with zeros.
zeros_array = np.zeros((3, 3))
print("\nArray of zeros:\n", zeros_array)

# Creating an array of ones
ones_array = np.ones((2, 4))
print("\nArray of ones:\n", ones_array)

# Creating an empty array
empty_array = np.empty((2, 2))
print("\nEmpty array:\n", empty_array)

# Creating an array with a range of values
# The arguments (0, 10, 2) specify the start (0), stop (10), and step size (2). Here, it creates an array with values [0, 2, 4, 6, 8].
range_array = np.arange(0, 10, 2)
print("\nArray with range:", range_array)

# Creating an array with evenly spaced values
# The arguments (0, 10, 5) specify the start (0), stop (10), and number of values (5). Here, it creates an array with 5 values evenly spaced between 0 and 10.
linspace_array = np.linspace(0, 10, 5)
print("\nLinspace array:", linspace_array)

Array from list: [1 2 3 4 5]

Array of zeros:
 [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

Array of ones:
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]]

Empty array:
 [[            nan 4.94065646e-324]
 [8.43352038e-312 8.43359475e-312]]

Array with range: [0 2 4 6 8]

Linspace array: [ 0.   2.5  5.   7.5 10. ]


## Basic Array Operations

- NumPy arrays allow you to perform basic operations like addition, subtraction, multiplication, and division element-wise, meaning that these operations are applied independently to each element of the array. This capability is one of the key features that makes NumPy so powerful for numerical computing, as it allows you to apply mathematical operations across entire datasets without needing to write explicit loops.

- Additionally, NumPy supports operations such as array broadcasting, which automatically expands the dimensions of arrays so that they are compatible for arithmetic operations, even if their shapes differ. This feature is especially useful for efficiently performing operations between arrays of different shapes and sizes.

- Understanding how to perform these operations on arrays is essential for any data manipulation or scientific computing task, as it forms the basis for more complex operations and algorithms.

In [10]:
# Element-wise addition
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
sum_array = array1 + array2
print("Sum of arrays:", sum_array)

# Element-wise multiplication
product_array = array1 * array2
print("\nProduct of arrays:", product_array)

# Dot product
dot_product = np.dot(array1, array2)
print("\nDot product of arrays:", dot_product)

Sum of arrays: [5 7 9]

Product of arrays: [ 4 10 18]

Dot product of arrays: 32


## Array Indexing

You can access elements of an array using square brackets, similar to Python lists.

- Array indexing in NumPy is similar to list indexing in Python, but it is more powerful and flexible, especially when working with multi-dimensional arrays. Indexing allows you to access and modify individual elements, rows, columns, or subarrays within a NumPy array.

- NumPy arrays can be indexed in several ways:

    - **Basic Indexing:** This is the simplest form, where you use an integer or tuple of integers to access elements. For example, `array[0]` accesses the first element, and `array[1, 2]` accesses the element in the second row, third column in a 2D array.
    - **Slicing**: Similar to lists, you can slice arrays to access a range of elements. For instance, `array[1:3]` returns the second and third elements, while `array[:, 1:3]` returns all rows but only the second and third columns.
    - **Advanced Indexing:** NumPy supports more complex forms of indexing, such as boolean indexing, where you use a boolean array to select elements, and fancy indexing, where you use arrays of indices to access multiple elements.

- Mastering array indexing is crucial for effectively manipulating and analyzing data stored in NumPy arrays, as it allows you to access and modify specific portions of your data efficiently.

In [3]:
array = np.array([10, 20, 30, 40, 50])

# Indexing
print("First element:", array[0])
print("\nLast element:", array[-1])

First element: 10

Last element: 50


## Array Slicing

Slicing in NumPy works similarly to Python lists. You can slice arrays to create subarrays.

- Slicing in NumPy allows you to create subarrays from an existing array. This operation is extremely efficient because slices are views on the original array, meaning that no data is copied, and modifying a slice will affect the original array. This feature is particularly useful when working with large datasets, as it conserves memory and speeds up operations.

- In NumPy, slices are specified using the : operator, which defines a start and end point along a given axis. For example, `array[1:4]` will give you a subarray containing the second to fourth elements. You can also use slicing with step sizes (e.g., `array[::2]`), allowing you to skip elements as needed.

- You can slice along multiple axes simultaneously. For example, in a 2D array, `array[1:3, 0:2]` will select a subarray from rows 2 to 3 and columns 1 to 2.

- Understanding slicing is essential for efficiently working with subsets of data, particularly in scenarios where you need to analyze or modify specific portions of an array.

In [4]:
# Slicining
print("\nFirst three elements:", array[:3])
print("\nEvery other element:", array[::2])

# Multi-dimensional array slicing
multi_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nSub-array:\n", multi_array[0:2, 1:3])


First three elements: [10 20 30]

Every other element: [10 30 50]

Sub-array:
 [[2 3]
 [5 6]]


## Array Manipulation

- Array manipulation is a crucial aspect of working with NumPy, allowing you to reshape, flatten, and concatenate arrays according to your data processing needs. These operations are often necessary when preparing data for analysis, visualizations, or machine learning models.

    - Reshaping: This involves changing the structure of an array while maintaining the original data. For instance, converting a 1D array to a 2D array or vice versa. Reshaping is useful when you need to align data into a specific format for matrix operations or for feeding into machine learning algorithms.
    - Flattening: This operation converts a multi-dimensional array into a 1D array. It's particularly useful when you need to perform operations that require data in a single dimension, such as certain types of sorting or applying linear algebra operations.
    - Concatenation: This involves combining multiple arrays into one. Whether you are adding rows to a dataset, merging columns, or appending arrays, concatenation is essential for handling data that comes in parts or needs to be assembled before analysis.

In [12]:
# Reshaping an array
reshaped_array = np.arange(6).reshape((2, 3))
print("Reshaped array:\n", reshaped_array)

# Flattening an array
flattened_array = reshaped_array.flatten()
print("\nFlattened array:", flattened_array)

# Concatenating arrays
concatenated_array = np.concatenate((array1, array2))
print("\nConcatenated array:", concatenated_array)


Reshaped array:
 [[0 1 2]
 [3 4 5]]

Flattened array: [0 1 2 3 4 5]

Concatenated array: [1 2 3 4 5 6]


## Mathematical Functions

NumPy includes a variety of mathematical functions, such as `np.sum`, `np.mean`, `np.sqrt`, and more

- NumPy provides a vast array of mathematical functions that allow you to perform calculations on arrays efficiently. These functions are optimized for performance and are applied element-wise, making them much faster than equivalent Python loops.

    - Sum and Mean: These functions are used to calculate the sum and average of array elements, respectively. They are fundamental in statistical analysis, helping you understand the total and average values in your dataset.
    - Square Root: This function computes the square root of each element in an array. It is commonly used in mathematical operations that require normalization or in algorithms like the Euclidean distance calculation.
    - Min and Max: These functions help identify the smallest and largest values in an array. They are useful for understanding the range of your data or for finding extrema in optimization problems.
    - Rounding: Rounding functions in NumPy allow you to control the precision of your data by rounding numbers to a specified number of decimal places. This is particularly useful when preparing data for presentation or when precision is critical.

In [14]:
# Sum of elements
print("Sum of elements:", np.sum(array))

# Mean of elements
print("\nMean of elements:", np.mean(array))

# Square root of each element
print("\nSquare root of elements:", np.sqrt(array))

# Min and Max of elements
print("\nMinimum value:", np.min(array))
print("\nMaximum value:", np.max(array))

# Rounding
rounded_array = np.round(np.array([1.23456, 2.34567, 3.45678]), decimals=2)
print("\nRounded array:", rounded_array)

Sum of elements: 150

Mean of elements: 30.0

Square root of elements: [3.16227766 4.47213595 5.47722558 6.32455532 7.07106781]

Minimum value: 10

Maximum value: 50

Rounded array: [1.23 2.35 3.46]


## NumPy Random Number Generation

- Random number generation is a core component of many computational tasks, such as simulations, statistical sampling, and machine learning. NumPy provides a suite of functions to generate random numbers from various distributions.

    - Uniform Distribution: Generates random numbers where each number within the specified range has an equal probability of being selected. This is useful in simulations where each outcome should have an equal chance of occurring.
    - Normal Distribution: Generates random numbers based on a Gaussian distribution, characterized by its mean and standard deviation. This is essential in scenarios where data follows a natural distribution, such as in finance or natural sciences.
    - Random Integers: Generates random integers within a specified range. This is useful in tasks like creating random samples or initializing random states in algorithms.
    - Setting Seeds: Setting a random seed ensures that the random numbers generated can be reproduced. This is critical in experiments where reproducibility is required, such as in research or model validation.

In [15]:
# Generate random numbers from uniform distribution
uniform_random = np.random.rand(5)
print("Random numbers from uniform distribution:")
print(uniform_random)

# Generate random numbers from standard normal distribution
normal_random = np.random.randn(5)
print("\nRandom numbers from standard normal distribution:")
print(normal_random)

# Generate random numbers from normal distribution with mean=0 and std=1
custom_normal = np.random.normal(loc=0, scale=1, size=5)
print("\nRandom numbers from custom normal distribution:")
print(custom_normal)

# Generate random integers
random_integers = np.random.randint(low=1, high=10, size=5)
print("\nRandom integers:")
print(random_integers)

# Set random seed for reproducibility
np.random.seed(42)
print("\nRandom numbers after setting seed:")
print(np.random.rand(3))

Random numbers from uniform distribution:
[0.44210703 0.33440118 0.39457232 0.52994059 0.16136736]

Random numbers from standard normal distribution:
[ 1.32864083  0.31318447 -0.60650339  0.4559042  -0.45909031]

Random numbers from custom normal distribution:
[-0.69460037 -1.15436267 -1.75182881 -0.38992371  0.15805349]

Random integers:
[4 3 7 2 6]

Random numbers after setting seed:
[0.37454012 0.95071431 0.73199394]


## Broadcasting

Broadcasting allows NumPy to perform element-wise operations on arrays of different shapes


- Broadcasting is a powerful mechanism in NumPy that allows operations on arrays of different shapes. It simplifies array operations by automatically expanding smaller arrays to match the dimensions of larger ones, allowing for element-wise operations without the need to explicitly reshape arrays.

    - This feature is particularly useful in operations involving scalar values or arrays of different dimensions. For example, adding a 1D array to a 2D array where the 1D array is "broadcasted" across the 2D array so that the operation can be performed element-wise.
    - Broadcasting follows specific rules to align the dimensions, ensuring that operations are both efficient and intuitive. Understanding these rules allows you to leverage broadcasting to write concise and efficient code.

In [22]:
# Broadcasting example
array3 = np.array([1, 2, 3])
array4 = np.array([[4], [5], [6]])
broadcasted_sum = array3 + array4
print("Broadcasted sum:\n", broadcasted_sum)

Broadcasted sum:
 [[5 6 7]
 [6 7 8]
 [7 8 9]]


## Matrix Multiplication

Matrix multiplication is a binary operation that produces a matrix from two matrices. NumPy provides the `np.dot()` function for this purpose.

- Matrix multiplication is a fundamental operation in linear algebra and is widely used in various fields, including computer graphics, data analysis, and machine learning. NumPy provides the np.dot() function to perform matrix multiplication, which computes the product of two matrices.

    - Matrix multiplication is different from element-wise multiplication, as it involves the dot product of rows and columns from the input matrices. This operation is crucial in transformations, solving systems of equations, and in algorithms like those used in neural networks.
    - Understanding how to correctly apply matrix multiplication is essential for working with mathematical models and for implementing algorithms that rely on linear algebra.

In [24]:
# Multiplying two 2-dimensional arrays (matrices)
# Here we define a 2x3 matrix and a 3x2 matrix
matrix1 = np.array([[1, 2, 3], [4, 5, 6]])
matrix2 = np.array([[7, 8], [9, 10], [11, 12]])

# Performing matrix multiplication
result = np.dot(matrix1, matrix2)
print("Matrix 1:\n", matrix1)
print("\nMatrix 2:\n", matrix2)
print("\nResult of matrix multiplication:\n", result)

Matrix 1:
 [[1 2 3]
 [4 5 6]]

Matrix 2:
 [[ 7  8]
 [ 9 10]
 [11 12]]

Result of matrix multiplication:
 [[ 58  64]
 [139 154]]


# Advanced NumPy Functions

NumPy extends beyond basic operations to offer advanced functions, particularly in the domain of linear algebra. These functions are essential for solving complex mathematical problems and performing operations like matrix decomposition, eigenvalue computation, and solving systems of linear equations.

## Linear Algebra Operations

- **Matrix Multiplication:**

    - Matrix multiplication involves taking two matrices and producing a third matrix by multiplying corresponding elements and summing them up. Unlike element-wise multiplication, matrix multiplication considers rows and columns of the matrices. This operation is fundamental in many areas, including transformations in computer graphics, where it's used to rotate, scale, and translate shapes.
    - In mathematical terms, if $A$ is an $m×n$ matrix and $B$ is an $n×p$ matrix, the resulting matrix $C$ will be of size $m×p$. The element $c_{ij}$ of matrix $C$ is computed as:
        $$
        c_{ij} = \sum_{k=1}^{n} a_{ik} \times b_{kj}
        $$

    - Matrix multiplication is not commutative, meaning $A×B$ is not necessarily the same as $B×A$. This property is important in various applications, including solving linear equations and performing matrix decompositions.

- **Matrix Transposition:**

    - The transpose of a matrix is obtained by flipping the matrix over its diagonal, effectively turning its rows into columns and vice versa. If matrix $A$ is of size $m×n$, its transpose $A^T$ will be of size $n×m$.
    - Transposition is useful in many scenarios, such as simplifying the equations involving dot products, or when transforming coordinate systems in physics and engineering.

- **Finding Determinants:**

    - The determinant is a scalar value that is computed from the elements of a square matrix. It provides important information about the matrix, such as whether it is invertible (a non-zero determinant indicates that the matrix is invertible).
    - Determinants are used in various applications, including solving systems of linear equations, analyzing the stability of systems in control theory, and in calculating areas and volumes in geometry.

- **Solving Linear Systems:**

    - Linear systems are equations of the form $Ax=b$, where $A$ is a matrix, $x$ is a vector of unknowns, and $b$ is a vector of constants. Solving a linear system involves finding the vector $x$ that satisfies this equation.
    - In many practical applications, such as in economics, engineering, and physics, systems of linear equations arise naturally, and efficient methods like Gaussian elimination, LU decomposition, or using matrix inverses are employed to solve them.


## Singular Value Decomposition (SVD)

- Singular Value Decomposition (SVD) is a method of decomposing a matrix into three other matrices: $U$, $S$, and $V^T$, where:


    - $U$ is an orthogonal matrix representing the left singular vectors.

    - $S$ is a diagonal matrix containing the singular values (which are the square roots of the eigenvalues of $A^T A$).

    - $V^T$ is an orthogonal matrix representing the right singular vectors.

- Mathematically, if $A$ is an $m×n$ matrix, SVD expresses it as:

    $A=UΣV^T$
 
- Applications of SVD:

    - Signal Processing: SVD is used to filter out noise from signals, where the smaller singular values (associated with noise) can be discarded to reconstruct a cleaner signal.
    - Statistics: SVD is fundamental in Principal Component Analysis (PCA), a technique used to reduce the dimensionality of data while preserving as much variance as possible.
    - Machine Learning: SVD is employed in techniques like Latent Semantic Analysis (LSA) for text processing, where it helps in identifying patterns and relationships within data.

- Advantages of SVD:

    - SVD provides a stable and robust method for solving systems of linear equations, especially in cases where the matrix is close to singular or poorly conditioned.
    - It also enables the computation of the pseudo-inverse, which is essential in scenarios where the system does not have a unique solution.


## Eigenvalues and Eigenvectors

- Eigenvalues and Eigenvectors:
    - For a given square matrix $A$, an eigenvector is a non-zero vector $v$ that only changes by a scalar factor when the matrix is applied to it. The scalar factor is known as the eigenvalue $λ$.
    - The equation that defines eigenvalues and eigenvectors is:

      $Av=λv$
    - Here, $A$ is the matrix, $v$ is the eigenvector, and $λ$ is the eigenvalue.

- Importance in Applications:
    - Stability Analysis: In systems of differential equations, eigenvalues determine the stability of the system. For example, in mechanical systems, eigenvalues are related to natural frequencies of vibration.
    - Quantum Mechanics: In quantum physics, eigenvalues correspond to observable quantities like energy levels, and eigenvectors represent the states of the system.
    - Principal Component Analysis (PCA): Eigenvectors represent the directions of maximum variance in data, and eigenvalues indicate the magnitude of this variance. PCA is widely used in data reduction and pattern recognition.

- Computational Considerations:

    - Finding eigenvalues and eigenvectors is computationally intensive, especially for large matrices. Efficient algorithms like the QR algorithm and power iteration are used to compute them.
    - Understanding the properties of eigenvalues and eigenvectors is crucial for tasks like diagonalizing matrices, which simplifies many mathematical operations.

## Pseudo-inverse and Verification

- Pseudo-inverse (Moore-Penrose Inverse):
    - The pseudo-inverse is a generalization of the inverse matrix that can be applied to any matrix, whether square or not. It provides a way to solve linear systems that do not have a unique solution.
    - For a matrix $A$, the pseudo-inverse $A^+$ is defined such that:

      $AA^+A=A$,
      
      $A^+AA^+=A^+$,
      
      $(AA^+)^T=AA^+$,
      
      $(A^+A)^T=A^+A$

- Applications:

    - Regression Models: In linear regression, the pseudo-inverse is used to find the least-squares solution when the system of equations is under-determined (more equations than variables).
    - Optimization Problems: In optimization, the pseudo-inverse helps in minimizing the norm of the residuals in an over-determined system, where the system has more constraints than variables.
    - Machine Learning: The pseudo-inverse is used in algorithms like ridge regression and in situations where the matrix involved is not of full rank.

- Verification:

    - Verifying the properties of the pseudo-inverse ensures that the solution obtained is correct and stable. This involves checking that the relationships defined above hold true, which guarantees that the pseudo-inverse has been computed accurately.
    - In practice, small numerical errors can affect the stability and accuracy of the pseudo-inverse, so understanding and verifying these properties is crucial, especially in applications involving large-scale data or ill-conditioned matrices.

In [6]:
import numpy as np

# Create matrices
A = np.array([[1, 2], [3, 4], [5, 6]])
B = np.array([[1, 2, 3], [4, 5, 6]])

print("Matrix A:")
print(A)
print("\nMatrix B:")
print(B)

# Matrix multiplication
C = np.dot(A, B)
print("\nMatrix multiplication A * B:")
print(C)

# Transpose
print("\nTranspose of A:")
print(A.T)

# Determinant (for square matrix)
D = np.array([[1, 2], [3, 4]])
det_D = np.linalg.det(D)
print(f"\nDeterminant of D:\n{D}\nis {det_D}")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(D)
print("\nEigenvalues of D:")
print(eigenvalues)
print("\nEigenvectors of D:")
print(eigenvectors)

# Solving linear equations: Ax = b
A = np.array([[1, 1], [1.5, 4.0]])
b = np.array([2200, 5050])
x = np.linalg.solve(A, b)
print("\nSolution to Ax = b:")
print(f"x = {x}")

# Singular Value Decomposition (SVD)
U, s, VT = np.linalg.svd(A)
print("\nSingular Value Decomposition of A:")
print("U:")
print(U)
print("Singular values:")
print(s)
print("V^T:")
print(VT)

# Compute the pseudo-inverse
A_pinv = np.linalg.pinv(A)
print("\nPseudo-inverse of A:")
print(A_pinv)

# Verify pseudo-inverse properties
print("\nA * A+ * A:")
print(np.round(A.dot(A_pinv).dot(A), decimals=3))
print("\nA+ * A * A+:")
print(np.round(A_pinv.dot(A).dot(A_pinv), decimals=3))

Matrix A:
[[1 2]
 [3 4]
 [5 6]]

Matrix B:
[[1 2 3]
 [4 5 6]]

Matrix multiplication A * B:
[[ 9 12 15]
 [19 26 33]
 [29 40 51]]

Transpose of A:
[[1 3 5]
 [2 4 6]]

Determinant of D:
[[1 2]
 [3 4]]
is -2.0000000000000004

Eigenvalues of D:
[-0.37228132  5.37228132]

Eigenvectors of D:
[[-0.82456484 -0.41597356]
 [ 0.56576746 -0.90937671]]

Solution to Ax = b:
x = [1500.  700.]

Singular Value Decomposition of A:
U:
[[-0.29316423 -0.9560621 ]
 [-0.9560621   0.29316423]]
Singular values:
[4.46503132 0.55990649]
V^T:
[[-0.38684104 -0.92214641]
 [-0.92214641  0.38684104]]

Pseudo-inverse of A:
[[ 1.6 -0.4]
 [-0.6  0.4]]

A * A+ * A:
[[1.  1. ]
 [1.5 4. ]]

A+ * A * A+:
[[ 1.6 -0.4]
 [-0.6  0.4]]


## Matrix Operations in Housing Market Analysis

- In this section, matrix operations are applied to a practical example involving housing market analysis. NumPy's capabilities in handling and analyzing matrices make it an ideal tool for such tasks.

    - Average Price Calculation: Using NumPy's statistical functions, we can quickly compute average prices, helping us summarize data.
    - Price Range Identification: Finding the maximum and minimum prices gives insight into the price spread within the market.
    - Price per Square Foot: This is a critical metric in real estate, providing a normalized measure of value across different property sizes.
    - Correlation Analysis: By computing correlations between features and prices, we can uncover relationships and dependencies in the data, which are crucial for predictive modeling and decision-making.

In [17]:
# Sample housing data
prices = np.array([200000, 250000, 300000, 350000, 400000])
features = np.array([
    [1500, 3, 10],  # sqft, bedrooms, age
    [1800, 4, 5],
    [2000, 3, 15],
    [2200, 4, 8],
    [2500, 5, 3]
])

# Calculate average price
avg_price = np.mean(prices)
print(f"Average house price: ${avg_price:.2f}")

# Find the most expensive and least expensive houses
max_price = np.max(prices)
min_price = np.min(prices)
print(f"Price range: ${min_price} to ${max_price}")

# Calculate price per square foot
sqft = features[:, 0]
price_per_sqft = prices / sqft
print("Price per square foot:")
print(np.round(price_per_sqft, 2))

# Correlation between features and prices
correlation = np.corrcoef(features.T, prices)
print("\nCorrelation matrix:")
print(np.round(correlation, 2))

# Simple linear regression (price vs sqft)
X = np.column_stack((np.ones_like(sqft), sqft))
coefficients = np.dot(np.linalg.pinv(X), prices)
intercept, slope = coefficients

print(f"\nLinear regression: Price = {intercept:.2f} + {slope:.2f} * sqft")

Average house price: $300000.00
Price range: $200000 to $400000
Price per square foot:
[133.33 138.89 150.   159.09 160.  ]

Correlation matrix:
[[ 1.    0.78 -0.41  1.  ]
 [ 0.78  1.   -0.89  0.76]
 [-0.41 -0.89  1.   -0.37]
 [ 1.    0.76 -0.37  1.  ]]

Linear regression: Price = -113793.10 + 206.90 * sqft


## Integration with Pandas

- NumPy seamlessly integrates with Pandas, a powerful library for data manipulation and analysis. This integration allows you to leverage NumPy's computational efficiency while working within the flexible and user-friendly framework of Pandas.

    - DataFrame Creation and Manipulation: You can create DataFrames, perform operations on columns using NumPy functions, and efficiently analyze data.
    - Correlation Matrix Calculation: Using NumPy, you can compute correlation matrices, which are essential for understanding relationships between variables in your dataset. This integration is particularly useful in data analysis workflows, where you need to move between raw numerical computation and higher-level data manipulation.

In [2]:
import numpy as np
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'A': np.random.rand(5),
    'B': np.random.rand(5),
    'C': np.random.rand(5)
})

print("Original DataFrame:")
print(df)

# Apply NumPy operations
df['D'] = np.sqrt(df['A']**2 + df['B']**2)
df['E'] = np.where(df['C'] > 0.5, 1, 0)

print("\nDataFrame after NumPy operations:")
print(df)

# Use NumPy to compute correlation matrix
corr_matrix = np.corrcoef(df.values.T)
print("\nCorrelation matrix:")
print(corr_matrix)

Original DataFrame:
          A         B         C
0  0.665086  0.844628  0.614545
1  0.847074  0.634044  0.766978
2  0.560136  0.146385  0.463104
3  0.369633  0.429412  0.917781
4  0.274928  0.040316  0.673262

DataFrame after NumPy operations:
          A         B         C         D  E
0  0.665086  0.844628  0.614545  1.075051  1
1  0.847074  0.634044  0.766978  1.058086  1
2  0.560136  0.146385  0.463104  0.578948  0
3  0.369633  0.429412  0.917781  0.566589  1
4  0.274928  0.040316  0.673262  0.277868  1

Correlation matrix:
[[ 1.          0.6952949  -0.15888543  0.91181505 -0.04094267]
 [ 0.6952949   1.          0.24284109  0.92712008  0.45671354]
 [-0.15888543  0.24284109  1.          0.01520174  0.73788667]
 [ 0.91181505  0.92712008  0.01520174  1.          0.21384175]
 [-0.04094267  0.45671354  0.73788667  0.21384175  1.        ]]


## NumPy in Machine Learning (with scikit-learn)

- NumPy plays a crucial role in machine learning, providing the numerical backbone for data preprocessing, model training, and evaluation. This section demonstrates how NumPy integrates with scikit-learn, a popular machine learning library, to perform tasks such as linear regression.

    - Data Preparation: NumPy is used to handle raw numerical data, which is then fed into machine learning models. Tasks such as splitting data into training and test sets, standardizing features, and creating model input matrices rely heavily on NumPy's array manipulation capabilities.
    - Model Training and Evaluation: Scikit-learn uses NumPy arrays to train models and evaluate their performance, allowing for efficient computation of metrics like mean squared error and R-squared.
    - Coefficient Analysis: After training a model, NumPy is used to analyze the coefficients, providing insights into how different features influence the target variable. This is critical for model interpretation and improving model accuracy.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate sample data
X = np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1) * 0.1

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Use NumPy to calculate R-squared
ss_res = np.sum((y_test - y_pred) ** 2)
ss_tot = np.sum((y_test - np.mean(y_test)) ** 2)
r_squared = 1 - (ss_res / ss_tot)
print(f"R-squared: {r_squared}")

## Applied Example: House Price Analysis

- Context and Objective:

    - In this example, we're simulating a simplified housing market analysis using NumPy. The goal is to understand how different features of a house, such as size, number of bedrooms, and age, influence its price. This analysis is essential in real estate for pricing properties accurately and predicting market trends.

- Data Preparation:

    - We begin by generating synthetic data to simulate a small housing dataset. This data includes the house size (in square feet), number of bedrooms, age of the house (in years), and the corresponding price. The data is stored in a Pandas DataFrame, but all numerical operations are performed using NumPy arrays.

- Feature Selection and Matrix Operations:

    - The features (size, bedrooms, age) are extracted into a matrix $X$, and the target variable (price) is stored in vector $y$. This setup is typical in machine learning, where $X$ represents the input features, and $y$ represents the output we aim to predict.
    - Using NumPy, we calculate various statistics, such as the average house price, price range, and price per square foot. These metrics provide immediate insights into the dataset, helping us understand the distribution and variability of the housing prices.

- Correlation Analysis:

    - We compute the correlation matrix to identify the relationships between different features and the house prices. A high correlation between a feature and the price indicates that the feature strongly influences the price. For example, a high correlation between house size and price suggests that larger houses tend to be more expensive.
    - This step is crucial in feature selection, as it helps in identifying which features are most predictive of the target variable (price) and should be included in the model.

- Simple Linear Regression:

    - We perform a simple linear regression analysis using the size of the house as the independent variable to predict the price. Linear regression models the relationship between the dependent variable (price) and one or more independent variables (e.g., size).
    - The regression equation is of the form:
    
      $Price=Intercept+Slope×Size$
      
    - The coefficients (intercept and slope) are computed using the pseudo-inverse of the feature matrix $X$. The intercept represents the baseline price when the size is zero, and the slope represents the increase in price for each additional square foot.

- Model Evaluation:

    - After training the model, we evaluate its performance using metrics such as Mean Squared Error (MSE) and R-squared ($R^2$). MSE measures the average squared difference between the observed and predicted prices, providing an indication of the model's accuracy. A lower MSE indicates better performance.
    - $R^2$ represents the proportion of the variance in the dependent variable that is predictable from the independent variables. An $R^2$ value closer to 1 indicates that the model explains a large portion of the variance in the price.

- Coefficient Analysis:

    - We analyze the coefficients of the linear regression model to understand the impact of each feature on the house price. For instance, a positive coefficient for size indicates that larger houses are more expensive, while a negative coefficient for age suggests that older houses tend to be cheaper.
    - This analysis helps in interpreting the model and understanding the underlying factors that drive housing prices.

- NumPy for Additional Analysis:

    - Beyond regression, NumPy is used for additional analysis, such as calculating the average house size, number of bedrooms, and age. These statistics provide further context to the dataset, helping to paint a complete picture of the housing market.
    - We also compute the correlation matrix for all features and the price, reinforcing the importance of understanding feature relationships in predictive modeling.

- Conclusion:

    - This applied example demonstrates the power of NumPy in real-world data analysis and machine learning. By leveraging NumPy's array manipulation and mathematical capabilities, we can efficiently analyze complex datasets, build predictive models, and gain insights that are critical for decision-making in industries like real estate.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create sample data
np.random.seed(42)
n_samples = 1000

size = np.random.normal(1500, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)

# Create price with some randomness
price = 100000 + 100 * size + 20000 * bedrooms - 1000 * age + np.random.normal(0, 50000, n_samples)

# Create a DataFrame
df = pd.DataFrame({
    'size': size,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

# Prepare data for model
X = df[['size', 'bedrooms', 'age']].values
y = df['price'].values

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Analyze coefficients
coef_df = pd.DataFrame({
    'Feature': ['size', 'bedrooms', 'age'],
    'Coefficient': model.coef_
})

print("\nModel Coefficients:")
print(coef_df)

# Use NumPy for additional analysis
print("\nAverage house details:")
print(f"Size: {np.mean(df['size']):.2f} sq ft")
print(f"Bedrooms: {np.mean(df['bedrooms']):.2f}")
print(f"Age: {np.mean(df['age']):.2f} years")
print(f"Price: ${np.mean(df['price']):.2f}")

print("\nCorrelation matrix:")
print(np.corrcoef(df.values.T))

Mean Squared Error: 2235961170.3665066
R-squared: 0.6253712850449186

Model Coefficients:
    Feature   Coefficient
0      size    100.724556
1  bedrooms  20288.956557
2       age   -930.449615

Average house details:
Size: 1509.67 sq ft
Bedrooms: 3.04
Age: 24.83 years
Price: $285734.84

Correlation matrix:
[[ 1.         -0.02808147  0.02846315  0.64274548]
 [-0.02808147  1.         -0.0633853   0.37499572]
 [ 0.02846315 -0.0633853   1.         -0.17589202]
 [ 0.64274548  0.37499572 -0.17589202  1.        ]]


## Conclusion

This notebook has covered the basics of NumPy, including array creation, operations, and more. With these tools, you can start using NumPy for numerical computing tasks in Python.

## Extra Resources

Hyperlinks are attached to each of the extra resources

- [NumPy Tutorial](https://youtu.be/GB9ByFAIAH4?si=F63UUEK9E9u7IWVR)
    - Complete Python NumPy Tutorial (Creating Arrays, Indexing, Math, Statistics, Reshaping)
    - YouTube video
        - Time = 58 Minutes 
      
&nbsp;

- [NumPy Tutorial](https://realpython.com/numpy-tutorial/)
    - Complete beginner course for NumPy
          
&nbsp; 
     
- [Learn Data Sci](https://www.learndatasci.com/tutorials/applied-introduction-to-numpy-python-tutorial/)
    - NumPy Tutorial: An Applied Introduction for Beginners