# NumPy

NumPy is used for numerical computing in Python. It is designed for scientific computation and is used extensively for data analysis because of its ability to handle large, multi-dimensional arrays and matrices efficiently. Serves as the basis for many other Python data science libraries, due to its speed and efficiency in numerical computations. More importantly, we can use it as a way to store data in a structured format, making it easier to organizee, access, and manipulate data. Pandas library is built on top of NumPy

- A few special use cases for NumPy specifically for data analysis
 - Array operations
 - Linear algebra
 - Statistical functions
 - Random number generation

In [11]:
# Create a list of 1,000,000 salaries ranging from 50k to 150k

import random

#[random.randint(50000, 150000) for _ in range(1_000_000)]

salary_list = [random.randint(50000, 150000) for _ in range(10_000_000)]

In [12]:
import statistics

In [13]:
%%timeit

statistics.median(salary_list)

5.27 s ± 388 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
import numpy as np


In [15]:
%%timeit

np.median(salary_list)

848 ms ± 204 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
import numpy as np


In [18]:
my_array = np.array([1,2,3,4])

In [19]:
my_array.mean()

2.5

In [27]:
# Job Titles
job_titles = np.array(['Data Analyst', 'Data Scientist', 'Data Engineer', 'Machine Learning Engineer', 'AI Engineer'])

#Base Salaries
base_salaries = np.array([60000, 80000, 75000, 90000, np.nan])

#Bonus Rates
bonus_rates = np.array([0.05, 0.1, 0.08, 0.12, np.nan])

In [28]:
#Peforming computations between arrays

total_salaries = base_salaries * (1 + bonus_rates)

total_salaries

array([ 63000.,  88000.,  81000., 100800.,     nan])

In [29]:
np.mean(total_salaries)

nan

In [30]:
#To calculate the mean despite the presence of a nan value, use the nanmean method

np.nanmean(total_salaries)

83200.0

In [23]:
type(None)

NoneType

An alternative to the None keyword is the nan value. This is because nan is read as a float

In [26]:
type(np.nan)

float

# Extra Exercises

In [31]:
import numpy as np

In [33]:
years_of_experience = np.array([1,2,3,4,5])

# Basic Operations

In [34]:
#Addition

years_of_experience_plus_one = years_of_experience + 1
years_of_experience_plus_one

array([2, 3, 4, 5, 6])

In [35]:
#Subtraction
years_of_experience_minus_one = years_of_experience - 1
years_of_experience_minus_one

array([0, 1, 2, 3, 4])

In [36]:
#Division
years_of_experience_half = years_of_experience / 2
years_of_experience_half

array([0.5, 1. , 1.5, 2. , 2.5])

In [37]:
#Multiplication
years_of_experience_double = years_of_experience * 2
years_of_experience_double

array([ 2,  4,  6,  8, 10])

In [38]:
#Slicing
# Example: Selecting the experience requirement for the second and third job listings.
second_and_third_jobs_experience = years_of_experience[1:3]
second_and_third_jobs_experience

array([2, 3])

In [39]:
#Boolean Indexing
# Example: Selecting only those job listings that require more than 1 year of experience.
jobs_with_more_than_one_year_exp = years_of_experience[years_of_experience > 2]
jobs_with_more_than_one_year_exp

array([3, 4, 5])

# Math Operations
- Aggregate Functions
  - sum: sum
  - prod: product
  - cumsum: cumulative sum
  - cumprod: cumulative product
- Mathematical Operations
  - sqrt: square root
  - exp: exponent
  - log: log
  - sin: sin
  - cos: cos

# Examples

First lets create a list with 10 yearly salaries for a Senior Data Analyst job. We're just using a combination of the random library to get random integers between 100000 and 150000. Then using a for loop to get 10 (random) values.

In [42]:
import numpy as np
import random

In [45]:
#[random.randint(100000, 150000) for i in range(11)]

[100439,
 145434,
 126766,
 135929,
 133269,
 140105,
 138617,
 103673,
 124373,
 107578,
 142608]

In [46]:
salary = [random.randint(100000, 150000) for num in range(10)]
salary

[102596,
 130494,
 109523,
 147869,
 105111,
 124534,
 111153,
 124282,
 118215,
 126784]

In [48]:
#Convert it to a numpy array
salary_array = np.array(salary)
salary_array

array([102596, 130494, 109523, 147869, 105111, 124534, 111153, 124282,
       118215, 126784])

In [49]:
#Calculate the sum of the elements in the salary_array
total_sum_salaries = np.sum(salary_array)
total_sum_salaries

1200561

In [None]:
#Calculate the product of the elements in the salary_array
# This is a conceptual example since taking the product of a boolean series isn't common
product_salaries = np.prod(salary_array)
product_salaries

Cumsum (Cumulative Sum)
- Calculates the cumulative of the elements of the salary_array. It calculates the cumulative sum at each index, meaning each element in the output array is the sum of all preceding elements including the current one from the original array
- For the salary_array
  - First element of cumsum is 102596
  - Second element is 233090 (102596 + 130494)
  - Third element is 342613 (102596 + 130494 + 109523)
  - And so on...

In [50]:
#Cumsum (Cumulative Sum)
#Calculates the cumulative of the elements of the salary_array.
#It calculates the cumulative sum at each index, meaning each element in the output array is the sum of all preceding elements including the current one from the original array

cumulative_sum_salaries = np.cumsum(salary_array)
cumulative_sum_salaries

array([ 102596,  233090,  342613,  490482,  595593,  720127,  831280,
        955562, 1073777, 1200561])

# Cumprod (Cumulative Product)
Calculates the cumulativve product of elements of salary_array. It calculates the cumulative product at each index, meaning each element in the output array is the product of all preceding elements including the current one from the original array

For the salary_array
  - First element of cumsum is 102596
  - Second element is 13388162424 (102596 + 130494)
  - Third element is 1466311713163752 (102596 + 130494 + 109523)
  - And so on...


In [51]:
cumulative_prod_salaries = np.cumprod(salary_array)
cumulative_prod_salaries

array([              102596,          13388162424,     1466311713163752,
       -4538882170703774904,  1698133505649510264,  1883931549811491152,
       -2795168554153926576,   -53851260026685920, -1899998624880725280,
        6605201695160653824])

# Statistics Operations

A lot of these are able to be used in Pandas since Pandas is built on top of NumPy

- mean
- median
- var: variance
- std: standard deviation
- min
- max

In [52]:
#Mean
#Calculate the average salary in th salary_array

average_salary = np.mean(salary_array)
average_salary

120056.1

In [53]:
#Median
#Calculate the median salary in th salary_array
median_salary = np.median(salary_array)
median_salary

121248.5

In [54]:
#Variance
#Calculate the variance salary in th salary_array
salary_variance = np.var(salary_array, ddof=1) #ddof = 1 for sample variance
salary_variance

187499306.76666668

In [55]:
# Standard deviation of 'salary_year_avg' column
salary_std_dev = np.std(salary_array, ddof=1)  # ddof=1 for sample standard deviation
salary_std_dev

13693.03862430347

In [56]:
# Minimum of 'salary_year_avg' column
min_salary = np.min(salary_array)
min_salary

102596

In [57]:
# Maximum of 'salary_year_avg' column
max_salary = np.max(salary_array)
max_salary

147869

# NaN
- Generate NaN values using np.nan
- np.nan value is used in NumPy (and by extension, Pandas) to represent missing or undefined data
- Helpful because it:
  - Handles missing data
  - Helps with computations since it won't return errors but instead return np.nan
  - Help filter out or fill in missing data using other methods that we'll use often in the pandas library like dropna(), fillna(), isna(), or notna()

In [58]:
# Insert missing values
# If you want to insert missing values into your array intentionally, perhaps to indicate that data is expected but not yet available. You use np.nan.

salary_with_nan = np.array([123124, np.nan, 145000, 128000, 110000, 149999, np.nan, 135000, 115000, 140000], dtype=float)
salary_with_nan

array([123124.,     nan, 145000., 128000., 110000., 149999.,     nan,
       135000., 115000., 140000.])

In [59]:
# Replace values with NaN
# If you want to replace existing values with np.nan, for example, if certain values are considered invalid or outliers:

salary_with_nan[salary_with_nan<130000] = np.nan
salary_with_nan

array([    nan,     nan, 145000.,     nan,     nan, 149999.,     nan,
       135000.,     nan, 140000.])

# Where
- np.where check elements of an array against a condition and to assign a value for True and another for false
- Syntax: np.where(condition)
- It's commonly used to conditionally replace array elements

Example
- We're going to replace all values in salary_array that are less than 120,000 with 120,000 (to apply a minimum salary threshold).
- We'll use this syntax for it: np.where(condition, x, y).
  - With a condition and if it's true we do x and if not then do y.

In [60]:
#Replace values using np.where
salary_array = np.where(salary_array < 120000, 120000, salary_array)
salary_array

array([120000, 130494, 120000, 147869, 120000, 124534, 120000, 124282,
       120000, 126784])

# Random Sampling

- Generate a random numbers or samples
- np.random.normal - draws random samples from a normal (Gaussian) distribution
  - Specify the Arguments
    - loc: This is the mean (u) of the normal distribution
    - scale: This is the standard deviation (o) of the normal distribution, representing the dispersion from the mean
    - size: This defines the number of random samples to draw, which is set to match the number of job positings
  - Syntax: np.random.normal(loc = 0.0, scale = 0.0, size = None)
- A few other random sampling functions
 - np.random.rand
 - np.random.randn
 - np.random.randint
 - np.random.random
 - np.random.uniform
 - np.random.binomial
 - np.random.poisson

Example
- Let's add some random noise to the salary_array to simulate salary variations. We can generate random values from a normal distribution with a mean of 0 and a standard deviation of 5000, then add these values to the salaries.
- Why? This can be used to simulate salary data for job postings if actual salary data isn't available, for instance, in modeling or simulation scenarios.

In [61]:
#Generate numbers based on normal distribution
noise = np.random.normal(0, 5000, salary_array.size)

#Add these numbers to the salary array
salary_array_with_noise = salary_array + noise
salary_array_with_noise

array([123807.60242597, 126197.00689288, 119882.42470323, 155421.65936605,
       120633.21789   , 123453.71977225, 117459.44012756, 131469.58467103,
       115497.73761989, 130452.9887922 ])