# Introduction to NumPy for Data Analytics

## Introduction
In this notebook, we will explore the basics of NumPy, a fundamental library in Python for numerical computations.
NumPy is crucial for efficient data manipulation and analysis, especially when dealing with large datasets.

NumPy arrays are optimized for numerical computations, faster, and more memory-efficient than Python lists.
They allow you to easily perform calculations on all elements at once, work with arrays of different shapes, and handle data in multiple dimensions, like tables or grids.
Lists can not do this and unlike lists, they only store elements of a single data type.

In [1]:
import numpy as np

## Understanding ndarrays
The core data structure in NumPy is the ndarray (n-dimensional array), which provides fast and efficient operations on large data.
Let's start by creating some ndarrays.

### Creating a 1D ndarray
A 1D array (vector) can be created using the np.array() function.

In [2]:
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array (ndarray):")
print(arr1)

1D Array (ndarray):
[1 2 3 4 5]


### Creating a 2D ndarray
A 2D array (matrix) can be created by passing a list of lists to the np.array() function.

In [3]:
arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D Array (ndarray):")
print(arr2)

2D Array (ndarray):
[[1 2 3]
 [4 5 6]
 [7 8 9]]


## Understanding Vectorisation
Vectorisation is a powerful feature in NumPy that allows you to perform operations on entire arrays without the need for explicit loops.
This leads to more concise and faster code.

### Vectorised Operations
You can perform arithmetic operations on entire arrays.

In [4]:
print("Adding 10 to each element of arr1:")
print(arr1 + 10)

print("Multiplying each element of arr1 by 2:")
print(arr1 * 2)

Adding 10 to each element of arr1:
[11 12 13 14 15]
Multiplying each element of arr1 by 2:
[ 2  4  6  8 10]


## Selecting and Slicing Rows and Columns
Indexing and slicing allow you to select specific elements, rows, or columns from an ndarray.

### Selecting Specific Elements
You can access elements by their indices (remember that indexing starts at 0).

In [5]:
print("First element of arr1:", arr1[0])
print("Element at position (1,2) in arr2:", arr2[1, 2])

First element of arr1: 1
Element at position (1,2) in arr2: 6


### Slicing Rows and Columns
Slicing allows you to select a range of elements.

In [6]:
print("First two elements of arr1:", arr1[:2])
print("First row of arr2:", arr2[0, :])
print("First two columns of arr2:")
print(arr2[:, :2])

First two elements of arr1: [1 2]
First row of arr2: [1 2 3]
First two columns of arr2:
[[1 2]
 [4 5]
 [7 8]]


## Assigning Values to ndarrays
You can also assign values to specific elements or slices of an ndarray.
This is useful for updating the data within an array.

### Assigning a Single Value
Assign a single value to a specific element in the array.

In [7]:
arr1[0] = 10
print("Updated arr1 after assigning 10 to the first element:")
print(arr1)

Updated arr1 after assigning 10 to the first element:
[10  2  3  4  5]


### Assigning Values to Slices
You can assign the same value to a slice of the array, or even a different ndarray.

In [8]:
arr2[:, 0] = 100
print("Updated arr2 after assigning 100 to the first column:")
print(arr2)

Updated arr2 after assigning 100 to the first column:
[[100   2   3]
 [100   5   6]
 [100   8   9]]


### Assign different values to a slice

In [9]:
arr2[0, :] = [101, 102, 103]
print("Updated arr2 after assigning different values to the first row:")
print(arr2)

Updated arr2 after assigning different values to the first row:
[[101 102 103]
 [100   5   6]
 [100   8   9]]


## Vector Operations on 1D Arrays
NumPy provides easy-to-use functions to calculate basic statistics on arrays.

### Operations on 1D Arrays
Let's calculate the minimum, maximum, mean, and sum of elements in a 1D array.

In [10]:
print("arr1:", (arr1))
print("Min of arr1:", np.min(arr1))
print("Max of arr1:", np.max(arr1))
print("Mean of arr1:", np.mean(arr1))
print("Sum of arr1:", np.sum(arr1))

arr1: [10  2  3  4  5]
Min of arr1: 2
Max of arr1: 10
Mean of arr1: 4.8
Sum of arr1: 24


## Vector Operations on 2D Arrays
Similarly, you can perform these operations on 2D arrays.

### Operations on 2D Arrays
You can calculate the min, max, mean, and sum across the entire array or along a specific axis (row/column).

In [11]:
print("arr2:", (arr2))
print("Min of arr2:", np.min(arr2))
print("Max of arr2:", np.max(arr2))
print("Mean of arr2:", np.mean(arr2))
print("Sum of arr2:", np.sum(arr2))

arr2: [[101 102 103]
 [100   5   6]
 [100   8   9]]
Min of arr2: 5
Max of arr2: 103
Mean of arr2: 59.333333333333336
Sum of arr2: 534


### Sum across rows (axis=1)

In [12]:
print("Sum across rows of arr2:")
print(np.sum(arr2, axis=1))

Sum across rows of arr2:
[306 111 117]


### Sum across columns (axis=0)

In [13]:
print("Sum across columns of arr2:")
print(np.sum(arr2, axis=0))

Sum across columns of arr2:
[301 115 118]


## Reading CSV Files with NumPy
NumPy can also be used to read data from CSV files, which is essential for data analysis.

### Reading a CSV File
Let's read a CSV file into a NumPy array. We will use the file related to the screenshot.

In [15]:
csv_data = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)
print("CSV Data as ndarray:")
print(csv_data[:5])

CSV Data as ndarray:
[[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
  2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 1.000e+00
  1.629e+01 1.520e+03 4.500e+01 1.300e+00 0.000e+00 8.000e+00 5.430e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  1.270e+01 1.462e+03 3.650e+01 1.300e+00 0.000e+00 0.000e+00 3.780e+01
  2.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  8.700e+00 1.210e+03 2.600e+01 1.300e+00 0.000e+00 5.460e+00 3.276e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  5.560e+00 7.590e+02 1.750e+01 1.300e+00 0.000e+00 0.000e+00 1.880e+01
  2.000e+00]]


In [16]:
# Define a custom formatter function
def custom_formatter(x):
    if x.is_integer():
        return f'{int(x)}'  # Convert to an integer and format without decimals
    else:
        return f'{x:.2f}'  # Format with two decimal places

# Set the print options using the custom formatter
np.set_printoptions(suppress=True, formatter={'float_kind': custom_formatter})

In [17]:
csv_data = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)
print("CSV Data as ndarray:")
print(csv_data[:5])

CSV Data as ndarray:
[[2016 1 1 5 0 2 4 21 2037 52 0.80 5.54 11.65 69.99 1]
 [2016 1 1 5 0 2 1 16.29 1520 45 1.30 0 8 54.30 1]
 [2016 1 1 5 0 2 6 12.70 1462 36.50 1.30 0 0 37.80 2]
 [2016 1 1 5 0 2 6 8.70 1210 26 1.30 0 5.46 32.76 1]
 [2016 1 1 5 0 2 6 5.56 759 17.50 1.30 0 0 18.80 2]]


## Boolean Arrays and Indexing
Boolean arrays are arrays where each value is either True or False, and they are useful for filtering data.

### Creating a Boolean Array
Let's create a boolean array by applying a condition to an ndarray.

In [18]:
boolean_arr = csv_data[:, 7] > 10  # Check if trip distance is greater than 10 units
print("Boolean Array (trip distance > 10):")
print(boolean_arr[:10])

Boolean Array (trip distance > 10):
[ True  True  True False False  True False False  True  True]


### Boolean Indexing
You can use a boolean array to filter elements in an ndarray.

In [19]:
print("Filtered rows where trip distance > 10:")
print(csv_data[boolean_arr, :][:10])

Filtered rows where trip distance > 10:
[[2016 1 1 5 0 2 4 21 2037 52 0.80 5.54 11.65 69.99 1]
 [2016 1 1 5 0 2 1 16.29 1520 45 1.30 0 8 54.30 1]
 [2016 1 1 5 0 2 6 12.70 1462 36.50 1.30 0 0 37.80 2]
 [2016 1 1 5 0 4 2 21.45 2004 52 0.80 0 52.80 105.60 1]
 [2016 1 1 5 0 2 5 36.30 2562 109.50 0.80 11.08 10 131.38 1]
 [2016 1 1 5 0 6 2 12.46 1351 36 1.30 0 0 37.30 2]
 [2016 1 1 5 1 4 2 16.60 1467 52 0.80 5.54 0 58.34 2]
 [2016 1 1 5 1 2 4 18.06 1588 52 0.80 5.54 11.67 70.01 1]
 [2016 1 1 5 1 4 2 16.30 1484 52 0.80 5.54 0 58.34 2]
 [2016 1 1 5 1 4 3 12.09 1265 33.50 1.30 5.54 10.08 50.42 1]]


In [20]:
csv_data = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)
print("CSV Data as ndarray:")
print(csv_data[:5])

CSV Data as ndarray:
[[2016 1 1 5 0 2 4 21 2037 52 0.80 5.54 11.65 69.99 1]
 [2016 1 1 5 0 2 1 16.29 1520 45 1.30 0 8 54.30 1]
 [2016 1 1 5 0 2 6 12.70 1462 36.50 1.30 0 0 37.80 2]
 [2016 1 1 5 0 2 6 8.70 1210 26 1.30 0 5.46 32.76 1]
 [2016 1 1 5 0 2 6 5.56 759 17.50 1.30 0 0 18.80 2]]


# NumPy Challenges with NYC Taxis Dataset

Now that you've learned the basics of NumPy, let's apply your knowledge to the `nyc_taxis.csv` dataset. Try to complete the following tasks:

- **Load the Dataset:** Reload the `nyc_taxis.csv` file into a NumPy ndarray to remove any changes

## Challenge 1: Exploring the Data
- **Inspect the Data:** Print out the first 10 rows of the dataset to get an overview.
- **Shape and Structure:** Display the shape of the ndarray to understand its dimensions (i.e., number of rows and columns).

## Challenge 2: Selecting and Slicing Data
- **Select Columns:** Extract the `trip_distance`, `fare_amount`, and `total_amount` columns and create a new ndarray from them.
- **Slice Rows and Columns:** Select the first 100 rows and the first 5 columns of the dataset.

## Challenge 3: Vectorized Operations
- **Calculate Total Fare Per Mile:** Using vectorized operations, calculate the fare per mile for each trip and create a new ndarray with these values.
  - *Hint:* Divide the `total_amount` column by the `trip_distance` column.
- **Increase Fare Amounts:** Add a flat $2.50 surcharge to all fare amounts and store the result in a new ndarray.

## Challenge 4: Basic Statistics
- **Calculate Averages:** Compute the mean, median, and standard deviation for the `trip_distance`, `fare_amount`, and `total_amount` columns.
- **Identify Extremes:** Find the maximum and minimum values for `trip_distance` and `total_amount`.

## Challenge 5: Boolean Indexing
- **Filter High Fares:** Create a Boolean array to filter trips where the total fare amount is greater than $100. Print the first 5 rows of the filtered data.
- **Short Trips:** Identify trips where the distance was less than 2 miles. How many such trips are there?

## Challenge 6: Assigning Values
- **Apply Discounts:** For trips with a `fare_amount` greater than $50, apply a 10% discount. Create a new ndarray with the discounted fares.
- **Mark High Tips:** Add a new column to the dataset that marks trips with a `tip_amount` greater than 20% of the `fare_amount` as 1 (high tip) and others as 0 (low tip).

## Challenge 7: Working with Boolean Arrays
- **Count June Rides:** Determine how many rides took place in June.
- **Calculate Average Trip Length:** Calculate the average `trip_length` for all rides that took place in June.

In [None]:
#1

In [None]:
#2

In [None]:
#3

In [None]:
#4

In [None]:
#5

In [None]:
#6

In [None]:
#7