## introduction to numpy

- applying the numpy fundamentals on the NYC airport dataset taxi dataset
- here the data set columns

- pickup_month:  The month of the trip (January is 1, December is 12).
- pickup_day:  The day of the month of the trip.
- pickup_location_code:  The airport or borough where the trip started.
- dropoff_location_code: The airport or borough where the trip finished.
- trip_distance: The distance of the trip in miles.
- trip_length: The length of the trip in seconds.
- fare_amount: The base fare of the trip, in dollars.
- total_amount: The total amount charged to the passenger, including all fees, tolls and tips. 

In [114]:
import numpy as np 
import csv

In [115]:
#open the dataset using csv module and then convert the dtaset into nd array using numpy 

opened_file = open("nyc_taxis.csv")
file_as_list = list(csv.reader(opened_file))
header = file_as_list[0]
data = file_as_list[1:]
print(header)


['pickup_year', 'pickup_month', 'pickup_day', 'pickup_dayofweek', 'pickup_time', 'pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount', 'payment_type']


In [116]:
#before converting the dataset notice that
#you need to convert the data from string to numeric values first

converted_data = []

for row in data:
    converted_row = []
    for i in row:
        i = float(i)
        converted_row.append(i)
    converted_data.append(converted_row)
    
#convert the data into numpy 2D array now using the numpy.array() method
taxi = np.array(converted_data)
#show some rows
print(taxi[0:4])

[[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
  2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 1.000e+00
  1.629e+01 1.520e+03 4.500e+01 1.300e+00 0.000e+00 8.000e+00 5.430e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  1.270e+01 1.462e+03 3.650e+01 1.300e+00 0.000e+00 0.000e+00 3.780e+01
  2.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  8.700e+00 1.210e+03 2.600e+01 1.300e+00 0.000e+00 5.460e+00 3.276e+01
  1.000e+00]]


In [117]:
# show some information about the dataset
print("the type of the data now is: ",type(taxi))

#print(number of rows and columns)
print("the shape of the matrix is: ", taxi.shape)
print("the type of the array elements is : ", taxi.dtype)

the type of the data now is:  <class 'numpy.ndarray'>
the shape of the matrix is:  (89560, 15)
the type of the array elements is :  float64


In [118]:
#show some data elements or 2d array like:
print(taxi[0:3 , 3:6])

[[5. 0. 2.]
 [5. 0. 2.]
 [5. 0. 2.]]


In [119]:
#show some days in the month 
print("some days in the month : ", taxi[[50,2000,2500], 2])

some days in the month :  [1. 5. 6.]


In [120]:
#if u wanna see the pick up year try this:
print("the pick up year is:" , taxi[:, 0])

the pick up year is: [2016. 2016. 2016. ... 2016. 2016. 2016.]


In [121]:
# if u wanna know the minimum fare amount try this:
print("the minumum fare amount is :" , taxi[:,9].min() , "it is negative amount it is not good at all we should clean it")

the minumum fare amount is : -52.0 it is negative amount it is not good at all we should clean it


In [122]:
# u can know also yhe maximum amount for some rows in the dataset easily:
print("the max values for row [2,3,5] are: ", taxi[[2,3,5]].max())

the max values for row [2,3,5] are:  2016.0


In [123]:
#what is the length of the taxi dataset ? 
print("the length of the dataset is {} rows:".format(len(taxi)))

the length of the dataset is 89560 rows:


In [124]:
#multiplication for 2 columns:
pickup_month = taxi[:, 9]
pickup_day = taxi[:,8]

product = pickup_month * pickup_day
print(product, "length of the product", len(product))

[105924.   68400.   53363.  ... 146744.   37363.5  82128. ] length of the product 89560


## calculate the miles per hour for each trip

In [125]:
length_by_hour = taxi[:,8] / 3600
miles = taxi[:,7]

speed= miles / length_by_hour
print(speed[:5])

[37.11340206 38.58157895 31.27222982 25.88429752 26.3715415 ]


In [126]:
# max speed out of all trips
print("the max speed for a trip is: ",speed.max())

the max speed for a trip is:  82800.0


- here it is impossible speed it faster than the fastest plane on the planet 82800

- total amount equals all the amounts of other [9:12] columns

In [127]:
total_amount = taxi[:, 13]
sum_amounts = taxi[:, 9:13].sum(axis = 1)

print(total_amount[:5])
print(sum_amounts[:5])

[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]


In [128]:
np.arange(5)

array([0, 1, 2, 3, 4])

### boolean indexing with boolean arrays

- open file using numpy

In [129]:
taxi = np.genfromtxt("nyc_taxis.csv" , delimiter="," , skip_header= 1 )
print(taxi[0])

[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
 2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
 1.000e+00]


- starting boolean indexing

In [130]:
arr = np.array([1,2,3,5,20,35,15])
print(arr > 10)
print(arr[arr > 10])

[False False False False  True  True  True]
[20 35 15]


In [131]:
# apply for january month
pickup_month = taxi[:,1]
cond = pickup_month == 1
jan = pickup_month[cond]
print(jan)

[1. 1. 1. ... 1. 1. 1.]


In [132]:
february = pickup_month[taxi[:,1] == 2]
print(february)

[2. 2. 2. ... 2. 2. 2.]


- calculate the top tips for the tips column

In [133]:
tips = taxi[:,12]
bool_1 = taxi[:,12] > 50
top = tips[bool_1]
print(top)

[ 52.8   60.    59.34  80.    70.    60.    55.    65.    80.    62.
 100.    58.    62.    75.7   60.    70.  ]


In [134]:
#top fares 
print("top fares are \n ", taxi[:,9][taxi[:,9]>100])

top fares are 
  [109.5 122.  117.  112.  114.5 102.  105.5 110.5 110.5 101.5 128.  134.5
 102.5 113.  102.5 116.  110.5 119.5 106.5 110.  105.5 113.  115.5 106.5
 120.  121.  100.5 130.  123.  110.5 106.5 400.  116.5 125.  157.5 103.
 120.5 104.  101.  101.  101.  115.5 129.5 112.  106.  114.5 107.  119.
 126.  115.  123.  111.  110.  102.5 110.  126.  112.5 101.5 220.  108.
 150.  101.  134.  101.  113.  180.5 104.  129.  107.5 120.  112.  117.5
 114.5 116.  117.5 116.5 113.5]


In [135]:
#total highest fees
total_amount = taxi[:,13]
cond = total_amount > 200
print(total_amount[cond])

[400.3  453.34 220.3  834.84 286.84]


### assigning values using boolean indexing

In [136]:
#change the maximum fees to be 850 dollr 
taxi[:,13][taxi[:,13] == 834.84] = 850
print(taxi[:,13].max())

850.0


In [137]:
x = np.zeros((5,5))
x[0,1] = 10
x[:2, 3] = 5
x[:,2] = [1,5,20,15,2]
x[4,0] = 50


In [138]:
x[x==50] = 48
x

array([[ 0., 10.,  1.,  5.,  0.],
       [ 0.,  0.,  5.,  5.,  0.],
       [ 0.,  0., 20.,  0.,  0.],
       [ 0.,  0., 15.,  0.,  0.],
       [48.,  0.,  2.,  0.,  0.]])

In [139]:
taxi_copy = taxi.copy()
total_amount = taxi_copy[:,13]
total_amount[total_amount <0] = 0
total_amount[total_amount == 0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [140]:
bool_1 = taxi_copy[:,13] > 200
taxi_copy[bool_1 , 13] = 500
taxi_copy[:,13][taxi[:, 13]>400]

array([500., 500., 500.])

- create new columns using boolean indexing

[2. 2. 2. ... 2. 2. 2.]
