# Numpy Introduction

This notebook corresponds to mission 13 of [dataquest](https://www.dataquest.io).

**Summary:**
* np.Arrays, creation and methods
* Vector operations
* Axis
* OBS: timeit


---

In [1]:
import csv
import numpy as np

In [6]:
#Importing csv files
csv_file = open('taxi_nyc_copied.csv')
taxi_dataset_unformated =list(csv.reader(csv_file))
print("Header elements:\n",taxi_dataset_unformated[0])
taxi_dataset_unformated = taxi_dataset_unformated[1:]

#Unlike with Python lists, every value in an ndarray must be of the same types. 
# ndarray = n dimensions array
#Transforming every element of the data into float
taxi_dataset_float = []
for row in taxi_dataset_unformated:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
                             
    taxi_dataset_float.append(converted_row)
    
#Passing the data in float to a numpy array
taxi = np.array(taxi_dataset_float)    

['pickup_year', 'pickup_month', 'pickup_day', 'pickup_dayofweek', 'pickup_time', 'pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount', 'payment_type']


**Shape method**: See how many rows and columns there is in our array

In [21]:
print(taxi.shape,"    ->There are two numbers, which tells us that our ndarray is two-dimensional.\n")
print(type(taxi.shape), " ->Note: the data type returned is called a tuple. Tuples are very similar",
                        "to Python lists, but are immutable                  (can't be modified). Tuples are defined and displayed using parentheses () rather than brackets [].")

(89560, 15)     ->There are two numbers, which tells us that our ndarray is two-dimensional.

<class 'tuple'>  ->Note: the data type returned is called a tuple. Tuples are very similar to Python lists, but are immutable                  (can't be modified). Tuples are defined and displayed using parentheses () rather than brackets [].


----

**NDARRAY advantages on selecting elements**

In [35]:
print(taxi[15][8]," ",type(taxi[15][8]))
print(taxi[15,8]," ",type(taxi[15,8]))
print()
print((taxi[:,2])) #Every 2 element of every row
print()
print(taxi[[0,1,3,4],0]) #Only the 0 element of the rows indicated in the brackets

1265.0   <class 'numpy.float64'>
1265.0   <class 'numpy.float64'>

[ 1.  1.  1. ... 30. 30. 30.]

[2016. 2016. 2016. 2016.]


<img src="selection_columns.svg" alt="Drawing" style="width:500px"/>

<img src="selection_1darray.svg" alt="Drawing" style="width:500px"/>


<img src="selection_2darray.svg" alt="Drawing" style="width:500px"/>   

Examples:
* every row for the columns at indexes 1, 4, and 7. -> taxi[:,[1,4,7]]

* columns at indexes 5 to 8 inclusive for the row at index 99. -> taxi[99,5:9]

* rows at indexes 100 to 200 inclusive for the column at index 14. -> taxi[100:201,14]

**Vector Math**

In [40]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_mph = trip_distance_miles / trip_length_hours

print(trip_mph)

#Another way to perform this operation is 
# >>> trip_mph_2 = np.divide(trip_distance_miles,trip_length_hours)
#They are basicaly the same, and its better to choose the first one, because its easier to read

[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]


**Basic Mehtods:**
* ndarray.min() to calculate the minimum value
* ndarray.max() to calculate the maximum value
* ndarray.mean() to calculate the mean average value
* ndarray.sum() to calculate the sum of the values

In [43]:
mph_min = trip_mph.min()
print(mph_min)
mph_max = trip_mph.max()
print(mph_max)
mph_mean = trip_mph.mean()
print(mph_mean)

0.0
82800.0
32.24258580925573


---

**AXIS**: This parameter tells a direction
* axis = 0  VERTICAL
* axis = 1 HORIZONTAL

<img src="axis_param.svg" alt="Drawing" style="width:500px"/>

In [56]:
twodlist = np.array([[1,2],[3,4]])

print("AXIS 0:")
print("|1  2|\n|⬇  ⬇|\n|3  4|")
print( twodlist.sum(axis=0))
print("\nAXIS 1:")
print("|1 ➡ 2|\n|3 ➡ 4|")
print(twodlist.sum(axis=1))

AXIS 0:
|1  2|
|⬇  ⬇|
|3  4|
[4 6]

AXIS 1:
|1 ➡ 2|
|3 ➡ 4|
[3 7]


**Concatenate Method**: add rows and columns to an ndarray <br>
[documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html)

In [67]:
#First lets learn about transversal
x =  np.array([[1, 2],[3,4]])
print(x)
print()
print(x.T)

[[1 2]
 [3 4]]

[[1 3]
 [2 4]]


In [68]:
#Now how to concatenate
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(np.concatenate((a, b), axis=0))
print()
print(np.concatenate((a, b.T), axis=1))


[[1 2]
 [3 4]
 [5 6]]

[[1 2 5]
 [3 4 6]]


---

**Inserting a new axis**: 
np.expand_dims() <br> [ducumentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html)

In [77]:
#Example 
ones = np.array([[ 1 , 1 , 1],[ 1 , 1 , 1]])
zeros = np.array([0,0,0])

#assuming you want to add zeros to ones as a row

#The code below give us a error:
#>>>combined = np.concatenate([ones,zeros],axis=0)
#ERROR: all the input arrays must have same number of dimensions
#because our dimensions don't match

print("zeros shpe:", zeros.shape)
print("ones shpe:", ones.shape)

# Since we're using axis=0, our shapes have to match across all dimensions except the first
#But zeros doesn't have a second dimension.

#We'll start by passing axis=0 because we want to convert our 1D array into a 2D array representing a row:
zeros_2d = np.expand_dims(zeros, axis=0)
print("\nNow we get the second dimension of zeros: ",zeros_2d.shape)

#So we can use the concatanate to add them together
combined = np.concatenate([ones,zeros_2d],axis=0)
print("\nCombined: \n", combined)






zeros shpe: (3,)
ones shpe: (2, 3)

Now we get the second dimension of zeros:  (1, 3)

Combined: 
 [[1 1 1]
 [1 1 1]
 [0 0 0]]


In [90]:
#We can add mph data to our taxi table:
#Must be used axis = 1, since we need to add our mph data in the end of everey row, like a new column
#But our trip_mph ndarray has only 1 dimension
print("Mph shape: ", trip_mph.shape)
print("Taxi shape: ", taxi.shape)

trip_mph_2d = np.expand_dims(trip_mph, axis=1)
print("\n2 dimensions mph: ",trip_mph_2d.shape)

#Concatenating then together
taxi = np.concatenate((taxi, trip_mph_2d), axis = 1)

print(taxi_mph.shape)


Mph shape:  (89560,)
Taxi shape:  (89560, 15)

2 dimensions mph:  (89560, 1)
(89560, 16)


---

**Sort the Array**: np.argsrt()
[documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.argsort.html#numpy.argsort)

In [119]:
#EXAMPLE 1D array
names = np.array(['Breno', 'Alicia', 'Clara'])
sorted_names = np.argsort(names)
print(sorted_names)

print()
for ind in sorted_names:
    print(names[ind])

[1 0 2]

Alicia
Breno
Clara


In [117]:
#EXAMPLE in 2D arrays
int_square = np.array([[5 ,2, 8, 3, 4]
,[2, 8, 6, 2, 5]
,[1, 6, 2, 7, 7]
,[0, 7, 7, 4, 5]
,[5, 7, 1, 1, 2]])

#Separating the last column
last_column = int_square[:,4]
print(last_column)

#Sorting the last columns index
sorted_order = np.argsort(last_column)
print(sorted_order)

#Sorting the last column values
last_column_sorted = last_column[sorted_order]
print(last_column_sorted)

#Sorting the entire array by using the recent sorted column
print()
int_square_sorted = int_square[sorted_order]
print(int_square_sorted)

[4 5 7 5 2]
[4 0 1 3 2]
[2 4 5 5 7]

[[5 7 1 1 2]
 [5 2 8 3 4]
 [2 8 6 2 5]
 [0 7 7 4 5]
 [1 6 2 7 7]]


In [120]:
#Lets use this technique to sort our taxi that, that now has its mph by speed
taxi_sort_index = np.argsort(taxi[:,15]) #The 15 column is the mph recent added column

#We pass the columns in the sorted order
taxi_sort_mph_column = taxi[taxi_sort_index]
print(taxi_sort_mph_column)

[[2.016e+03 6.000e+00 2.800e+01 ... 7.000e+01 1.000e+00 0.000e+00]
 [2.016e+03 3.000e+00 3.000e+00 ... 6.230e+01 1.000e+00 0.000e+00]
 [2.016e+03 4.000e+00 6.000e+00 ... 3.300e+00 4.000e+00 0.000e+00]
 ...
 [2.016e+03 3.000e+00 2.800e+01 ... 4.300e+00 2.000e+00 3.204e+04]
 [2.016e+03 2.000e+00 1.300e+01 ... 3.300e+00 2.000e+00 7.056e+04]
 [2.016e+03 1.000e+00 2.200e+01 ... 3.300e+00 2.000e+00 8.280e+04]]


# OBS:

**TIME IT**: [documentation](https://docs.python.org/3/library/timeit.html)

In [41]:
import timeit

timeit.timeit('"-".join(str(n) for n in range(100))', number=10000)

0.2849951999996847

In [84]:
teste = np.array([1,1,1])
teste2 = np.array([1,1,1])
print(teste.shape)

print("\nAxis=0: ", np.expand_dims(teste,0).shape)
print("\nAxis=1: ", np.expand_dims(teste,1).shape)


(3,)

Axis=0:  (1, 3)

Axis=1:  (3, 1)
