# Numpy continuation

This notebook corresponds to mission 14 of [dataquest](https://www.dataquest.io)

**Summary:**
* Importing files with Numpy
* Working with Boolean Vectors and good practicies with it
* Some Shortcuts to work with arrays elements substitutions

----

In [1]:
import numpy as np

### Importing Files with Numpy
**numpy.genfromtxt()**
[documentation](https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt)<br>
There is a parameter called 'skip_header', which has a pattern value of 0, but if you pass 1 it deletes the first row, but in the example below we won't do it, so we can see what happens to string values when passed to genfromtxt()

In [2]:
#The most used paramter is 'delimiter', which determinates how we separate elements from the file
taxi = np.genfromtxt('../13 Numpy - Introduction/taxi_nyc_copied.csv', delimiter=',')
print(taxi[0])
taxi = taxi[1:]

[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]


NaN = Non numerical, which is expected since the header elements are probably strings who describes the columns.<br>
It's similar to [None](https://docs.python.org/3.4/library/constants.html#None)

In [3]:
print(taxi.dtype)

float64


# Boolean arrays

**Introduction concepts:**

In [4]:
#Boolean type in Phyton
a=True
print(type(a))

#Vector maph
print(np.array([1,2,3,4])+10)

#Boolean Vector
print(np.array([1,5,8])<6)

<class 'bool'>
[11 12 13 14]
[ True  True False]


Some examples:

In [5]:
a = np.array([1, 2, 3, 4, 5])
a_bool = a<3
print(a_bool)

b = np.array(["blue", "blue", "red", "blue"])
b_bool= b == "blue"
print(b_bool)

c = np.array([80.0, 103.4, 96.9, 200.3])
c_bool = c>100
print(c_bool)

[ True  True False False False]
[ True  True False  True]
[False  True False  True]


**In practice:**

The second column in the taxi dataset is pickup_month.<br>
Lets use a boolean array to calculate how many rides were given in January.

In [6]:
#Separating the pickup months form the rest of the data
pickup_month = taxi[:,1]

#Creating a boolean array where the month matchs January
january_bool = pickup_month == 1

#Then we use the new boolean array to select only the items from pickup_month that have a value of 1:
january_rides = pickup_month[january_bool]

#Finally, we use the .shape attribute to find out how many items are in our january ndarray which is the number of taxi rides in our data set from the month of January. We'll use [0] to extract the value from the tuple returned by .shape
print(january_rides.shape)

#Returning as a number
print()
print(type(january_rides.shape)) #Showing that when use only shape it returns a tuple
print(type(january_rides.shape[0])) #But if we index the right number, it returns our number as a int

january = january_rides.shape[0]
print(january)

(13481,)

<class 'tuple'>
<class 'int'>
13481


**Understandig boolean vectors in diferent dimensions:**

<img src="bool_dims.svg" style="width:500px" />

Example:<br>
Lets suppose we want the columns of rows that have more than 50 trip amounts<br>
5:14 -> pickup_location_code, dropoff_location_code, trip_distance, trip_length, fare_amount, fees_amount, tolls_amount, tip_amount, and total_amount.

In [7]:
trip_amount = taxi[:,12]
trip_bool = trip_amount>50
top_trips = taxi[trip_bool, 5:14]


**Using index to modify more than one element at a time:**

In [8]:
#Lets create a modify taxi, so we dont mosdify our data
modified_taxi = taxi.copy()

#The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.
modified_taxi[:,0]=16
print(modified_taxi[190][0], modified_taxi[18][0])

#Now the most insteresting one:
#Lets suppose in lines 1800 and 1801 the trip_distance(index=7) is wrong.
#But the rest of the row is some important data
#The simplest/best way to deal with it is to get the mean of this entire column and put it in those rows,
#so it wont affect our data
modified_taxi[1800:1802, 7] = modified_taxi[:,7].mean()


16.0 16.0


### Shortcut

**1st Example**

In [9]:
#Primary way
a2 = np.array([1, 2, 3, 4, 5])
a2_bool = a2 > 2
a2[a2_bool] = 9
print(a2)

#Shortcut
a = np.array([1, 2, 3, 4, 5])
a[a > 2] = 9
print(a)

[1 2 9 9 9]
[1 2 9 9 9]


<img src="bool_assignment_1.svg" style="width:500px" />

**2nd Example:**

In [10]:
b = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

b[b > 4] = 99
print(b)

[[ 1  2  3]
 [ 4 99 99]
 [99 99 99]]


<img src="bool_assignment_2.svg" style="width:500px" />

**3rd Example:**

In [11]:
c = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

c[c[:,1] > 2, 1] = 99
print(c)

[[ 1  2  3]
 [ 4 99  6]
 [ 7 99  9]]


<img src="bool_assignment_3.svg" style="width:500px" />

The Pattern:
* Using a Intermediate variable:<br>
bool = array[:, column_for_comparison] == value_for_comparison<br>
array[bool, column_for_assignment] = new_value
<br><br>
* all in one line:<br>
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value

### Back to work with our dataset

**1st** We will try to find out which airport is the most popular destination in our data set.<br>
It will be used "dropoff_location_code" which is the column with index 6.<br>
In this column the airports are represented by a number, and this study will only work with these airports:
* 2 - JFK Airport.
* 3 - LaGuardia Airport.
* 5 - Newark Airport.

In [12]:
#Lets count first JFK
taxi_jfk_bool = np.array(taxi[:,6]==2)
taxi_jfk = taxi[taxi_jfk_bool]
jfk_cnt = len((taxi_jfk))
#Resumed code above >>>print(len(taxi[np.array(taxi[:,6] == 2)]))

#Now LaGuardia
taxi_laGuardia_bool = np.array(taxi[:,6]==3)
taxi_laGuardia = taxi[taxi_laGuardia_bool]
laGuardia_cnt = len((taxi_laGuardia))

#And now Newark
taxi_newark_bool = np.array(taxi[:,6]==5)
taxi_newark = taxi[taxi_newark_bool]
newark_cnt = len((taxi_newark))


print("LaGuardia rides: ", jfk_cnt)
print("LaGuardia rides: ", laGuardia_cnt)
print("Newark rides: ", newark_cnt)

LaGuardia rides:  11832
LaGuardia rides:  16602
Newark rides:  63


**2nd** (Data Cleaning) Using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph).<br>
* trip_distance, at column index _7_
* trip_length, at column index _8_
* total_amount, at column index _13_

In [13]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
cleaned_taxi = taxi[trip_mph < 100]

mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()
mean_mph = trip_mph[trip_mph < 100].mean()

print("Mean Distance: ",mean_distance)
print("Mean Length (minutes): ",mean_length/60)
print("Mean total_amount: ", mean_total_amount)
print("Mean speed: ", mean_mph)

Mean Distance:  12.666396599932893
Mean Length (minutes):  37.325060955150434
Mean total_amount:  48.98131853260262
Mean speed:  23.353238774840836
