## Rajesh's DS & AI Learning

# 1. Reading CSV files with NumPy

* We learned that NumPy makes it quick and easy to select data, and includes a number of functions and methods that make it easy to calculate statistics across the different axes (or dimensions).

* However, what if we also wanted to find out how many trips were taken in each month? Or which airport is the busiest? For this, we will learn a new technique: Boolean Indexing.

* Use the `numpy.genfromtxt()` function to read files into NumPy ndarrays from csv

* `np.genfromtxt(filename, delimiter=None)`
* filename: A positional argument, usually a string representing the path to the text file to be read.
* delimiter: A named argument, specifying the string used to separate each value.

In [1]:
import os

# Define the name of your CSV file
csv_filename = "nyc_taxis.csv"

# Get the current directory of the Python script
current_directory = os.getcwd()

# Move back to the grandparent directory (two levels up)
project_directory = os.path.dirname(os.path.dirname(current_directory))

# Navigate to the "datasets" folder
datasets_directory = os.path.join(project_directory, "DataSets")

# Construct the full path to your CSV file
csv_path = os.path.join(datasets_directory, csv_filename)

# Check if the file exists
if os.path.exists(csv_path):
    print("CSV file found at:", csv_path)
else:
    print("CSV file not found at:", csv_path)
    
import numpy as np 
taxi=np.genfromtxt(csv_path,delimiter=',')

In [None]:
taxi_shape=taxi.shape
print(taxi_shape)

# 2. Reading CSV files with NumPy Continued

*  when numpy.genfromtxt() reads in a file, it attempts to determine the data type of the file by looking at the values.

In [None]:
taxi.dtype

In [None]:
print(taxi[:4])

NaN is an acronym for Not a Number - it literally means that the value cannot be stored as a number. It is similar to (and often referred to as a) null value, like Python's None constant.

* NaN is most commonly seen when a value is missing, but in this case, we have NaN values because the first line from our CSV file contains the names of each column. NumPy is unable to convert string values like pickup_year into the float64 data type.

* For now, we need to remove this header row from our ndarray. 
* Alternatively, we can pass an additional parameter, skip_header, to the numpy.genfromtxt() function. The skip_header parameter accepts an integer, the number of rows from the start of the file to skip.
* Note that because this integer should be the number of rows and not the index, skipping the first row would require a value of 1, not 0.

In [5]:
taxi=np.genfromtxt('nyc_taxis.csv',delimiter=',',skip_header=1)

In [None]:
taxi.shape

# 3. Boolean Arrays

* A boolean array, as the name suggests, is an array of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

In [None]:
a = np.array([1, 2, 3, 4, 5])
a_bool=a<3
a_bool

In [None]:
print(a[a_bool])

# 4. Boolean Indexing with 1D ndarrays

**The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.**

**(1)use boolean indexing to confirm the number of taxi rides in our data set from the month of January.**

In [None]:
pickup_month=taxi[:,1]
print(pickup_month[:10])

In [16]:
january_bool=pickup_month==1

In [None]:
january_pickups=pickup_month[january_bool]
print(january_pickups[:15])

**no of rides in january**

In [None]:
january_pickups.shape

**13481 rides in january**

**(2)use boolean indexing to confirm the number of taxi rides in our data set from the month of December.**

In [None]:
december_bool=pickup_month==float(12)
december_rides=pickup_month[december_bool]
print(december_rides[:4])

In [None]:
december_rides.shape

**(3)use boolean indexing to confirm the number of taxi rides in our data set from the month of June.**

In [None]:
june_bool=pickup_month==6
june_rides=pickup_month[june_bool]
print(june_rides[:5])

In [None]:
june_rides.shape

**15739 rides in June**

# 5. Boolean Indexing with 2D ndarrays

* When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods.`The only limitation is that the boolean array must have the same length as the dimension you're indexing.`

In [None]:
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
print(trip_mph.max())

In [None]:
# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)

* We can see from the last column that most of these are very short rides - all have trip_length values of 4 or less seconds, which does not reconcile with the trip distances, all of which are more than 16 miles.

**(2)examine the rows that have the highest values for the tip_amount column.**

In [30]:
tip_amount = taxi[:,12]
tip_bool=tip_amount>50
top_tips=taxi[tip_bool,5:14]

In [None]:
top_tips[:3]

# 6. Assigning Values in ndarrays

`ndarray[location_of_values] = new_value`

### TO DO:
* (1) The value at column index 5 (pickup_location) of row index 28214 is incorrect. Use assignment to change this value to 1          in the taxi_modified ndarray.
* (2) The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our          data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.
* (3) The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these          values in the taxi_modified ndarray to the mean value for that column.

In [33]:
taxi_modified=taxi.copy()

In [34]:
# updating specific item

taxi_modified[28214,5]=1

In [35]:
# updating whole column

taxi_modified[:,0]=16

In [36]:
# updating slice of data

taxi_modified[1800:1802,7]=taxi_modified[:,7].mean()

# 7. Assignment Using Boolean Arrays

* Boolean arrays become very powerful when we use them for assignment.

**The boolean array controls the values that the assignment applies to, and the other values remain unchanged.**

### TO DO:
* select the fourteenth column (index 13) in taxi_copy. Assign it to a variable named total_amount.
* For rows where the value of total_amount is less than 0, use assignment to change the value to 0

In [39]:
taxi_copy=taxi.copy()

total_amount=taxi_copy[:,13]
taxi_copy[total_amount<0]=0


## short way
taxi_copy[taxi_copy[:,13]<0]=0

# 8. Assignment Using Boolean Arrays Continued

`bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value`

#### In one line
`array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value`

## TO DO:
We have created a new copy of our taxi dataset, taxi_modified with an additional column containing the value 0 for every row.

* In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
* For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
* For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
* For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [44]:
taxi_modified=taxi.copy()

# create a new column filled with `0`.
import numpy as np 
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)

In [None]:
print(taxi_modified)

In [46]:
taxi_modified[taxi_modified[:,5]==2,15]=1
taxi_modified[taxi_modified[:,5]==3,15]=1
taxi_modified[taxi_modified[:,5]==5,15]=1

# 9. Challenge: Which is the most popular airport?

### To complete this task, we'll need to check if the dropoff_location_code column (column index 6) is equal to one of the following values:

* 2: JFK Airport
* 3: LaGuardia Airport
* 5: Newark Airport.

In [47]:
jfk=taxi[taxi[:,6]==2]
LaG=taxi[taxi[:,6]==3]
Newark=taxi[taxi[:,6]==5]

In [None]:
jfk.shape[0]

In [None]:
LaG.shape[0]

In [None]:
Newark.shape[0]

### Observation:
LaGuardia Airport is most popular airport

# 10. Challenge: Calculating Statistics for Trips on Clean Data

#### The columns we're interested in are:

* trip_distance, at column index 7
* trip_length, at column index 8
* total_amount, at column index 13

In [53]:
# Calculate trip per miles
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

### TO DO:
* Create a new ndarray, cleaned_taxi, containing only rows for which the values of trip_mph are less than 100.
* Calculate the mean of the trip_distance column of cleaned_taxi. Assign the result to mean_distance.
* Calculate the mean of the trip_length column of cleaned_taxi. Assign the result to mean_length.
* Calculate the mean of the total_amount column of cleaned_taxi. Assign the result to mean_total_amount.

In [None]:
cleaned_taxi=taxi[trip_mph<100]

mean_distance=cleaned_taxi[:,7].mean()
mean_distance

In [None]:
mean_length=cleaned_taxi[:,8].mean()
mean_length

In [None]:
mean_total_amount=cleaned_taxi[:,13].mean()
mean_total_amount