# Chapter 03 - Importing and exporting data

## Exercise 15 • Weird taxi rides

For this exercise, I want you to create a data frame from the CSV data for January 2019:

1. Load the CSV file into a data frame using only the four columns mentioned earlier: passenger_count, trip_distance, payment_type, and total_amount.

2. How many taxi rides had more than eight passengers?

3. How many taxi rides had zero passengers?

4. How many taxi rides were paid for in cash and cost over $1,000?

5. How many rides cost less than $0?

6. How many rides traveled a below-average distance but cost an above-average amount?

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('data/nyc_taxi_2019-01.csv', usecols=[
    'passenger_count', 'trip_distance', 'total_amount', 'payment_type'
])
df

Unnamed: 0,passenger_count,trip_distance,payment_type,total_amount
0,1,1.50,1,9.95
1,1,2.60,1,16.30
2,3,0.00,1,5.80
3,5,0.00,2,7.55
4,5,0.00,2,55.55
...,...,...,...,...
7667787,1,4.79,1,23.16
7667788,1,0.00,1,0.00
7667789,1,0.00,1,0.00
7667790,1,0.00,1,0.00


In [16]:
print('Taxi having more than 8 passengers: ', 
      df.loc[df['passenger_count'] > 8, 'passenger_count'].count())
print('Taxi having 0 passengers: ', 
      df.loc[df['passenger_count'] == 0, 'passenger_count'].count())
print('taxi rides were paid for in cash and cost over $1,000: ', 
      df.loc[(df['payment_type'] == 2) & (df['total_amount'] > 1000), 'passenger_count'].count())
print('rides cost less than $0: ',
      df.loc[df['total_amount'] < 0, 'total_amount'].count())
print('rides traveled a below-average distance but cost an above-average amount: ',
      df.loc[(df['trip_distance'] < df['trip_distance'].mean()) & 
             (df['total_amount'] < df['total_amount'].mean()), 'trip_distance'].count())
# using query
print(df.query('(trip_distance < trip_distance.mean()) & (total_amount < total_amount.mean())')['trip_distance'].count())

Taxi having more than 8 passengers:  9
Taxi having 0 passengers:  117381
taxi rides were paid for in cash and cost over $1,000:  5
rides cost less than $0:  7131
rides traveled a below-average distance but cost an above-average amount:  5346016
5346016


In [17]:
# find what percentage normally pays in cash versus a credit card.
df['payment_type'].value_counts(normalize=True)[[1,2]]

payment_type
1    0.715464
2    0.278752
Name: proportion, dtype: float64

## Exercise 16 • Pandemic taxis
With that data in hand, I want you to answer a few questions:

* How many rides were taken in 2019 and 2020, and what is the difference between these two figures?

* How much money (in total) was collected in 2019 and 2020, and what was the difference between these two figures?

* Did the proportion of trips with more than one passenger change dramatically?

* Did people use cash (i.e., payment_type of 2) less in 2020 than in 2019?

In [32]:
df_2019_jul = pd.read_csv('data/nyc_taxi_2019-07.csv', usecols=[
    'passenger_count', 'total_amount', 'payment_type'
])
df_2019_jul['year'] = 2019

df_2020_jul = pd.read_csv('data/nyc_taxi_2020-07.csv', usecols=[
    'passenger_count', 'total_amount', 'payment_type'
])
df_2020_jul['year'] = 2020

df = pd.concat([df_2019_jul, df_2020_jul])
df

Unnamed: 0,passenger_count,payment_type,total_amount,year
0,1.0,1.0,4.94,2019
1,1.0,2.0,20.30,2019
2,1.0,1.0,70.67,2019
3,1.0,1.0,66.36,2019
4,0.0,1.0,15.30,2019
...,...,...,...,...
800407,,,83.50,2020
800408,,,19.78,2020
800409,,,38.45,2020
800410,,,29.77,2020


In [33]:
print('How many rides were taken in 2019 and 2020, and what is the difference between these two figures?')
count_2019 = df.loc[df['year'] == 2019, 'year'].count()
count_2020 = df.loc[df['year'] == 2020, 'year'].count()
print('2019 rides: ', count_2019)
print('2020 rides: ', count_2020)
print('difference: ', abs(count_2019 - count_2020))

How many rides were taken in 2019 and 2020, and what is the difference between these two figures?
2019 rides:  6310419
2020 rides:  800412
difference:  5510007


In [35]:
print('How much money (in total) was collected in 2019 and 2020, and what was the difference between these two figures?')
money_2019 = df.loc[df['year'] == 2019, 'total_amount'].sum()
money_2020 = df.loc[df['year'] == 2020, 'total_amount'].sum()
print('Total money collected in 2019: ', round(money_2019, 2))
print('Total money collected in 2020: ', round(money_2020, 2))
print(f'Difference between money collected in 2019 and 2020: {round(abs(money_2019 - money_2020), 2)}')

How much money (in total) was collected in 2019 and 2020, and what was the difference between these two figures?
Total money collected in 2019:  123761823.33
Total money collected in 2020:  14912844.09
Difference between money collected in 2019 and 2020: 108848979.24


In [38]:
print('Did the proportion of trips with more than one passenger change dramatically?')
df.loc[(df['year'] == 2019) & (df['passenger_count'] > 1), 'passenger_count'].count() / df.loc[(df['year'] == 2019), 'payment_type'].count()

Did the proportion of trips with more than one passenger change dramatically?


np.float64(0.2833900000955953)

In [39]:
df.loc[(df['year'] == 2020) & (df['passenger_count'] > 1), 'passenger_count'].count() / df.loc[(df['year'] == 2020), 'payment_type'].count()

np.float64(0.2061513222563435)

In [44]:
print('Did people use cash (i.e., payment_type of 2) less in 2020 than in 2019?')
cash_2019_percentage = df.loc[(df['year'] == 2019) & 
                              (df['payment_type'] == 2), 'payment_type'].count() / df.loc[df['year'] == 2019, 'payment_type'].count()
print(f'Percentage of cash usage in 2019: {round(cash_2019_percentage * 100, 2)} %')
cash_2020_percentage = df.loc[(df['year'] == 2020) & 
                              (df['payment_type'] == 2), 'payment_type'].count() / df.loc[df['year'] == 2020, 'payment_type'].count()
print(f'Percentage of cash usage in 2020: {round(cash_2020_percentage * 100, 2)} %')

Did people use cash (i.e., payment_type of 2) less in 2020 than in 2019?
Percentage of cash usage in 2019: 28.71 %
Percentage of cash usage in 2020: 32.06 %


## Exercise 18 • passwd to df
Specifically, do the following:

1. Create a data frame based on linux-etc-passwd.txt. Notice that this file contains comment lines (starting with #) and blank lines (which you should ignore). The field separator is :.

2. Add column names: username, password, userid, groupid, name, homedir, and shell.

3. Make the username column the data frame’s index.

In [47]:
df = pd.read_csv('data/linux-etc-passwd.txt',
                 sep=':', comment='#', header=None,
                 names=['username', 'password', 'userid', 'groupid', 'name', 'homedir', 'shell'])
df.head(3)

Unnamed: 0,username,password,userid,groupid,name,homedir,shell
0,root,x,0,0,root,/root,/bin/bash
1,daemon,x,1,1,daemon,/usr/sbin,/usr/sbin/nologin
2,bin,x,2,2,bin,/bin,/usr/sbin/nologin


In [49]:
# Ignore the password and groupid fields, such that they don't appear in the data frame.
df = pd.read_csv('data/linux-etc-passwd.txt', 
                 sep=':', comment='#', header=None,
                 usecols=['username', 'userid', 'name', 'homedir', 'shell'],
                names=['username', 'password', 'userid', 'groupid', 'name', 'homedir', 'shell'])
df.head(3)

Unnamed: 0,username,userid,name,homedir,shell
0,root,0,root,/root,/bin/bash
1,daemon,1,daemon,/usr/sbin,/usr/sbin/nologin
2,bin,2,bin,/bin,/usr/sbin/nologin


In [50]:
# Immediately after logging into a Unix system, a command interpreter, known as a "shell," fires up. What are the different shells in this file?
df['shell'].drop_duplicates()

0             /bin/bash
1     /usr/sbin/nologin
4             /bin/sync
18           /bin/false
31              /bin/sh
42         /bin/nologin
Name: shell, dtype: object