# Mutliple Tables in Pandas

In [1]:
import pandas as pd

#### 1.

Inspect the DataFrames using print and `head`:

- `visits` lists all of the users who have visited the website
- `cart` lists all of the users who have added a t-shirt to their cart
- `checkout` lists all of the users who have started the checkout
- `purchase` lists all of the users who have purchased a t-shirt

In [2]:
visits = pd.read_csv("visit.csv")
cart = pd.read_csv("cart.csv")
checkout = pd.read_csv("checkout.csv")
purchase = pd.read_csv("purchase.csv")

In [3]:
visits.head()

Unnamed: 0,user_id,visit_time
0,943647ef-3682-4750-a2e1-918ba6f16188,2017-04-07 15:14:00
1,0c3a3dd0-fb64-4eac-bf84-ba069ce409f2,2017-01-26 14:24:00
2,6e0b2d60-4027-4d9a-babd-0e7d40859fb1,2017-08-20 08:23:00
3,6879527e-c5a6-4d14-b2da-50b85212b0ab,2017-11-04 18:15:00
4,a84327ff-5daa-4ba1-b789-d5b4caf81e96,2017-02-27 11:25:00


In [4]:
cart.head()

Unnamed: 0,user_id,cart_time
0,2be90e7c-9cca-44e0-bcc5-124b945ff168,2017-11-07 20:45:00
1,4397f73f-1da3-4ab3-91af-762792e25973,2017-05-27 01:35:00
2,a9db3d4b-0a0a-4398-a55a-ebb2c7adf663,2017-03-04 10:38:00
3,b594862a-36c5-47d5-b818-6e9512b939b3,2017-09-27 08:22:00
4,a68a16e2-94f0-4ce8-8ce3-784af0bbb974,2017-07-26 15:48:00


In [5]:
checkout.head()

Unnamed: 0,user_id,checkout_time
0,d33bdc47-4afa-45bc-b4e4-dbe948e34c0d,2017-06-25 09:29:00
1,4ac186f0-9954-4fea-8a27-c081e428e34e,2017-04-07 20:11:00
2,3c9c78a7-124a-4b77-8d2e-e1926e011e7d,2017-07-13 11:38:00
3,89fe330a-8966-4756-8f7c-3bdbcd47279a,2017-04-20 16:15:00
4,3ccdaf69-2d30-40de-b083-51372881aedd,2017-01-08 20:52:00


In [6]:
purchase.head()

Unnamed: 0,user_id,purchase_time
0,4b44ace4-2721-47a0-b24b-15fbfa2abf85,2017-05-11 04:25:00
1,02e684ae-a448-408f-a9ff-dcb4a5c99aac,2017-09-05 08:45:00
2,4b4bc391-749e-4b90-ab8f-4f6e3c84d6dc,2017-11-20 20:49:00
3,a5dbb25f-3c36-4103-9030-9f7c6241cd8d,2017-01-22 15:18:00
4,46a3186d-7f5a-4ab9-87af-84d05bfd4867,2017-06-11 11:32:00


#### 2.

Combine `visits` and `cart` using a left merge.

In [7]:
visits_cart_left = pd.merge(visits, cart, how="left")
visits_cart_left.head()

Unnamed: 0,user_id,visit_time,cart_time
0,943647ef-3682-4750-a2e1-918ba6f16188,2017-04-07 15:14:00,
1,0c3a3dd0-fb64-4eac-bf84-ba069ce409f2,2017-01-26 14:24:00,2017-01-26 14:44:00
2,6e0b2d60-4027-4d9a-babd-0e7d40859fb1,2017-08-20 08:23:00,2017-08-20 08:31:00
3,6879527e-c5a6-4d14-b2da-50b85212b0ab,2017-11-04 18:15:00,
4,a84327ff-5daa-4ba1-b789-d5b4caf81e96,2017-02-27 11:25:00,


#### 3.

How long is your merged DataFrame?

<details><summary>Solution</summary>
2000 rows x 3 columns

In [8]:
visits_cart_left.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     2000 non-null   object
 1   visit_time  2000 non-null   object
 2   cart_time   348 non-null    object
dtypes: object(3)
memory usage: 47.0+ KB


#### 4.


How many of the timestamps are `null` for the column `cart_time`?

What do these null rows mean?

<details><summary>Solution</summary>

1652 rows are null. These are the users who visited the website but did not add a t-shirt to their cart.

In [9]:
len_visits_cart_left_is_null = len(visits_cart_left[visits_cart_left.cart_time.isnull()])
len_visits_cart_left_is_null

1652

In [10]:
visits_cart_left.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     2000 non-null   object
 1   visit_time  2000 non-null   object
 2   cart_time   348 non-null    object
dtypes: object(3)
memory usage: 47.0+ KB


#### 5.

What percent of users who visited Cool T-Shirts Inc. ended up *not* placing a t-shirt in their cart?

**Note:** To calculate percentages, it will be helpful to turn either the numerator or the denominator into a *float*, by using `float()`, with the number to convert passed in as input. Otherwise, Python will use integer division, which truncates decimal points.

In [13]:
percentage_visits_cart_left_is_null = len_visits_cart_left_is_null / len(visits_cart_left)
percentage_visits_cart_left_is_null, f"{percentage_visits_cart_left_is_null:.2%}"

(0.826, '82.60%')

#### 6.

Repeat the left merge for `cart` and `checkout` and count `null` values. What percentage of users put items in their cart, but did not proceed to checkout?

In [15]:
cart_checkout_left = pd.merge(cart, checkout, how="left")
cart_checkout_left.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   user_id        482 non-null    object
 1   cart_time      482 non-null    object
 2   checkout_time  360 non-null    object
dtypes: object(3)
memory usage: 11.4+ KB


In [16]:
len_cart_checkout_left_is_null = len(cart_checkout_left[cart_checkout_left.checkout_time.isnull()])
len_cart_checkout_left_is_null

122

In [17]:
percentage_cart_checkout_left_is_null = len_cart_checkout_left_is_null / len(cart_checkout_left)
percentage_cart_checkout_left_is_null, f"{percentage_cart_checkout_left_is_null:.2%}"

(0.25311203319502074, '25.31%')

#### 7.

Merge all four steps of the funnel, in order, using a series of *left merges*. Save the results to the variable `all_data`.

Examine the result using `print` and `head`.

In [18]:
all_data = visits_cart_left.merge(checkout, how="left").merge(purchase, how="left")
all_data.head()

Unnamed: 0,user_id,visit_time,cart_time,checkout_time,purchase_time
0,943647ef-3682-4750-a2e1-918ba6f16188,2017-04-07 15:14:00,,,
1,0c3a3dd0-fb64-4eac-bf84-ba069ce409f2,2017-01-26 14:24:00,2017-01-26 14:44:00,2017-01-26 14:54:00,2017-01-26 15:08:00
2,6e0b2d60-4027-4d9a-babd-0e7d40859fb1,2017-08-20 08:23:00,2017-08-20 08:31:00,,
3,6879527e-c5a6-4d14-b2da-50b85212b0ab,2017-11-04 18:15:00,,,
4,a84327ff-5daa-4ba1-b789-d5b4caf81e96,2017-02-27 11:25:00,,,


#### 8.

What percentage of users proceeded to checkout, but did not purchase a t-shirt?

In [None]:
reached_checkout = len(all_data[~all_data.checkout_time.isnull()])  # get df that contains only rows where checkout_time is not null, then get the length of that df
reached_checkout

598

In [20]:
reached_checkout_not_purchased = len(all_data[all_data.checkout_time.notnull() & all_data.purchase_time.isnull()])
reached_checkout_not_purchased

101

In [21]:
percentage_reached_checkout_not_purchased = reached_checkout_not_purchased / reached_checkout
percentage_reached_checkout_not_purchased, f"{percentage_reached_checkout_not_purchased:.2%}"

(0.1688963210702341, '16.89%')

#### 9.

Which step of the funnel is weakest (i.e., has the highest percentage of users not completing it)?

How might Cool T-Shirts Inc. change their website to fix this problem?

In [24]:
print("{} percent of users who visited the page did not add a t-shirt to their cart".format(round(percentage_visits_cart_left_is_null*100, 2)))
print("{} percent of users who added a t-shirt to their cart did not checkout".format(round(percentage_cart_checkout_left_is_null*100, 2)))
print("{} percent of users who made it to checkout  did not purchase a shirt".format(round(percentage_reached_checkout_not_purchased*100, 2)))

82.6 percent of users who visited the page did not add a t-shirt to their cart
25.31 percent of users who added a t-shirt to their cart did not checkout
16.89 percent of users who made it to checkout  did not purchase a shirt


#### 10.

Using the giant merged DataFrame `all_data` that you created, let’s calculate the average time from initial visit to final purchase. Add a column that is the difference between purchase_time and visit_time.

In [32]:
all_data.purchase_time = all_data.purchase_time.astype("datetime64[ns]")
all_data.checkout_time = all_data.checkout_time.astype("datetime64[ns]")
all_data.cart_time = all_data.cart_time.astype("datetime64[ns]")
all_data.visit_time = all_data.visit_time.astype("datetime64[ns]")

all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2372 entries, 0 to 2371
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   user_id        2372 non-null   object        
 1   visit_time     2372 non-null   datetime64[ns]
 2   cart_time      720 non-null    datetime64[ns]
 3   checkout_time  598 non-null    datetime64[ns]
 4   purchase_time  497 non-null    datetime64[ns]
dtypes: datetime64[ns](4), object(1)
memory usage: 92.8+ KB


In [34]:
all_data['time_to_purchase'] = all_data.purchase_time - all_data.visit_time

#### 11.

Examine the results by printing the new column to the screen.

In [35]:
all_data.time_to_purchase

0                  NaT
1      0 days 00:44:00
2                  NaT
3                  NaT
4                  NaT
             ...      
2367               NaT
2368               NaT
2369               NaT
2370               NaT
2371               NaT
Name: time_to_purchase, Length: 2372, dtype: timedelta64[ns]

#### 12.

Calculate the average time to purchase by applying the `.mean()` function to your new column.

In [36]:
all_data.time_to_purchase.mean()

Timedelta('0 days 00:43:53.360160965')