## Exercises: Explore the dataset

In [73]:
import pandas as pd
import seaborn as sns
taxis = sns.load_dataset("taxis")
taxis.head(3)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan


**Explore the "taxis" dataset to answer the following questions:**

**Q1:** How many rows and column are in the dataset?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Rows:</b> 6433
&nbsp;&nbsp;&nbsp;<b>Columns:</b> 14
</details>

In [77]:
taxis.shape

(6433, 14)

**Q2:** What datatype is the most common in the set?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;object (6 columns)
</details>

In [None]:
taxis.info() # obj

**Q3:** What is the average number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1.54
</details>

In [None]:
taxis["passengers"].mean() # 1.539250738380227


**Q4:** What is the most common number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1
</details>

In [None]:
taxis["passengers"].value_counts() # 1

**Q5:** What is the most common payment method?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;credit card
</details>

In [None]:
taxis["payment"].value_counts() # credit card


**Q6:** Which of the categorical features has the most categories?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;dropoff_zone (203 categories)
</details>

**Q7:** What percentage of cars in the set are yellow?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;84.7%
</details>

In [None]:
filt_yellow = taxis["color"] == "yellow"
all_yellows = (len(taxis[filt_yellow]) / len(taxis["color"])) * 100

print(f'{all_yellows:.1f}%')

**Q8:** Which dropoff borough is most common? Which one is least common?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Most common:</b> Manhattan (5206)<br>
&nbsp;&nbsp;&nbsp;<b>Least common:</b> Staten Island (2)<br>
</details>

**Q9:** Which column has the most missing values? How many?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<i>dropoff_zone</i> and <i>dropoff_borough</i> both have 45 missing values.
</details>

### Memory usage
``` taxis.info(memory_usage="deep") ``` gives you the total memory usage of the dataframe.

``` taxis.memory_usage(deep=True) ``` give you the total memory usage for each column.

**Answer the following questions:**

**Q10:** What is the total memory usage of the dataframe?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;2.9 MB
</details>

**Q11:** Which column takes up the most memory? How many kilobytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;pickup_zone (470 KB)
</details>

**Q12:** Why does the numeric columns all take up exactly 51464 bytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;They all use 64 bit datatypes. 64 bits = 8 bytes. 6433 entries * 8 bytes = 51464 bytes.
</details>

**Q13:** What is the total memory usage after converting all *object* columns to *category*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;494.0 KB
</details>

**Q14:** ... and after also converting *float64* to *float32*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;368.4 KB
</details>

**Q15:** What is the smallest datatype we can convert passengers to? What is the total memory usage after converting passengers to the new type?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;The maximum number of passengers in the dataset are 6,<br> 
&nbsp;&nbsp;&nbsp;and therefore the values easily fit into the <i>int8</i> type (8 bit integer).<br>
<br>
&nbsp;&nbsp;&nbsp;New size: 324.4 KB
</details>

**Q16:** How many percent of the orignal datasize is the new dataset after converting all the types as above?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;11.0 %
</details>

### Final note:
Just to be clear, if we want to limit our memory usage by specifying datatypes with a smaller memory footprint, it makes more sense to do so when loading the dataset in to pandas, than changing the type afterwards (as in the example above).

Most common ways to load data into pandas (like pd.from_csv, pd.from_json etc) provides optional parameters for setting the datatype as the files are read into pandas dataframes.

Also, note that this is really only a concern when working with huge sets of data. For smaller datasets, like the one in the example above, it doesn't really matter, and might be only unneccessary work to optimize. The above exercises just serve as examples to better understand data types and their memory footprints.