### Exam 3

The downloadable course materials contain a folder named `Exam Materials` which contains the following file:

```
widget_sales.csv
```

This CSV file contains sales data for some widgets, and contains the following columns:
- `widget`: the name of the widget (string)
- `date_sold`: an epoch indicating date and time sale occurred (integer)
- `quantity_sold`: the number of widgets sold (integer)
- `unit_price`: the unit price widget was sold for (float)
- `tax`: the tax that was added to the sale price (float)
- `discount`: the discount that was subtracted from the sale price (float)

All the questions on this exam relate to the data contained in this file.

#### Q1

First thing is to load the data into a Pandas dataframe.

We do this by running the following code:

In [5]:
import pandas as pd
df = pd.read_csv('widget_sales.csv')
df.loc[:10]

Unnamed: 0,widget,date_sold,quantity_sold,unit_price,tax,discount
0,DDD,1579626005,2.0,9.99,0.26,0.51
1,,1579134185,16.0,9.99,2.86,7.13
2,BBB,1579978810,,15.0,1.06,2.83
3,EEE,1578912699,18.0,199.99,176.89,145.83
4,BBB,1579162853,4.0,15.0,2.19,2.7
5,FFF,1579246947,16.0,250.0,111.96,70.76
6,CCC,1580424245,7.0,22.99,7.78,3.84
7,EEE,1578954583,2.0,199.99,16.1,10.97
8,AAA,1578253210,13.0,10.75,4.96,5.49
9,FFF,1577866793,16.0,250.0,165.61,66.63


The expected data type for the `quantity_sold` column is an integer, what are the expected and actual Numpy data types for that column in the loaded dataframe?

- a. expected = `int64`, actual = `int64`
- b. expected = `int64`, actual = `float64`
- c. expected = `float64`, actual = `object`
- d. expected = `float64`, actual = `float64`


#### Q2

Why is the expected and actual data type for `quantity_sold` not the same?

- a. The csv data contains some float values for that column, not just integers
- b. There's a bug in Pandas
- c. That column has some missing values
- d. The expected and actual data types are the same - nothing to see here, move along!


#### Q3

In [11]:
nan_columns = set(df.columns[df.isna().any()])
nan_columns

{'quantity_sold', 'widget'}

Inspect the data frame to determine which columns have missing (null) values.

- a. `widget`, `quantity_sold` only
- b. `date_sold`, `quantity_sold` only
- c. `quantity_sold` only
- d. there are no missing values anywhere


#### Q4

In [15]:
df_not_null = df.dropna()
df_not_null

Unnamed: 0,widget,date_sold,quantity_sold,unit_price,tax,discount
0,DDD,1579626005,2.0,9.99,0.26,0.51
3,EEE,1578912699,18.0,199.99,176.89,145.83
4,BBB,1579162853,4.0,15.00,2.19,2.70
5,FFF,1579246947,16.0,250.00,111.96,70.76
6,CCC,1580424245,7.0,22.99,7.78,3.84
...,...,...,...,...,...,...
9995,DDD,1580265826,3.0,9.99,0.52,0.68
9996,DDD,1579719495,2.0,9.99,0.19,0.51
9997,BBB,1579872262,11.0,15.00,6.96,2.57
9998,EEE,1578057662,16.0,199.99,42.43,14.63


We want to create a new dataframe (named `df_not_null`) that does not have any missing values.
We want to do this by removing all **rows** that contain null values.

Which of the following expressions will achieve this?

I.
```
df_temp = df[pd.notnull(df['widget'])]
df_not_null = df_temp[pd.notnull(df_temp['quantity_sold'])]
```

II.
```
df_not_null = df.dropna(axis=0)
```

III.
```
df_not_null = df.dropna()
```

- a. I only
- b. II and III only
- c. none of them
- d. all of them


#### Q5

In [16]:
df_not_null.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9998 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   widget         9998 non-null   object 
 1   date_sold      9998 non-null   int64  
 2   quantity_sold  9998 non-null   float64
 3   unit_price     9998 non-null   float64
 4   tax            9998 non-null   float64
 5   discount       9998 non-null   float64
dtypes: float64(4), int64(1), object(1)
memory usage: 546.8+ KB


Assume that `df_not_null` is the result of correctly removing any rows in the original data frame that contained any null values.

Inspect this new data frame - what is the data type of `quantity_sold`?

- a. `object`
- b. `int64`
- c. `float64`
- d. `uint64`


#### Q6

The data type for `quantity_sold` can be changed to be an integer since we expect all non-null values to be positive integers.

Assuming `df_not_null` is a dataframe that contains no null values (derived from our original dataframe `df`), which of the following code results in `data` being a dataframe that contains the `quantity_sold` column as an `int64` data type?

I.
```
quantity_sold = df_not_null['quantity_sold'].astype(int)
data = pd.concat(
    [
        df_not_null[['widget', 'date_sold', 'unit_price', 'tax', 'discount']], 
        quantity_sold
    ],
    axis=1
)
```

II.
```
quantity_sold = df_not_null['quantity_sold'].astype(int)
data = df_not_null.drop('quantity_sold', axis=1)
data = pd.concat([data, quantity_sold], axis=1, join='inner')
```

III.
```
quantity_sold = df['quantity_sold'].dropna().astype(int)
data = df.drop('quantity_sold', axis=1)
data = pd.concat([data, quantity_sold], axis=1, join='inner')
```

- a. I only
- b. II only
- c. III only
- d. I and II only

#### Q7

In [27]:
df_qsold = df_not_null['quantity_sold'].sum()
df_uprice = df_not_null['unit_price'].sum()
df_tax = df_not_null['tax'].sum()
df_discount = df_not_null['discount'].sum()
net_sale = df_qsold * df_uprice + df_tax - df_discount
net_sale

84107129567.49

The net sale for each row is given by:

```quantity_sold * unit_price + tax - discount```

Calculate the total net sales of the data contained in the dataframe that has all rows with null values removed, rounded to 2 decimal points.

The answer is:
- a. `8383874.86`
- b. `7963979.82`
- c. `8380560.82`
- d. `99355.0`


#### Q8

In [34]:
aaa = df_not_null[df_not_null['widget'] == 'AAA']
aaa

Unnamed: 0,widget,date_sold,quantity_sold,unit_price,tax,discount
8,AAA,1578253210,13.0,10.75,4.96,5.49
28,AAA,1578014639,1.0,10.75,0.10,0.52
33,AAA,1579503131,7.0,10.75,0.98,3.40
35,AAA,1578525575,6.0,10.75,1.10,0.81
50,AAA,1578409447,1.0,10.75,0.22,0.36
...,...,...,...,...,...,...
9974,AAA,1580535761,17.0,10.75,1.69,7.92
9975,AAA,1579846939,1.0,10.75,0.39,0.27
9980,AAA,1578959008,15.0,10.75,2.72,3.89
9985,AAA,1579604013,15.0,10.75,0.18,0.57


In [49]:
df_qsold = aaa['quantity_sold']
df_uprice = aaa['unit_price']
df_tax = aaa['tax']
df_discount = aaa['discount']
net_sale = df_qsold * df_uprice + df_tax - df_discount
second = net_sale.sort_values(ascending=True)
second.iloc[-2]

212.56

Identify the date on which the **second** highest net sale for widget `AAA` occurred.

Just as before, the net sale formula is given by:
```quantity_sold * unit_price + tax - discount```

This sale happened on this date:
s
- a. 2020-01-21T16:22:07
- b. 2020-01-21T21:39:52
- c. 2020-01-26T00:30:51
- d. 2020-01-23T21:31:26


#### Q9

In [58]:
counts = df_not_null['widget'].value_counts()
counts

widget
BBB    1704
DDD    1667
EEE    1667
FFF    1661
CCC    1652
AAA    1647
Name: count, dtype: int64

Calculate the number of rows (sales) that each widget has generated (limit your dataframe to rows that contain no null values).

Represented as a dictionary, the result is:

a. 
```
{'AAA': 1646, 'BBB': 1705, 'CCC': 1653, 'DDD': 1666, 'EEE': 1668, 'FFF': 1661}
```

b.
```
{'AAA': 1647, 'BBB': 1704, 'CCC': 1652, 'DDD': 1667, 'EEE': 1667, 'FFF': 1661}
```

c.
```
{'AAA': 1666, 'BBB': 1666, 'CCC': 1666, 'DDD': 1666, 'EEE': 1666, 'FFF': 1670}
```

d.
```
{'AAA': 1647, 'BBB': 1705, 'CCC': 1652, 'DDD': 1667, 'EEE': 1667, 'FFF': 1661, NAN: 1}
```


#### Q10

Limiting your dataframe to rows with non-null values only, calculate the average percentage discount (rounded to 2 digits after the decimal point) of each widget.

(The percentage discount for a specific row in the dataframe is given by:
```
discount / (quantity_sold * unit_price) * 100
```
)

Represented as a dictionary, the result is:

a. 
```
{'AAA': 2.44, 'BBB': 2.5, 'CCC': 2.49, 'DDD': 2.45, 'EEE': 2.48, 'FFF': 2.46}
```

b.
```
{'AAA': 2.35, 'BBB': 2.49, 'CCC': 2.42, 'DDD': 2.39, 'EEE': 2.44, 'FFF': 2.47}
```

c.
```
{'AAA': 2.14, 'BBB': 2.23, 'CCC': 2.2, 'DDD': 2.14, 'EEE': 2.24, 'FFF': 2.2}
```

d.
```
{'AAA': 2.65, 'BBB': 2.76, 'CCC': 2.71, 'DDD': 2.63, 'EEE': 2.72, 'FFF': 2.7}
```
