# Analyzing Retail Purchases with Pandas 

In this assignment, you are a marketing analyst for a popular e-commerce site. The company is putting together a marketing presentation that highlights its last decade in business. You have specifically been tasked with pulling figures for the year 2011. 

You have been provided with a CSV file of sales data that you will first need to load and clean. In order to thoroughly clean  the data, you must:
* 		Convert columns that contain dates to a `datetime` type
* 		Convert any columns that contain negative values to positive ones
* 		Remove rows containing `none` values
* 		Remove rows where the unit price of an item is equal to 0
* 		Select only 2011 sales data 

Once your dataset is clean, you will analyze it to collect the following figures for 2011:
* 		Total items sold
* 		Total revenue
* 		Total number of unique items sold 
* 		Average Number of Orders per Customer
* 		Average Value of Each Invoice

---
### Getting Started
To get started, download the following files:
- `Unit 20 - Technical - Unsolved.ipynb` (_this notebook_)
- `Transactions.csv`

Place these together in to a dedicated directory on your hard drive. We recommend creating a folder in your `Documents` directory for this week of class, as follows:

```
Documents/
  Term III/
    Week 20/
      Unit 20 - Technical - Unsolved.ipynb
      Transactions.csv
```

Then, start Jupyter Notebook in the `Week20` directory, and open `Unit 20 - Technical - Unsolved.ipynb` in your browser. Make sure the `Transactions.csv` file lives in the same directory.

---

### Problem Structure
Each problem will be accompanied by:
- **Instructions**
  - Each problem features a markdown cell explaining the problem.
- **Unfinished Code Cells**
  - Each problem has unfinished code cells, where you will write code to solve the problem.
  - Cells will contain either starter code for you to finish, or a comment explaining what your code should do.
- **Expected Output**. 
  - Many unfinished code cells will have output below them. You will be expected to write code that produces the same output.
  - Some unfinished code cells do _not_ have output below them. This is simply because not all code will generate output. Your solutions for these cells should _not_ print anything.

---
  
### Deliverables
To receive credit for this assignment, you must submit the following files:
- Your completed Jupyter Notebook

Your completed Jupyter Notebook will be this file, but with all of the problems solved. This is the only file you will need to submit.

When you're done with the assignment, run all cells to verify that your code executes as expected. Then, save and submit this notebook.

Good luck!

---

# Part 1: Loading & Exploring Data
All data analysis projects start with the same steps: Loading the data, inspecting its types, converting data types and fixing erroneous or missing values. Only when this is finished can you safely proceed to analysis.

In Part 1, you will perform all of these steps on the data in `Transactions.csv`. 

### Problem 1: Load Transactions Data
`Transactions.csv` contains a record of purchases at a retail gift shop. In this problem, you must:
- Load `Transactions.csv` with `pandas` into a DataFrame called `transactions`
- Print the first 5 rows of the DataFrame

Printing the first 5 rows of the DataFrame will reveal the column names, as well as what kind of data each column contains.

---

Your code should print the following:

```
InvoiceNo	InvoiceDate	CustomerID	StockCode	UnitPrice	QuantitySold	Description
0	536365	12/1/2010 8:26	17850.0	85123A	2.55	6	WHITE HANGING HEART T-LIGHT HOLDER
1	536365	12/1/2010 8:26	17850.0	71053	3.39	6	WHITE METAL LANTERN
2	536365	12/1/2010 8:26	17850.0	84406B	2.75	8	CREAM CUPID HEARTS COAT HANGER
3	536365	12/1/2010 8:26	17850.0	84029G	3.39	6	KNITTED UNION FLAG HOT WATER BOTTLE
4	536365	12/1/2010 8:26	17850.0	84029E	3.39	6	RED WOOLLY HOTTIE WHITE HEART.

InvoiceNo             object
InvoiceDate           object
CustomerID           float64
StockCode             object
UnitPrice            float64
Quantity               int64
Description           object
dtype: object
```

<hr>

**Hints**
- Recall that `dataframe.head(n)` prints the first `n` rows of `dataframe`. 

In [1]:
# TODO: Provided Data -- Do NOT Edit!
filename = 'Transactions.csv'

In [2]:
# TODO: Load `Transactions.csv` into `transactions` DataFrame
import pandas as pd
transactions = pd.read_csv('Transactions.csv', sep =',')

In [3]:
# TODO: Print first 5 rows of `transactions`
transactions.head(5)

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,StockCode,UnitPrice,QuantitySold,Description,QuantityRemaining
0,536365,12/1/2010 8:26,17850.0,85123A,2.55,6,WHITE HANGING HEART T-LIGHT HOLDER,3
1,536365,12/1/2010 8:26,17850.0,71053,3.39,6,WHITE METAL LANTERN,3
2,536365,12/1/2010 8:26,17850.0,84406B,2.75,8,CREAM CUPID HEARTS COAT HANGER,4
3,536365,12/1/2010 8:26,17850.0,84029G,3.39,6,KNITTED UNION FLAG HOT WATER BOTTLE,3
4,536365,12/1/2010 8:26,17850.0,84029E,3.39,6,RED WOOLLY HOTTIE WHITE HEART.,3


In [4]:
#transactions.dtypes
transactions.dtypes

InvoiceNo             object
InvoiceDate           object
CustomerID           float64
StockCode             object
UnitPrice            float64
QuantitySold           int64
Description           object
QuantityRemaining      int64
dtype: object

### Problem 2: Convert Type of `InvoiceDate` Column
Note that the `InvoiceDate` column has been imported as an `object`. In this problem, you will convert `InvoiceDate` into a `datetime` column.

When you're done, print the DataFrame's `dtypes` to verify that your changes have taken effect.

---

Your code should print the following:

```
InvoiceNo              object
InvoiceDate    datetime64[ns]
CustomerID            float64
StockCode              object
UnitPrice             float64
Quantity                int64
Description            object
dtype: object
```

---

**Hints**
- Pandas has a `to_datetime` method that can convert a Series of date-formatted strings to actual datetime objects. It is called as follows: `pd.to_datetime(dataframe.ColumnWithDateFormattedStrings)`.

In [5]:
# TODO: Convert `InvoiceDate` to datetime values
transactions['InvoiceDate'] = pd.to_datetime(transactions['InvoiceDate'])

In [6]:
# TODO: Print datatypes of `transactions` DataFrame
transactions.dtypes

InvoiceNo                    object
InvoiceDate          datetime64[ns]
CustomerID                  float64
StockCode                    object
UnitPrice                   float64
QuantitySold                  int64
Description                  object
QuantityRemaining             int64
dtype: object

In [7]:
transactions.head()

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,StockCode,UnitPrice,QuantitySold,Description,QuantityRemaining
0,536365,2010-12-01 08:26:00,17850.0,85123A,2.55,6,WHITE HANGING HEART T-LIGHT HOLDER,3
1,536365,2010-12-01 08:26:00,17850.0,71053,3.39,6,WHITE METAL LANTERN,3
2,536365,2010-12-01 08:26:00,17850.0,84406B,2.75,8,CREAM CUPID HEARTS COAT HANGER,4
3,536365,2010-12-01 08:26:00,17850.0,84029G,3.39,6,KNITTED UNION FLAG HOT WATER BOTTLE,3
4,536365,2010-12-01 08:26:00,17850.0,84029E,3.39,6,RED WOOLLY HOTTIE WHITE HEART.,3


### Problem 3: Fix Negative `QuantitySold` Values
Next, you will correct erroneously imported data by verifying that the `QuantitySold` column contain only _positive_ values. Follow the steps below:
- Count the number of values in `QuantitySold` that are less than zero
- Flip each negative number in `QuantitySold` to a positive number
  - E.g., if a row contains a `QuantitySold` of `-15`, it should be "flipped" to `15`.


---

Your code should print the following:

```
InvoiceNo      8905
InvoiceDate    8905
CustomerID     8905
StockCode      8905
UnitPrice      8905
Quantity       8905
Description    8905
dtype: int64

InvoiceNo      0
InvoiceDate    0
CustomerID     0
StockCode      0
UnitPrice      0
Quantity       0
Description    0
dtype: int64
```

In [8]:
transactions[transactions['QuantitySold'] < 0].count()

InvoiceNo            10624
InvoiceDate          10624
CustomerID            8905
StockCode            10624
UnitPrice            10624
QuantitySold         10624
Description           9762
QuantityRemaining    10624
dtype: int64

In [9]:
# TODO: Flip every negative element in `transactions.Quantity`
transactions.loc[transactions.QuantitySold < 0, 'QuantitySold'] = -transactions.loc[transactions.QuantitySold < 0, 'QuantitySold']

In [10]:
transactions[transactions['QuantitySold'] < 0].count()

InvoiceNo            0
InvoiceDate          0
CustomerID           0
StockCode            0
UnitPrice            0
QuantitySold         0
Description          0
QuantityRemaining    0
dtype: int64

### Problem 4: Handling Missing Values
Now that your data types are correct, you must remove any rows containing `None` values. Follow the steps below:
- Count the number of `None` values in each column
  - This time, _some_ columns should have `None` values
- If any column contains `None` values, drop the corresponding rows
- Count the number of `None` values in each column
  - This time, no column should have any `None` values

Modify your DataFrame in-place when you drop `None` values.

<hr>

Your code should print the following:

```
InvoiceNo                 0
InvoiceDate               0
CustomerID           119449
StockCode                 0
UnitPrice                 0
Quantity                  0
Description            1329
dtype: int64

InvoiceNo            0
InvoiceDate          0
CustomerID           0
StockCode            0
UnitPrice            0
Quantity             0
Description          0
dtype: int64
```

<hr>

**Hints**
- Use the `isna` and `sum` methods to count the number of `None` values in each column.
- Use `dropna` with the `inplace` argument to drop corrupt rows from your DataFrame.
- Make sure to drop along the `row` axis.

In [11]:
# TODO: Use `isna` and `sum` to determine if columns contain `None` values
transactions.isna().sum()

InvoiceNo                 0
InvoiceDate               0
CustomerID           135080
StockCode                 0
UnitPrice                 0
QuantitySold              0
Description            1454
QuantityRemaining         0
dtype: int64

In [12]:
# TODO: Drop `any` rows with null values on the `rows` axis `inplace`
transactions.dropna(axis = 'rows',how = 'any', inplace=True)

In [13]:
# TODO: Use `isna` and `sum` to verify columns no longer contain `None` values
transactions.isna().sum()

InvoiceNo            0
InvoiceDate          0
CustomerID           0
StockCode            0
UnitPrice            0
QuantitySold         0
Description          0
QuantityRemaining    0
dtype: int64

### Problem 5: Removing Rows with `UnitPrice` of `0`
Finally, you will ensure that the `UnitPrice` column contains only _positive_ values. Follow the steps below:
- Count the number of values in `UnitPrice` that are equal to _or_ less than zero
- Set `transactions` equal to the subset of rows with a `UnitPrice` _greater_ than zero
- Count the number of values in `UnitPrice` that are equal to _or_ less than zero to verify that your code worked

Your code should print the following:

```
40

0
```

In [14]:
transactions.loc[transactions['UnitPrice'] <= 0, 'UnitPrice'].count()

40

In [15]:
# TODO: Filter out all rows whose `UnitPrice` is GREATER than 0
positive_filter = transactions['UnitPrice'] > 0

In [16]:
# TODO: Set `positive_transactions` equal to a `copy` of the rows in `transactions` that match the `positive_filter` condition
positive_transactions = transactions.loc[positive_filter].copy()

In [17]:
# TODO: Filter out all values in `UnitPrice` that are less than or equal to 0
positive_transactions.loc[positive_transactions.UnitPrice <= 0, 'UnitPrice'].count()

0

### Problem 6: Selecting Only Data from 2011
Next, you will filter for _only_ data from 2011. Use a filter to set `transactions` equal to the subset of rows whose `InvoiceDate` occurs in 2011.

In [18]:
# TODO: Select rows from year `2011`
transactions[transactions['InvoiceDate'].dt.year == 2011]

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,StockCode,UnitPrice,QuantitySold,Description,QuantityRemaining
42481,539993,2011-01-04 10:00:00,13313.0,22386,1.95,10,JUMBO BAG PINK POLKADOT,5
42482,539993,2011-01-04 10:00:00,13313.0,21499,0.42,25,BLUE POLKADOT WRAP,12
42483,539993,2011-01-04 10:00:00,13313.0,21498,0.42,25,RED RETROSPOT WRAP,12
42484,539993,2011-01-04 10:00:00,13313.0,22379,2.10,5,RECYCLING BAG RETROSPOT,2
42485,539993,2011-01-04 10:00:00,13313.0,20718,1.25,10,RED RETROSPOT SHOPPER BAG,5
...,...,...,...,...,...,...,...,...
541904,581587,2011-12-09 12:50:00,12680.0,22613,0.85,12,PACK OF 20 SPACEBOY NAPKINS,6
541905,581587,2011-12-09 12:50:00,12680.0,22899,2.10,6,CHILDREN'S APRON DOLLY GIRL,3
541906,581587,2011-12-09 12:50:00,12680.0,23254,4.15,4,CHILDRENS CUTLERY DOLLY GIRL,2
541907,581587,2011-12-09 12:50:00,12680.0,23255,4.15,4,CHILDRENS CUTLERY CIRCUS PARADE,2


# Part 2: Simple Analysis


### Problem 1: Compute Total Items Sold
Next, compute the total `QuantitySold` of items sold.

Your code should print the following:

```
5442620
```

In [19]:
# TODO: Find sum of `QuantitySold` column
transactions['QuantitySold'].sum()

5456504

### Problem 2: Compute Total Revenue
Now, compute the total amount of money generated by the transactions in this data set.

You will solve this problem in two parts:
- Add a new to your DataFrame column, called `TotalPrice`, using the formula below:
  - `TotalPrice = Quantity X UnitPrice`
- Take the sum of the `TotalPrice` column to compute total revenue.

Your code should print the following:
```
9522749.994000005
```

In [20]:
# TODO: Add `TotalPrice` column
transactions['TotalPrice'] = transactions['QuantitySold'] * transactions['UnitPrice']

In [21]:
# Compute sum of `TotalPrice`
total_revenue = transactions['TotalPrice'].sum()
print(total_revenue)

9522749.994000003


### Problem 3: Number of Unique Items Sold
Next, you will determine how many _unique_ items appear in the dataset using the `StockCode` column. 

Your code should print the following:

```
3612
```

In [22]:
# TODO: Count unique `StockCode` entries
transactions['StockCode'].nunique()

3684

### Problem 4: Average Number of Orders per Customer
Next, you will determine how orders each customer made on average.

Follow the steps below:
- Save the number of unique customers to a variable, called `unique_customers`
- Save the number of unique invoices to a variable, called `unique_invoices`
- Compute the average number of invocies per customer

Your code should print the following:

```
4.826302144708932
```

In [23]:
# TODO: Count number of unique customers
unique_customers = transactions['CustomerID'].nunique()

In [24]:
# TODO: Count number of unique invoices 
unique_invoices = transactions['InvoiceNo'].nunique()

In [25]:
# TODO: Compute average number of invoices per customer
unique_invoices/unique_customers


5.07548032936871

### Problem 5: Average Value of Each Invoice
Finally, compute the average value of a customer invoice.

Follow the steps below:
- Count the number of unique invoices
- Compute the total sales value
- Use the above two values to compute the average value of an invoice 

Your code should print the following:
```
436.1718055474166
```

In [26]:
# TODO: Count number of unique invoices 
unique_invoices = transactions['InvoiceNo'].nunique()

In [27]:
# TODO: Total sales value
total_revenue

9522749.994000003

In [28]:
# TODO: Average value of each invoice
average_value_of_invoice = total_revenue/unique_invoices
print(average_value_of_invoice)

429.1460114465977
