---

### Investigating sales data

- In this notebook, you will investigate data about sales representatives in four regions
- To answer the questions in the notebook, you need to carefully investigate the structure of the dataframe
- Use methods like `df.columns` and `df.head` to get a sense of the structure of the data before attempting the question
- Be sure you understand what the data represents. That is almost always the first step in a data analysis problem

### Sales to region

- We will get started by writing a helper function you will use throughout the problem set

In [None]:
import pandas as pd
from pandas import DataFrame
df = pd.read_csv("sales.csv")

In [None]:
def get_city_to_region(input_df: DataFrame) -> dict:
    '''
    Create a dictionary that maps a city to a region
    
    We will use this to fill in missing data later in the problem set
    
    For instance, if a sales rep is located in Boston you know they are in the 
    northeast region. If a sales rep is located in Atlanta you know they are in
    the southeast region. Be sure to look carefully at the dataset to 
    understand how to construct this mapping. It will be clear from
    context if you examine the data
    '''
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This is a test

df = pd.read_csv("sales.csv")
assert type(get_city_to_region(df)) == dict, "The return type should be a dictionary"

In [None]:
# This is a test

df = pd.read_csv("sales.csv")
assert get_city_to_region(df)["Atlanta"] == "SE", "The region for Atlanta should be SE"
assert get_city_to_region(df)["Boston"] == "NE", "The region for Boston should be NE"

In [None]:
# this is a hidden test of get_city_to_region


In [None]:
def fill_in_regions(input_df: DataFrame) -> DataFrame:
    '''
    After some initial analysis, you realize that
    some data is missing for some sales reps. The sales 
    rep's city is included in the data, but the region
    is sometimes left out. 
    
    Use your get_city_to_region function to fill in any
    missing regions in the dataset. Then return the same 
    dataset with any missing regions filled in.

    Again, be sure to look carefully at the data to understand how to 
    correctly fill out this function. Examine the output of
    this function. Does it make sense?
    '''
    
    city2region = get_city_to_region(input_df)
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This is a test

df = pd.read_csv("sales.csv")
assert type(fill_in_regions(df)) == DataFrame, "The return type should be a data frame"

In [None]:
# This is a  test

df = pd.read_csv("sales.csv")
filled = fill_in_regions(df)
assert set(filled[filled["home_city"] == "Boston"]["sales_region"]) == {"NE"}, "Sales reps from Boston should be in the NE region"

In [None]:
# This is a test

df = pd.read_csv("sales.csv")
filled = fill_in_regions(df)
assert filled["home_city"].isna().sum() == 0, "There should not be any null values in your answer"

In [None]:
def mean_sales_by_region(input_df: DataFrame) -> DataFrame:
    '''
    Return the mean sales by region for the dataframe
    
    Use the region as the index for the output dataframe. If you pass the next
    test, you have done that
    '''
    input_df = fill_in_regions(input_df)
    input_df = input_df.drop("sales_rep_id_number", axis=1)
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# This is a test
assert int(mean_sales_by_region(df).loc["NE"]["sales"]) == 2546

In [None]:
# This is a test

total = mean_sales_by_region(df).reset_index()["sales"].sum()
assert round(total) == 10013

In [None]:
# This is a hidden test


### Challenge problem 
- The challenge problem is hard! If you get everything but the challenge problem you will earn 21/25 on the assignment
- Say you have a Series of numerical values. One value will be the biggest, one will be the smallest.
- There will be some value that is at the midpoint of the series, meaning it is bigger than 1/2 of the elements of the series and smaller than half of the elements of the series.
- We can describe this value as the 50th percentile
- Similarly, if we have some value that is bigger than 10% of a series, we can call it the 10th percentile.
- To get percentiles in pandas you use the quantile function [quantile](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html) function.
- This will return the value of a series which is higher than a fraction `q` of a series
- For instance, if `q=.6` then the quantile function will return a value that is bigger than 60% of the other values in the series.

In [None]:
# here is an example of the quantile function
example = []
for n in range(1000):
    example.append({"value": n})
dfmini = pd.DataFrame(example)

dfmini['value'].quantile(q=.6) # note this value is amgiguous, 599.8 would also work

For the challenge problem, add a new column to the dataframe called `bonus` which sets a Boolean to True for all workers who sell more than 98% other sales reps **in their region**. The `bonus` column should be False otherwise.

**Important** your code must run without any pandas warnings to get full credit for the assignment. You may have to do a little research online and do a little thinking to find out how to resolve any warnings. Understanding how top to resolve and fix pandas warnings is an important programming skill. A warning will appear as a little box with an error message below your code. If you don't see a box like this, you don't have any warnings.

_Hint_: One possible warning you might see when coding will arise if you try to set values on a subset of a dataframe. Think about ways to avoid this, perhaps using [copy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html).

In [None]:
def bonus_workers(input_df):
    '''
    Return the same data frame with an 
    additional boolean column set to True 
    for all sales reps who had sales that were
    higher than 98% of other reps in their **region**
    '''
    # YOUR CODE HERE
    raise NotImplementedError()

df = pd.read_csv("sales.csv")
df2 = bonus_workers(df)

In [None]:
# this is a hidden test
