# LESSON 6: PANDAS PRACTICE
<img src="../../images/pd_logo.png" width="400px"/>

### PROBLEM:
You are provided with daily historical sales data. The dataset contains 5 .csv file, file description and data fields description below:
<br>

#### File descriptions
1. `sales.csv`: Daily historical data from January 2013 to October 2015.
2. `items_list_1.csv`: Supplemental information about the items/products in list 1.
3. `items_list_2.csv`: Supplemental information about the items/products in list 2.
4. `item_categories.csv`: Supplemental information about the items categories.
5. `shops.csv`: Supplemental information about the shops.
<br>

#### Data fields
1. `shop_id`: Unique identifier of a shop.
2. `item_id`: Unique identifier of a product.
3. `item_category_id`: Unique identifier of item category.
4. `item_cnt_day`: Number of products sold.
5. `item_price`: Current price of an item.
6. `date`: Date in format dd/mm/yyyy.
7. `date_block_num`: A consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33.
8. `item_name`: Name of item.
9. `shop_name`: Name of shop.
10. `item_category_name`: Name of item category.
<br>

#### Question
1. How many **items** are there **in list 1**? **in list 2**? in **only list 1** (List the name of them)? in **only list 2** (List the name of them)? in **both of the two** lists (List the name of them)? Create the new csv file only **contains the unique items** from the two lists? (Name that file as `items.csv`)
2. How many **items** are there in the `items.csv`? How many of them contain **digits in name**? How many of them are FIFA football game items (contain **"FIFA" in the name**)?
3. How many **item categories** are there in the dataset? Which item category contain **highest number of items**? **lowest number of items**? List all items according to each item category. Calculate the average number of items in each category.
4. Which **item** has the **highest price** in each year? Which **item** has the **lowest price** in each year? Calculate the **average price** of each item in each year?
5. Which **item** has the **highest sales** in each year? Which **item** has the **lowest sales** in each year? Calculate the **average sales** of each item in each year?

<!-- How many **shops** are there in the dataset? 
4. 
5. 
6. 
7. 
8. 
9. 
10.  -->

### SOLUTION:
#### Preparation
Import library and read some data files

In [None]:
import pandas as pd

In [None]:
item_list_1_df = pd.read_csv('data/predict_future_sales/items_list_1.csv')
item_list_1_df

In [None]:
item_list_2_df = pd.read_csv('data/predict_future_sales/items_list_2.csv')
item_list_2_df

In [None]:
item_categories_df = pd.read_csv('data/predict_future_sales/item_categories.csv')
item_categories_df.head()

In [None]:
sales_df = pd.read_csv('data/predict_future_sales/sales.csv')
sales_df.head()

#### Question 1:
How many **items** are there **in list 1**? **in list 2**? in **only list 1** (List the name of them)? in **only list 2** (List the name of them)? in **both of the two** lists (List the name of them)? Create the new csv file only **contains the unique items** from the two lists? (Name that file as `items.csv`)

In [None]:
print(f'There are {len(item_list_1_df)} items in list 1')
print(f'There are {len(item_list_2_df)} items in list 2')

In [None]:
merged_df = pd.merge(item_list_1_df, item_list_2_df, on='item_id', how='outer', indicator=True)
merged_df

In [None]:
def process_df(_df):
    _df = _df.dropna(axis=1).drop(columns='_merge')
    
    rename_dict = dict()
    for col_name in _df.columns:
        if '_x' in col_name or '_y' in col_name:
            rename_dict[col_name] = col_name[:-2]
    
    new_df = _df.rename(columns=rename_dict)
    return new_df

In [None]:
item_only_list_1_df = merged_df[merged_df._merge == 'left_only']
item_only_list_1_df = process_df(item_only_list_1_df)
item_only_list_1_df

In [None]:
item_only_list_2_df = merged_df[merged_df._merge == 'right_only']
item_only_list_2_df = process_df(item_only_list_2_df)
item_only_list_2_df

In [None]:
item_both_list_df = merged_df[merged_df._merge == 'both']
item_both_list_df = process_df(item_both_list_df)
item_both_list_df = item_both_list_df.loc[:,~item_both_list_df.columns.duplicated()]
item_both_list_df

In [None]:
print(f'There are {len(item_only_list_1_df)} items in only list 1')
print(f'There are {len(item_only_list_2_df)} items in only list 2')
print(f'There are {len(item_both_list_df)} items in both list 1 and list 2')

In [None]:
items_df = pd.concat([item_only_list_1_df, item_only_list_2_df, item_both_list_df])
items_df.to_csv('items.csv', index=False)
items_df

#### Question 2:
How many **items** are there in the `items.csv`? How many of them contain **digits in name**? How many of them are FIFA football game items (contain **"FIFA" in the name**)?

In [None]:
print(f'There are {len(items_df)} items in items.csv')

In [None]:
def check_digits(name):
    for char in name:
        if char.isdigit():
            return True
    return False

def check_fifa(name):
    if 'FIFA' in name.upper():
        return True
    return False

In [None]:
items_df['is_digits_in_name'] = items_df.item_name.apply(check_digits)
items_df

In [None]:
item_name_with_digits_df = items_df.loc[items_df.is_digits_in_name]

print(f'There are {len(item_name_with_digits_df)} items with digits in item_name')
item_name_with_digits_df

In [None]:
items_df['is_fifa_in_name'] = items_df.item_name.apply(check_fifa)
item_name_with_fifa_df = items_df[items_df.is_fifa_in_name]

print(f'There are {len(item_name_with_fifa_df)} items with FIFA in item_name')
item_name_with_fifa_df

#### Question 3:
How many **item categories** are there in the dataset? Which item category contain **highest number of items**? **lowest number of items**? List all items according to each item category. Calculate the average number of items in each category.

In [None]:
import numpy as np

In [None]:
print(f'There are {len(item_categories_df)} categories in the dataset')

In [None]:
merged_item_cat_df = pd.merge(items_df, item_categories_df, how='left', on='item_category_id')
grouped_merged_item_cat_df = merged_item_cat_df.groupby(
    by=['item_category_id', 'item_category_name']).agg({'item_id': list}).reset_index()
grouped_merged_item_cat_df['num_of_items'] = grouped_merged_item_cat_df.item_id.apply(len)
grouped_merged_item_cat_df

In [None]:
print(f'Average number of items {np.average(grouped_merged_item_cat_df.num_of_items)}')

In [None]:
max_num_of_items_df = grouped_merged_item_cat_df[
    grouped_merged_item_cat_df.num_of_items == np.max(grouped_merged_item_cat_df.num_of_items)]
max_num_of_items_df

In [None]:
min_num_of_items_df = grouped_merged_item_cat_df[
    grouped_merged_item_cat_df.num_of_items == np.min(grouped_merged_item_cat_df.num_of_items)]
min_num_of_items_df

#### Question 4:
Which **item** has the **highest price** in each year? Which **item** has the **lowest price** in each year? Calculate the **average price** of each item in each year?

In [None]:
sales_item_name_df = pd.merge(sales_df, items_df, on='item_id', how='left')
sales_item_name_df.date = sales_item_name_df.date.apply(lambda x: x.split('.')[-1])
sales_item_name_df

In [None]:
def find_highest_lowest_price(df, year):
    sales_year_df = df[df.date == year]
    highest_df = sales_year_df[sales_year_df.item_price == np.max(sales_year_df.item_price)]
    lowest_df = sales_year_df[sales_year_df.item_price == np.min(sales_year_df.item_price)]
    average_price = np.average(sales_year_df.item_price)
    return highest_df, lowest_df, average_price

In [None]:
year_list = sales_item_name_df.date.unique()

price = dict()
for year in year_list:
    highest_df, lowest_df, average_price = find_highest_lowest_price(sales_item_name_df, year)
    price[year] = dict()
    price[year]['highest'] = highest_df
    price[year]['lowest'] = lowest_df
    price[year]['average_price'] = average_price


In [None]:
price['2013']['highest']

In [None]:
price['2013']['lowest']

In [None]:
price['2013']['average_price']

In [None]:
price['2014']['highest']

In [None]:
price['2014']['lowest']

In [None]:
price['2014']['average_price']

In [None]:
price['2015']['highest']

In [None]:
price['2015']['lowest']

In [None]:
price['2015']['average_price']

#### Question 5:
Which **item** has the **highest sales** in each year? Which **item** has the **lowest sales** in each year? Calculate the **average sales** of each item in each year?

In [None]:
sales_item_name_df

In [None]:
def find_highest_lowest_sales(df, year):
    sales_year_df = df[df.date == year]
    grouped_df = sales_year_df.groupby(by='item_id').agg({'item_cnt_day': sum}).reset_index()

    highest_df = grouped_df[grouped_df.item_cnt_day == np.max(grouped_df.item_cnt_day)]
    lowest_df = grouped_df[grouped_df.item_cnt_day == np.min(grouped_df.item_cnt_day)]
    average_sales = np.average(grouped_df.item_cnt_day)
    return highest_df, lowest_df, average_sales

In [None]:
sales = dict()
for year in year_list:
    highest_df, lowest_df, average_sales = find_highest_lowest_sales(sales_item_name_df, year)
    sales[year] = dict()
    sales[year]['highest'] = highest_df
    sales[year]['lowest'] = lowest_df
    sales[year]['average_sales'] = average_sales

In [None]:
sales['2013']['highest']

In [None]:
sales['2013']['lowest']

In [None]:
sales['2013']['average_sales']

In [None]:
sales['2014']['highest']

In [None]:
sales['2014']['lowest']

In [None]:
sales['2014']['average_sales']

In [None]:
sales['2015']['highest']

In [None]:
sales['2015']['lowest']

In [None]:
sales['2015']['average_sales']