# Assignment - Pandas
**<div style="text-align: right"> [TOTAL SCORE: 20]</div>**

In this assignment, you will practice using pandas on datasets that are similar to what you will encounter in real life. You will manipulate dataframes in many ways such as merging two dataframes and creating new columns in them. If you feel like you need a refresher for pandas, go through the resources in this chapter. [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) is also a really good introduction to pandas. 

## Exercise 1 - Importing Modules
**<div style="text-align: right"> [SCORE: 1]</div>**

**Task:** Import `numpy` and `pandas`.

In [1]:
# YOUR CODE HERE
import numpy as np
import pandas as pd

In [2]:
### INTENTIONALLY LEFT BLANK

## Exercise 2: Loading Datasets

**<div style="text-align: right"> [SCORE: 3]</div>**

You are given two CSV files in the urls `URL1` and `URL2`.

**Task:** Load the two CSV files into the dataframes `sales1` and `sales2` and investigate the data in them. Note that you can use these URLs directly instead of file paths when importing the CSV files using pandas.

In [3]:
URL1 = "https://storage.googleapis.com/codehub-data/1-A-2-4-sales1.csv"
URL2 = "https://storage.googleapis.com/codehub-data/1-A-2-4-sales2.csv"

sales1 = pd.read_csv(URL1)
sales2 = pd.read_csv(URL2)

In [4]:
assert sales1 is not None
assert sales2 is not None


## Exercise 3: Investigating Dataframes                                                    
**<div style="text-align: right"> [UNGRADED]</div>**

Take a look at the data in the two dataframes and get to know the variables in them.

In [5]:
# YOUR CODE HERE
print(sales1.columns)
print(sales2.columns)

Index(['Item Type', 'Order Date', 'Order ID', 'Ship Date', 'Units Sold',
       'Unit Price', 'Unit Cost'],
      dtype='object')
Index(['Order ID', 'Region', 'Country', 'Order Priority', 'Total Revenue',
       'Total Cost'],
      dtype='object')


## Exercise 4: Merging Two Dataframes
**<div style="text-align: right"> [SCORE: 4]</div>**
**Task:** Merge the two dataframes into a single dataframe `sales` based on their `Order ID`. Do not throw away any data. Use the function [`pd.merge`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.merge.html).

In [6]:
sales = pd.merge(sales1,sales2, on="Order ID",how='outer')
# YOUR CODE HERE

In [7]:
assert(sales[sales["Order ID"]==686800706]["Country"].values == "Libya")
assert(sales[sales["Order ID"]==686800706]["Item Type"].values == "Cosmetics")

## Exercise 5: Searching for Missing Values

**<div style="text-align: right"> [SCORE: 2]</div>**

**Task:** Investigate the `sales` dataframe to find out any missing values. Count the number of missing values and store the resulting count into the variable `NA_count`.

In [8]:
NA_count = sales.isnull().sum()
NA_count


Item Type          0
Order Date         0
Order ID           0
Ship Date          0
Units Sold         6
Unit Price         6
Unit Cost          6
Region            24
Country           24
Order Priority    20
Total Revenue     23
Total Cost        23
dtype: int64

In [9]:
assert NA_count is not None

## Exercise 6: Filling in Missing Values

**<div style="text-align: right"> [SCORE: 3]</div>**


When solving the previous exercise, you must have noticed 
**Task:** Fill in the missing values in the `sales` dataframe following the strategy below. 
* If any values in the `Region` and `Country` columns are missing, fill in the fields with the string `unknown`.
* If any values in the `Order Priority` column are missing, fill in the fields with mode of the values in the column.
* If any values in the rest of the columns are missing, fill in the fields using the mean of the respective columns.


You can use any of the functions in the following references for this section:
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)
- [Python Lambda Expressions](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions)
- [Pandas Apply Function](https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.Series.apply.html)
- [Pandas FillNa](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.fillna.html)

In [10]:
# YOUR CODE HERE
sales[['Region','Country']] = sales[['Region','Country']].fillna("unknown")
sales['Order Priority']= sales['Order Priority'].fillna(sales['Order Priority'].mode()[0])
sales = sales.fillna(sales.mean())


In [11]:
assert(sales[sales["Order ID"] == 267066323]["Region"].values == "unknown")

## Exercise 7: Grouping and Analysis 
**<div style="text-align: right"> [SCORE: 3]</div>**
**Task:** Find out the count, mean and standard deviation for the ‘Unit Cost’ column. Store it in the variable `summary`.

References:
- [Pandas Indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
- [Groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html)
- [Loc](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.loc.html)

In [12]:
summary = sales.groupby('Item Type')['Unit Cost'].agg(['count','mean','std'])
summary

Unnamed: 0_level_0,count,mean,std
Item Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
*Cereal,79,117.970017,7.643994
Baby Food,87,159.714613,2.747965
Beverages,101,31.79,0.0
Clothes,78,35.84,0.0
Cosmetics,75,263.33,0.0
Fruits,70,6.92,0.0
Household,77,498.41677,36.181194
Meat,78,364.69,0.0
Office Supplies,89,524.96,0.0
Personal Care,87,58.145647,13.763921


In [13]:
assert(summary.loc["Fruits","count"]==70)

## Exercise 8: Getting the Statistics for a Variable
**<div style="text-align: right"> [SCORE: 1]</div>**
**Task:** Extract the count, mean and standard deviation for "Fruits", "Vegetables", "Baby Food", and "Meat" from `summary`. Assign the result to the variable `kitchen`.



In [14]:
kitchen = summary.loc[['Fruits','Vegetables','Baby Food', 'Meat']]
kitchen

Unnamed: 0_level_0,count,mean,std
Item Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fruits,70,6.92,0.0
Vegetables,97,91.900323,9.556571
Baby Food,87,159.714613,2.747965
Meat,78,364.69,0.0


In [15]:
assert(kitchen.loc["Vegetables","count"]==97)

## Exercise 9: Creating New Columns
**<div style="text-align: right"> [SCORE: 2]</div>**
**Task:** Generate a new column ‘Total Profit’ for each orders (Total Profit = Total Revenue - Total Cost)

In [16]:
# YOUR CODE HERE
sales['Total Profit'] = sales['Total Revenue'] - sales['Total Cost']

In [17]:
assert(sales[sales["Order ID"]==246222341]["Total Profit"].values == 145419.62)

## Exercise 10: A Simple Insight
**<div style="text-align: right"> [SCORE: ]</div>**
**Task:** How many instances (rows) are there with profit greater than 60%? Store this number of instances in the variable `high_profit`.

In [18]:
high_profit =((sales["Total Profit"]/sales["Total Cost"])*100>60).count()
high_profit

1000

In [19]:
assert high_profit is not None