# Questions
1. Find top 10 customers in the last 2 years
2. Find top 10 customers in the last 5 years
3. Find top 10 customers with highest 5 year average amount spent
4. Find top 10 customers with highest average transaction amount
5. Export the top 10 high value loyal customers (create new DF and drop columns example)

# Basic Operations in Pandas DataFrames

In [15]:
#Import Packages
import pandas as pd

In [16]:
#Import Dataset
customers = pd.read_csv('Retail_Data_Customers_Summary.csv')

## Find top 10 customers in the last 2 years
**Concepts covered:**
1. Creating New Calculated Columns - Addition (+) operator
2. Usage of Dropping Columns using .drop()
3. Introduction to (axis = 1) argument

### Adding using '+' Operator

In [17]:
#We can add customers using the '+' operator. Its adds row wise
customers['tran_amount_2_yr'] = customers['tran_amount_2014'] + customers['tran_amount_2015']

In [39]:
#We will short list the columns we need to get to the answer
cols_2 = ['customer_id','tran_amount_2011','tran_amount_2012','tran_amount_2013','tran_amount_2014','tran_amount_2015','tran_amount_2_yr']

In [40]:
#Note where there is NaN in any one column, it shows NaN in the Calculated Column
customers[cols_2].head()

Unnamed: 0,customer_id,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,tran_amount_2_yr
0,CS2945,153.0,516.0,173.0,1029.0,40.0,1069.0
1,CS4074,269.0,429.0,737.0,1027.0,,
2,CS4798,153.0,536.0,414.0,1001.0,47.0,1048.0
3,CS4424,547.0,380.0,921.0,984.0,101.0,1085.0
4,CS5057,290.0,235.0,509.0,974.0,,


### .fillna() to replace missing values 

In [42]:
#We have to replace NaN with Zero's using the .fillna() method across all the columns
customers.fillna(0, inplace = True)
customers[cols_2].head()
#Note that the we get 0 in the tran_amount_2_yr column after replacing missing values which is not right

Unnamed: 0,customer_id,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,tran_amount_2_yr
0,CS2945,153.0,516.0,173.0,1029.0,40.0,1069.0
1,CS4074,269.0,429.0,737.0,1027.0,0.0,0.0
2,CS4798,153.0,536.0,414.0,1001.0,47.0,1048.0
3,CS4424,547.0,380.0,921.0,984.0,101.0,1085.0
4,CS5057,290.0,235.0,509.0,974.0,0.0,0.0


### (axis = 1) argument

In [43]:
#Lets drop the column, so we can re-create it with proper addition
customers.drop('tran_amount_2_yr', axis = 1, inplace = True)

In [44]:
#Now lets to the addition again with missing values(NaN) replaced with 0
customers['tran_amount_2_yr'] = customers['tran_amount_2014'] + customers['tran_amount_2015']

In [45]:
#Now we can find the top 10 customers based on last 2 years transaction amount
customers[['customer_id', 'tran_amount_2_yr']].sort_values('tran_amount_2_yr', ascending = False).head(10)

Unnamed: 0,customer_id,tran_amount_2_yr
29,CS5244,1162.0
6,CS2647,1158.0
20,CS3270,1145.0
3,CS4424,1085.0
0,CS2945,1069.0
48,CS2331,1053.0
2,CS4798,1048.0
100,CS3592,1041.0
9,CS1141,1036.0
40,CS2663,1035.0


## Find top 10 customers in the last 5 years
**Concepts covered:**
1. Creating New Calculated Columns - Intermediate addition using .sum()
2. Usage of (axis = 1) argument for operating within columns

### Using (.sum()) to add columns

In [46]:
#Lets choose the columns we want to sum in a variable 
cols_to_sum = ['tran_amount_2011','tran_amount_2012','tran_amount_2013','tran_amount_2014','tran_amount_2015']

In [47]:
#We can sum the 5 columns using the .sum()
customers[cols_to_sum].sum()

tran_amount_2011    1340339.0
tran_amount_2012    2116599.0
tran_amount_2013    2137368.0
tran_amount_2014    2094508.0
tran_amount_2015     435175.0
dtype: float64

### Using (axis = 1) to add columns

In [49]:
#Notice that is sums the entire column to give 5 values, hence we use the argument axis = 1
#This will help us sum column wise
customers['tran_amount_5_yr'] = customers[cols_to_sum].sum(axis = 1)

In [52]:
#Lets check the data to find our 2 newly added columns
customers.head()

Unnamed: 0,customer_id,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,transactions_2011,transactions_2012,transactions_2013,transactions_2014,transactions_2015,First_Transaction,Latest_Transaction,tran_amount_2_yr,tran_amount_5_yr
0,CS2945,153.0,516.0,173.0,1029.0,40.0,2.0,7.0,3.0,13.0,1.0,18-May-11,08-Mar-15,1069.0,1911.0
1,CS4074,269.0,429.0,737.0,1027.0,0.0,3.0,6.0,10.0,15.0,0.0,29-May-11,05-Dec-14,1027.0,2462.0
2,CS4798,153.0,536.0,414.0,1001.0,47.0,2.0,7.0,6.0,11.0,1.0,01-Jul-11,21-Feb-15,1048.0,2151.0
3,CS4424,547.0,380.0,921.0,984.0,101.0,7.0,5.0,13.0,13.0,1.0,25-May-11,19-Jan-15,1085.0,2933.0
4,CS5057,290.0,235.0,509.0,974.0,0.0,4.0,4.0,6.0,12.0,0.0,10-Jul-11,02-Dec-14,974.0,2008.0


In [54]:
#To find the top 10 customers, we need to sort by tran_amount_5_yr in descending order
customers[['customer_id', 'tran_amount_5_yr']].sort_values(by = ['tran_amount_5_yr'], ascending = False).head(10)

Unnamed: 0,customer_id,tran_amount_5_yr
3,CS4424,2933.0
998,CS4320,2647.0
106,CS5752,2612.0
873,CS4660,2527.0
16,CS3799,2513.0
674,CS5109,2506.0
1,CS4074,2462.0
163,CS3805,2453.0
1845,CS4608,2449.0
388,CS5555,2439.0


## Find top 10 customer with highest 5 year average amount spent
**Concepts covered:**
1. Creating New Calculated Columns - Divide (/) operation

### Using divide (/) operator 

In [55]:
#We can divide the 5 year transaction amount by 5 to get the average
customers['tran_amount_5_yr_avg'] = customers['tran_amount_5_yr'] / 5

In [57]:
#Then get the top 10 customers that meets the criteria for the question
customers[['customer_id','tran_amount_5_yr','tran_amount_5_yr_avg']].sort_values('tran_amount_5_yr_avg', ascending=False).head(10)

Unnamed: 0,customer_id,tran_amount_5_yr,tran_amount_5_yr_avg
3,CS4424,2933.0,586.6
998,CS4320,2647.0,529.4
106,CS5752,2612.0,522.4
873,CS4660,2527.0,505.4
16,CS3799,2513.0,502.6
674,CS5109,2506.0,501.2
1,CS4074,2462.0,492.4
163,CS3805,2453.0,490.6
1845,CS4608,2449.0,489.8
388,CS5555,2439.0,487.8


## Find top 10 customers with highest average transaction amoun
**Concepts covered:**
1. Adding Colunms - Using continuous selection (:) and (.loc())

### Adding Colunms - (:) and (.loc())

In [58]:
#Lets first add the 5 year transactions 
customers['transactions_5_yr'] = customers.loc[:,'transactions_2011':'transactions_2015'].sum(axis = 1)

In [59]:
#Then we can create a new columns diving the amount with transactions
customers['tran_amount_avg'] = customers['tran_amount_5_yr'] / customers['transactions_5_yr']

In [61]:
#With the right sort we can achieve the answer for the question
customers[['customer_id','tran_amount_5_yr','transactions_5_yr','tran_amount_avg','tran_amount_5_yr_avg']].sort_values('tran_amount_avg', ascending=False).head(10)

Unnamed: 0,customer_id,tran_amount_5_yr,transactions_5_yr,tran_amount_avg,tran_amount_5_yr_avg
5958,CS3053,1311.0,15.0,87.4,262.2
523,CS2593,1199.0,14.0,85.642857,239.8
3209,CS3654,1627.0,19.0,85.631579,325.4
6090,CS5910,1712.0,20.0,85.6,342.4
4052,CS5045,1019.0,12.0,84.916667,203.8
715,CS1312,1353.0,16.0,84.5625,270.6
2236,CS2504,927.0,11.0,84.272727,185.4
3568,CS5864,1432.0,17.0,84.235294,286.4
1473,CS1142,1847.0,22.0,83.954545,369.4
5241,CS1377,1505.0,18.0,83.611111,301.0


# END
**Pandas Concepts Covered:**
1. Using (+) and (/) operator to create new columns
2. Argument (axis = 1)
3. Using (.sum()) to Add columns
4. Alternate methods of selecting columns - using variable
5. (.fillna()) to replace missing values
6. Alternate methods of selecting columns - using (:) and (.loc())