# Summary
Objective is to identify 100 customers that Supplier Ltd should target for Coca-cola.
Steps:
1) Find Customers with the highest value of Sales in the 'Sugary Drinks' category
2) Sort those Customers in descending order of Sales

In [1]:
import pandas as pd

# Load csv into pandas and summarise data
file_path = "Data.csv"
df = pd.read_csv(file_path)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116338 entries, 0 to 116337
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Transaction Date  116338 non-null  object 
 1   Customer Id       116338 non-null  int64  
 2   Sales Agent ID    116338 non-null  object 
 3   Invoice ID        116338 non-null  object 
 4   Credit Term       116338 non-null  int64  
 5   Customer Name     116338 non-null  object 
 6   Order ID          116338 non-null  int64  
 7   SKU Name          116338 non-null  object 
 8   SKU Code          116338 non-null  object 
 9   Brand             116338 non-null  object 
 10  Category          116338 non-null  object 
 11  Quantity          116338 non-null  int64  
 12  Sales Amount      116338 non-null  float64
dtypes: float64(1), int64(4), object(8)
memory usage: 11.5+ MB


In [2]:
df.head()

Unnamed: 0,Transaction Date,Customer Id,Sales Agent ID,Invoice ID,Credit Term,Customer Name,Order ID,SKU Name,SKU Code,Brand,Category,Quantity,Sales Amount
0,4/1/2018,101336,21LW8,310189386,14,RUDIN,326875,100 PLUS ORIGINAL 1.5LPET (G),1170835,100PLUS,SUGARY DRINKS,1,34.66
1,4/1/2018,101336,21LW8,310189386,14,RUDIN,326875,F&N I/CSODA 1.5LPET,1170027,F&N,CULINARY,1,33.6
2,4/1/2018,101336,21LW8,310189386,14,RUDIN,326875,F&N ORG 1.5LPET,1170025,F&N,CULINARY,1,33.6
3,4/1/2018,101336,21LW8,310189386,14,RUDIN,326875,100+ REGULAR 500MLPET,1170113,100PLUS,SUGARY DRINKS,1,34.34
4,30/1/2018,101336,21LW8,310190647,14,RUDIN,326876,GOLD COIN EVAP CREAMER 390GX48,1111965,GOLD,DAIRY,2,226.84


Initial analysis shows 116338 entries with no missing data. dtypes are strings, integers and float64s.

Best way to sort this data to fit the task is by category.

In [3]:
df['Category'].unique() # Find unique categories

array(['SUGARY DRINKS', 'CULINARY', 'DAIRY', 'MINERAL WATER',
       'ALCHOHOLIC BEVERAGE', 'ACCESSORIES', 'REBATE ITEMS',
       'ENERGY DRINKS'], dtype=object)

The category that most fits coca cola is 'sugary drinks'. The customers that coca cola should target are the top 100 customers that buys the most products under that category. Two variables indicate this, Quantity and Sales Amount. The best variable to use in this case is Sales Amount.

In [4]:
trimmed_df = df[['Customer Id','Customer Name','Category','Sales Amount']] # Remove unneccessary columns
trimmed_df = trimmed_df[trimmed_df['Category'] == 'SUGARY DRINKS'] # Only keep invoices that are the category 'Sugary Drinks'
trimmed_df.head()


Unnamed: 0,Customer Id,Customer Name,Category,Sales Amount
0,101336,RUDIN,SUGARY DRINKS,34.66
3,101336,RUDIN,SUGARY DRINKS,34.34
8,101336,RUDIN,SUGARY DRINKS,134.41
9,101336,RUDIN,SUGARY DRINKS,169.6
15,101336,RUDIN,SUGARY DRINKS,67.21


In [5]:
# Create new df for most valuable customers based on trimmed_df, only keep customer Id and Sales Amount
Most_Valuable_Customers_df = trimmed_df.drop(['Customer Name','Category'], axis=1) # Only keep 'Customer Id' and 'Sales Amount'
Most_Valuable_Customers_df = Most_Valuable_Customers_df.groupby('Customer Id',as_index=False).sum() # Add up 'Sales Amount' for each unique 'Customer Id'
Most_Valuable_Customers_df = Most_Valuable_Customers_df.sort_values(by='Sales Amount', ascending=False) # Sort by most 'Sales Amount'
Most_Valuable_Customers_df = Most_Valuable_Customers_df.head(100) # Keep first 100 rows, discard the rest
Most_Valuable_Customers_df['Customer Id'].unique()

array([101858, 100907, 100977, 116468,  99954, 102114, 100835, 100173,
        99294, 100659, 100280, 100916, 100816, 100257, 101904, 100947,
       100909, 101844,  99857, 100530,  98612, 100231,  98668, 102249,
       100926,  99742, 101394, 101519, 100904,  99530, 100275, 101283,
       100441, 100911,  99540, 100249,  99695, 178648, 101274,  98531,
        99938, 101768, 100253, 115964, 101370, 101229,  98817, 102308,
       100942, 101828, 100828, 101803, 100934, 102370, 101890, 118909,
       100195, 100423,  98805, 102380, 101516, 101366, 100264, 101056,
       100933, 155619, 101148, 100260,  98570, 100301, 100147, 100912,
       100248, 102184, 100518, 101122, 101022, 102182, 100533,  98949,
       100849, 100796,  98619, 102348, 117478, 100185, 102135, 101871,
       101126,  99369,  99555, 101257, 101289, 100198, 101373, 100196,
       101499, 100245, 101093, 100969])

In [6]:
# Add 'Customer Name' back into dataset
Customer_Names_df = df[['Customer Id','Customer Name']] # Create df with just 'Customer Id'and'Customer Name'
Customer_Names_df = Customer_Names_df.drop_duplicates() # Remove Duplicates
Most_Valuable_Customers_df = Most_Valuable_Customers_df.merge(Customer_Names_df, on='Customer Id', how='left') # Merge the two dfs on 'Customer Name'
column_order = ['Customer Id', 'Customer Name', 'Sales Amount']# Change column order
Most_Valuable_Customers_df = Most_Valuable_Customers_df[column_order]# Reorder df
Most_Valuable_Customers_df.head(100)# Display the updated df

Unnamed: 0,Customer Id,Customer Name,Sales Amount
0,101858,TAN,407960.44
1,100907,FOONG,299797.14
2,100977,PAM,287333.44
3,116468,YEE,254431.60
4,99954,PEI,219450.06
...,...,...,...
95,100196,HOONG,4401.78
96,101499,RAMAN,4338.36
97,100245,HAMAT,4252.69
98,101093,ELY,4154.56


In [None]:
#Convert Most_Valuable_Customers_df to CSV file.
Most_Valuable_Customers_df.to_csv("Most_Valuable_Customers.csv", index=False)