# RFM Data Preparing

In this project, customers need to be segmented in order to understand which customers should be focused on. Therefore, **RFM Analysis** is used to understand segmentation. In the RFM analysis, a customer is evaluated by following considerations; Recency, Frequency and Monetary.
- **Recency**: how much time passed since the last purchase for a customer
- **Frequency**: how often a customer purchase
- **Monetary**: how much money a customer spend

There are 6 parts in this file in order to prepare a proper clustering file.

1. <a href='#customerid_section'><b>CustomerID</b></a>
2. <a href='#recency_section'><b>Recency</b></a>
3. <a href='#frequency_section'><b>Frequency</b></a>
4. <a href='#monetary_section'><b>Monetary</b></a>
5. <a href='#country_section'><b>Country</b></a>
6. <a href='#data_storing_section'><b>Data Storing</b></a>

---

## Data Read

In [1]:
import pandas as pd

In [2]:
cleaned_data = pd.read_csv('Cleaned_Data.csv')

In [3]:
cleaned_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
406824,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
406825,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
406826,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
406827,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


---

In [4]:
# empty clustering data frame
rfm_clusters = pd.DataFrame()

---

<a id='customerid_section'></a>
## 1. CustomerID

Customers are sorted as ascending trend and assigned to the rfm_clusters data frame.

In [5]:
rfm_clusters['CustomerID'] = cleaned_data['CustomerID'].unique()

In [6]:
rfm_clusters['CustomerID'] = rfm_clusters.loc[:,'CustomerID'].sort_values(ascending=True).reset_index(drop=True)

In [7]:
rfm_clusters.head()

Unnamed: 0,CustomerID
0,12346.0
1,12347.0
2,12348.0
3,12349.0
4,12350.0


<a id='recency_section'></a>
## 2. Recency

In this part, time that has passed from the last transaction is determined for each customer.

In [8]:
import datetime as dt

In [9]:
# At the data cleaning step, data type for the InvoiceDate is lost to object data type. 
# Thus, object type is casted to datetime type.
cleaned_data['InvoiceDate'] = pd.to_datetime(cleaned_data['InvoiceDate'])

In [10]:
# A new column as Time_Difference is added to the cleaned_data. So, that last transaction duration can be detected.
cleaned_data['Time_Difference'] = cleaned_data['InvoiceDate'].max() - cleaned_data['InvoiceDate']

In [11]:
# The minimum time difference for each customer is found and 
#  assigned to the rfm_clusters data frame under TimeDifference feature by merging. 
# In this way, time duration from the last transaction could be found for each customer.
time_difference = cleaned_data.groupby('CustomerID')[['Time_Difference']].min()
rfm_clusters = pd.merge(rfm_clusters, time_difference, how='inner', on='CustomerID')

In [12]:
# Time_Difference column in the rfm_clusters data frame is renamed as Recency
rfm_clusters.rename(columns={'Time_Difference':'Recency'}, inplace=True)

In [13]:
# Timestamp is cleared as only days
rfm_clusters['Recency'] = rfm_clusters['Recency'].dt.days

In [14]:
rfm_clusters.head()

Unnamed: 0,CustomerID,Recency
0,12346.0,325
1,12347.0,1
2,12348.0,74
3,12349.0,18
4,12350.0,309


<a id='frequency_section'></a>
## 3. Frequency

In this part, it is determined how often a customer purchased.



In [15]:
# Count of invoice for each customer is determined and assigned to rfm_clusters data frame
invoice_count = cleaned_data.groupby('CustomerID')[['InvoiceNo']].count()
rfm_clusters = pd.merge(rfm_clusters, invoice_count, how='inner', on='CustomerID')

In [16]:
# column name for frequency is renamed after merged operation
rfm_clusters.rename(columns={'InvoiceNo':'Frequency'}, inplace=True)

In [17]:
rfm_clusters.head()

Unnamed: 0,CustomerID,Recency,Frequency
0,12346.0,325,2
1,12347.0,1,182
2,12348.0,74,31
3,12349.0,18,73
4,12350.0,309,17


<a id='monetary_section'></a>
## 4. Monetary

In this part, total spend is determined for each customer.

In [18]:
# Total spend for every row is calculated and assigned into a new column in cleaned data
cleaned_data['Total_Spend'] = cleaned_data['UnitPrice'] * cleaned_data['Quantity']

In [19]:
# Total spend for each customer is determined and assigned into rrfm_cluster dataframe
total_spend = cleaned_data.groupby('CustomerID')[['Total_Spend']].sum()
rfm_clusters = pd.merge(rfm_clusters, total_spend, how='inner', on='CustomerID')

In [20]:
# column name for monetary is renamed after merged operation
rfm_clusters.rename(columns={'Total_Spend':'Monetary'}, inplace=True)

In [21]:
rfm_clusters.head()

Unnamed: 0,CustomerID,Recency,Frequency,Monetary
0,12346.0,325,2,0.0
1,12347.0,1,182,4310.0
2,12348.0,74,31,1797.24
3,12349.0,18,73,1757.55
4,12350.0,309,17,334.4


<a id='country_section'></a>
## 5. Country

In this part, the transaction country for each customer is determined.

In [22]:
for i in range(0,len(rfm_clusters)):
    cust_id = rfm_clusters.loc[i,'CustomerID']
    country = cleaned_data[cleaned_data.loc[:,'CustomerID'] == cust_id]['Country'].unique()[0]
    rfm_clusters.loc[i,'Country'] = country 

In [23]:
rfm_clusters.head()

Unnamed: 0,CustomerID,Recency,Frequency,Monetary,Country
0,12346.0,325,2,0.0,United Kingdom
1,12347.0,1,182,4310.0,Iceland
2,12348.0,74,31,1797.24,Finland
3,12349.0,18,73,1757.55,Italy
4,12350.0,309,17,334.4,Norway


<a id='data_storing_section'></a>
## 6. Data Storing

In this part, the prepared rfm_clusters dataset is stored with proper file format.

In [24]:
rfm_clusters.to_csv('RFM_Clusters.csv', index=False)

---