![Value Inc](https://finch-groundhog-9245.squarespace.com/s/Value-Inc-Logo.png "Value Inc")

> **Project Title:** Sales Analysis<br>
> **Project Owner:** Berlinda Anaman <br>
> **Email:** Berlana.d@gmail.com <br>
> **Github Profile:** https://github.com/berl-cloud<br>
> **LinkedIn Profile:** http://www.linkedin.com/in/berlinda-anaman<br>
> **Tableau Profile:** https://public.tableau.com/app/profile/berlinda.anaman

## Table of Contents<a id='mu'></a>

* [Business Problem Understanding](#bpu)
    * [Problem Statement](#ps)
    * [Project Goal](#pg)
    * [Information Needed](#in)
    * [Methodology](#my)
* [Data Preparation](#dp)
    * [Data Quality Assessment](#dqa)
       * [Season Data](#sd)
       * [Transaction Data](#td)
    * [Data Cleaning and Preprocessing](#dcp)
* [Data Visualisation](#dv)
* [References](#r)

## 1. Business Problem Understanding<a id='bpu'></a>

[Move Up](#mu)

<p style="text-align:justify;">The first step in approaching a data science problem is problem understanding. This step is very important since it allows us to know the kind of decisions we want to make, the information or data that will be needed to inform those decisions and finally, the kind of analysis that will be used to arrive at those decisions. In a nutshell, developing a mental model of the problem allows us to properly structure potentially relevant information needed to solve the problem.</p>

### 1.1 Problem Statement <a id='ps'></a>

The Sales Manager at Value Inc lacks sales reporting and requires a comprehensive understanding of the current sales. They are unaware of monthly costs, monthly sales and profits. The project aims to address these challenges by analyzing the provided sales data and presenting the information through a dashboard.

### 1.2 Project Goal <a id='pg'></a>

In this project, we seek to achieve 2 main goals and they are;

* Analyze dataset and draw unique insights for sales reporting.
* Build a dashboard to track monthly cost, profit and top selling products.

### 1.3 Information Needed <a id='in'></a>

In order to build a sales analytics dashboard, we would need to acquire the data needed to perform the analysis. This will enable us draw unique insights and perform calculations that may be useful for the dashboard.

The data needed is: 

**`Sales transactional data`** - which should indicate every transaction performed on Items bought and  sold.

### 1.4 Methodology <a id='my'></a>

<p style="text-align:justify;">The methodology that will be used for our project will largely depend on the goals we set out to achieve. The methodlogy framework below gives us a comprehensive guide on the methodology apparoach that will help us achieve our goals.</p>
<br>
<p style="text-align:center;font-weight:bold;font-size:20px"> Methodology Framework</p>
<br>
<img src='https://artofdatablog.files.wordpress.com/2017/10/methodology-map.jpg' style="float:center;width:700px;">

Once we have the data, we would need to clean it and perform the analysis to generate insights.

## 2. Data Preparation <a id='dp'></a>

[Move Up](#mu)

An understanding of the data coupled with problem understanding will help us in cleaning and preparing our data for analysis. It is usually rare to acquire a ready-to-use data for any analysis without some level of preparation. To prepare our data, we normally assess the quality of the data, cleanse, format, blend and sample the data since we may encounter various issues with columns in our data. These issues may include:

* **`Missing values:`** meaning column values are incomplete
* **`Incorrect data:`** meaning you see values not expected for the column name
* **`Inconsistent values:`** meaning some values may fall outside the expected range
* **`Duplicate values:`** meaning whether or not there are duplicate values
* **`Inconsistent data type:`** meaning values entered in the columns may not be consistent with the column names

To properly prepare our data for analysis, we will perform two important tasks which are;

* Part I: Data Quality Assessment
* Part II: Data Cleaning and Preprocessing 

### 2.1 Data Quality Assessment <a id='dqa'></a>

<p style="text-align:justify;">The first task that we will perform under the data preparation step is initial assessment of the quality of data which will easily allow us to properly clean our data. We will use this section to write any code necesary for inspecting the dataset. Once completed, we will leave our report in the Data Quality Report Document.

At the end of our inspection, we will provide a summary of all of our findings.</p>

In [1]:
# import libraries needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings


warnings.filterwarnings('ignore')



In [2]:
# load data using pandas

data = pd.read_csv("../Data/transaction.csv", sep=";")
data1 = pd.read_csv("../Data/value_inc_seasons.csv", sep=";")

#### 2.1.1 Inspecting Seasons Data <a id='sd'></a>

In this step we will inspect the seasons data (`data1`) to get an overview of what the data entails and also assess the quality of the data and make all the possible recommendation for cleaning this data.

In [3]:
# have a glimpse of the data1

data1.head()

Unnamed: 0,Month,Season
0,Jan,High
1,Feb,Mid
2,Mar,Low
3,Apr,Low
4,May,Low


In [4]:
#checking the shape of the data

data1.shape

(12, 2)

> We can see from the above result that the data has `12` observations and `2` columns.

In [5]:
#inspecting data type

data1.info()

data1.describe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Month   12 non-null     object
 1   Season  12 non-null     object
dtypes: object(2)
memory usage: 320.0+ bytes


<bound method NDFrame.describe of    Month Season
0    Jan   High
1    Feb    Mid
2    Mar    Low
3    Apr    Low
4    May    Low
5    Jun   High
6    Jul   High
7    Aug   High
8    Sep    Mid
9    Oct    Low
10   Nov    Low
11   Dec   High>

> We can see from the above result that the data is complete. There are no missing values, and no duplicates. We will move on to assessing the transactional data.

#### Data Quality Summary:

* As stated earlier, the data is made up of 2 columns which are;
    * **`Month:`**  This column represents the month of the season.
    * **`Season:`**  This column represents the various seasons that occured.    


* Also, the data types for each column is accurate.


* Finally, the data does not contain any missing values. 

#### 2.1.2 Inspecting the transaction Data <a id='td'></a>

In this step we will inspect the transaction data (`data`) to get an overview of what the data entails and also assess the quality of the data and make all the possible recommendation for cleaning this data.

In [6]:
# have a glimpse of the data

data.head(4)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,SellingPricePerItem,Country,"ClientKeywords,,"
0,278166,6355745.0,2019.0,Feb,2.0,12:50:00,465549.0,FAMILY ALBUM WHITE PICTURE FRAME,6.0,11.73,21.11,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie..."
1,337701,6283376.0,2018.0,Dec,26.0,09:06:00,482370.0,LONDON BUS COFFEE MUG,3.0,3.52,3.87,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']"
2,267099,6385599.0,2019.0,Feb,15.0,09:45:00,490728.0,SET 12 COLOUR PENCILS DOLLY GIRL,72.0,0.9,1.62,France,"['Middle Age', 'Corporation', '2-5 Year Client']"
3,380478,6044973.0,2018.0,Jun,22.0,07:14:00,459186.0,UNION JACK FLAG LUGGAGE TAG,3.0,1.73,1.9,United Kingdom,"['Middle Age', 'Small Business', 'New Client']"


In [7]:
# inspect the shape of the dataframe

data.shape

(1047588, 13)

> We can see from the above results that we have `1047588` observations and `13` columns. The data is rich enough to help us perform our analysis. However, we will move on to assess the quality of the data and make all the possible recommendations for cleaning before setting out to achieve our goals.

In [8]:
#getting data information

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1047588 entries, 0 to 1047587
Data columns (total 13 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   UserId                  1047588 non-null  object 
 1   TransactionId           1045208 non-null  float64
 2   Year                    1045208 non-null  float64
 3   Month                   1045208 non-null  object 
 4   Day                     1045208 non-null  float64
 5   Time                    1045208 non-null  object 
 6   ItemCode                1045208 non-null  float64
 7   ItemDescription         1042417 non-null  object 
 8   NumberOfItemsPurchased  1045208 non-null  float64
 9   CostPerItem             1045208 non-null  float64
 10  SellingPricePerItem     1045208 non-null  float64
 11  Country                 1045208 non-null  object 
 12  ClientKeywords,,        1045208 non-null  object 
dtypes: float64(7), object(6)
memory usage: 103.9+ MB


In [9]:
# viewing the columns

data.columns

Index(['UserId', 'TransactionId', 'Year', 'Month', 'Day', 'Time', 'ItemCode',
       'ItemDescription', 'NumberOfItemsPurchased', 'CostPerItem',
       'SellingPricePerItem', 'Country', 'ClientKeywords,,'],
      dtype='object')

We will proceed by checking if the data contain any missing values. We can easily tell from the results of the info but we want to be extremely sure there are no missing values.

In [10]:
#checking for total missing values in each column

data.isnull().sum()

UserId                       0
TransactionId             2380
Year                      2380
Month                     2380
Day                       2380
Time                      2380
ItemCode                  2380
ItemDescription           5171
NumberOfItemsPurchased    2380
CostPerItem               2380
SellingPricePerItem       2380
Country                   2380
ClientKeywords,,          2380
dtype: int64

#### Data Quality Summary:

This data presents a lot of opportunity for data cleaning. This is because most of the features have issues that have to be resolved. Also, we have to determine which columns will not be needed for various reasons and then consequently drop them.

Based on the ouputs displayed above, here is the summary of the data quality issues which have to be fixed during the data cleaning process:<br>

* Since the null values in some of the columns are few, we will drop just those rows with missing values in order to be able to perform calculations on our data without errors. We will drop null values in all columns except the UserId column which has no null values.


* We will need to rename some columns. The column to be renamed is:
  * `ClientKeywords,,` We will eliminate ',,' at the end.


* We also need to properly format the data types for some of the columns which include:
    * **UserId**  (format to int data type)
    * **TransactionId** (format to int data type)
    * **Year** (format to int data type)
    * **Day** (format to int data type)
    * **ItemCode** (format to int data type)
    * **NumberofItemsPurchased** (format to int data type)


* Again, we will format the date by combining day,year and month to make it more readable. We will aslo split `clientkeywords` column into 3 separate columns to make it easy to filter for information on the dashboard.


* Finally, there are some columns that will not be needed after merging the two dataframes for the analysis we seek to perform and dashboard  building. These columns are:
    * **Day**  (can be found in the date column created)
    * **Year** (can be found in the date column created)
    * **Month** (can be found in the date column created)
    * **ClientKeywords** (has been splitted into separate columns)

### 2.2 Data Cleaning and Preprocessing<a id='dcp'></a>

#### Data Cleaning - Season data

From the data quality assessment we stated that this data is clean and does not require any further cleansing. However, we will merge it with the transaction data in order to perform any suitable analysis.

####  Data Cleaning - Transaction data

<p style="text-align:justify">The preprocessing step (usually an iterative process) is carried out to clean the data based on data quality issues identified. During the data quality assessment, we identified various data quality issues including missing values, incorrect data, inconsistent values, etc. 

In this task we will perform all the initial data cleaning and preprocessing needed on the transaction data.</p>

#### Handling Missing Values

In this step, we will delete all rows of misisng values from their columns.

In [11]:
#drop missing values

data = data.dropna()

#inspect data for missing values
data.isnull().sum()

UserId                    0
TransactionId             0
Year                      0
Month                     0
Day                       0
Time                      0
ItemCode                  0
ItemDescription           0
NumberOfItemsPurchased    0
CostPerItem               0
SellingPricePerItem       0
Country                   0
ClientKeywords,,          0
dtype: int64

#### Renaming Columns

Here, we will rename the `ClientKeywords,,` column by elinimating `,,` at the end to enable us to easily read and use the column in the future.

In [12]:
data.rename(columns={'ClientKeywords,,':'ClientKeywords'}, inplace=True)

#### Data Type Formatting

In this step we will format the columns to the correct data types which will make our analysis easier.

In [13]:
#inspecting the columns data type

data.dtypes

UserId                     object
TransactionId             float64
Year                      float64
Month                      object
Day                       float64
Time                       object
ItemCode                  float64
ItemDescription            object
NumberOfItemsPurchased    float64
CostPerItem               float64
SellingPricePerItem       float64
Country                    object
ClientKeywords             object
dtype: object

In [14]:
# formatting data types

data[['Year', 'Day', 'ItemCode', 'TransactionId', 'UserId']] = data[['Year', 'Day', 'ItemCode', 'TransactionId', 'UserId']].astype(int)

data['NumberOfItemsPurchased'] = data['NumberOfItemsPurchased'].astype(int)

In [15]:
# inspect the columns data types after conversion

data.dtypes

UserId                      int32
TransactionId               int32
Year                        int32
Month                      object
Day                         int32
Time                       object
ItemCode                    int32
ItemDescription            object
NumberOfItemsPurchased      int32
CostPerItem               float64
SellingPricePerItem       float64
Country                    object
ClientKeywords             object
dtype: object

#### Combining Date Columns

In this step we will combine the `Day`, `Month` and `Year` columns into a new column called `Date`. This will make the date a lot more readable.

In [16]:
# combine Day,Year and Month columns

day = data['Day'].astype(str)
year = data['Year'].astype(str)

data['Date'] = day +'-' + data['Month'] +'-' + year

In [17]:
#have a glimpse of the data

data.head(3)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,SellingPricePerItem,Country,ClientKeywords,Date
0,278166,6355745,2019,Feb,2,12:50:00,465549,FAMILY ALBUM WHITE PICTURE FRAME,6,11.73,21.11,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019
1,337701,6283376,2018,Dec,26,09:06:00,482370,LONDON BUS COFFEE MUG,3,3.52,3.87,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018
2,267099,6385599,2019,Feb,15,09:45:00,490728,SET 12 COLOUR PENCILS DOLLY GIRL,72,0.9,1.62,France,"['Middle Age', 'Corporation', '2-5 Year Client']",15-Feb-2019


#### Splitting Clientkeywords Column

In this step we will split the values for `clientKeywords` column and assign the values to age, business type and length of contract.

In [18]:
#splitting and formatting ClientKeywords column

data['ClientAge'] = data['ClientKeywords'].str.split(',', expand=True)[0]
data['ClientType'] = data['ClientKeywords'].str.split(',', expand=True)[1]
data['LengthOfContract'] = data['ClientKeywords'].str.split(',', expand=True)[2]

data['ClientAge'] = data['ClientAge'].str.replace('[', '')
data['LengthOfContract'] = data['LengthOfContract'].str.replace(']', '')

In [19]:
#inspecting data after spiltting

data.head(3)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,SellingPricePerItem,Country,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract
0,278166,6355745,2019,Feb,2,12:50:00,465549,FAMILY ALBUM WHITE PICTURE FRAME,6,11.73,21.11,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client'
1,337701,6283376,2018,Dec,26,09:06:00,482370,LONDON BUS COFFEE MUG,3,3.52,3.87,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018,'Middle Age','Corporation','2-5 Year Client'
2,267099,6385599,2019,Feb,15,09:45:00,490728,SET 12 COLOUR PENCILS DOLLY GIRL,72,0.9,1.62,France,"['Middle Age', 'Corporation', '2-5 Year Client']",15-Feb-2019,'Middle Age','Corporation','2-5 Year Client'


#### Formatting ItemDescription Column

In this step, we will change the font of the ItemDescription data from upper case to lower case. This will ensure consistency in the data.

In [20]:
data['ItemDescription'] = data['ItemDescription'].str.lower() 
data.head(5)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,SellingPricePerItem,Country,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract
0,278166,6355745,2019,Feb,2,12:50:00,465549,family album white picture frame,6,11.73,21.11,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client'
1,337701,6283376,2018,Dec,26,09:06:00,482370,london bus coffee mug,3,3.52,3.87,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018,'Middle Age','Corporation','2-5 Year Client'
2,267099,6385599,2019,Feb,15,09:45:00,490728,set 12 colour pencils dolly girl,72,0.9,1.62,France,"['Middle Age', 'Corporation', '2-5 Year Client']",15-Feb-2019,'Middle Age','Corporation','2-5 Year Client'
3,380478,6044973,2018,Jun,22,07:14:00,459186,union jack flag luggage tag,3,1.73,1.9,United Kingdom,"['Middle Age', 'Small Business', 'New Client']",22-Jun-2018,'Middle Age','Small Business','New Client'
4,-1,6143225,2018,Sep,10,11:58:00,1733592,washroom metal sign,3,3.4,5.78,United Kingdom,"['Middle Age', 'Solo Entrepreneur', '2-5 Year ...",10-Sep-2018,'Middle Age','Solo Entrepreneur','2-5 Year Client'


#### Mathematical Computations on Tableau

In this step, we will perform some mathematical computations and add new columns to the data which will help us achieve our goal with the dashboard. The new columns to be added are:

* **`CostPerTransaction`** this is calculated as: *CostPerTransaction = CostPerItem * NumberOfItemsPurchased*
* **`SalesPerTransaction`** this is the same as *SellingPricePerTransaction* and calculated as: *SellingPricePerTransaction = SellingPricePerItem * NumberOfItemsPurchased*
* **`ProfitPerTransaction`** this is calculated as: *ProfitPerTransaction = ProfitPerItem * NumberOfItemsPurchased*
* **`Markup`** this is calculated as: *Markup = ProfitPerTransaction / CostPerTransaction*


#### Working with Calculations

We will perform the various mathematical computations and add the new columns for our dashboard building in Tableau

##### Adding new columns to the dataframe

In [21]:
#Cost per transaction

data['CostPerTransaction'] = data['CostPerItem'] * data['NumberOfItemsPurchased']

In [22]:
#Sales per transaction

data['SalesPerTransaction'] = data['SellingPricePerItem'] * data['NumberOfItemsPurchased']


In [23]:
#Profit per transaction

data['ProfitPerTransaction'] = data['SalesPerTransaction'] - data['CostPerTransaction']


In [24]:
#Markup calculation = (sales-cost)/cost = profit/cost

data['Markup'] = data['ProfitPerTransaction']/data['CostPerTransaction']


In [25]:
# having a glimpse of the data

data.head(2)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,...,Country,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract,CostPerTransaction,SalesPerTransaction,ProfitPerTransaction,Markup
0,278166,6355745,2019,Feb,2,12:50:00,465549,family album white picture frame,6,11.73,...,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client',70.38,126.66,56.28,0.799659
1,337701,6283376,2018,Dec,26,09:06:00,482370,london bus coffee mug,3,3.52,...,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018,'Middle Age','Corporation','2-5 Year Client',10.56,11.61,1.05,0.099432


> The Markup values have too many decimal values. Hence we will need to round it to just two decimals to enhace further calculations.

In [26]:
#Rounding up markup values
data['Markup'] = round(data['Markup'], 2)

In [27]:
# having a glimpse of the data

data.head(2)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,...,Country,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract,CostPerTransaction,SalesPerTransaction,ProfitPerTransaction,Markup
0,278166,6355745,2019,Feb,2,12:50:00,465549,family album white picture frame,6,11.73,...,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client',70.38,126.66,56.28,0.8
1,337701,6283376,2018,Dec,26,09:06:00,482370,london bus coffee mug,3,3.52,...,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018,'Middle Age','Corporation','2-5 Year Client',10.56,11.61,1.05,0.1


In [28]:
#final inspection of data 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1042417 entries, 0 to 1047587
Data columns (total 21 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   UserId                  1042417 non-null  int32  
 1   TransactionId           1042417 non-null  int32  
 2   Year                    1042417 non-null  int32  
 3   Month                   1042417 non-null  object 
 4   Day                     1042417 non-null  int32  
 5   Time                    1042417 non-null  object 
 6   ItemCode                1042417 non-null  int32  
 7   ItemDescription         1042417 non-null  object 
 8   NumberOfItemsPurchased  1042417 non-null  int32  
 9   CostPerItem             1042417 non-null  float64
 10  SellingPricePerItem     1042417 non-null  float64
 11  Country                 1042417 non-null  object 
 12  ClientKeywords          1042417 non-null  object 
 13  Date                    1042417 non-null  object 
 14  Cl

#### Merging Datasets

At this point, we will merge the transaction data with the season data to obtain one dataset for our dashboard building.

In [29]:
#having a glimpse of data

data.head(4)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,...,Country,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract,CostPerTransaction,SalesPerTransaction,ProfitPerTransaction,Markup
0,278166,6355745,2019,Feb,2,12:50:00,465549,family album white picture frame,6,11.73,...,United Kingdom,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client',70.38,126.66,56.28,0.8
1,337701,6283376,2018,Dec,26,09:06:00,482370,london bus coffee mug,3,3.52,...,United Kingdom,"['Middle Age', 'Corporation', '2-5 Year Client']",26-Dec-2018,'Middle Age','Corporation','2-5 Year Client',10.56,11.61,1.05,0.1
2,267099,6385599,2019,Feb,15,09:45:00,490728,set 12 colour pencils dolly girl,72,0.9,...,France,"['Middle Age', 'Corporation', '2-5 Year Client']",15-Feb-2019,'Middle Age','Corporation','2-5 Year Client',64.8,116.64,51.84,0.8
3,380478,6044973,2018,Jun,22,07:14:00,459186,union jack flag luggage tag,3,1.73,...,United Kingdom,"['Middle Age', 'Small Business', 'New Client']",22-Jun-2018,'Middle Age','Small Business','New Client',5.19,5.7,0.51,0.1


In [30]:
#having a glimpse of data1

data1.head(4)

Unnamed: 0,Month,Season
0,Jan,High
1,Feb,Mid
2,Mar,Low
3,Apr,Low


In [31]:
#merging the two dataframes

data_final = data.merge(data1,on='Month')

In [32]:
data_final.head(5)

Unnamed: 0,UserId,TransactionId,Year,Month,Day,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,...,ClientKeywords,Date,ClientAge,ClientType,LengthOfContract,CostPerTransaction,SalesPerTransaction,ProfitPerTransaction,Markup,Season
0,278166,6355745,2019,Feb,2,12:50:00,465549,family album white picture frame,6,11.73,...,"['Senior', 'Solo Entrepreneur', '2-5 Year Clie...",2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client',70.38,126.66,56.28,0.8,Mid
1,267099,6385599,2019,Feb,15,09:45:00,490728,set 12 colour pencils dolly girl,72,0.9,...,"['Middle Age', 'Corporation', '2-5 Year Client']",15-Feb-2019,'Middle Age','Corporation','2-5 Year Client',64.8,116.64,51.84,0.8,Mid
2,328440,6387425,2019,Feb,16,10:35:00,494802,set of 6 ribbons perfectly pretty,36,3.99,...,"['Adult', 'Small Business', 'New Client']",16-Feb-2019,'Adult','Small Business','New Client',143.64,201.24,57.6,0.4,Mid
3,364791,6358242,2019,Feb,3,09:25:00,486276,set of 5 mini grocery magnets,3,2.88,...,"['Young Adult', 'Small Business', 'Loyal Client']",3-Feb-2019,'Young Adult','Small Business','Loyal Client',8.64,11.22,2.58,0.3,Mid
4,-1,6388019,2019,Feb,16,13:24:00,490329,roll wrap vintage christmas,30,3.4,...,"['Middle Age', 'Solo Entrepreneur', '2-5 Year ...",16-Feb-2019,'Middle Age','Solo Entrepreneur','2-5 Year Client',102.0,132.6,30.6,0.3,Mid


#### Dropping Unneeded Columns

We highlighted during the data quality assessment that there are some columns that won't be needed for both our analysis and dashboard building. Therefore, we will have to drop those columns entirely.

In [33]:
# We are dropping irrelevant columns.

data_final = data_final.drop(columns=['ClientKeywords', 'Month', 'Year', 'Day'])

In [34]:
data_final.head(5)

Unnamed: 0,UserId,TransactionId,Time,ItemCode,ItemDescription,NumberOfItemsPurchased,CostPerItem,SellingPricePerItem,Country,Date,ClientAge,ClientType,LengthOfContract,CostPerTransaction,SalesPerTransaction,ProfitPerTransaction,Markup,Season
0,278166,6355745,12:50:00,465549,family album white picture frame,6,11.73,21.11,United Kingdom,2-Feb-2019,'Senior','Solo Entrepreneur','2-5 Year Client',70.38,126.66,56.28,0.8,Mid
1,267099,6385599,09:45:00,490728,set 12 colour pencils dolly girl,72,0.9,1.62,France,15-Feb-2019,'Middle Age','Corporation','2-5 Year Client',64.8,116.64,51.84,0.8,Mid
2,328440,6387425,10:35:00,494802,set of 6 ribbons perfectly pretty,36,3.99,5.59,United Kingdom,16-Feb-2019,'Adult','Small Business','New Client',143.64,201.24,57.6,0.4,Mid
3,364791,6358242,09:25:00,486276,set of 5 mini grocery magnets,3,2.88,3.74,United Kingdom,3-Feb-2019,'Young Adult','Small Business','Loyal Client',8.64,11.22,2.58,0.3,Mid
4,-1,6388019,13:24:00,490329,roll wrap vintage christmas,30,3.4,4.42,United Kingdom,16-Feb-2019,'Middle Age','Solo Entrepreneur','2-5 Year Client',102.0,132.6,30.6,0.3,Mid


## 3. Data Visualisation<a id='dv'></a>

[Move Up](#mu)

As stated earlier, we are going to build a dashboard to track the sales KPIs. We will export the processed data and load unto Tableau to build the dashboard.

In [35]:
#export data to csv

data_final.to_csv('ValueInc_Sales_Cleaned.csv', index=False)

## 4. References<a id='r'></a>

[Move Up](#mu)

* [Date File: transactions.csv](https://drive.google.com/file/d/1i6MQZmXUuqyqGjSGbsPrNKV-eJPAhx-U/view?usp=sharing)
* [Value Inc Seasons Dataset](https://finch-groundhog-9245.squarespace.com/s/value_inc_seasons.csv)
* [Value Inc. Logo](https://finch-groundhog-9245.squarespace.com/s/Value-Inc-Logo.png)

Link to dashboard on Tableau: [Value Inc. Sales Dashboard](https://public.tableau.com/views/ValueInc_SalesDashboard_16894308550170/Dashboard1?:language=en-US&:display_count=n&:origin=viz_share_link)
