# Data Analyst Professional Practical Exam Submission

**You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.**

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the [Markdown Guide](https://s3.amazonaws.com/talent-assets.datacamp.com/Markdown+Guide.pdf) before you start.


## 📝 Task List

Your written report should include written text summaries and graphics of the following:
- Data validation:   
  - Describe validation and cleaning steps for every column in the data 
- Exploratory Analysis:  
  - Include two different graphics showing single variables only to demonstrate the characteristics of data  
  - Include at least one graphic showing two or more variables to represent the relationship between features
  - Describe your findings
- Definition of a metric for the business to monitor  
  - How should the business use the metric to monitor the business problem
  - Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake

*Start writing report here..*

# Analysis & Report Overview

Pens and Printers are looking to launch a new product range in an evolving market where customers now shop more online than they previously done in-store. To adjust with the market changes, they have decided to investigate new sales methods for their new product line launch. The sales team has to present the sales approaches to the executive team and has asked the analytics department for data and analysis for data driven decisions.

The questions, they need answering are as follows:

- How many customers were there for each approach?
- What does the spread of the revenue look like overall? And for each method?
- Was there any difference in revenue over time for each of the methods?
- Based on the data, which method would you recommend we continue to use?

Overall, the goal of the project is to analysis which sales method is the most effective in deciding which method they should use for their new product launch.

# Data Cleaning and Validation

The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

- week: Numeral data with a range of 1 to 6  denoting the week number. The values were the same as the description. No cleaning is needed.
- sales_method: There should have only been three unique values for this column ('Email + Call, Email & Call). However there were 5. The other two were misspellings of the correct values so I replaced them with the correct values and then converted the column type from object to Pandas Catetgorical Data Type.
- customer_id: Object/string values without missing or duplicate data. 13926 unique values; the unique identify for the records in the table. The column was the same as the description. No cleaning is needed.
- nb_sold: Numerical data without missing values, same as the description. No cleaning is needed.
- revenue: Numerical data, but the only column within the entire dataset that contained any missing data. In total, there were 1074 records with missing data for revenue. This accounted for 7.16% of the entire datset. Given that there were no relationships between missing revenue values and any of the other observed (or unobserved) values, this missing data is classed as MCAR (Missing Completely at Random)
- years_as_customer: numerical data without missing values. The column data/values is the same as the description. No cleaning is needed.
- nb_site_visits: numerical data without missing values. The column data/values is the same as the description. No cleaning is needed.
- state: There are 50 possible unique values without missing values, same as the description. While no cleaning was necessary for this coloumn, I did however, connvert the column from an object tyoe to a Pandas Categorical Data Type for the sake of speed and memory.

![Missing Data Matrix](Missing_Data_Matrix.png "Text to show on mouseover")

Above shows the graph used for exploring the missing data. I created a matrix using the Missingno library. I also experimented with ordering other observed values to investigate if there was any relationship or patttern for the missing data. There was none, and that is how I managed to infer that it is MCAR data.

After the data validation, the dataset contains 13926 rows and 8 columns without missing values.

# Exploratory Analysis

## How Many Customers in each Sales Method Group?

![NUmber of Customers in each Sales Group](Number_of_Customer_per_Sales_Method.png)

There was some disparity in the distribution of customers between the 3 sale method groups. 53.6% of all customers were part of the email group, with a total of 7466 customers. Second was those that received only call with 4962 customers. Those who received both email and call were of the minority with only 2572 customers, accounting for only 18.5% of all customers (which is marginally higher than half of the number of call customers).

However the number of customers per sales group was not consistent throughout the six week period...

![Total Count of Customers Per Week](Count_of_Customers_Per_Week.png)

You can see that there was large decline of customers after the first week. The number of customers remains stable from the second week until it declines heavily again the final week. This variation in number of customers can negatively influence the narrative that revenue illustrates.

I used a bar chart in both these graphs because I was comparing a single numerical calculation, count, across categorical variables. Bar charts are the most common and simple and easiest use case for this scenario.

## What does the spread of the revenue look like overall?

![BoxPLot of Revenue per Sales Method](Distribution_of_Revenue_per_Sales_Method.png)

Those who only received a call have the lowest spread of revenue (as depicted by the edges of the box). But they also have the lowest revenue amongst the three groups. In fact, the maximum revenue for a customer in the call method, is still lower than the minimum for any of the others.

The Email group yields a wider spread of revenue. However, it does have an even distribution of data, as shown by the median being in the middle of the box.

The Email and Call group have by far the widest spread of revenue but also the highest as well. The minimum generated revenue for a customer in the email and call group is just below the maxmium for the email group. It can be said though, that the revenue distribution for this group, is negatively skewed, as the median is closer to the upper quartile. This indicates that the data constitudes higher frequency of higher value revenues.

In this case, I used a box plot to as it can depict the variability of data through the IQR from the box. It also shows how skewed the data is by comparing the median value from the middle the of the box.


## Was there any difference in revenue over time for each of the methods?

![Sum Of Revenue Per Week](Sum_of_Revenue_Per_Week.png)

This graoh shows the sum of revenue per week. We can see there is a major dip in the second week and a slight dip in the 3rd week. This could be from the drastic and minor decline in number of custoemrs for the email and the call group respectively. However, there is a an exponential rise in revenue from the 5th week into the 6th.

![CUmulative Sum of Revenue Per Week](CumSum_of_Revenue_Per_Week.png)

However, as you can see above, the revenue increases every week, regardless of the decline in revenue and customers.

![Sales Over Time Per Sales Category](Average_Revenue_over_Time_Per_Sales_Method.png)

The linegraph shows the avarage revenue per week per sales group. I chose to show the average of revenue as opposed to the sum, because the revenue is dependent on number of customers, and the number of custoemrs per sales group varies over the 6 week period. It shows there is a general upwards trend in revenue. 

When we seperate the trend in revenue between the different sales method, we can see that the email and call group consistently yields the highest revenue per customer, while calls consistenly yield the least.

For the call groip and email group, the gradient for revenue is quite flat whereas the gradient for the email and call group is prominently steeper.

I've chosen a line graph to display the rate of change of revenue.



# Business Metrics

The key business metric should be revenue per customer. As the aim of is to discover the most effective sales strategy, its a collary that revenu is an important metric. But the revenue is dependent on many factors, one of the biggest being the number of customers a given method was used on. As this varies widely throughout the six-week period, it gives a false/narrative of revenue. To standardise this, the best business method is an average of revenue generated per customer.

At the end of the six week period, considering all sales methods combined, the average revenue per customer \$93.93. But for the email and call method seperates, the revenue per customer is \$183.66!!

The Email and Call method generated just over 31% of the total revenue, despite only being use with 15% of the customers!

![](Average_of_Revenue_and_Number_Sold_per_Customer.png)

# Recommendations

Under evaluation of my work, I have successfully answered all necessary questions and requirements for the sales rep to present to the executive team. I have gone through the necessary stages of the data analysis cycle, including data cleaning/validation, anaylysis and visualisation, in order to identify patterns and to explore the data to come to a definitive conclusion.

For the following weeks, I would recommend Pens & Printers use the email and calls method when trying to generate leads for thei new product line.

- Measure and monitor whether the length of the email or the lengt of the phone calls further have an impact on the generated revenue from that customer.
- Only use email and call in combination with each other, as its proven that in combination they are twice as effective than either by themself.

![Site Visits against Revenue Regression Plot](Site_Vists_against_Revenue_Regression_Plot.png)

There's also a relationship between the number of sites visits and revenue.Those who receive the emaila and calls combined, the number of site visits more strongly predict the revenue. So i would also recommend focusing on the number of site visitd and bringing customers onto their site

Given that there was a greater return when both email and call were used at the same time, indicates that customers are more receptive and likely to buy when they feel that the company genuinely care about them and make them feel valued, as well as buidling a relationship with them.

As an extention of that, Pens and Printers could also research into email marketing campaigns and sequences, as well as offering them discounts, for example, as a gift for being a customer to strengthen the brand - customer relationship.

For a deeper analysis, they could also include more detailed data. For example, they could collect the specific time and date of a customer order to investigate any seasonality or peak sale times.

If possible, they should also include more demographic data to explore if there's any other factors to do with the customer themselves, that influence the revenue or sales, as opposed to just purely the marketing/sales method. For example, does a certain age group tend buy more? Are married couples with children more likely to buy than single adults without a family?

## ✅ When you have finished...
-  Publish your Workspace using the option on the left
-  Check the published version of your report:
	-  Can you see everything you want us to grade?
    -  Are all the graphics visible?
-  Review the grading rubric. Have you included everything that will be graded?
-  Head back to the [Certification Dashboard](https://app.datacamp.com/certification) to submit your practical exam report and record your presentation