# Data Science Soft Skills

Soft skills are probably the most underrated. If you look at most data science curricula, they are completely ignored. Communication is key: can you write well? Can you listen well? Can you explain "statistically significant" to a business person (do they even need to know?)?

## A Data Science Framework

Kaggle's Use Cases page describes a number of questions you should ask yourself as you get involved in any data science project (although "revenue" will not always be a relevant metric, you should find one that is). Max Shron in his book [Thinking with Data](http://www.amazon.com/Thinking-Data-Turn-Information-Insights/dp/1449362931/ref=sr_1_1?ie=UTF8&qid=1440971384&sr=8-1&keywords=max+shron&pebp=1440971386106&perid=0S9QFGFM14D9X5V76D7N) describes a larger framework in greater detail. Additionally, I quite like the idea of data science as "thinking with data", the title of the book. Unlike most books on data science, Shron's deals *entirely* with the soft skills around data science.

There are different views about this softer side of data science, though. In the first camp are those who think that you should just unleash your data science team upon your data and come back in three months. In this view, data scientists are just sort of playing with the data and algorithms and coming up with a data product. I'm not sure this sort of data science crucible approach can or should be repeated at other organizations, at least maybe not right out of the gate. I have had former students who have been handed a data set by their manager with the instructions, "find something out". For reasons we will go into later, that isn't likely to work out all that well.

For the typical organization, it is probably a better idea to make sure that whatever the data scientist(s) is working on has a direct impact on the organization's goals. If it's an NGO, the data scientist should be helping with the mission of the NGO. If it's a company, the data scientist should be helping with the goals of the company. The goals of the data scientist should be developed with the input of all stakeholders. Shron provides a decent way of talking about this approach so we'll follow his lead and have a "CoNVO".

### CoNVO

"CoNVO" is a pneumonic for Context, Needs, Vision and Outcome. For any Data Science project, you should have a CoNVO.

* **Context** - What is the context of the need? Who are the stakeholders and other interested parties?
* **Need** - What is the organizational need which requires fixing with data?
* **Vision** - What is going to be required and what does success look like?
* **Outcome** - How will the result work itself back into the organization?

Shron spends the first chapter of the book discussing these elements. For example, Need is *never* "the decision maker lacks a dashboard". That is a potential solution, not a need. The decision maker themself should be able to express the need in their own language, not in technobabble. He goes on, "A data science need is a problem that can be solved with knowledge, not a lack of a particular tool." The solution is never "the CEO needs Tableau".

With regard to data, Shron claims exactly the opposite of most. Most people, I believe, would claim that if you think about your problem ahead of time *then* get the data, you will limit your options. You should let the data talk. You should "play" with the data. Shron argues exactly the opposite. Because we start with a need (and not the solution) as we identify a vision (a potential solution) we are not constraining ourselves by the data. Of course, our vision may be too grandiose at the start, especially once we find out what data we actually have or what data we can actually afford.

Vision is conveyed through mockups and argument sketches. A mockup is a low level idealization of the final result of all our work. The mockup can be prose, a chart or charts, or an hypothetical model. Having a good mental library of examples (of mockups and argument sketches) is critical to coming up with a vision. The examples can be acquired by reading widely and experimenting.

Shron gives a few examples which I will go through in greater detail. In each case, what is the result of the CoNVO?

#### Refuge Non-Profit

* **Context** - A nonprofit reunites families that have been separated by conflict. It collects information from refugees in host countries. It visits refugee camps and works with informal networks in host countries. It has built a tool for helping refugees find each other. The decision makers are the CEO and CTO.
* **Need** - the non-profit does not have a good way to measure success. It is prohibitively expensive to follow up with every individual to see if they have contacted their families. By knowing when individuals are doing well or poorly, the non-profit will be able to judge the effectiveness of changes to its strategy.
* **Vision** - The non-profit that is trying to measure its successes will get an email of key performance metrics on a regular basis. The email will consist of graphs and automatically generated text.
    * **Mockup** - After making a change to our marketing, we hit an enrollment goal this week that we've never hit before, but it isn't being reflected in our success measures.
    * **Argument Sketch** - The nonprofit is doing well (poorly) because it has high (low) values for key performance indicators. After seeing the key performance indicators, the reader will have a good sense of the state of the non-profit's activites and will be able to make appropriate adjustments.
* **Outcome** - the metrics email for the nonprofit needs to be setup, verified and tweaked. The sysadmin at the nonprofit needs to be briefed on how to keep the email system running. The CTO and CEO need to be trained on how to read the metrics emails, which will consist of a document writtent to explain it.

#### Marketing Department

* **Context** - A department in a large company handles marketing for a large shoe manufacturer with an online presence. The department's goal is to convince new customers to try its shoes and to convince existing customers to return again. The final decision maker is the VP of marketing.
* **Need** - the marketing department does not have a smart way to select cities to advertise in. Right now it selects targets based on intuition but it thinks there is a better way. With a better way of selecting cities, the department expects sales to go up.
* **Vision** - The marketing department will get a spreadsheet that can be dropped into the existing workflow. It will fill in some of the characteristics of a city and the spreadsheet will indicate what the estimated value would be.
    * **Mockup** - By inputting gender, age skew and performance results for 20 cities, an estimated return on investment is placed next to each potential new market. Austin, Texas is a good place to target based on gender, age skew, performance in similar cities and its total market size.
    * **Argument Sketch** - The department should focus on city X because it is most likely to bring in high value. The definition of high value that we use is substantiated for the following reasons.
* **Outcome** - The marketing team needs to be trained in using the model (or software) in order to have it guide their decisions, and the success of the model needs to be guaged in its effects on sales. If the result ends up being a report instead, it will be delivered to the VP of Marketing, who will decide based on the recommendations of the report which cities will be targeted and relay the instructions to the staff. To make sure everything is clear, there will be a follow-up meeting two weeks and then two months after the delivery.

#### Media Organization

* **Context** - This news organization produces stories and editorials for a wide audience. It makes money through advertising and through premium subscriptions to its content. The main decision maker for this project is the head of online business.
* **Need** - the media organization does not have the right way to define an engaged reader. The standard web metric of unique daily users doesn't really capture what it means to be a reader of an online newspaper. When it comes to optimizing revenue, growth and promoting subscriptions, 30 different people visiting on 30 different days means something different than 1 person visiting for 30 days in a row. What is the right way to measure engagement that respects these goals?
* **Vision** - The media organization trying to define user engagement will get a report outlining why a particular user engagement metric is the best one, with supporting examples, models that connect that metric to revenue, growth and subscriptions; and a comparison against other metrics.
    * **Mockup** - Users who score highly on engagement metric A are more likely to be readers at one, three and six months than users who score highly on engagement metrics B or C. Engagement metric A is also more correlated with lifetime value than other metrics.
    * **Argument Sketch** - The media organization should use this particular engagement metric going forward because it is predictive of other valuable outcomes.
* **Outcome** - The report going to the media organization about engagement metrics will go to the head of online business. If she signs off on its findings, the selected user engagement metric will be incorporated by the business analysts into the performance metrics across the entire organization. Funding for existing and future intiatives will be based in part on how they affect the new engagement metric. A follow-up study will be conducted in six months to verify that the new metric is successfully predicting revenue.

compare that to this:

> We will create a logistic regression of web log data using SAS to find patterns in reader behavior. We will predict the probability that someone comes back after visiting the site once.

#### Advocacy Group

* **Context** - This advocacy group specializes in ferreting out and publicizing corruption in politics. It is a small operation with several staff members who serve multiple roles. They are working with a software development team to improve their technology for tracking evidence of corrupt politicians.
* **Need** - the advocacy group doesn't have a good way to automatically collect and collate media mentions of politicians. With an automated system for collecting media attention, it will spend less time and money keeping up with the news and more time writing it.
* **Vision** - The developers working on the corruption project will get a piece of software that takes in feeds of media sources and rates the chances that a particular politician is being talked about. The staff will set a list of names and affiliations to watch for. The results will be fed into a database, which will feed a dashboard and email alert system.
    * **Mockup** - A typical alert is that politician X, who was identified based on campaign contributions as a target to watch, has suddenly shown up on 10 news talk shows.
    * **Argument sketch** - We have correctly kept tabs on politicians of interest, and so the people running the anti-corruption project can trust this service to do the work of following names for them.
* **Outcome** - The media mention finder needs to be integrated with the existing mention database. The staff needs to be trained to use the dashboard. The IT person needs to be informed of the existence of the tool and taught how to maintain it. Periodic updates to the system will be needed in order to keep it correctly parsing new soures, as bugs are uncovered. The developers who are doing the integration will be in charge of that. Three months after the delivery, we will follow up to check on how well the system is working.

*Thinking with Data* goes on to further explain Shron's framework.
