## X04. Approaching an AI Project

I've come up with the following way to approach solving a particular business problem using Data Science. This has been done using a few resources (mentioned above) and also through lessons learned on our previous project.

Note that the steps below are presented in a linear fashion, however in truth, this is likely to be an iterative rather than linear process, as you learn things, receive new data, go up blind alleys etc.

#### Engagement

Talk to people who understand 'things' better than you do. Building a rapport early is important. Spend a day sat with them in their area if necessary. The 'things' you need to understand are:

* **The Requirement:** User want / need. Deadline. Expectations. Is the Requirement documented? If not document it. What other people do they reccomend you speak to? Who has been working on this previously?
* **Domain Knowledge:** What area of the business is the requirement from? What's their wider role? Good domain knowledge is important when it comes to feature selection / engineering also. Read and listen lots!
* **The Current Solution:** How do they currently deal with this issue? Who created it? How does it work? What political considerations are there to upgrading this solution?
* **The Existing Data:** What data are you using? Is there a schema, descriptions or metadata for it? How was it sourced? Is it a sample or a full population? What pre-processing has the data undergone? How is it currently being used? What Assumptions have been made in it? How big is it? Will it require platforming? Is it secure or sensitive? Will this cause issues? Consider anonymisation wherever possible.
* **Scope for New Data Sources:** Is there any scope for accquiring and using new data sources? Is this acceptable to the product owner? What barriers are there to accquiring these data sources? Does the wider business have any other data sources, or historical data that might be useful? Start the ball rolling on these early. These things can take time. 
* **What work has been done previously:** What have they tried before? Is there a record of this? How did it work? Have any other areas of the business done something similar which might be useful?
* **What wider work / packages could support a solution?** What wider work has been done in this area? Are there any packages to support it? What can you leverage?  
<br/>

Some ideas of people to talk to:

* The **Product Owner** (obviously).
* The **Product User**   
* Any **Data Holders**  
* **Senior Leaders** in the Product Owner's area.
* **Analysts** in the Product Owner's area.
* **Other Data Scientists** inside and outside of your team and organisation.
* **At least one 'expert' data scientist** You should always be trying to leverage your network for better business outcomes! How would they approach the problem and what solutions would they reccomend? If you don't know an expert data scientist, find one!
<br/>

#### Data Analysis

* **Basic Analysis:** What's the quality of the data like? What's the quality of the labels, if indeed there are any? Are there any correlations? Is there much missing data?
* **Visualisation:** What patterns are there when visualising the data? What's the degree of variability, cardinality, skew etc.
* **Dimensionality Reduction:** What variables are highly correlated and can be removed / reduced?
* **Eyeball Test:** Take some samples and eyeball the data in csv format. Are there any patterns to missing data? Is there anything that sticks out (e.g. Survey Bias)    
<br/>

#### Plan a solution

*"Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t"* - Google

* **Keep it Simple:** What's the simplest approach to solving this problem, ideally not involving Machine Learning? If it's NLP can you create a simple solution with regular expressions? Can you build a simple if / else classifier? What can you build and track with the current system?
* **Create Metrics:** Create simple metrics in the data. These may come in handy later.
* **What's palatable to the business?:** Will they be happy with a deep neural network where you might not be quite sure about how it works? Consider explainability of potential solutions. What scope for innovation is there?
* **Be ethical:** Always consider your actions. 'Could' does not mean 'Should'. Refer to and consider the [GDS Data Science Ethical Framework](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/524298/Data_science_ethics_framework_v1.0_for_publication__1_.pdf).    
<br/>

#### Data Stuff

* **Aim to write 'Enterprise Level' ETL / Engineering / Pipelining code from the outset:** Not to be confused with your ML pipeline!! The ETL pipeline is about taking the data from source and creating a dataset that's cleaner and fit to start the ML pipelining process from. This will likely be a massive time saver. As you get new data, receive updated data, receive a backseries, accidently delete your datasets etc. you'll not want to be performing a manual load each time. Think of this as an investment that will save you time later on.
* **Datasets should be immutable:** If you transform a dataset, don't overwrite the existing one. Doing so can cause errors to go unnoticed.
* **Clean up the Dataset:** Get rid of blank rows or useless variables. Create succinct and meaninful variable names and categorical variables. Impress this upon your product owner and the peope working with you also! Get them to buy into it. 
* **Simplify the Data whereever possible:** Remove highly correlated variables, perform dimensionality reduction, combine variables to create averages etc.
* **Consider Platforming the data:** csv's may be fine, but cut your cloth according to the solution. If you're building a system rather than a piece of analysis you might want to consider SQL / NoSQL structures.  
* **Adopt Standards:** If the current data doesn't have standards then make some. Wherever possible adopt industry standards (e.g. consistency of variable names across datasets and version, no spaces in variable names, round variable values etc.)
* **Document Early and update often:** Ensure that the data and ETL process is documented throughly to enable others to pick it up and also to remember what you did six months ago. The README.md is your friend!  
<br/>


#### Create an MVP

*"If you're not at least a little bit ashamed of it, it's probably not an MVP"* - Tom Ewing (me)

* **Come up with an MVP:** Create the simplest MVP possible. This could be a simple algorithm, if/then statements instead of a classifier, regex instead of NLP etc. This is vital as it will serve as a useful benchmark for later on.
* **Plan for the future:** You should be ashamed of your MVP, but not of things like your documentation or data pipeline (see the Data section).
* **Bring people with you:** Keep your Product owner updated on development and if possible involve them in the solution. This will help get buy in to your product at an early stage and they can act as a champion for the product in their area of the business. 
* **Don't be afraid to iterate:**  If your MVP shows promise, feel free to iterate on it. The goal is to solve a problem, not 'Data Science' stuff for the sake of it.  
<br/>  

#### Apply AI  

If the MVP isn't good enough it may be time to build an AI based solution.

<br/>  

#### Pipelining  

1. Deal with missing values (SKL Imputer)
2. Encode Categorical Variables (SKL Label Encoder / One Hot Encoder)
3. Consider Feature Scaling if the variables have vastly different scales  
4. Consider adding hyperparameter options based upon the variables in the dataset.

#### Building / Running the model  

1. What's the error?
2. Is the model overfitting? 
3. Is the model underfitting?
4. Consider trying different model types.

#### Fine Tuning  

1. Consider Grid Search, Randomised Search or Ensemble Methods
2. Analyse the best models and their errors. What variables can we remove?
3. Evaluate on the test set.






#### Sources & Further Reading

[Seven Practical Ideas for Beginner Data Scientists](https://medium.com/nulogy/seven-practical-ideas-for-beginner-data-scientists-9af97aeb88ab)  
[Google's Rules of ML](https://developers.google.com/machine-learning/guides/rules-of-ml/)  
[Doing Data Science Right](http://firstround.com/review/doing-data-science-right-your-most-common-questions-answered/)  
[GDS Data Science Ethical Framework](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/524298/Data_science_ethics_framework_v1.0_for_publication__1_.pdf) 