## Scientific Thinking for Business

Data science involves lots of investigation via trial and error. The investigations are based on evidence and this is one of the strongest reasons why data science is considered a "real" science.

**Science is a process and the route to solving problems is not always direct**

Pulling in data and jumping right into exploratory data analysis can make your work prone to exactly the types of negative issues that plague data science today.

=> At the heart of this problem is the process of communicating results to leadership. It should begin with a *meaningful and well-articulated business opportunity*. If that opportunity is stated too simply, as say, increasing overall revenue then the central talking point for communication is too vague to be meaningful from the data side.

`Important`: The business scenario needs to be communicated in a couple of ways:  1. Stated in a testable way in terms of data  2. Stated in a clear way that minimizes the influence of confounding factors

### Testable hypotheses

There is no one single best way to articulate a business opportunity as a testable hypothesis. In some cases the statement will be intuitive, but in other cases there will be some back and forth with stakeholders and domain experts.

#### Guidelines for creating testable hypotheses

1. **Become a scientist of the business**
    Spend a little bit less time learning new algorithms and Python packages and more time learning the levers that make your specific business go up or down and the variables that impact those levers.

2. **Make an effort to understand how data are produced**

    If the data come from a database you should ask about the process by which the data are stored. If the data are compiled by another person then dig into the details and find out about the compiling process as well as the details of what happened before the data arrived on their desk.

3. **Make yourself part of the business**
    Do not under any circumstances become siloed. Proactively get involved with the business unit as a partner, not a support function.

4. **Think about how to measure success**
    When thinking about what course of action might be most appropriate, keep at the forefront of your mind how you will measure business value when said action is complete.


### Thinking scientifically about the business scenario

A major goal of this process is to make the *business objectives clear to leadership*. Some of these individuals are technical and some are not, so as a good rule-of-thumb get in the habit of articulating the business problem at a level that everyone can understand. *Stakeholders and leadership need to know what you are trying to accomplish* before you begin work. They also need to be aware from the start what *success would look like*.

### The Scientific Method

`=>`The process by which science is carried out. The general idea is to build on previous knowledge in order to improve and understanding of a given topic.

1. Formulate  the **question**
2. Generate a **Hypothesis** to address the question
3. Make a **Prediction**
4. Conduct an **Experiment**
5. **Analyze** data and draw conclusion

**Question**

The question can be open-ended and generally it summarizes your business opportunity. Let’s say you work for a small business that manufactures sleds and other winter gear and you are not sure which cities to build your next retail locations. You have heard that Utah, Colorado and Vermont are all states that have high rates of snowfall, but it is unclear which one has the highest rate of snowfall.

**Hypothesis**

Because the Rocky mountains are higher in elevation and they are well-known for fresh powder on their ski slopes, you hypothesize that both Utah and Colorado have more snow than Vermont.

**Prediction**

If you were to run a hypothesis test, you would find that Vermont has significantly less snow fall than Colorado or Utah.

**Experiment**

In [2]:
import pandas as pd
df = pd.read_csv('snowfall.csv')

In [3]:
#subset data to focus on states of interest
df1 = df[df['state'].isin(['CO','UT','VT'])]
#create a pivot on relevant summary data
df1_pivot = pd.pivot_table(df1, values='snowfall', index='state',
                            aggfunc=['count', 'mean', 'max'])

print(df1_pivot)


         count     mean      max
      snowfall snowfall snowfall
state                           
CO           5    37.76     59.6
UT           2    51.65     58.2
VT           1    80.90     80.9


**Analyze**

There is not enough data to do a 1-way ANOVA. The experiment is not a failure; it has a few pieces of information.

1. There is not enough data
2. There is a small possibility that VT gets more snow on average than either CO or UT
3. Our degree of belief in the conclusion drawn from (2) is very small because of (1)

The notion of degree of belief is central to scientific thinking. In science the word belief, with respect to a hypothesis is proportional to the evidence.

Evidence is derived from the process described above and if we have none then we are stuck at the question stage and a proper scientific hypothesis cannot be made.

The other important side to degree of belief is that it never caps out at `100 percent certainty`. Some hypotheses have become laws like `Newton’s Law of Gravitation`, but most natural phenomena in the world outside of physics cannot be explained as a law.

A hypothesis is the simplest explanation of a phenomenon. *A scientific theory is an in-depth explanation of the observed phenomenon.*

There are additional factors like *external peer review* that help ensure the integrity of the scientific method and in the case of implementing a model for a specific business task this could mean assigning reviewers for a pull request or simply asking other qualified individuals to check over your work. 

## Gathering Data
### Documenting Data

documenting data has potential to:
1. Streamline the modeling Process
2. Ensure that all future data come in improved form

### ETL

The process of gathering data is often referred to as `Extract`, `Transform` and `Load`.

To ensure that projects are completed in a reasonable amount of time the initial pass at **ETL** should use a simple format like CSV, then a more complex system can be built out once you have accomplished the `Minimum Viable Product (MVP)`.

### Common methods of gathering data

1. Plain text files
2. Delimited files
3. JSON files
4. Relational Databases
5. No-SQL databases
6. Web scraping and APIs
7. Streaming data

    Data streams become important when the data of a project or company becomes mature and the AI pipeline is connected to it
8. Apache Hadoop File Share (HDFS)

`HINTS:`

1.  A best practice when loading data from plain text or delimited files is to separate the code for parsing into its own script. Because the files are read line by line in a separate Python call, it is more memory efficient and this separation of tasks helps with automation and maintenance.

`Question:`
When embarking on a data science project, why do you ultimately want to format your data so that it can be housed in something like a Pandas DataFrame or NumPy Array?

`Answer:`
Because most data science projects have a modeling component and most models depend on mathematical algorithms where a single observation is represented by a vector and a set of observations is represented by a matrix. Vectors and matrices are a central part of NumPy functionality, and Pandas Series and DataFrames extend this functionality when working with more heterogeneous data.