# Data Science Workflow
<hr>

![workflow](./img/ds-workflow.PNG)

## Step 1: Acquire
<hr>

### Explore Problem
- **Get the right question:** What is the problem I try to solve?

### Identify Data
- The data is either given by client or gotten online

### Import Data
- Import data using the pandas DataFrame from `Excel`,`CSV`,`Database`,`Parquet`, and web scraping.

### Combine Data
- It is often used to combine data from different sources.

Use: `concat()`,`join()`, or `merge()`

## Step 2: Prepare
<hr>

### Explore Data

**Simple Exploration:** `head()`,`.shape`,`.dtypes`,`info()`,`describe()`,`isna()`

**Groupby, Counts and Statistics**
- Grouping is by the non-numeric values.
- Count group to see the significance across results.
- Check for the mean of the values.
- Standard deviation: it is the measure of how dispersed (spread) the data is in relation to the mean.
- Box plot: used instead of `describe()` in visual. 

### Visualize Data
- Plot
- Scatter plot
- Pie chart
- Histogram
- Bar chart

### Cleaning Data
- `dropna()`: Remove missing values.
- `fillna()`: Fill NA/NaN values using specified method.
- `drop_duplicate():` Return DataFrame with duplicate rows removed.

**Working with Time Series**
- `reindex()`: Conform Series/DataFrame to new index with optional filling logic
- `interpolate()`: Fill NaN using an interpolation method.

## Step 3: Analyze

<hr>

### Split into Train and Test
- Assign independent features to `X`
- Assign dependent features to `y`
- Divide into training and test sets

### Feature Scaling
- **Normalization (MinMaxScaler)**
```Python
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)
```

**N.B:** Only fit on the training data

- **Standardization (StandardScaler)**
```Python
from sklearn.preprocessing import StandardScaler
stand = StandardScaler().fit(X_train)
X_train_stand = stand.transform(X_train)
X_test_stand = stand.transform(X_test)
```

### Feature Selection

To get higher accuracy and simpler models, reducing overfitting risk.

* **Filter Methods**

*Examples: Chi square, Information gain, Correlation Score, Correlation matrix with Heatmap.*

* **Wrapper Methods**

*Examples: Best-first search, Random hill-climbing algorithm, forward selection, backward elimination.*

* **Embedded Methods**

*Examples: LASSO, Elastic Net, Ridge Regression*


### Model Selection
* Process of selecting the models among a collection of machine learning models.

**Model Selection Techniques**
- **Probabilistic Measures:** Scoring the performance and complexity of model.
- **Resampling Methods:** Splitting in sub-train and sub-test datasets by mean value of repeated nums.

* `LinearRegression()`
```Python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
r2_score(y_test,y_pred) 
```

* `SVC()`: Support Vector Classification
```Python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
accuracy(y_test,y_pred) 
```

### Analyze Result
This is the main **check-point** of your analysis.

- Review the **Problem** and **Data Science problem** you started with.
    - The analysis should add value to the **Data Science Problem**
    - Sometimes our focus drifts - we need to ensure alignment with original **Problem**.
    - Go back to the **Exploration** of the **Problem** - does the result add value to the **Data Science Problem** and the initial **Problem** (which formed the **Data Science Problem**)
    - *Example:* As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
    
- Did we learn anything?
    - Does the **Data-Driven Insights** add value?
    - *Example:* Does it add value to have evidence for: Wealthy people buy more expensive cars.
        - This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
        
- Can we make any valuable insights from our analysis?
    - Do we need more/better/different data?
    - Can we give any Actionable Data Driven Insights?
    - It is always easy to want better and more accurate high quality data.
    
- Do we have the right features?
    - Do we need eliminate features?
    - Is the data cleaning appropriate?
    - Is data quality as expected?
    
- Do we need to try different models?
    - Data Analysis is an iterative process
    - Simpler models are more powerful
    
- Can result be inconclusive?
    - Can we still give recommendations?

#### Quote
> *“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”* 
> - Sherlock Holmes
 
#### Iterative Research Process

- **Observation/Question**: Starting point (could be iterative)
- **Hypothesis/Claim/Assumption**: Something we believe could be true
- **Test/Data collection**: We need to gether relevant data
- **Analyze/Evidence**: Based on data collection did we get evidence?
    - Can our model predict? (a model is first useful when it can predict)
- **Conclude**: *Warning!* E.g.: We can conclude a correlation (this does not mean A causes B)
    - Example: Based on the collected data we can see a correlation between A and B

## Step 4: Report

<hr>

### Present Findings

- You need to *sell* or *tell* a story with the findings.
- Who is your **audience**?
    - Focus on technical level and interest of your audience
    - Speak their language
    - Story should make sense to audience
    - Examples
        - **Team manager**: Might be technical, but often busy and only interested in high-level status and key findings.
        - **Data engineer/science team**: Technical exploration and similar interest as you
        - **Business stakeholders**: This might be end-customers or collaboration in other business units.
- When presenting
    - **Goal**: Communicate actionable insights to key stakeholders
    - Outline (inspiration):
        - **TL;DR** (Too-long; Didn’t read) - clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
        - Start with your understanding of the business problem
        - How does it transform into a Data Science Problem
        - How will to measure impact - what business metrics are indicators of results
        - What data is available and used
        - Presenting hypthosis of reseach
        - A visual presentation of the insights (model/analysis/key findings)
            - This is where you present the evidence for the insights
        - How to use insight and create actions
        - Followup and continuous learning increasing value

### Visualize Results
- Telling a story with the data
- This is where you convince that the findings/insights are correct
- The right visualization is important
    - Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

#### Resources for visualization
- [Seaborn](https://seaborn.pydata.org) Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- [Plotly](https://plotly.com) open-source for analytic apps in Python
- [Folium](http://python-visualization.github.io/folium/) makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

### Credibility Counts
- This is the check point if your research is valid
    - Are you hiding findings you did not like (not supporting your hypothesis)?
    - Remember it is the long-term relationship that counts
- Don't leave out results
    - We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

## Step 5: Actions

<hr>

### Use Insights
- How do we follow up on the presented **Insights**?
- **No one-size-fits-all**: It depends on the **Insights** and **Problem**
- *Examples:*
    1. **Problem**: What customers are most likely to cancel subscription?
        - Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
        - But you should still try to add value
    2. **Problem**: Here is our data - find valuable insights!
        - This is a challenge as there is no given focus
        - An iterative process involving the customer can leave you with no surprises

### Measure Impact
- If customer cannot measure impact of your work - they do not know what they pay for.
    - If you cannot measure it - you cannot know if hypothesis are correct.
    - A model is first valuable when it can be used to predict with some certainty
- There should be identified metrics/indicators to evaluate in the report
- This can evolve - we learn along the way - or we could be wrong.
- How long before we expect to see impact on identified business metrics?
- What if we do not see expected impact?
- Understanding of metrics
    - The metrics we measure are indicators that our hypthesis is correct
    - Other aspects can have impact on the result - but you need to identify that
    
### Main Goal
- Your success of a Data Scientist is to create valuable actionable insights

#### A great way to think
- Any business/organisation can be thought of as a complex system
    - Nobody understands it perfectly and it evolves organically
- Data describes some aspect of it
- It can be thought of as a black-box
- Any insights you can bring is like a window that sheds light on what happens inside

## General Advice
Advice from the course Tutor...
<hr>

- **Expectations**
    - When I started my PhD (researcher) journey I expected to solve big problems - change the world to a better place
    - Reality was different - small incremental contributions
    - Start with simple interesting problems - do not expect to find insights that will change the world from day one.
- **Learning**
    - This is a new field - but like any research field, it evolves and we learn new techniques and get new tools
    - This course gave a you a solid basis, but there is a lot more to learn
    - Don't expect your learning to end
- **Long-term focus**
    - Be clear on your goal: Become a Data Scientist
    - This will help you when things seems difficult - everyone has times of struggle
    - Don't get discouraged by seeing someone else present some awesome work - learn from it
- **Curiosity**
    - I always say **keep it playful**
    - You need to enjoy what you do
    - Most people are curious - so let your curiosity guide you on your Data Science journey