# Data Science Workflow

![workflow](./img/ds-workflow.PNG)

## Step 1: Acquire
<hr>

### Explore Problem
- **Get the right question:** What is the problem I try to solve?

### Identify Data
- The data is either given by client or gotten online

### Import Data
- Import data using the pandas DataFrame from `Excel`,`CSV`,`Database`,`Parquet`, and web scraping.

### Combine Data
- It is often used to combine data from different sources.

Use: `concat()`,`join()`, or `merge()`

## Step 2: Prepare
<hr>

### Explore Data

**Simple Exploration:** `head()`,`.shape`,`.dtypes`,`info()`,`describe()`,`isna()`

**Groupby, Counts and Statistics**
- Grouping is by the non-numeric values.
- Count group to see the significance across results.
- Check for the mean of the values.
- Standard deviation: it is the measure of how dispersed (spread) the data is in relation to the mean.
- Box plot: used instead of `describe()` in visual. 

### Visualize Data
- Plot
- Scatter plot
- Pie chart
- Histogram
- Bar chart

### Cleaning Data
- `dropna()`: Remove missing values.
- `fillna()`: Fill NA/NaN values using specified method.
- `drop_duplicate():` Return DataFrame with duplicate rows removed.

**Working with Time Series**
- `reindex()`: Conform Series/DataFrame to new index with optional filling logic
- `interpolate()`: Fill NaN using an interpolation method.

## Step 3: Analyze

<hr>

### Split into Train and Test
- Assign independent features to `X`
- Assign dependent features to `y`
- Divide into training and test sets

### Feature Scaling
- **Normalization (MinMaxScaler)**
```Python
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)
```

**N.B:** Only fit on the training data

- **Standardization (StandardScaler)**
```Python
from sklearn.preprocessing import StandardScaler
stand = StandardScaler().fit(X_train)
X_train_stand = stand.transform(X_train)
X_test_stand = stand.transform(X_test)
```

### Feature Selection

To get higher accuracy and simpler models, reducing overfitting risk.

* **Filter Methods**

*Examples: Chi square, Information gain, Correlation Score, Correlation matrix with Heatmap.*

* **Wrapper Methods**

*Examples: Best-first search, Random hill-climbing algorithm, forward selection, backward elimination.*

* **Embedded Methods**

*Examples: LASSO, Elastic Net, Ridge Regression*


### Model Selection
* Process of selecting the models among a collection of machine learning models.

**Model Selection Techniques**
- **Probabilistic Measures:** Scoring the performance and complexity of model.
- **Resampling Methods:** Splitting in sub-train and sub-test datasets by mean value of repeated nums.

* `LinearRegression()`
```Python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
r2_score(y_test,y_pred) 
```

* `SVC()`: Support Vector Classification
```Python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
accuracy(y_test,y_pred) 
```

### Analyze Result
This is the main **check-point** of your analysis.

- Review the **Problem** and **Data Science problem** you started with.
    - The analysis should add value to the **Data Science Problem**
    - Sometimes our focus drifts - we need to ensure alignment with original **Problem**.
    - Go back to the **Exploration** of the **Problem** - does the result add value to the **Data Science Problem** and the initial **Problem** (which formed the **Data Science Problem**)
    - *Example:* As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
    
- Did we learn anything?
    - Does the **Data-Driven Insights** add value?
    - *Example:* Does it add value to have evidence for: Wealthy people buy more expensive cars.
        - This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
        
- Can we make any valuable insights from our analysis?
    - Do we need more/better/different data?
    - Can we give any Actionable Data Driven Insights?
    - It is always easy to want better and more accurate high quality data.
    
- Do we have the right features?
    - Do we need eliminate features?
    - Is the data cleaning appropriate?
    - Is data quality as expected?
    
- Do we need to try different models?
    - Data Analysis is an iterative process
    - Simpler models are more powerful
    
- Can result be inconclusive?
    - Can we still give recommendations?

#### Quote
> *“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”* 
> - Sherlock Holmes
 
#### Iterative Research Process

- **Observation/Question**: Starting point (could be iterative)
- **Hypothesis/Claim/Assumption**: Something we believe could be true
- **Test/Data collection**: We need to gether relevant data
- **Analyze/Evidence**: Based on data collection did we get evidence?
    - Can our model predict? (a model is first useful when it can predict)
- **Conclude**: *Warning!* E.g.: We can conclude a correlation (this does not mean A causes B)
    - Example: Based on the collected data we can see a correlation between A and B