# Data Science Workflow

![workflow](./img/ds-workflow.PNG)

## Step 1: Acquire
<hr>

### Explore Problem
- **Get the right question:** What is the problem I try to solve?

### Identify Data
- The data is either given by client or gotten online

### Import Data
- Import data using the pandas DataFrame from `Excel`,`CSV`,`Database`,`Parquet`, and web scraping.

### Combine Data
- It is often used to combine data from different sources.

Use: `concat()`,`join()`, or `merge()`

## Step 2: Prepare
<hr>

### Explore Data

**Simple Exploration:** `head()`,`.shape`,`.dtypes`,`info()`,`describe()`,`isna()`

**Groupby, Counts and Statistics**
- Grouping is by the non-numeric values.
- Count group to see the significance across results.
- Check for the mean of the values.
- Standard deviation: it is the measure of how dispersed (spread) the data is in relation to the mean.
- Box plot: used instead of `describe()` in visual. 

### Visualize Data
- Plot
- Scatter plot
- Pie chart
- Histogram
- Bar chart

### Cleaning Data
- `dropna()`: Remove missing values.
- `fillna()`: Fill NA/NaN values using specified method.
- `drop_duplicate():` Return DataFrame with duplicate rows removed.

**Working with Time Series**
- `reindex()`: Conform Series/DataFrame to new index with optional filling logic
- `interpolate()`: Fill NaN using an interpolation method.

## Step 3: Analyze

<hr>

### Split into Train and Test
- Assign independent features to `X`
- Assign dependent features to `y`
- Divide into training and test sets

### Feature Scaling
- Normalization (MinMaxScaler)
```Python
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)
```
**N.B:** Only fit on the training data

- Standardization (StandardScaler)