## The document is in progress. Feel free to add more methods / links



### Define Business goal

### EDA
    
#### Note: for some of these methods preprocessing might be needed (e.g. missing values filling)
    
- Take a look into pandas_profiling.ProfileReport()/describe/value_count and so on
- Look at distribution of features (sns.distplot, sns.violinplot)
- Look at correlation between features (sns.heatmap(df.corr(), plt.scatter())
- Look at duplicates (df.drop_duplicates())
- Look at outliers (3-sigma, quantiles, histograms, scatter, DBscan)
- Look at missing values
- Try PCA/T-SNE


### Preprocessing

#### Dealing with missing values
- Just drop rows with missing values
- Drop the whole column if there are too many missing values
- Impute with mean/median/most popular value
- Impute with average value of nearest neighbours by other features
- Train some model which predicts missing value based on other features
- For categorical data might be used as a separate class
- Факт наличия пропущенного значения тоже может нести информацию! 
- Для временных рядов можно оценить по предыдущим значениям

#### Dealing with categorical variables

- sklearn.preprocessing.LabelEncoder 
  encodes labels with value between 0 and n_classes - 1. might be used for algorithms like decision trees just to ensure it takes less space

- sklearn.preprocessing.OneHotEncoder transforms to vectors [0,..0,1,0,...0]

  might blow the dimensionality. tip: use PCA after it

- sklearn.feature_extraction.FeatureHasher

  hashing trick is good for features with high ordinality

- Replace categorical feature with its count

- Replace categorical feature with average value of target variable for regression task (or relative frequencies for classification tasks)

  NOTE: it makes sense to use cross-validation for this task 
  (see discussion here https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36136#201638)

  some options are implemented in these libraries

  https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/

  http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html
  


#### Dealing with outliers 

- Drop rows with outliers
- Apply the same methods as for missing values
- Leave as is if you use methods like decision trees 


#### Dealing with skewed data

- Apply mathematical transformations like log or Box-Cox (scipy.stats.boxcox)
- Binning 
  Split data into bins and use bin number as a categorical feature. Quantiles might be used (pd.qcut)
  
  
#### Standartization / Normalization

###### Matters for most of models (exception are tree-based methods). Standartization makes sense for PCA as well

- Standartization: subtract the mean and scale by std

  sklearn.preprocessing.StandardScaler()

- Normalization (divide by max)
  
  sklearn.preprocessing.normalize
  
- Min-max scaling: subtract min value and divide by (max-min)

  sklearn.preprocessing.MinMaxScaler


### Feature engineering:

- Binary features 

  e.g. higher than mean or not. count is greater than zero or not
  
  some methods are in sklearn.preprocessing.Binarizer
- Roundings

  transform fractions to percents with a loose of precision (0.8531 -> 85). can be used as categorical feature
  
- Extract day/month/year from dates, feature is holiday

- Aggregations:
    
    Calculate counts of an attribute (e.g. calculate number of appearance of each neighborhood in the dataset to estimate its size/density)
    
    Calculate avgs on an attribute (e.g. calculate average price of a house for each neighborhood)
    
- Adding polynomial features for linear models    

  sklearn.preprocessing.PolynomialFeatures

    
- Apply PCA for highly correlated features

- Clustering based features 
  
  Do some clustering (perhaps of a part of attributes or some nested data) and use cluster number as a new categorical or average of target variable as a new numerical features

- !Features based on domain knowledge

  e.g. extract person age from birth date and current data, get distance to metro from house address
  

- Text data (haven't worked with it but here are just some basics)

  feature_extraction.text.CountVectorizer
  
  feature_extraction.text.TfidfVectorizer
  
  word2vec/glove



### Modelling:

#### Simple model

  - Linear models / Decision trees
  - Grid search / Random Search / Cross-validation
  - Metrics
  - Feature importance
  - Describe model

#### Complicated model

  - Boostings
  - hyperopt
  - Metrics
  - Feature importance
  - Describe model
  
  
### Usefult tools

http://scikit-learn.org/stable/modules/pipeline.html

http://hyperopt.github.io/hyperopt/
