# Women in Analytics 2019 Workshop
## March 22, 2019
## Author: Jaya Zenchenko, @datanerd_jaya
## https://www.linkedin.com/in/jayazenchenko/
## Machine Learning with Python

Agenda:
* Intro
* What is machine learning?
* Python package highlights
* Quickest intro to numpy, scipy
* Exploratory Data Analysis (EDA) with Pandas
* Unsupervised Learning, Supervised Learning
* Model evaluation, cross-validation
* Wrap Up


Fork this notebook!

Exciting times! Data sources, tools, compute resources readily available to get started!

Free Data Sources:
* Too many to list!
* Caution: Data sources vs. Machine learning data - structured/unstrctured data vs labeled data
* Caution: Check out everyone's licensing before using it for your enterprise needs.


Quick intro to open source data science and analytics - 

Compute resources to use:
Free (or free trial):
* https://data.world/community/open-community/
* https://colab.research.google.com/notebooks/welcome.ipynb
* https://aws.amazon.com/sagemaker/pricing/
* https://cloud.google.com/products/ai/
* https://datastudio.google.com/navigation/reporting
* https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/
* https://www.dominodatalab.com/domino-for-good/
* https://www.dominodatalab.com/domino-for-good/for-students/
* https://www.kaggle.com/sigma23/women-in-analytics-2019-workshop/edit
* https://medium.com/@jamsawamsa/running-a-google-cloud-gpu-for-fast-ai-for-free-5f89c707bae6

Caution: Make sure you know how to shut them down to not rack up a huge bill!

Generally free tools 
* RStudio
* Anaconda (Jupyter, Spyder, Orange)
* Weka
* KNIME
* https://www.h2o.ai/products/h2o/#how-it-works
* https://public.tableau.com/en-us/s/
* https://plot.ly/create/#/


Great Resources:
Free Code!
* https://github.com/amueller/scipy-2018-sklearn
* https://jakevdp.github.io/PythonDataScienceHandbook
* https://github.com/rasbt/python-machine-learning-book-2nd-edition
* https://github.com/josephmisiti/awesome-machine-learning
* https://github.com/lazyprogrammer/machine_learning_examples
* https://github.com/scikit-learn/scikit-learn

Free Courses:
Coursera, edx, classcentral.com

Soapbox: Free means people have dedicated time and resources to creating and maintaining these things.  Be a part of the open source community by contributing!





In [None]:
from IPython.display import Image


# What is Machine Learning?

Machine Learning

Great definition: https://emerj.com/ai-glossary-terms/what-is-machine-learning/



In [None]:
Image("../input/images2/images/images/what_is_ml.png")

In [None]:
Image("../input/images2/images/images/types_oh_ml.png")

In [None]:
Image("../input/images2/images/images/types_of_ml.png")

Source:

- https://medium.com/deep-math-machine-learning-ai/introduction-of-machine-learning-why-how-what-84c881c70763
- https://medium.com/deep-math-machine-learning-ai/different-types-of-machine-learning-and-their-types-34760b9128a2 (pic) 

## Let's get started! Import packages!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sklearn
import scipy 

import xgboost
import matplotlib.pyplot as plt
import seaborn as sns


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Need latest package for categorical_encoders:

In [None]:
!pip install --upgrade git+https://github.com/scikit-learn-contrib/categorical-encoding

In [None]:
import category_encoders

Caution: Keep track of python packages using docker (kaggle does this), conda environments, virtualenv, or with watermark:  https://github.com/rasbt/watermark#installation-and-updating

Caution: Important to remember reproducible code and research!

Caution: Check internet connection for pip install

In [None]:
!pip install watermark

In [None]:
%load_ext watermark

In [None]:
%watermark --iversions

Caution: VERY important to keep track of the versions used - open source packages change frequently and the code may not work with new package versions.  Use Docker or virtual environments to keep track and test all your code anytime you want to update a package in your environment.

I like to print it out in the notebook if I'm just sharing my notebook.


## Quickest intro to numpy/scipy:
* NumPy (1995 as numeric, 2006 as NumPy)
    * Array data types and basic operations (with some overlap with scipy)
* SciPy (Scientific Python) - created 2001
    * Fully featured versions of the linear algebra and numerical algorithms
    
    
* Fortran/C/C++ under the hood - fast! Don't rewrite these methods!
* Incredible SciPy conference held yearly in Austin, TX! Meet many of the scikit-learn and other python open source contributers! https://conference.scipy.org/ Watch all talks for free: https://www.youtube.com/user/EnthoughtMedia


In [None]:
## Basic numpy:
# 1d:
a = np.array([0, 1, 2, 3])
print("a = ", a)

In [None]:
# 2x3 array:
b = np.array([[0, 1, 2], [3, 4, 5]])    # 2 x 3 array
print("b = ", b)
# https://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays
    

In [None]:
a.shape

In [None]:
b.shape

#### Most common: 
* np.reshape()
* np.arange()
* np.linspace()
* np.zeros()
* np.sum()
* np.mean()
* np.argmax()
* np.argmin()
* np.array()
* np.sort()

#### Cheat Sheets!
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf

#### Additional Resources:
* https://towardsdatascience.com/lets-talk-about-numpy-for-datascience-beginners-b8088722309f


In [None]:
a.reshape(4,1)

In [None]:
a

In [None]:
a.shape

In [None]:
a = a.reshape(4,1)

In [None]:
a

In [None]:
a = np.arange(10) # evenly spaced
print(a)

In [None]:
np.linspace(0, 1, 10) # Start, end, Number of points

In [None]:
?np.linspace # Get help on how to call the function!

In [None]:
Image("../input/images2/images/images/numpy_slicing.png")

In [None]:
# Indexing
a = np.arange(10) # evenly spaced
a[2:9:3] # [start:end:step]

In [None]:
a[:4] # last index is not included.  Default start is 0, end is last, step is 1

In [None]:
a[::-1] # Can easily reverse!

In [None]:
Image("../input/images2/images/images/numpy_fancy_indexing.png")

## Pandas
- Part of NumFOCUS
- Used for python data analysis

https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

https://numfocus.org/sponsored-projects

In [None]:
Image("../input/images2/images/images/pandas_cheetsheet.png")

Let's dig into our data!

**Import Student Dataset:**

https://www.kaggle.com/aljarah/xAPI-Edu-Data

- Data already attached to this kernel so no need to import :)

Citation:
* Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

* Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

In [None]:
school_data = pd.read_csv('../input/xAPI-Edu-Data/xAPI-Edu-Data.csv') # pandas has lots of ways to read in data!

In [None]:
school_data.head() #shows top 5 lines

In [None]:
Image("../input/images2/images/images/xapi_data_attributes.png")

In [None]:
Image("../input/images2/images/images/xapi_class_outputs.png")

In [None]:
type(school_data) # get the type of data

In [None]:
type(school_data.gender)

In [None]:
school_data.iloc[0,3:5] #just like numpy indexing

Caution: Pandas has 2 basic methods of indexing, "loc" and "iloc" - 
* loc - gets rows/columns with particular labels from the index
* iloc - gets rows/columns with positional index

In [None]:
school_data.columns

In [None]:
print(school_data.StageID[0])
print(school_data['StageID'][0])
print(school_data.loc[0, 'StageID'])
print(school_data.iloc[0, 3])

print("Caution: Indexing slightly differently will return a Series: school_data.iloc[0, 3:4]")
print(school_data.iloc[0, 3:4])

In [None]:
type(school_data.iloc[0, 3])

In [None]:
type(school_data.iloc[0, 3:4])

In [None]:
# Filter the data:
school_data[(school_data.gender=='F')].head() 

In [None]:
# Multiple filters
school_data[(school_data.gender=='F') & (school_data.Topic=='IT')].head() 

In [None]:
school_data[(school_data.gender=='F')].loc[5:9,:]

In [None]:
school_data[(school_data.gender=='F')].iloc[0:3,:]

In [None]:
# Count the values in a column:
school_data.gender.value_counts()

In [None]:
# Group by and count:
school_data.groupby(['StageID', 'Topic']).count()


In [None]:
# Since we are just counting the number of rows, select one of them and sort 
school_data.groupby(['StageID', 'Topic']).gender.count().sort_values(ascending=False)


In [None]:
school_data.columns

In [None]:
# Get statistics on numeric column and sort by the mean of raisedhands
school_data.groupby(['StageID', 'Topic']).raisedhands.describe().sort_values(by='mean', ascending=False)

- Exercise: For all the kids in middle school (school_data.StageID=='MiddleSchool'), which subject (Topic) is taken the most?
- Exercise: Do boys or girls in Middle school have the highest average Discussion ?

In [None]:
# Exercise:
school_data[(school_data.StageID=='MiddleSchool')].groupby('Topic').count()
school_data[(school_data.StageID=='MiddleSchool')].Topic.value_counts()

### Explore the data more with pandas and seaborn:

In [None]:
school_data.info(verbose=True, null_counts=True) # get information about the dataframe, how many nulls, and datatype

In [None]:
school_data.columns

In [None]:
## My method of taking an initial glance at the columns:
for each_column in school_data.columns:
    print("Column = ", each_column)
    print(school_data[each_column].describe()) #describe gives summary stats for numeric columns and count/unique/top value for strings
    if school_data[each_column].nunique() < 50: # nunique get the number of unique values for the column
        print("Counts of {} :".format(each_column)) # can use format to format the string
        print(school_data[each_column].value_counts())
    print()

In [None]:
## New packages, pandas-profiling and pandas_summary

In [None]:
school_data.shape # 480 rows and 17 columns

Exercise: Identify the maximum value for all the numerical columns. Which categorical column has the most unique values? 

In [None]:
# Exercise:
school_data.max()

Caution: When looking at your dataset, check to see how big it is and select methods that are appropriate for the size.  Beware of overfitting when working with small datasets.

Resources: 
- https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89
- https://datascience.stackexchange.com/questions/19925/what-are-the-most-suitable-machine-learning-algorithms-according-to-type-of-data

Alternative ways to explore data !  Additional profiling, and visualization with pandas_summary and pandas_profiling.


In [None]:
from pandas_summary import DataFrameSummary # https://github.com/mouradmourafiq/pandas-summary

In [None]:
school_data.columns

In [None]:
## Explore with pandas

In [None]:
dfs = DataFrameSummary(school_data)

In [None]:
dfs.columns_types #one step further than pandas.info()

In [None]:
dfs.columns_stats

In [None]:
import pandas_profiling

In [None]:
pandas_profiling.ProfileReport(school_data)

Caution: Generally we would want to deal with duplicate rows, here we assume that 2 students have the same data.  Data set would be better if it had some kind of studentID.

Resource: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

### Visualize the data:

Caution: Always important to visualize and not just rely on sample statistics! 

Resources: 
- https://seaborn.pydata.org/examples/anscombes_quartet.html
- HOT OFF THE PRESS: https://medium.com/@plotlygraphs/introducing-plotly-express-808df010143d - Plotly express!


In [None]:
%matplotlib inline

In [None]:
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe") #same sample statistics, and regression line for 4 different datasets

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": 1})

In [None]:
# Seaborn : http://seaborn.pydata.org/index.html
sns.set(style="ticks")
sns.pairplot(school_data, hue="Class")
plt.show()

In [None]:
school_data.info()

In [None]:
for each in school_data.columns:
    if school_data[each].dtype=='object':
        sns.catplot(x=each, y="raisedhands", hue="Class", kind="swarm", data=school_data)
        plt.show()

Exercise: Look at the data with respect to another numeric variable (i.e y='VisITedResources').  Any interesting insights?

In [None]:
# Exercise:

In [None]:
school_data.columns

In [None]:
school_data.gender.value_counts().plot(kind='bar') # Quick view of the data:

In [None]:
school_data.AnnouncementsView.plot(kind='hist')

In [None]:
school_data.AnnouncementsView.hist() # Another way to quickly plot

## Machine Learning:

In [None]:
Image("../input/images2/images/images/types_of_ml.png")



Resource: https://www.researchgate.net/figure/Examples-of-real-life-problems-in-the-context-of-supervised-and-unsupervised-learning_fig8_319093376


No free lunch! 
* "All models are wrong but some are useful"
* Bias-Variance tradeoff
* Curse of dimensionality

        

In [None]:
Image("../input/images2/images/images/bias_variance_tradeoff_reg.png")
# Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

In [None]:
Image("../input/images2/images/images/overfitting_under_classification.png")

## Data Preprocessing:
* Models require numeric data (most of the time)
* Potential Steps:
    * Clean up outliers
    * Decide how to deal with missing values (i.e. impute missing values, remove rows or columns) 
    * Identify multicollinearity (and remove if neccessary)
    * Scale or normalize (if neccessary)
    * Encode categoricals into numeric (if neccessary, many ways to do this)

Caution: Important to note the assumptions of your algorithm so you preprocess correctly!

Resources:
- https://hackernoon.com/what-steps-should-one-take-while-doing-data-preprocessing-502c993e1caa
- https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9

#### Missing Values

In [None]:
school_data.info()

Caution: We do not have any missing data, need to clean up if we did with imputing or dealing with rows/columns.

Resource: Preprocessing - https://scikit-learn.org/stable/modules/preprocessing.html

#### Convert categoricals into numeric
Quick way to convert categoricals into numeric - "get dummies"

In [None]:
school_data.GradeID.head(10)

In [None]:
pd.get_dummies(school_data['GradeID']).head(10)

In [None]:
school_data.GradeID.head(10)

In [None]:
school_data_dummies_df = pd.get_dummies(school_data)

In [None]:
school_data_dummies_df.head()

In [None]:
school_data_dummies_df.corr()

In [None]:
corr = school_data_dummies_df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
import seaborn as sns
corr = school_data_dummies_df.corr()
sns.set(rc={'figure.figsize':(18,13)})
#sns.heatmap(corr, 
 #           xticklabels=corr.columns.values,
 #           yticklabels=corr.columns.values) # original didnt show colors that made it easy to see the correlations


# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap='coolwarm', vmax=1, center=0,
            square=True, linewidths=.5,  xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

# https://seaborn.pydata.org/examples/many_pairwise_correlations.html

Exercise: What columns look redundant?

## Sci-kit Learn (sklearn)
* Considered the "gold standard" interface to transforming, and fitting models
* Spark's ML Lib interface is modeled after it
* fit_transform, fit_predict
* Can create pipelines to transform, preprocess, clean, and fit models



#### Example of preprocessing data with sklearn
Primarily done with "fit_transform"


In [None]:
from sklearn import preprocessing

In [None]:
ohe = sklearn.preprocessing.OneHotEncoder() # create a one hot encoding object

In [None]:
ohe.get_feature_names()

In [None]:
ohe_data = ohe.fit_transform(school_data) # fit_transform, fit will modify the ohe object

In [None]:
ohe

In [None]:
# Get the feature names:
ohe.get_feature_names()

In [None]:
ohe_data.shape

In [None]:
ohe.get_feature_names()

Notice that converted all the numeric columns into categoricals as well due to their discrete nature.


In [None]:
school_data.columns

In [None]:
school_data.head()

Caution: Calling "fit" or "fit_transform" on object "ohe" will modify the metadata!

# Unsupervised Learning

Clustering and dimensionality reduction are 2 common approaches to unsupervised learning.
Unsupervised learning can be a part of exploratory data analysis.

* Find meaningful relationships
* Low dimensional representations for visualization or compression

Resource: https://web.stanford.edu/class/stats202/content/lec2.pdf

## Clustering
* Clustering is to group the data
* Some algorithms are: KMeans, heirarchical clustering, DBSCAN, etc
* Downside is that it is difficult to evaluate the model

Caution: Make sure to look at the details of your algorithm to identify the assumpions of the expected input data.  

Resource: https://www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch/

https://web.stanford.edu/class/stats202/content/lec2.pdf



In [None]:
Image("../input/images2/images/images/wht_is_a_cluster.png")

## KMeans Clustering
* Simple clustering algorithm
* Can handle very large datasets

Caution: Make sure to look at the details of your algorithm to identify the assumpions of the expected input data

* Hands on visualization of how it works: http://web.stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

Lets look at a subset of our data to dig into clustering.  To do similar one hot encoding as pandas get_dummies, need to use ColumnTransformer.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
subset_features = ['GradeID','StudentAbsenceDays', 'raisedhands',
       'VisITedResources', 'AnnouncementsView', 'Discussion']

numeric_features = ['raisedhands',  'VisITedResources', 'AnnouncementsView', 'Discussion']

categorical_features = ['GradeID', 'StudentAbsenceDays']

preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features), ('nothing', 'passthrough', numeric_features)])


In [None]:
ohe_data = preprocessor.fit_transform(school_data[subset_features])

In [None]:
ohe_data.shape

In [None]:
preprocessor.named_transformers_['onehot'].get_feature_names() #Get the feature names of the transformer called ohe_transformer

In [None]:
ohe_features = list(preprocessor.named_transformers_['onehot'].get_feature_names()) + list(numeric_features)

In [None]:
ohe_features

In [None]:
ohe_data.shape

In [None]:
ohe_data

KMeans groups points together in the 16 dimensional space using Eucliean distance - anything concerning about this data set?

Let's look back at our data set and pull out a different subset:

In [None]:
numeric_school_data = school_data[numeric_features]

In [None]:
numeric_school_data.mean(axis=0)

## KMeans Clustering

Caution: Need to scale the data for kmeans to work!
Caution: Make sure to look at the details of your algorithm to identify the assumpions of the expected input data

KMeans: 
- shperical clusters
- evenly sized clusters
- need to provide K

Resources:
- https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html
- https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
- https://stats.stackexchange.com/questions/89809/is-it-important-to-scale-data-before-clustering


pic from http://bioinformaticsinstitute.ru/sites/default/files/preprocessing_unsupervised.pdf
pics from https://www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch/




In [None]:
Image("../input/images2/images/images/unevenly_sized_kmeans.png")

In [None]:
Image("../input/images2/images/images/kmeans_nonspherical.png")


In [None]:
from sklearn import cluster

In [None]:
from sklearn import decomposition

In [None]:
ss = sklearn.preprocessing.StandardScaler()

In [None]:
ss_data = ss.fit_transform(numeric_school_data)

In [None]:
numeric_school_data.shape

In [None]:
ss.mean_

In [None]:
ss_data.mean(axis=0)

In [None]:
kmeans_scaled = sklearn.cluster.KMeans(n_clusters=3, random_state=42, n_jobs=-1)

Caution: To ensure reproducibility, check to see if your function accepts a random_state, and make sure to set it manually otherwise your results will change each time you re-run the function!

In [None]:
kclusters_scaled_only = kmeans_scaled.fit_predict(ss_data)

In [None]:
kmeans_scaled.cluster_centers_.shape #cluster_centers_ attribute has values for the cluster centers, 3 clusters x 4 features

In [None]:
pd.Series(kclusters_scaled_only).value_counts() # How many in each cluster

In [None]:
pd.Series(kmeans_scaled.labels_).value_counts() # Alternative way to find cluster labels

In [None]:
school_data[kclusters_scaled_only==0][numeric_features].head(20)

In [None]:
dict(zip(numeric_features, kmeans_scaled.cluster_centers_[0]))

What does this mean?

In [None]:
# Need to look at what the data looked like before it was scaled
dict(zip(numeric_features, ss.inverse_transform(kmeans_scaled.cluster_centers_[0]))) 


In [None]:
dict(zip(numeric_features, ss.inverse_transform(kmeans_scaled.cluster_centers_[1]))) 

In [None]:
dict(zip(numeric_features, ss.inverse_transform(kmeans_scaled.cluster_centers_[2]))) 

Difficult to evaluate, but there are clustering evaluation metrics out there.

In [None]:
Image("../input/images2/images/images/clustering_metrics.png")

In [None]:
Image("../input/images2/images/images/clustering_algos.png")

Our dataset is mostly categorical - to leverage all of the data, we could use kmodes.

Resource: https://github.com/nicodv/kmodes

Using all the data means we would have 17 dimensions.  Not huge, but it's bigger than the 4 we used.

### NO FREE LUNCH - Curse of Dimensionality
Remember, KMeans is finding 'distance' as euclidean distance in an n-dimensional space.  Points are few in these high dimensional space.

In [None]:
Image("../input/images2/images/images/curse_of_dimensionality.png")

Resource: https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html

## Dimensionality Reduction
* Reduce the dimensions of your input data
* Remove multi-collinearity
* Some approaches are: PCA, LDA, SVD, t-SNE, etc

Caution: Some ML algorithms need the data to be non-collinear (i.e. generalized linear models) make sure to check and remove multi-collinearity!  
Caution: Some dimensionality reduction techniques find linear relationships and others find non-linear relationships

Resources:
- http://setosa.io/ev/principal-component-analysis/
- https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py
- https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/

## PCA/SVD

Relationship - PCA(X) = SVD(X-mean(X))
- Sklearn calculates PCA using SVD
- Sklearn's PCA doesn't support sparse matrix
- Use TruncatedSVD for sparse matrices
- PCA and SVD creates new features which are linear transformations of original features
- Basis of Latent Semantic Indexing Topic Modeling 


Resources:
- CLICK ME: http://setosa.io/ev/principal-component-analysis/
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- Theoretical: https://arxiv.org/pdf/1404.1100.pdf
- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
- https://medium.com/data-design/how-to-not-be-dumb-at-applying-principal-component-analysis-pca-6c14de5b3c9d



In [None]:
Image("../input/images2/images/images/setosa_pca_pic.png")

In [None]:
svd = sklearn.decomposition.TruncatedSVD(n_components=2, random_state=42)

In [None]:
ss_data.shape

In [None]:
svdschool = svd.fit_transform(ss_data)

In [None]:
svd.explained_variance_ratio_.sum() # With our 2 components, we have accounted for 83% of the variability

In [None]:
svdschool.shape # number of data points x number of dimensions in our lower dim space

In [None]:
svd.components_.shape # each new component as a linear combination of the original space 

Let's cluster using our new data:

In [None]:
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=42, n_jobs=-1)

In [None]:
kclusters_dimreduced = kmeans.fit_predict(svdschool)

In [None]:
pd.Series(kclusters_dimreduced).value_counts()


In [None]:
school_data[kclusters_dimreduced==0][numeric_features].head()

In [None]:
kmeans.cluster_centers_ # cluster centers, 3 clusters x 2 dimensions in SVD space

In [None]:
svd_centroid = svd.inverse_transform(kmeans.cluster_centers_) # Just like before, we want to know what the center is in our original feature space

In [None]:
svd_centroid.shape

In [None]:
dict(zip(numeric_features, ss.inverse_transform(svd_centroid[0])))

In [None]:
dict(zip(numeric_features, ss.inverse_transform(svd_centroid[1])))

In [None]:
dict(zip(numeric_features, ss.inverse_transform(svd_centroid[2])))

#### Reduce dimensions for visualization:
https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

In [None]:
ss_data.shape

In [None]:
svdschool.shape

In [None]:
pd.Series(kclusters_scaled_only).nunique()

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(svdschool[:,0], svdschool[:,1], c=kclusters_scaled_only.astype(float), edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('rainbow', pd.Series(kclusters_scaled_only).nunique()))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.title("2d Visualization of Numeric School Data - KMEANS on Scaled Data (4 Dimensions )")
plt.colorbar()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(svdschool[:,0], svdschool[:,1], c=kclusters_dimreduced.astype(float), edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('rainbow', pd.Series(kclusters_dimreduced).nunique()))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.title("2d Visualization of Numeric School Data - KMEANS on Reduced Dimensional Data (2 Dimensions)")
plt.colorbar()
plt.show()

How different do these clusters look?

Let's visualize how well the clusters look when colored by the Class (L, M, H).

In [None]:
school_data.Class.values

In [None]:
class_vals = school_data.Class.map({'L': 0, 'M': 1, 'H': 2})
#le = sklearn.preprocessing.OrdinalEncoder(categories=np.array(['L', 'M', 'H']) )
#le.fit(np.array(['L', 'M', 'H']).reshape(-1,1))
#class_vals = le.fit_transform(school_data.Class.values.reshape(-1,1))

# LabelEncoder and OrdinalEncoder didn't let me set the order, so used basic pandas instead :)


In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(svdschool[:,0], svdschool[:,1], c=class_vals.astype(float), edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('rainbow', school_data.Class.nunique()))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.title("2d Visualization of Numeric Data Colored by True Class Labels")
plt.colorbar()
plt.show()

Resources: https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/


Maybe our clustering would have been better if we used all our data and not just 4 numeric features.

Just like KMeans has KModes for categorical, PCA on categoricals can be done with MCA (Multiple Correspondance Analysis)

Resource: 
https://github.com/MaxHalford/Prince (sklearn compatible package for categorical analysis)

In [None]:
Image("../input/images2/images/images/clustering_metrics.png")

Clustering metric exercise in extra credit section.

Resources:
- https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation
- https://scikit-learn.org/stable/modules/classes.html#clustering-metrics




### Let's get fancy!
### Convert our categorical data to numeric using target encoding:
* Also called "mean encoding"
* This is advanced feature engineering
* Easy to overfit

In [None]:
Image("../input/images2/images/images/categorical_encoders.png")

In [None]:
Image("../input/images2/images/images/mean_encoding.png")


Resource: 
- https://github.com/scikit-learn-contrib/categorical-encoding
- Video: https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv

In [None]:
ce = category_encoders.TargetEncoder()

In [None]:
# Go through the kmeans clustering again with the target encoding

In [None]:
school_data_target_encoded_df = ce.fit_transform(school_data.drop('Class', axis=1), class_vals)

In [None]:
school_data_target_encoded_df.gender.nunique()

In [None]:
school_data.head(20)

In [None]:
school_data_target_encoded_df.head(20)

Exercise: Go through and redo the exercise of reducing the dimensions of this new data and doing clustering - how do the clusters look? 

## Supervised Learning:
* Learning with labels provided.
* Many classification methods


In [None]:
Image("../input/images2/images/images/classifier_comparison.png")

## Getting started building a classifier:
- Assuming data cleaning has been done ( outliers, missing, duplicates, etc)
* Split data into training, validation, and test
* Encode/scale/preprocess data as neccessary
* Build baseline dummy classifier
* Evaluate dummy classifier
* Try other classifiers (make sure to preprocess the data as expected by the classifier)
* Validate and evaluate other classifiers
* Choose a classifier
* Evaluate on the test set
* Stop

Caution: Split the data right away into training and test. Make sure to truly separate training and test data - dont have dirty data or data leakage 🙂

Caution: If there is student specific data in the training and test, this will make our test set evaluation look more optimisitic.  Careful how the data is split. Think about data availability at the time of a prediction.



#### Split the data

In [None]:
from sklearn import model_selection

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(school_data.drop('Class', axis=1), 
                                                            class_vals, test_size=0.25, random_state = 42, stratify=class_vals)


Caution: Stratify splits based on class imbalance instead of randomly.  Make sure to look at the class imbalance and deal with it appropriately based on the algorithm!

Resource: https://github.com/scikit-learn-contrib/imbalanced-learn 



In [None]:
pd.Series(y_train).value_counts() #Check class imbalance

In [None]:
ce = category_encoders.TargetEncoder() # Create TargetEncoder object

In [None]:
X_train_numeric = ce.fit_transform(X_train, y_train) # Transform our categorical data

In [None]:
X_train_numeric.head()

In [None]:
ss = preprocessing.StandardScaler()

In [None]:
X_train_scaled = ss.fit_transform(X_train_numeric) # Scale our data

In [None]:
X_train_scaled.shape 

Get a baseline and then let's create more models to compare it to. 

Topics:
- Model evaluation - how do we evaluate our model
- Pipelines - makes it easy to preprocessing training and test data
- Cross-validation - how do we choose hyperparameters and final model

In [None]:
from sklearn import dummy # Make dummy clissifier as baseline

In [None]:
dummy = sklearn.dummy.DummyClassifier(random_state=42)

In [None]:
dummy.fit(X_train_scaled, y_train) # Fit the model

In [None]:
dummy_predictions = dummy.predict(X_train_scaled) # Make predictions

In [None]:
dummy.score(X_train_scaled, y_train) # Score our training data set for now

What does this score mean?

Caution: Scoring on the training data set will be better than scoring on validation or test set.

Caution: Model evaluation - how to evaluate? Accuracy? Never only use it and “if it seems too good to be true, it probably is” .

* e-book on Evaluating Machine Learning Models: https://www.oreilly.com/ideas/evaluating-machine-learning-models

Resource:
https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/


Source: https://en.wikipedia.org/wiki/Precision_and_recall

In [None]:
Image("../input/images2/images/images/truepos_trueneg_wiki.png")

In [None]:
Image("../input/images2/images/images/precision_recall.png")

In [None]:
Image("../input/images2/images/images/roc_curve_pic.png")


roc pic: https://machinelearningmastery.com/assessing-comparing-classifier-performance-roc-curves-2/

Source: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html

Source: https://scikit-learn.org/stable/modules/model_evaluation.html

- Multiclass metrics:
"weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

Caution:
- Many classifiers predict probabilities along with class labels, use ROC or precision-recall curves to find the best threshold for your classifier based on your problem!
- Your model is always wrong, spend time thinking about what type of wrong you are comfortable with.

In [None]:
from sklearn import metrics

In [None]:
print(metrics.classification_report(dummy_predictions, y_train))

In [None]:
metrics.confusion_matrix(dummy_predictions, y_train)

In [None]:
metrics.accuracy_score(dummy_predictions, y_train)

Great - now we have a baseline model - let's go beat it!


### Model Selection
* Can run through many ML algorithms
* Loop through hyperparamters
* Where's my validation set? 

Resource: https://machinelearningmastery.com/difference-test-validation-datasets/

* Even better - cross-validation!

Caution:  The purpose of cross-validation is model selection, not model building!

### Pipelines

https://scikit-learn.org/stable/modules/cross_validation.html
- Pipelines make it easy to store "steps" - preprocessing steps, encoding, model itself - easy to call the pipeline to transform our validation set in the same way
* Caution: Preprocessing all the training data once prior to doing cross validation is a form of 'leakage' - where part of the solution has leaked into the training data,




In [None]:
Image("../input/images2/images/images/pipelines.png")

In [None]:
Image("../input/images2/images/images/cross_validation.png")

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
preprocessing_pipeline = Pipeline([('cat_encoder', category_encoders.TargetEncoder(return_df=False)), 
                                   ('scaler', preprocessing.StandardScaler())])

In [None]:
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.metrics import accuracy_score, f1_score 

In [None]:
sklearn.metrics.SCORERS

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
Image("../input/images2/images/images/stratified_k_fold.png")

Resource: 
https://www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validation-in-python-r/
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate

#### Logistic Regression:
* Good one to start with
* Fast to train and score

In [None]:
preprocessing_pipeline = Pipeline([('cat_encoder', category_encoders.TargetEncoder(return_df=False)), 
                                   ('scaler', preprocessing.StandardScaler())])

In [None]:
lr_pipeline = make_pipeline(preprocessing_pipeline,
                         sklearn.linear_model.LogisticRegression(class_weight='balanced', random_state=42, solver='sag', multi_class='multinomial', n_jobs=1))
scoring = ['balanced_accuracy', 'f1_weighted']
lr_cv_scores = cross_validate(lr_pipeline, X_train, y_train, scoring=scoring, cv=10, return_train_score=False)

In [None]:
lr_cv_scores

In [None]:
for key in lr_cv_scores.keys():
    print("%s: %0.2f (+/- %0.2f)" % (key, lr_cv_scores[key].mean(), lr_cv_scores[key].std() * 2))

Since we want to try out a bunch of models, lets make a function:

In [None]:
def cross_validate_and_print(preprocessing_pipeline, model_type):
    cv=10
    pipeline = make_pipeline(preprocessing_pipeline, model_type)
    print("Preprocessing Steps : ", preprocessing_pipeline)
    print("Model Type : ", model_type)
    scoring = ['balanced_accuracy', 'f1_weighted']
    cv_scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, return_train_score=False)
    print("Running {} Fold Cross Validation :".format(cv))
    for key in cv_scores.keys():
        print("%s: %0.2f (+/- %0.2f)" % (key, cv_scores[key].mean(), cv_scores[key].std() * 2))

In [None]:
cross_validate_and_print(preprocessing_pipeline=preprocessing_pipeline, 
        model_type=sklearn.linear_model.LogisticRegression(class_weight='balanced', random_state=42, 
                                                           solver='sag', multi_class='multinomial', n_jobs=1))

Exercise: Try also reducing the dimensions to 10 and see how LR changes.  (Hint: add TruncatedSVD to preprocessing_pipeline)

In [None]:
#Exercise:

## Deep dive - Trees
- Decision Tree
- Random Forest
- Gradient Boosted Trees (GBT)
- One implementation of GBT - XGBoost - created 2016 - won a lot of kaggle competitions when it first came out. 
- Trees can handle multi-collinearity
- Trees can handle categoricals
- Caution: Tree based methods with feature importance will give more importance to variables with greater cardinality

Resources:
- http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
- https://www.displayr.com/gradient-boosting-the-coolest-kid-on-the-machine-learning-block/
- http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
- https://stackoverflow.com/questions/51601122/xgboost-minimize-influence-of-continuous-linear-features-as-opposed-to-categori
- Brief history - http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
- Categorical Boosting - https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

CLICK ME! http://www.r2d3.us/visual-intro-to-machine-learning-part-1/



### Tree Terminology:

In [None]:
Image("../input/images2/images/images/tree_terminology.png")

Source:
https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

### Tree Creation
- Maximum class separation
- Pruning

Resources:
- https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1


In [None]:
Image("../input/images2/images/images/tree_pruning.png")

### Decision Trees
- One tree
https://en.wikipedia.org/wiki/Decision_tree_learning

Resources:
- https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
- https://bricaud.github.io/personal-blog/entropy-in-decision-trees/
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
pic


In [None]:
Image("../input/images2/images/images/decision_tree_wiki.png")

In [None]:
Image("../input/images2/images/images/dtc_variables.png")

What do you think could be the pros and cons of using a decision tree?

### Random Forest
- Forest of trees!
- Also called 'ensemble' method - ensembling multiple decision trees
- Randomly selects points to build each tree (can spcify with or without replacement)
- Can use a subset of the features to reduce overfitting



In [None]:
Image("../input/images2/images/images/random_forest_simplified.png")

Resource: https://github.com/andosa/treeinterpreter
https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

In [None]:
Image("../input/images2/images/images/linreg_vs_xgboost.png")

Resource:
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

### Gradient Boosted Trees
* Also builds many trees, but learns from the mistake of the previous tree
* Can use a subset of the features and subset of the data to reduce overfitting

CLICK ME: http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

Resource: Excellent  explanation of each paramter for XGBoost: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [None]:
Image("../input/images2/images/images/ensembling_pic.png")

In [None]:
Image("../input/images2/images/images/bagging_vs_boosting.png")

Resource:
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

In [None]:
from sklearn import tree

In [None]:
preprocessing_pipeline = Pipeline([('cat_encoder', category_encoders.TargetEncoder(return_df=False)), 
                                   ('scaler', preprocessing.StandardScaler())])

In [None]:
cross_validate_and_print(preprocessing_pipeline, tree.DecisionTreeClassifier(random_state=42, class_weight='balanced'))

In [None]:
from sklearn import ensemble

In [None]:
forest = sklearn.ensemble.RandomForestClassifier(n_estimators=100, max_depth=10, 
                                                 n_jobs=-1, random_state=42, class_weight='balanced_subsample')


In [None]:
X_train.shape

In [None]:
X_train_ss = preprocessing_pipeline.fit_transform(X_train, y_train)

In [None]:
forest.fit(X_train_ss, y_train)

In [None]:
forest.feature_importances_

In [None]:
forest.estimators_[0] # Can access each of the decision tree estimators - we have 100 - can use this for visualization

In [None]:
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train_ss.shape[1]):
    print("%d. feature %d = %s (%f)" % (f + 1, indices[f], X_train.columns[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train_ss.shape[1]), importances[indices],
       color="r", align="center")
plt.xticks(range(X_train_ss.shape[1]), indices )
plt.xlim([-1, X_train_ss.shape[1]])
plt.show()

In [None]:
print(metrics.classification_report(forest.predict(X_train_ss), y_train))  # What an awesome classifier! ?

In [None]:
rfc = ensemble.RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1, random_state=42, class_weight='balanced_subsample')

In [None]:
cross_validate_and_print(preprocessing_pipeline, rfc)

And this is why we cross validate! So easy to overfit! 

   Use min_samples_lead, min_samples_split, max_features to not overfit.

In [None]:
rfc = ensemble.RandomForestClassifier(n_estimators=100, max_depth=3, n_jobs=-1, random_state=42, min_samples_split=5, class_weight='balanced_subsample')
cross_validate_and_print(preprocessing_pipeline, rfc)

In [None]:
import xgboost as xgb

XGBoost has a scikitlearn api wrapper so we can call it the same way we call other classification algorithms.

In [None]:
xgb.XGBClassifier()

Need to change our objective because we have a multiclass problem.  
Resource:
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/


In [None]:
Image("../input/images2/images/images/xgboost_learning_task_parameter.png")

In [None]:
# Fit the model
xgb_model = xgb.XGBClassifier(max_depth=5, random_state=42, 
                              subsample=0.8, colsample_bytree=0.8, n_jobs=-1, objective='multi:softmax').fit(X_train_ss, y_train)

In [None]:
# feature importance
print(xgb_model.feature_importances_)

In [None]:
from xgboost import plot_importance

In [None]:
plot_importance(xgb_model)

In [None]:
sorted_idx = np.argsort(xgb_model.feature_importances_)[::-1]
for index in sorted_idx:
    print([X_train.columns[index], xgb_model.feature_importances_[index]]) 


Note that 'Relation' moved up here in feature importance compared to RandomForest model.

In [None]:
xgbc = xgb.XGBClassifier(n_estimators=20, max_depth=3, random_state=42, subsample=0.8, colsample_bytree=0.8, 
                                           n_jobs=-1, objective='multi:softmax')

In [None]:
cross_validate_and_print(preprocessing_pipeline, xgbc)

Note how close the evaluation metrics are to Random Forest - how many estimators did we use? Go back and compare the score_times.

Exercise: Try without scaling, does this change the cross-validation accuracy of GBT? 



STOP - After trying all the different paramters, identify a model to select, and train with all the data and show the test set accuracy.  DO NOT GO BACK AND CHOOSE A DIFFERENT MODEL IF TEST SET ACCURACY DOESN"T MAKE YOU HAPPY.

Note: We would use a combination of cross validation and grid search against all the parameters - called "hyperparameter tuning" to pick our final model.  Look at resources section at the end which has python packages that will do these in a 'smarter' way rather than brute force.


In [None]:
final_model = xgb.XGBClassifier(n_estimators=20, max_depth=3, random_state=42, subsample=0.8, colsample_bytree=0.8, 
                                           n_jobs=-1, objective='multi:softmax')

In [None]:
final_model_pipeline = make_pipeline(preprocessing_pipeline, final_model)

In [None]:
final_model_pipeline.fit(X_train, y_train)

In [None]:
final_model_pipeline.score(X_test, y_test)

In [None]:
y_test_predictions = final_model_pipeline.predict(X_test)

In [None]:
print(metrics.classification_report(y_test_predictions, y_test))

Congratulations - you have now gone through a data set and built many models and selected one.  To continue your journey in improving this model, you can try additional things like hyperparamter tuning, feature engineering, ensembling linear and tree based models, etc!  For example, start digging into the misclassifications to gain more insight about your model and data.

Caution: Digging into misclassification is an excellent way to find gaps in your modeling approach - especially when putting models in production! 

## Wrap Up:
- scipy, numpy, pandas
- sklearn, pipelines
- model evaluation
- supervised, unsupervised approaches
- model selection using cross validation

## Extra Credit Assignments

Extra Credit Cluster:
* Since we have Class labels for our data - go through different clustering algorithms and evaluation metrics to find an approach that clusters our data well
* Rerun Kmeans and evaluate, better with 5 clusters? better with 10 clusters? Evaluation of your choice
* Use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score metric to evaluate which approach gives us a better score, kmeans with pca or kmeans without scaling?
Resources:
https://scikit-learn.org/stable/modules/clustering.html
https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Extra Credit LDA:
* Use LDA algorithm to reduce the dimension of the data using the Class labels and see how the performance of the tree based models changes with this transformed data
Resources:
https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/

Extra Credit Trees:
* Try building a decision tree, random forest, and GBT and reduce the dimensions to 5 using TruncatedSVD, how does this change the cross-validation accuracy?
* How does removing scaling change the cross validation accuracy of Random Forest? or Decision Tree?


## Additional Packages (My favorites)

Python packages used:
numpy, scipy, pandas, sklearn, xgboost

Highlights of others:
h2o - huge set of algos including AutoML (newly opensourced)
LightGBM
catboost

text:
gensim
nltk
spacy

Graphs:
networkx

Feature engineering:
https://www.featuretools.com/
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
Encoding - https://github.com/scikit-learn-contrib/categorical-encoding

Imbalanced Learning:
https://imbalanced-learn.org/en/stable/index.html

Python Visualization:
matplotlib
seaborn
altair
holoviews
bokeh
plot.ly / Dash

Clustering:
HDBSCAN

Hyperparameter Optimization:
Hyperopt
tpot
spearmint

General:
https://github.com/scikit-learn-contrib



## Additional resources 

Explore the data with Pandas:
https://github.com/pandas-profiling/pandas-profiling

PCA: 
https://www.utdallas.edu/~herve/abdi-awPCA2010.pdf
https://www.kaggle.com/merckel/preliminary-investigation-pca-boosting
https://medium.com/data-design/how-to-not-be-dumb-at-applying-principal-component-analysis-pca-6c14de5b3c9d

Clustering:
https://courses.cs.washington.edu/courses/cse546/08sp/slides/cdr.pdf
http://www.cbs.dtu.dk/chipcourse/Lectures/ClusteringPCA_2010.pdf
https://www.youtube.com/watch?v=EUQY3hL38cw
https://github.com/nicodv/kmodes
https://pypi.org/project/pyclustering/

Visualizing Multidimensional Data :
http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/
https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b


LDA can be used for dimensionality reduction as well as classification.
https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html
https://scikit-learn.org/stable/modules/lda_qda.html
https://scikit-learn.org/stable/auto_examples/classification/plot_lda.html#sphx-glr-auto-examples-classification-plot-lda-py


Pitfalls:  
http://www.cs.colorado.edu/~mozer/Research/Selected%20Publications/white-paper3.html
http://danielnee.com/2015/01/common-pitfalls-in-machine-learning/

Data Set Size:
https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89
https://datascience.stackexchange.com/questions/19925/what-are-the-most-suitable-machine-learning-algorithms-according-to-type-of-data
https://blog.myyellowroad.com/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512
https://www.datacamp.com/community/tutorials/categorical-data

Other good things:
https://heartbeat.fritz.ai/how-to-make-your-machine-learning-models-robust-to-outliers-44d404067d07
https://github.com/scikit-learn-contrib/sklearn-pandas
https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
https://machinelearningmastery.com/the-model-performance-mismatch-problem/
https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365





