This is meant to be a simple SKLearn tutorial/cheatsheet notebook. Written by Drace Zhan of NYCDSA for student/public use.
Note that the mathematics behind the models will not be covered here and the examples used will be purely for artificial purposes so there won't be any preprocessing, train-test-splits, verifying assumptions, etc.

In [None]:
#data manipulation tools
import numpy as np
import pandas as pd

#data visualization tools
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
#loading datasets used for this notebook

from sklearn import datasets

In [None]:
#loading iris from dictionary to dataframe
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

In [None]:
#loading boston dataset from dictionary to dataframe
boston = datasets.load_boston()
boston_df = pd.DataFrame(boston.data, columns= boston.feature_names)
boston_df['target'] = boston.target

In [None]:
#Linear Regression & Logistic Regression
import sklearn
from sklearn.linear_model import LinearRegression, LogisticRegression

In [None]:
boston_df.head()

Note: Model objects in Sklearn has various parameters and arguments you can pass in to the object as you create it. I highly recommend reading the documentation to understand what you can do to adjust the parameters on it. One of the more useful ones is "n_jobs". This argument is the amount of cores your CPU will use running the model. Setting it to -1 such as "n_jobs = -1" will force it to use ALL your cores. Your laptop will get quite hot during this.

In [None]:
#creating a Linear regression in SKlearn
linreg_1 = LinearRegression()

Giving the X & y labels in Python can be a bit strange. The thing to keep in mind is that you are trying to pass in the values of your X and not the column names themselves. I recommend first creating a list of your column names before passing it into X as the example below.

In [None]:
#creating X, y for Boston dataset, for this toy example, we'll be using all the columns except for Y for a toy example.
b_feat_list = boston_df.columns[0:-1]
X = boston_df[b_feat_list]
y = boston_df.target

Once you create the model object, it will allow you to fit data into it and then will start training to your data. After the model finishes training, it will allow you to call on the various attributes of the now trained model. This will be elaborated further below.

In [None]:
#fitting the LinearRegression object to your data
linreg_1.fit(X, y)

After we have created the object, you'll note it now has several attributes you can call upon. Feel free to experiment with them! I will list some of the more useful ones below.

In [None]:
#getting the coefficients of your LinearRegression object
linreg_1.coef_

In [None]:
#getting the intercept
linreg_1.intercept_

In [None]:
#Getting the R2 score of your model
linreg_1.score(X, y)

In [None]:
#Predicting with your model
linreg_1.predict(X)

In [None]:
#passing in another set for your model to predict
X_test = X[0:30]
linreg_1.predict(X_test)

Similar to how you create a Linear Regression object, Logistic Regression is much the same. Some things that are useful in the Logistic Regression arguments include "class_weight" to handle imbalanced classes. In addition, the Logistic Regression object will automatically include regularization (ridge by default but you can set to lasso as well). You can combat this by setting the C value to a very large number.

In [None]:
logit_1 = LogisticRegression()

In [None]:
#similar steps to 
iris_feat_list = iris_df.columns[0:-1]
iris_X = iris_df[iris_feat_list]
iris_y = iris_df.target

In [None]:
logit_1.fit(iris_X, iris_y)

Much like LinearRegression, the LogisticRegression model will have many attributes you can call upon. In addition to standard prediction, the logistic regression model will also allow you to show probabilities of your predictions as well as log probabilities as well.

In [None]:
#Predictions
logit_1.predict(iris_X)

In [None]:
#Prediction probabilities
iris_predict_data = pd.DataFrame(logit_1.predict_proba(iris_X))
iris_predict_data

In [None]:
#Prediction of log odds
iris_log_predict = pd.DataFrame(logit_1.predict_log_proba(iris_X))
iris_log_predict


Decision Trees and Tree based models are also in Sklearn. While decision tree and random forest are often taught together, it's important to remember that most tree based models are ensembles so they are imported from different areas in sklearn.
As tree based models have lots of parameters that are reliant on theory of trees, I won't go too indepth into them here except to highlight some common parameters that are often tuned.
Trees can be split along according to gini or entropy as I'm using a Classifier here. It's worth evaluating which metrics gives you better performance even though both attempt to do similar things. Regressor models have other criterias as well such as "mse", etc. Standard things such as max_depth, min_samples_split, min_samples_leaf, etc can all be tuned as well. Note that the max_features is set to "None" when generally the "standard" practice is set it to square root of your features. Both "sqrt" or "auto" will adjust this.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
iris_tree = DecisionTreeClassifier()

In [None]:
iris_tree.fit(iris_X, iris_y)

In [None]:
#Examining feature importances of your tree model
iris_tree.feature_importances_

In [None]:
#A neat way to visualize your feature importance!
pd.Series(index = iris_feat_list, data = iris_tree.feature_importances_).sort_values().plot(kind = 'bar')

In [None]:
#Examining different classes in your class target
iris_tree.n_classes_

In [None]:
#counting how many features are in your model
iris_tree.n_features_

As mentioned, Sklearn's Random Forest and Gradient Boosting Machine are both under ensemble models. Most of the parameters are similar to tree's model itself so I'll simply be going over some ensemble specific parameters. Note you can also use VotingClassifier from this package as a simple ensemble object if you want to play with model ensembling a bit on your own.

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier

One of the things to note is that since these are ensembles of decision trees, one new parameter is "n_estimators". This signifies the amount of trees you desire for your ensemble. Generally a larger number such as 1000-2000 is recommended but we'll use the default 10 for this toy example. Another thing worth noting is that max_features is set to "auto" to designate that it will use square root of total features when considering a split.

In [None]:
boston_rf_1 = RandomForestRegressor()

In [None]:
boston_rf_1.fit(X, y)

In [None]:
#This will show you the base estimator of your ensemble, in this case, a decision tree regressor
boston_rf_1.base_estimator

In [None]:
#This will show you a list of ALL the estimators in your model
boston_rf_1.estimators_

In [None]:
#Predicting with model
boston_rf_1.predict(X)

GBM is the last model I'd like to discuss in this tutorial. There's a few new metrics to take into account here. One is the loss function that you wish to optimize and the other is the learning rate. I generally don't touch these but the learning_rate can be thought of as a recipricol effect with n_estimators. So more trees, you can use a faster learning rate and vice versa but tuning these can lead to better results. Subsample is another parameter that I often tune. I've COMPLETELY forgotten the mathematical reason behind it but general good practice is to leave it at .8 or so (depends on size of your data as well!) rather than the default 1.0 for SGD purposes. Once again, for simplicity sake, I will just leave most parameters at base though here.

In [None]:
iris_gbm = GradientBoostingClassifier()

In [None]:
iris_gbm.fit(iris_X, iris_y)
iris_gbm.predict(iris_X)

Finally there's another package in sklearn that's really useful which is the metrics package. For the sake of time, I don't have too much time to go into it but it's worth looking into. There's lots of different metrics there where you'll be able to evaluate such as RSME, accuracy, etc and the like. I'll show a standard convention:

In [None]:
import sklearn.metrics as metrics

To use the various metrics, simply call "metrics.FUNCTION(true_y, predicted_y) and it'll often give you the proper score. If you have issues with this, feel free to find me. :)

In [None]:
#example: metrics.accuracy_score(true_y, predicted_y)

If you want to cross validate your model, in Python 3, you'll need to import it from the model_selection module in sklearn. You can also use the 

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate, train_test_split

Here we will 5-fold cross validate the linear regression model from earlier as well as store the scores 
into the scores variable. You can adjust the scoring metrics as well but I'm leaving it default for now. Note the cross_validate function is imported here as well to show that you can also create an object that allows you to cross validate across multiple different scoring metrics and parameters. For simplicity sake, it won't be demo'd here but it's useful to note if you feel more comfortable with sklearn

In [None]:
scores = cross_val_score(linreg_1, X, y, cv = 5)
scores

We can also create a holdout set using the train_test_split module from model_selection.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)
clf = linreg_1.fit(X_train, y_train)
clf.score(X_test, y_test)

Here's a demo of the example as well as how to make a cross validated prediction from the training set to predict the hold out set. You can then use the metrics function mentioned previously as well on your results.

In [None]:
predictions = cross_val_predict(linreg_1, X, y, cv = 5)
metrics.r2_score(y, predictions)