# Functions in `sklearn` {.unnumbered}


## Functions in `matplotlib` and `seaborn`

## Functions and methods in `pandas`
We assume basic knowledge of `pandas` this course, e.g., 



| Name                          | Description           | Examples  | Lecture   |
| ----------------------------- | -------------------   | -------   | -------   |
| .read_csv()                   |                       |           | 8         |              
| .head()                       |                       |           | 8         |
| .info()                       |                       |           | 8         |
| .dtypes()                     |                       |           | 8         |
| .describe()                   |                       |           | 8         |
| .value_counts()               |                       |           | 8         |
| .astype()                     |                       |           | 8         |
| .sort_values()                |                       |           | 8         |
| .reset_index()                |                       |           | 8         |
| Indexing with []              |                       |           | 9         |
| Filtering with bools          |                       |           | 9         |
| `.loc`                        |                       |           | 9         |
| `.iloc`                       |                       |           | 9         |
| `concat` `merge`              |                       |           | 9         |
| `.isna()`                     |                       |           | 10        |
| `.dropna()`                   |                       |           | 10        |
| `.fillna()`                   |                       |           | 10        |
| `.sum()`                      |                       |           | 10        |
| `.cut()`                      |                       |           | 10        |
| `.groupby()`                  |                       |           | 10        |
| `.pivot_table()`              |                       |           | 10        |
| `.crosstab()`                 |                       |           | 10        |
| `.plot()`                     |                       |           | 11        |
| Axes and subplots             |                       |           | 12        |
| `sns.lineplot`                |                       |           | 12        |
| `sns.countplot`               |                       |           | 12        |
| `sns.barplot`                 |                       |           | 12        |
| `sns.heatmap`                 |                       |           | 12        |
| `sns.histplot`                |                       |           | 12        |
| `sns.jointplot`               |                       |           | 12        | 
| `sns.pairplot`                |                       |           | 12        |
| `sns.FacetGrid`               |                       |           | 12        |

### `pandas` recap

#### Lecture 8
* Series, their values and indices.
* Dataframe, creation from dictionaries, with indices.
* `read_csv`, `head`, and `info`.
* Show datatypes with `dtypes` for the dataframe and `dtype` for a series.
* `describe()`.
* `value_counts()`.
* `astype`.
* Sorting with `sort_values`.
* Using `inplace=True`.
* `reset_index`.

#### Lecture 9
* Selection by index with []; selecting multiple indices.
* Filtering by boolean arrays.
* Logical operators in selection.
* Selection with `.loc`.
* Selection with `.iloc`.
* Concatenation with `concat`.
* Merging data frames with `merge`.

**Attributes**
df.index, df.columns, 

#### Lecture 10
* `.isna()`,
* `.dropna()`,
* `.fillna()`,
* `.sum()`,
* `.cut()`,
* `.groupby()`,
* `.pivot_table()`,
* `.crosstab`.

#### Lecture 11
* `plt.plot`: Line plot, scatter plot, bar plot.
* `sns.lineplot`
* `sns.countplot`
* `sns.barplot`
* `sns.heatmap`
* `sns.histplot`
* `sns.pairplot`
* `sns.FacetGrid`


### Data cleaning and missing values

Data cleaning:
* Split columns.
* 
https://datascientyst.com/exploratory-data-analysis-pandas-examples/
https://bookdown.org/rdpeng/exdata/exploratory-data-analysis-checklist.html

### Outliers
* Use visual explanation of box plots in lecture.
* Show how to find the outliers.
* Mention there is a lot of stuff about this.
* Have some exercises where it matters.

### Imbalanced data

### Data leakage
* Find a good source on this at the correct level.

* Having the target as a feature.

#### Invalid processing
* The example from Elements of Statistical Learning.
* All data processing must be done using only the training data.
* Exploratory data analysis is usually fine, but no always.
* Separate into train and test in the beginning.
* Do all modelling steps in cross-validation:
    * Feature scaling.
    * Feature selection.
    * And so on.
* https://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
    * (i) An "account number" feature, for the problem of predicting whether a potential customer would open an account at a bank. Obviously, assignment of such an account number is only done after an account has been opened. 
    * (ii) An "interviewer name" feature, in a cellular company churn prediction problem.
    * **Anachronisms.**
        * Was feature $x$ registered before or after the event $y$?
            * IBM example.
    * Find some data sets with data leakage.

### Embarassing mistakes
* Understanding your domain.

### Formatting and styling

* style.highlight_between():
    * Highlight negative values.
* style.highlight_max(), style_highlight_min().


In [None]:
np.random.seed(24)
df = pd.DataFrame({"A": np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list("BCDE"))], axis=1)
df.iloc[3, 3] = np.nan
df.iloc[0, 2] = np.nan
df.

## New functions


We use several functions in `pandas`, `matplotlib`, and `seaborn` that haven't been covered, in addition to several `numpy` functions and some `statsmodels` functions.

| Name                          | Library               | Description         | Examples |
| ----------------------------- | --------------------- | ------------------- | -------  |
| .corr()                       | `pandas`              | Correlation matrix.  | 
| .dtype()                      | `pandas`              | Data type for series. |
| .min() / .max()               | `pandas` / `numpy`    | Find minimal / maximal value. 
| .argmin() / .argmax()         | `pandas` / `numpy`    | Find index of the minimal / maximal values. | 
| .nonzero()                    | `pandas`              | 
| .all() / .any()               | `pandas` / `numpy`    | Check if every element of an array is `True`.    
| .replace()                    | `pandas`              | Replace values in data frame or series. Often used to replace strings with numbers. |
| .to_numpy()                   | `pandas`              | Converts data frame or series to Numpy | 
| .lmplot                       | `seaborn`             | 
| .style.format                 | `pandas`              | Format data frames for visual inspection. |
| .std / .var                   | `pandas` / `numpy`    | Standard deviation and variance. |
| .mean / .median               | `pandas`              | Mean and median.
| style.background_gradient     | `pandas`              | Styling of data frames with colors. Important for correlation matrices. | 
| .query()                      | `pandas`              | Selecting rows from a dataset using specialized syntax.
| .boxplot()                    | `seaborn`             | 
| .violinplot()                 | `seaborn`             |
| .sort_index()                 | `seaborn`             |
| .log()                         | `numpy`               | Natural logarithm.                      
| .log1p()                       | `numpy`               | Narural logarithm of $1+x$.
| .sample()                     | `pandas` | 
| .shape |                      | `pandas` / `numpy`
| .describe(include='object')   | `pandas` | 
| .nunique() / .unique()                    | `pandas` | 
| sns.catplot()                 | `pandas` |


## Functions in `sklearn`
[`sklearn`](https://scikit-learn.org/stable/index.html), also known as scikit-learn, is the most important Python library for fitting machine learning model. It has strong support for basic functionality such as metrics, cross-validation, fitting algorithms, and feature manipulation. That said, it is a big library, and we only use parts of it. 

The following is a complete list of `sklearn` functions used in this course, and we expect some familiarity with all of them. No exercise requires you to use any other function than those on this list, and all of them have been covered in the lectures. In the same way, the exercises on the exam will only use functions on this list.



### Model selection and evaluation

| Name                                                                                                                                                        | Module            | Usage                                                                                 | Examples |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------------|----------|
| [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) | `model_selection` | Cross-validation for a specified score.                                                           |          |
| [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)          | `model_selection` | Cross-validation over a grid of values. Used to find the hyperparameters for a model.                                |          |
| [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)                                               | `metrics`         | Calculate the confusion matrix of a classifier.                                       |          |
| [`r2_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)                                      | `metrics`         | The $R^2$ score. Default score for most regression methods. Usually available through `reg.score()`.                           |          |
| [`mean_absolute_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error)     | `metrics`         | The mean absolute error score in regression, $\text{MAE}=\sum_{i=1}^n|\hat{y_i}-y_i|$ |          |
| [`mean_squared_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)        | `metrics`         | The mean squared error score in regression,$\text{MSE}=\sum_{i=1}^n(\hat{y_i}-y_i)^2$ |          |
| [`log_loss`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss)                                      | `metrics`         | The log loss, or cross-entropy loss, for classification.                              |          |
| [`auc`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc)                                                     | `metrics`         | The area under the ROC curve, used in classification.                                     |          |
| [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) | `metrics` | Accuracy classification score. Default score for most classification methods. Usually available through `clf.score()`. | |

### Visualization

| Name                                                                                                                                                             | Module                   | Usage                                                                     | Examples |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|---------------------------------------------------------------------------|----------|
| [`DecisionBoundaryDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html)                                   | `inspection`             | Plot the decision boundary for a classifier taking two features as input. |          |
| [`RocCurveDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html#sklearn.metrics.RocCurveDisplay)                      | `metrics`                | Display the ROC curve of a classifier.                                    |          |
| [`ConfusionMatrixDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay) | `ConfusionMatrixDisplay` | Display the confusion matrix of a classifier.                             |          |
| [`plot_tree`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree) | `tree` | Display a fitted decision tree.                             |          |

### Models

| Function                                                                                                                                                                                                                                                                                                                          | Module         | Usage                                            | Examples |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------------------------------------------|----------|
| [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) / [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)                                                                               | `linear_model` | Fit a LASSO model with or without CV.            |          |
| [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) / [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)                                                                               | `linear_model` | Fit a Ridge model with or without CV.            |          |
| [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) / [ElasticNetCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)                                                                     | `linear_model` | Forward an elastic net model with or without CV. |          |
| [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)                                                                                                                                                                            | `linear_model` | Fit a linear regression model.                   |          |
| [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) / [LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV) | `linear_model` | Fit a logistic regression model.                 |          |
| [LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis)                                                                                                                            | `discriminant` | Fit an LDA model.                                |          |
| [QuadraticDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html#sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis)                                                                                                                   | `discriminant` | Fit a QDA model.                                 |          |
| [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)                                                                                                                                                                                                | `naive_bayes`  | Fit a Gaussian Naive Bayes model.                |          |
| [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) / [tree.DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)             | `tree`         | Fit a decision tree.                             |          |
| [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) / [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)          | `neighbors`    | Fit a $k$ nearest neighbors classifier.                        |          |
| [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier) / [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)          | `ensemble`    | Gradient boosting for regression or clasification.               |          |
| [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) / [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)          | `ensemble`    | Random forests for regression or clasification.               |          |

### Preprocessing

| Function                                                                                                                                                                                    | Module              | Usage                                                      | Examples |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|------------------------------------------------------------|----------|
| [`FeatureUnion`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)                                                                                      | `pipeline`          | Merge two or more sets of features.                        |          |
| [SimpleImputer](SimpleImputerhttps://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)                                                                           | `impute`            | Impute missing data using descriptive statistics.          |          |
| [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)                                                                                              | `pipeline`          | Chain methods into a pipeline.                             |          |
| [SequentialFeatureSelector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector) | `feature_selection` | Forward and backward selection of features.                |          |
| [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)                                             | `compose`           | Transform all feature columns.                             |          |
| [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html#sklearn.compose.TransformedTargetRegressor)                  | `compose`           | Transform target.                                          |          |
| [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer)                           | `preprocessing`     | Transform features using custom function.                  |          |
| [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)                                          | `preprocessing`     | Scale features to unit variance and zero mean.             |          |
| [SplineTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.SplineTransformer.html#sklearn.preprocessing.SplineTransformer)                                 | `preprocessing`     | Construct B-splines.                                       |          |
| [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)                                             | `preprocessing`     | Use categorical variables.                                 |          |
| [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures)                              | `preprocessing`     | Generate polynomial and interaction features.              |          |
| [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer)                                    | `preprocessing`     | Construct step functions.                                  |          |
| [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)                                                                           | `decomposition`     | Construct new features using principal component analysis. |          |