该项目旨在使用numpy实现一个类scikit-learn的mini机器学习库,对于相关的知识,均配有blog文章对其理论进行讲解,对于部分功能,还配有notebook分析代码实现上的细节。该项目的初衷是为那些算法学习者提供从理论到实现的一站式服务。
由于本人学识有限,并且没有Python开发经验,该库目前还是一个非常松散的代码集合体。如果你在blog、notebook或者code中发现任何纰漏或bug,甚至是觉得哪写的不通顺,都可以联系我,当然也可以直接在项目页面提issue,谢谢。
QQ: 435248055 | WeChat: QQ435248055 | Blog
点击算法名称进入相应Blog了解算法理论,notebook指导如何step-by-step的去实现该算法,code为模块化的代码文件。
注:除非特别说明,各模型所接受的数据格式均为numpy.ndarray
格式,部分也可接受List
或者嵌套List
,除此之外的数据格式本人暂不保证。由于目前的Python type hint还不支持numpy,所以在代码中未说明(感谢微信昵称@Stream的提醒)。
Class | Algorithm | Implementation | Code |
---|---|---|---|
Generalized Linear Models | Linear Regression | notebook | code |
Logistic regression | notebook | code | |
Nearest Neighbors | Nearest Neighbors Classification | notebook | code |
Naive Bayes | Gaussian Naive Bayes | notebook | code |
Support Vector Machine | SVC | notebook | code |
Decision Trees | ID3 Classification | notebook | code |
ID3 Regression | notebook | code | |
CART Classification | notebook | code | |
CART Regression | notebook | code | |
Ensemble methods | Random Forests Classification | notebook | code |
Random Forests Regression | notebook | code | |
AdaBoosting Classification | notebook | code |
Class | Algorithm | Implementation | Code |
---|---|---|---|
Gaussian mixture models | Gaussian Mixture | notebook | code |
Clustering | K-means | notebook | code |
DBSCAN | notebook | code | |
Association Rules | Apriori | notebook | |
Collaborative Filtering | User-based | notebook | |
Item-based | notebook | ||
LFM | notebook |
Class | Approach | Code |
---|---|---|
Model Selection | Dataset Split | code |
K-Fold | code | |
Stratified K-Fold | code | |
Metrics | Accuracy | code |
Log loss | code | |
F1-score | code | |
AUC | code | |
Explained Variance | code | |
Mean Absolute Error | code | |
Mean Squared Error | code | |
R Square | code | |
Euclidean Distances | code |
Class | Algorithm | Implementation | Code |
---|---|---|---|
Feature Scaling | StandardScaler | code | |
MinMaxScaler | code | ||
Unsupervised dimensionality reduction | PCA | notebook | code |
SVD | notebook | code | |
Supervised dimensionality reduction | Linear Discriminant Analysis | notebook | code |
Text Feature | Count Feature | code | |
TF-IDF | code |
整体代码重用性较低。
random forest没有实现并行。
LDA代码存在功能欠缺。
K-Fold代码中使用了np.append()
,效率较低。