Models made while progressing through IBM open courseware and other personal projects.
Course info can be found here.
- Student's Final Grade Predictor
Uses Student Performance Data Set from UCI ML Repository.
- GDP Predictor
Uses yearly GDP data. Can be obtained from various sources such as The World Bank.
- Iris Species Decision Tree
Uses iris dataset
Model Accuracy: 0.977
- Iris Species Identification using KNN
Uses iris dataset
The best accuracy was with 1.0 with k = 8
- Iris Species Identification using Logistic Regression
Uses iris dataset
Prediction using Raw Data
Logloss = 0.863
Classification Report
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 0.00 0.00 0.00 13
Iris-virginica 0.32 1.00 0.48 6
accuracy 0.57 30
macro avg 0.44 0.67 0.49 30
weighted avg 0.43 0.57 0.46 30
Prediction using Normalized Data
Logloss = 0.855
Classification Report
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 0.23 0.38 13
Iris-virginica 0.38 1.00 0.55 6
accuracy 0.67 30
macro avg 0.79 0.74 0.64 30
weighted avg 0.88 0.67 0.64 30
- Iris Species Identification using various SVM kernels
Uses iris dataset
Best accuracy obtained using SVM linear kernel with 0.85 accuracy score
linear Classifcation Report
precision recall f1-score support
Iris-Versicolor 0.82 0.90 0.86 10
Iris-Virginica 0.89 0.80 0.84 10
accuracy 0.85 20
macro avg 0.85 0.85 0.85 20
weighted avg 0.85 0.85 0.85 20
poly Classification Report
precision recall f1-score support
Iris-Versicolor 0.75 0.90 0.82 10
Iris-Virginica 0.88 0.70 0.78 10
accuracy 0.80 20
macro avg 0.81 0.80 0.80 20
weighted avg 0.81 0.80 0.80 20
rbf Classification Report
precision recall f1-score support
Iris-Versicolor 0.75 0.90 0.82 10
Iris-Virginica 0.88 0.70 0.78 10
accuracy 0.80 20
macro avg 0.81 0.80 0.80 20
weighted avg 0.81 0.80 0.80 20
Sigmoid Classification Report
precision recall f1-score support
Iris-Versicolor 0.50 1.00 0.67 10
Iris-Virginica 0.00 0.00 0.00 10
accuracy 0.50 20
macro avg 0.25 0.50 0.33 20
weighted avg 0.25 0.50 0.33 20
- Spam/ Ham Classification (py, ipynb)
Uses SMS Spam Collection Data Set from UCI (Original source, dataset raw file)
Wordclouds
Evaluation: Classification Report, Log loss, Matthew Correlation Coefficient and Confusion Matrix
precision recall f1-score support
ham 0.97 1.00 0.99 976
spam 0.99 0.81 0.89 139
accuracy 0.98 1115
macro avg 0.98 0.91 0.94 1115
weighted avg 0.98 0.98 0.97 1115
Log loss: 0.836
Matthews Correlation Coefficient: 0.885
- Iris Species Clustering
Uses iris dataset
k = 3 clusters gives the best Rand Index score at 0.73
This evaluation method is possible since the original label (Species column) was retained aslabel_true
, and comparison were made betweenlabel_pred
andlabel_true
using rand index.
Optimization using elbow methods were also performed using both distortion and inertia.
Both methods confirm the best cluster is k = 3.
- Compressing Image Color
Uses K Means clustering to reduce the original color scale to predefined clusters.
- Iris Species Clustering
Uses iris dataset
iris.groupby(['cluster_', 'Species'])["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"].mean()
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
cluster_ Species
0 Iris-setosa 5.006000 3.418000 1.464000 0.244000
1 Iris-versicolor 6.700000 3.000000 5.000000 1.700000
Iris-virginica 6.893939 3.118182 5.806061 2.133333
2 Iris-versicolor 5.920408 2.765306 4.244898 1.318367
Iris-virginica 5.994118 2.694118 5.058824 1.817647
Evaluation using Species column as ground truth:
Homogeneity Score: 0.744
Adjusted Mutual Info Score: 0.753
Normalized Mutual Info Score: 0.756
V-measure Score: 0.756
- Iris Species Clustering
Uses iris dataset
Evaluation using Species column as ground truth:
Estimated number of clusters: 2
Estimated number of noise points: 3
Homogeneity: 0.576
Completeness: 0.877
V-measure: 0.696
Adjusted Rand Index: 0.554
Adjusted Mutual Info: 0.690
Silhouette Coefficient: 0.555
- Recomendating restaurants based on user past rating history
- Selected feature is the cuisine type
- Datasets as provided here. (Click here to navigate to original source)
- Future improvement:
- using knn to classify restaurant by cuisine type and use it as ground truth for evaluation
- adding/ incorporating other rating criterias to get a more solid user profile (only food_rating was considered in the existing built model)
- Somehow, all of the recommended placeID obtained from the model is not in geospatial2.csv file provided from the source (which I had assumed to contain all of the restaurants info).. Still unsure if this is a bug..
- Output: df containig topN of recommended placeID & its weighted recommendation score for the specified userID
get_recommendation("U1138")
Rcuisine total_by_place
placeID
132774 7
135099 6
135098 4
135103 4
135097 4
Full credit belongs to its source. Thank you IBM for providing free education.