Skip to content
/ ml Public

ML models made while progressing through the Machine Learning with Python course by IBM & various personal projects

Notifications You must be signed in to change notification settings

SarahHannes/ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Models made while progressing through IBM open courseware and other personal projects.
Course info can be found here.

Table of Contents

Supervised Learning

Regression

Simple Linear Regression

Non-Linear Regression

Classification

Decision Tree

K-Nearest Neighbors

Logistic Regression


Prediction using Raw Data

Logloss = 0.863

Classification Report
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       0.00      0.00      0.00        13
 Iris-virginica       0.32      1.00      0.48         6

       accuracy                           0.57        30
      macro avg       0.44      0.67      0.49        30
   weighted avg       0.43      0.57      0.46        30

Confusion Matrix


Prediction using Normalized Data

Logloss = 0.855

Classification Report
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      0.23      0.38        13
 Iris-virginica       0.38      1.00      0.55         6

       accuracy                           0.67        30
      macro avg       0.79      0.74      0.64        30
   weighted avg       0.88      0.67      0.64        30

Confusion Matrix

Support Vector Machine

linear Classifcation Report

                 precision    recall  f1-score   support

Iris-Versicolor       0.82      0.90      0.86        10
 Iris-Virginica       0.89      0.80      0.84        10

       accuracy                           0.85        20
      macro avg       0.85      0.85      0.85        20
   weighted avg       0.85      0.85      0.85        20

poly Classification Report

                 precision    recall  f1-score   support

Iris-Versicolor       0.75      0.90      0.82        10
 Iris-Virginica       0.88      0.70      0.78        10

       accuracy                           0.80        20
      macro avg       0.81      0.80      0.80        20
   weighted avg       0.81      0.80      0.80        20

rbf Classification Report

                 precision    recall  f1-score   support

Iris-Versicolor       0.75      0.90      0.82        10
 Iris-Virginica       0.88      0.70      0.78        10

       accuracy                           0.80        20
      macro avg       0.81      0.80      0.80        20
   weighted avg       0.81      0.80      0.80        20

Sigmoid Classification Report

                 precision    recall  f1-score   support

Iris-Versicolor       0.50      1.00      0.67        10
 Iris-Virginica       0.00      0.00      0.00        10

       accuracy                           0.50        20
      macro avg       0.25      0.50      0.33        20
   weighted avg       0.25      0.50      0.33        20

Confusion Matrix

Naive Bayes

  • Spam/ Ham Classification (py, ipynb)
    Uses SMS Spam Collection Data Set from UCI (Original source, dataset raw file)
    Wordclouds

    Evaluation: Classification Report, Log loss, Matthew Correlation Coefficient and Confusion Matrix
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       976
        spam       0.99      0.81      0.89       139

    accuracy                           0.98      1115
   macro avg       0.98      0.91      0.94      1115
weighted avg       0.98      0.98      0.97      1115
Log loss: 0.836
Matthews Correlation Coefficient: 0.885

Confusion Matrix

Unsupervised Learning

Clustering

K-Means - Iris clustering

  • Iris Species Clustering
    Uses iris dataset
    k = 3 clusters gives the best Rand Index score at 0.73
    This evaluation method is possible since the original label (Species column) was retained as label_true, and comparison were made between label_pred and label_true using rand index.

Optimization using elbow methods were also performed using both distortion and inertia.
Both methods confirm the best cluster is k = 3.

K-Means - Compressing Image Color

Agglomerative Hierarchical

iris.groupby(['cluster_', 'Species'])["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"].mean()

 	                         SepalLengthCm 	SepalWidthCm 	PetalLengthCm 	PetalWidthCm
cluster_ 	Species
0 	       Iris-setosa 	    5.006000 	3.418000 	1.464000 	0.244000

1 	       Iris-versicolor 	6.700000 	3.000000 	5.000000 	1.700000
               Iris-virginica 	6.893939 	3.118182 	5.806061 	2.133333

2 	       Iris-versicolor 	5.920408 	2.765306 	4.244898 	1.318367
               Iris-virginica 	5.994118 	2.694118 	5.058824 	1.817647
Evaluation using Species column as ground truth:
Homogeneity Score: 0.744
Adjusted Mutual Info Score: 0.753
Normalized Mutual Info Score: 0.756
V-measure Score: 0.756

DBSCAN

Evaluation using Species column as ground truth:
Estimated number of clusters: 2
Estimated number of noise points: 3
Homogeneity: 0.576
Completeness: 0.877
V-measure: 0.696
Adjusted Rand Index: 0.554
Adjusted Mutual Info: 0.690
Silhouette Coefficient: 0.555

Recommender Systems

Content-based

  • Recomendating restaurants based on user past rating history
  • Selected feature is the cuisine type
  • Datasets as provided here. (Click here to navigate to original source)
  • Future improvement:
    • using knn to classify restaurant by cuisine type and use it as ground truth for evaluation
    • adding/ incorporating other rating criterias to get a more solid user profile (only food_rating was considered in the existing built model)
    • Somehow, all of the recommended placeID obtained from the model is not in geospatial2.csv file provided from the source (which I had assumed to contain all of the restaurants info).. Still unsure if this is a bug..
  • Output: df containig topN of recommended placeID & its weighted recommendation score for the specified userID
get_recommendation("U1138")
Rcuisine 	total_by_place
placeID 	
132774 	7
135099 	6
135098 	4
135103 	4
135097 	4

Full credit belongs to its source. Thank you IBM for providing free education.

About

ML models made while progressing through the Machine Learning with Python course by IBM & various personal projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published