### COSC-247 Ensemble Assignment

Mia (Bomi) Jung

Includes code from *Python Machine Learning 3rd Edition* by Sebastian Raschka, Packt Publishing Ltd. 2019

Prepared for use in COSC-247 Machine Learning at Amherst College, Fall 2022, by Lee Spector (lspector@amherst.edu).

### Description of Data
The following dataset includes details of various applications scraped directly from Google play.

This includes the name of app, the category of the app, the overall user rating of the app (as when scraped), the nmber of user reviews for the app (as when scraped), the number of user downloads/installs for the app (as when scraped), and much more.

Data was attained from this web page:
https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download

In [1]:
import os
import pandas as pd
import csv 

try:
    data = pd.read_csv('googleplaystore.csv') 
    
except HTTPError:
    s = 'googleplaystore.csv'
    print('From local path:', s)
    data = pd.read_csv(s,
                     header=None,
                     encoding='utf-8')

data.tail()

Unnamed: 0,House party - live chat,DATING,1,1.1,9.2M,10.00,0,Mature 17+,Dating,31-Jul-18,3.52,4.0.3 and up
9360,Master E.K,FAMILY,5.0,90,Varies with device,1000.0,0,Everyone,Education,11-Aug-17,1.5.0,4.4 and up
9361,Barisal University App-BU Face,FAMILY,5.0,100,10M,1000.0,0,Everyone,Education,6-May-18,3.1.1,4.0.3 and up
9362,Oración CX,LIFESTYLE,5.0,103,3.8M,5000.0,0,Everyone,Lifestyle,12-Sep-17,5.1.10,4.1 and up
9363,"FD Calculator (EMI, SIP, RD & Loan Eligilibility)",FINANCE,5.0,104,2.3M,1000.0,0,Everyone,Finance,7-Aug-18,2.1.0,4.1 and up
9364,Ríos de Fe,LIFESTYLE,5.0,141,15M,1000.0,0,Everyone,Lifestyle,24-Mar-18,1.8,4.1 and up


### Description of Classification task

I will be training machine learning models to classify the data into two categories: apps that have more than 1 million downloads (category 1), and apps that have MORE than 1 million downloads (category 2). 

The models will use features user rating and user reviews to predict whether the given data falls into category 1 or 2 .

Since this is real-world data, some data pre-processing had to be done- including getting rid of 1000 separators (commans) and + signs in the number of downloads, changing Strings to floats, and getting rid of entries that had NULL values for user ratings.

In [2]:
import numpy as np
# Extract the number of installations from the data- but remember the numbers are in Strings.
yStr = data.iloc[0:10841,5].values
# Convert the number of installations from String to float.
yNums = [float(numString) for numString in yStr]

# Apps that have 100000 (1 million) views or more will be classified as 1. Otherwise, 0.
y=[]
for num in yNums:
    if (num>=1000000):
        y.append(1)
    else:
        y.append(0)
        
# User rating and reviews (in columns 2 and 3) will be used as the predictor variables.
X = data.iloc[0:, [2, 3]].values 

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

### Descriptions of models

For my pipelines, I first used a logistic regression model, which is a linear classification model that uses the sigmoid function (an activation function whose output is the probability of an example belonging to class 1) to make predictions.

I also used a Decision Tree Classifier, which is a model that breaks down the data by making a decision based on a series of questions. Based on features of the training data, this model learns a series of questions to infer the class labels of the examples, and outputs axis-parallel decision boundaries.

The third model I used was the K-neighbors classifer, which simply memorizes the training data as its "learning" process (aka "lazy learning"), and assigns the class labels by majority vote taken from the k-nearest neighbors of the data we want to classify.

I utilized StandardScaler to transform my features.

In [4]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

pipe1 = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression())

pipe2 = make_pipeline(DecisionTreeClassifier(max_depth=6,
                                             criterion='entropy',
                                             random_state=0))

pipe3 = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=20,
                                                             p=3,
                                                             metric='minkowski'))

clf_labels = ['LogisticRegression', 'Decision tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, pipe2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 0.77 Stdev: 0.014 [LogisticRegression]
Accuracy: 0.94 Stdev: 0.009 [Decision tree]
Accuracy: 0.94 Stdev: 0.011 [KNN]


### Desription of the ensemble method

I am using the majority voiting ensemble method, which allows us to combine my three classification algorithms
to build a stronger meta classifier that balances out individual classifers' weaknesses.

In [5]:
from sklearn.ensemble import VotingClassifier

mv_clf = VotingClassifier(estimators=[('p', pipe1), ('dt', pipe2), ('kn', pipe3)])

clf_labels += ['Majority voting']
all_clf = [pipe1, pipe2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

Accuracy: 0.77 Stdev: 0.014 [LogisticRegression]
Accuracy: 0.94 Stdev: 0.009 [Decision tree]
Accuracy: 0.94 Stdev: 0.011 [KNN]
Accuracy: 0.94 Stdev: 0.01 [Majority voting]


### cross-validation result summary

The results of cross-validation show that The Logistic Regression Model had an accuracy of 0.77, with a standard deviation of 0.014. The Decision Tree and K-nearest neighbor models both had an accuracy of 0.94, and the decision tree's result had a standard deviation of 0.009 while that of the K-nearest neighbor model was 0.11. 

Overall, the Majority voting's Accuracy is 0.94 with a standard deviation of 0.01. It seems that out of all the models, the logistic regression model had the lowest accuracy.


In [6]:
pipe1.fit(X_train, y_train)

y_pred = pipe1.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', np.array(y_test).shape[0])
print('Accuracy:', pipe1.score(X_test, y_test))

Misclassified test set examples: 644
Out of a total of: 2810
Accuracy: 0.7708185053380783


In [7]:
pipe2.fit(X_train, y_train)

y_pred = pipe2.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', np.array(y_test).shape[0])
print('Accuracy:', pipe2.score(X_test, y_test))

Misclassified test set examples: 170
Out of a total of: 2810
Accuracy: 0.9395017793594306


In [8]:
pipe3.fit(X_train, y_train)

y_pred = pipe3.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', np.array(y_test).shape[0])
print('Accuracy:', pipe3.score(X_test, y_test))

Misclassified test set examples: 165
Out of a total of: 2810
Accuracy: 0.9412811387900356


In [9]:
mv_clf.fit(X_train, y_train)

y_pred = mv_clf.predict(X_test)
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', np.array(y_test).shape[0])
print('Accuracy:', mv_clf.score(X_test, y_test))

Misclassified test set examples: 173
Out of a total of: 2810
Accuracy: 0.9384341637010676


### Summary of results from testing on the testing data

The results attained from testing on the testing data are pretty consistent with the results from cross-validation. 

The logistic regression model, which had the lowest accuracy level according to the cross-validation resutls, had the highest number of misclassified test set examples: at a whopping 644 missclassified examples, compared to the approximately 165 to 175 missclassified examples resulting from all the other models.