![image.png](pear.png)

***

# Welcome to Pear Inc. 

Hi there! 
My name is Robert! You can call me Bob 😉. <br>
I'm the communications officer (fancy title ha!) in our glorious company 😇.
My job is to help facilitate product development and market penetration 🤓. <br>
I spent endless hours talking to engineers, product managers, and  customers 😱  

Since we are a 20 people start-up (All of us have fancy names 😂), I also do some recruiting from time to time 💪 <br>
We are looking for **brave souls who are not afraid of a challenge and will help us** with our new product line of smart t-shirts! 🧐 <br> 

(Our CEO believes that smart t-shirts are the right direction for some reason 😅 I guess if you make something nobody needs, you won't have to sell it 🤓) <br>

Let me tell you a little bit more about our problem that you can help us with:<br>
We are creating a life changing smart t-shirt which has bluetooth and connects to your phone 🥳. They will be customizable outfits through downloaded applications. Our smart t-shirt will be developed with Google Wear OS which is a version of Google's Android operating system designed for smartwatches and other wearables. So users will be able to install custom programs through Google Play Store 🤭. <br> And we will sell them for 999.9$ a piece 💰💰💰<br>
But our engineers wanted to ensure that only Pear Inc. approved programs can be installed on our t-shirts because
market analysis showed that potential customers are afraid of ransomware that will break their "*premium*" t-shirts 🤦‍. So we need an antivirus for approving apps on the fly! <br>However, we don't want to install an off the shelf antivirus to our t-shirts 🤫, because BIG profit margins matter 🏦!

##### Enough chit-chat!
Let's get down to the business of why I contacted you: <br>
Our bright engineers came up with an algorithm that creates compressed signatures for the apps in the Google Play Store. It is called '*manifold averaging generally intelligent compressor*' or as we call it 'MAGIC'. <br>
The engineers told us that the outputs of MAGIC reflect the statistical properties of the uncompressed apps (whatever that may mean! 🤦‍). <br> MAGIC takes a Google Play Store app as an input and outputs a 4 dimensional numerical signature (they called it a vector but calling it a vector is not fancy enough for marketing! 🤪).   

Now, since these signatures are just numbers, an off the shelf antivirus can't work with them (even if it could, we can't install an off the shelf antivirus into our t-shirts -- too much computing power and space is needed). Therefore **we need a light weight proof of concept that takes these signatures as inputs and outputs labels (virus or not) for them.** We eventually want to install your program into our smart t-shirts, where it will scan a Google Play Store app (its signature to be precise!) and stop the app's execution if it thinks the app is a virus! But we are not going so far just yet so you only need to create the pipeline that take the signatures, and output labels for them. Don't worry about the rest, it is just a proof of concept at the end 😉. We are providing the dataset for you to develop your model.

In a nutshell: 
- There 4 dimensional (4 feature) numerical inputs (signatures) with labels!
- We need a simple model that takes these inputs and labels them (Virus, Not a Virus)
- We also need you to evaluate your model. Choose any metric you want, but don't forget to explain why, since I don't know much about this field (that is why we need your help!)

Things to keep in mind:
- There are less 'Virus' in the dataset than 'Not a Virus'. (Naturally!)
- While we call it MAGIC, it still sometimes doesn't work well 🤦‍, so there are signatures with missing features (missing values).
- I don't know much about these things so please show your work, your thinking process and please make it as clear as possible, otherwise I get confused 😵. (Visualizations of the data and comments in your code would be great!)

***
##### Let me describe the dataset, and you are ready to get to work!

It is a CSV file. Each row represents a signature for an app. First 4 columns from left to right represent dimensions (features) and the last column is the label (isVirus: True or False). 

- Visualize the data (so that people like me can understand!)
- Clean up the data (balance it out, impute missing values and so on... depending on the method you are going to use!)
- Visualize the cleaned data (so that people like me can understand the effect of cleaning process!)
- Create a simple model that performs reasonably well. (If it doesn't perform well, comment on why and how to improve it!)
- Evaluate the model with a testset you will create from the dataset. (Pretty plots make things easier to understand)
- Upload your code to a private github repo you can share with us, and invite us (https://github.com/tarikkranda, https://github.com/ltc0060 and https://github.com/ahmetkoklu) as collaborators so only we can see our super-secret project. 

And you are done! (Don't forget to comment, and show your work please 🤓)


### SOLUTION :


In [1]:
# Your code here!
import pandas as pd
import math
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import random


In [2]:
data = pd.read_csv("dataset.csv")
data.head(10)

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,isVirus
0,-0.233467,0.308799,2.484015,1.732721,False
1,1.519003,1.238482,3.34445,0.783744,False
2,0.40064,1.916748,3.291096,-0.918519,False
3,-1.616474,0.209703,1.461544,-0.291837,False
4,1.480515,5.299829,2.64067,1.867559,True
5,1.239941,5.36427,1.279281,0.938585,True
6,0.003583,-0.027756,3.04873,,True
7,-0.286887,0.905702,1.924124,,True
8,-0.898322,-1.198319,0.694305,0.802052,True
9,-1.084037,0.509091,2.26816,0.35178,True


In [3]:
leng = len(data)
leng

1999

We will use length of the data to find how many random samples we need to use for each feature in the one $\sigma$ field.

In [4]:
check_for_nan_1 = data['feature_1'].isnull().values.any()
print(check_for_nan_1)

True


In [5]:
check_for_nan_2 = data['feature_2'].isnull().values.any()
print(check_for_nan_2)

True


In [6]:
check_for_nan_3 = data['feature_3'].isnull().values.any()
print(check_for_nan_3)

True


In [7]:
check_for_nan_4 = data['feature_4'].isnull().values.any()
print(check_for_nan_4)

True


These 4 rows shows us that we have NaN values in each feature that we need to eliminate. We will do it in the next lines.

#### Let's Start

##### feature_1

In [8]:
data.loc[data["feature_1"].notnull(),"feature_1"].head(5)

0   -0.233467
1    1.519003
2    0.400640
3   -1.616474
4    1.480515
Name: feature_1, dtype: float64

In [9]:
i_cout_f_1 = len(data.loc[data["feature_1"].notnull(),"feature_1"])
i_cout_f_1

1897

In [10]:
feature_1_non_null_array = []
for i in data.loc[data["feature_1"].notnull(),"feature_1"]:
    feature_1_non_null_array.append(i)
feature_1_non_null_array2 = pd.DataFrame(feature_1_non_null_array)
feature_1_non_null_array2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,-0.233467
1,1.519003
2,0.40064
3,-1.616474
4,1.480515


In [11]:
feature_1_non_null_array3 = np.array(feature_1_non_null_array)
f_1_field = az.hdi(feature_1_non_null_array3, hdi_prob=.68) # HDI 68%
f_1_field

array([-1.23482582,  2.00932907])

From this we found for feature_1 one $\sigma$ or standard deviation range. So in this range we can make enough random values and insert them to the null points.

In [12]:
new_values_feature_1 = []
for i in range(0,(leng-i_cout_f_1)):
    new_values_feature_1.append(random.uniform(min(f_1_field),max(f_1_field)))
new_values_feature_1_2 = pd.DataFrame(new_values_feature_1)
new_values_feature_1_2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,0.659527
1,-0.639666
2,0.127207
3,-1.097138
4,1.746139


In [13]:
list_1 = data.loc[data["feature_1"].isnull(),"feature_1"].index.tolist()
i_cout = 0
for i in list_1:
    data["feature_1"][i] = new_values_feature_1[i_cout]
    i_cout += 1
i_cout = 0

data['feature_1'].isnull().values.any()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["feature_1"][i] = new_values_feature_1[i_cout]


False

Here it can be seen that all the NaN values are now in the range of one $\sigma$ error, which is highly reliably in a statistical point of view.

Let's also look at the length and head of the feature_1.

In [14]:
data['feature_1'].head(5)

0   -0.233467
1    1.519003
2    0.400640
3   -1.616474
4    1.480515
Name: feature_1, dtype: float64

In [15]:
len(data['feature_1'])

1999

As it can be seen, length of the feature is now equal to the total data we have.

##### feature_2

In [16]:
data.loc[data["feature_2"].notnull(),"feature_2"].head(5)

0    0.308799
1    1.238482
2    1.916748
3    0.209703
4    5.299829
Name: feature_2, dtype: float64

In [17]:
i_cout_f_2 = len(data.loc[data["feature_2"].notnull(),"feature_2"])
i_cout_f_2

1899

In [18]:
feature_2_non_null_array = []
for i in data.loc[data["feature_2"].notnull(),"feature_2"]:
    feature_2_non_null_array.append(i)
feature_2_non_null_array2 = pd.DataFrame(feature_2_non_null_array)
feature_2_non_null_array2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,0.308799
1,1.238482
2,1.916748
3,0.209703
4,5.299829


In [19]:
feature_2_non_null_array3 = np.array(feature_2_non_null_array)
f_2_field = az.hdi(feature_2_non_null_array3, hdi_prob=.68) # HDI 68%
f_2_field

array([0.86193778, 3.58874241])

From this we found for feature_2 one $\sigma$ or standard deviation range. So in this range we can make enough random values and insert them to the null points.

In [20]:
new_values_feature_2 = []
for i in range(0,(leng-i_cout_f_2)):
    new_values_feature_2.append(random.uniform(min(f_2_field),max(f_2_field)))
new_values_feature_2_2 = pd.DataFrame(new_values_feature_2)
new_values_feature_2_2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,1.553896
1,2.041037
2,2.576541
3,1.189264
4,2.871631


In [21]:
list_2 = data.loc[data["feature_2"].isnull(),"feature_2"].index.tolist()
i_cout = 0
for i in list_2:
    data["feature_2"][i] = new_values_feature_2[i_cout]
    i_cout += 1
i_cout = 0

data['feature_2'].isnull().values.any()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["feature_2"][i] = new_values_feature_2[i_cout]


False

Here it can be seen that all the NaN values are now in the range of one $\sigma$ error, which is highly reliably in a statistical point of view.

Let's also look at the length and head of the feature_2.

In [22]:
data['feature_2'].head(5)

0    0.308799
1    1.238482
2    1.916748
3    0.209703
4    5.299829
Name: feature_2, dtype: float64

In [23]:
len(data['feature_2'])

1999

As it can be seen, length of the feature is now equal to the total data we have.

##### feature_3

In [24]:
data.loc[data["feature_3"].notnull(),"feature_3"].head(5)

0    2.484015
1    3.344450
2    3.291096
3    1.461544
4    2.640670
Name: feature_3, dtype: float64

In [25]:
i_cout_f_3 = len(data.loc[data["feature_3"].notnull(),"feature_3"])
i_cout_f_3

1893

In [26]:
feature_3_non_null_array = []
for i in data.loc[data["feature_3"].notnull(),"feature_3"]:
    feature_3_non_null_array.append(i)
feature_3_non_null_array2 = pd.DataFrame(feature_3_non_null_array)
feature_3_non_null_array2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,2.484015
1,3.34445
2,3.291096
3,1.461544
4,2.64067


In [27]:
feature_3_non_null_array3 = np.array(feature_3_non_null_array)
f_3_field = az.hdi(feature_3_non_null_array3, hdi_prob=.68) # HDI 68%
f_3_field 

array([1.16997145, 3.75770094])

From this we found for feature_3 one $\sigma$ or standard deviation range. So in this range we can make enough random values and insert them to the null points.

In [28]:
new_values_feature_3 = []
for i in range(0,(leng-i_cout_f_3)):
    new_values_feature_3.append(random.uniform(min(f_3_field),max(f_3_field)))
new_values_feature_3_2 = pd.DataFrame(new_values_feature_3)
new_values_feature_3_2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,2.61791
1,2.488602
2,1.23523
3,2.190789
4,2.02502


In [29]:
list_3 = data.loc[data["feature_3"].isnull(),"feature_3"].index.tolist()
i_cout = 0
for i in list_3:
    data["feature_3"][i] = new_values_feature_3[i_cout]
    i_cout += 1
i_cout = 0

data['feature_3'].isnull().values.any()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["feature_3"][i] = new_values_feature_3[i_cout]


False

Here it can be seen that all the NaN values are now in the range of one $\sigma$ error, which is highly reliably in a statistical point of view.

Let's also look at the length and head of the feature_3.

In [30]:
data['feature_3'].head(5)

0    2.484015
1    3.344450
2    3.291096
3    1.461544
4    2.640670
Name: feature_3, dtype: float64

In [31]:
len(data['feature_3'])

1999

As it can be seen, length of the feature is now equal to the total data we have.

##### feature_4

In [32]:
data.loc[data["feature_4"].notnull(),"feature_4"].head(5)

0    1.732721
1    0.783744
2   -0.918519
3   -0.291837
4    1.867559
Name: feature_4, dtype: float64

In [33]:
i_cout_f_4 = len(data.loc[data["feature_4"].notnull(),"feature_4"])
i_cout_f_4

1897

In [34]:
feature_4_non_null_array = []
for i in data.loc[data["feature_4"].notnull(),"feature_4"]:
    feature_4_non_null_array.append(i)
feature_4_non_null_array2 = pd.DataFrame(feature_4_non_null_array)
feature_4_non_null_array2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,1.732721
1,0.783744
2,-0.918519
3,-0.291837
4,1.867559


In [35]:
feature_4_non_null_array3 = np.array(feature_4_non_null_array)
f_4_field = az.hdi(feature_4_non_null_array3, hdi_prob=.68) # HDI 68%
f_4_field

array([-1.0585719,  2.2004065])

From this we found for feature_4 one $\sigma$ or standard deviation range. So in this range we can make enough random values and insert them to the null points.

In [36]:
new_values_feature_4 = []
for i in range(0,(leng-i_cout_f_4)):
    new_values_feature_4.append(random.uniform(min(f_4_field),max(f_4_field)))
new_values_feature_4_2 = pd.DataFrame(new_values_feature_4)
new_values_feature_4_2.head(5) # just using this to make visualization better.

Unnamed: 0,0
0,1.764913
1,-1.015146
2,-0.254404
3,-0.198851
4,0.103418


In [37]:
list_4 = data.loc[data["feature_4"].isnull(),"feature_4"].index.tolist()
i_cout = 0
for i in list_4:
    data["feature_4"][i] = new_values_feature_4[i_cout]
    i_cout += 1
i_cout = 0

data['feature_4'].isnull().values.any()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["feature_4"][i] = new_values_feature_4[i_cout]


False

Here it can be seen that all the NaN values are now in the range of one $\sigma$ error, which is highly reliably in a statistical point of view.

Let's also look at the length and head of the feature_4.

In [38]:
data['feature_4'].head(5)

0    1.732721
1    0.783744
2   -0.918519
3   -0.291837
4    1.867559
Name: feature_4, dtype: float64

In [39]:
len(data['feature_4'])

1999

As it can be seen, length of the feature is now equal to the total data we have.

#### General Comment

In [40]:
data.isnull().values.any() 

False

So all the missing values are now given their new values. Now let's visualize the dataset with first 20 rows.

In [41]:
data.head(20)

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,isVirus
0,-0.233467,0.308799,2.484015,1.732721,False
1,1.519003,1.238482,3.34445,0.783744,False
2,0.40064,1.916748,3.291096,-0.918519,False
3,-1.616474,0.209703,1.461544,-0.291837,False
4,1.480515,5.299829,2.64067,1.867559,True
5,1.239941,5.36427,1.279281,0.938585,True
6,0.003583,-0.027756,3.04873,1.764913,True
7,-0.286887,0.905702,1.924124,-1.015146,True
8,-0.898322,-1.198319,0.694305,0.802052,True
9,-1.084037,0.509091,2.26816,0.35178,True


Now let's move on with the predictive model training and creation.

#### Model Creation

In this part we will select a model that is going to be used to predict the IsVirus or not. I prefer using Naive Bayes, since; it is easy to build and probabilistic approach is highly reliable, logistic regression or perceptrons can also be preferred for linear systems but, for most naive bayes feels more natural from a statistics point of view.

So that means I will use Naive Bayes model with cross validation to train it from itself.

In [42]:
from sklearn.naive_bayes import GaussianNB
from sklearn import model_selection # train_test_split, KFold ,cross_val_score, GridSearchCV

In [43]:
model = GaussianNB()

In [44]:
df = data.copy()
df.head(5)

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,isVirus
0,-0.233467,0.308799,2.484015,1.732721,False
1,1.519003,1.238482,3.34445,0.783744,False
2,0.40064,1.916748,3.291096,-0.918519,False
3,-1.616474,0.209703,1.461544,-0.291837,False
4,1.480515,5.299829,2.64067,1.867559,True


In [45]:
#df.iloc[:, :-1]
#df.iloc[:, -1] 

In [46]:
X = df.iloc[:, :-1] # choose all rows and first 4 columns (excluding the last)
y = df.iloc[:, -1] # choose all rows and only last column

#### Model

In [47]:
# simple train test split
X_train, X_test,y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)

GaussianNB()

In [48]:
preds = model.predict(X_test)

In [49]:
# K fold cross validation to test the specific model
cv = model_selection.StratifiedKFold(n_splits=5, shuffle = True) # getting 5 folds
scores = model_selection.cross_val_score(model, X, y, cv=cv, scoring='accuracy') # automatically uses model on 5 times on folds, 
# and calculate accuracy score and return array of size 5

In [50]:
scores1 = model_selection.cross_val_score(model, X, y, cv=cv, scoring='f1')# automatically uses model on 5 times on folds, 
# and calculate f1 score and return array of size 5

In [51]:
# hyperparameter tuning for GaussianNB - 
param_dic = {'var_smoothing': (1e-8, 1e-9, 1e-10)}
gridsearch = model_selection.GridSearchCV(estimator=model, param_grid=param_dic, cv=5, scoring='f1') 
# param_grid is dictionary of hyperparameter with corresponding list of values to test
# cv is for checking each combination of hyperparameters of 5 folds using scoring f1_score

In [52]:
gridsearch.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=GaussianNB(),
             param_grid={'var_smoothing': (1e-08, 1e-09, 1e-10)}, scoring='f1')

In [53]:
gridsearch.best_params_ # looking at the combination that yield to the best f1_score 

{'var_smoothing': 1e-08}

In [54]:
# cross validation accuracy scores for five fold
for i in scores:
    print("acc: {0:.2f}%".format(i*100))

acc: 66.25%
acc: 63.25%
acc: 70.00%
acc: 67.00%
acc: 74.94%


In [55]:
# cross validation f1 scores for five fold
for i in scores1:
    print("f1: {0:.2f}".format(i))

f1: 0.64
f1: 0.61
f1: 0.65
f1: 0.62
f1: 0.66


So as last comments, I want to point out that the NULL data that I imputed into the dataset are in the range of $\sigma$ , therefore we can deduce that 68% is expected for this test, by eliminating some and increasing the HDI to 2$\sigma$ or 3$\sigma$ in the uniform imputation can some of the points that can increase the scores.